Galaxy |

DeFuse

DeFuse is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion.

Journal reference: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001138

Inputs

DeFuse requires 2 fastq files for paried reads, one with the left mate of the paired reads, and a second fastq with the the right mate of the paired reads (with reads in the same order as in the first fastq dataset).

If your fastq files have reads in different orders or include unpaired reads, you can preprocess them with FASTQ interlacer to create a single interlaced fastq dataset with only the paired reads and input that to FASTQ de-interlacer to separate the reads into a left fastq and right fastq.

DeFuse uses a Reference Dataset to search for gene fusions. The Reference Dataset is generated from the following sources in DeFuse_Version_0.4:

genome_fasta from Ensembl
gene_models from Ensembl
repeats_filename from UCSC RepeatMasker rmsk.txt
est_fasta from UCSC
est_alignments from UCSC intronEst.txt
unigene_fasta from NCBI

Outputs

The galaxy history will contain 5 outputs: the config.txt file that provides DeFuse with its parameters, the defuse.log which details what DeFuse has done and can be useful in determining any errors, and the 3 results files that defuse generates.

DeFuse generates 3 results files: results.txt, results.filtered.txt, and results.classify.txt. All three files have the same format, though results.classify.txt has a probability column from the application of the classifier to results.txt, and results.filtered.txt has been filtered according to the threshold probability as set in config.txt.

The file format is tab delimited with one prediction per line, and the following fields per prediction (not necessarily in this order):

Identification

cluster_id : random identifier assigned to each prediction

library_name : library name given on the command line of defuse

gene1 : ensembl id of gene 1

gene2 : ensembl id of gene 2

gene_name1 : name of gene 1

gene_name2 : name of gene 2

Evidence

break_predict : breakpoint prediction method, denovo or splitr, that is considered most reliable

concordant_ratio : proportion of spanning reads considered concordant by blat

denovo_min_count : minimum kmer count across denovo assembled sequence

denovo_sequence : fusion sequence predicted by debruijn based denovo sequence assembly

denovo_span_pvalue : p-value, lower values are evidence the prediction is a false positive

gene_align_strand1 : alignment strand for spanning read alignments to gene 1

gene_align_strand2 : alignment strand for spanning read alignments to gene 2

min_map_count : minimum of the number of genomic mappings for each spanning read

max_map_count : maximum of the number of genomic mappings for each spanning read

mean_map_count : average of the number of genomic mappings for each spanning read

num_multi_map : number of spanning reads that map to more than one genomic location

span_count : number of spanning reads supporting the fusion

span_coverage1 : coverage of spanning reads aligned to gene 1 as a proportion of expected coverage

span_coverage2 : coverage of spanning reads aligned to gene 2 as a proportion of expected coverage

span_coverage_min : minimum of span_coverage1 and span_coverage2

span_coverage_max : maximum of span_coverage1 and span_coverage2

splitr_count : number of split reads supporting the prediction

splitr_min_pvalue : p-value, lower values are evidence the prediction is a false positive

splitr_pos_pvalue : p-value, lower values are evidence the prediction is a false positive

splitr_sequence : fusion sequence predicted by split reads

splitr_span_pvalue : p-value, lower values are evidence the prediction is a false positive

Annotation

adjacent : fusion between adjacent genes

altsplice : fusion likely the product of alternative splicing between adjacent genes

break_adj_entropy1 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 1

break_adj_entropy2 : di-nucleotide entropy of the 40 nucleotides adjacent to the fusion splice in gene 2

break_adj_entropy_min : minimum of break_adj_entropy1 and break_adj_entropy2

breakpoint_homology : number of nucleotides at the fusion splice that align equally well to gene 1 or gene 2

breakseqs_estislands_percident : maximum percent identity of fusion sequence alignments to est islands

cdna_breakseqs_percident : maximum percent identity of fusion sequence alignments to cdna

deletion : fusion produced by a genomic deletion

est_breakseqs_percident : maximum percent identity of fusion sequence alignments to est

eversion : fusion produced by a genomic eversion

exonboundaries : fusion splice at exon boundaries

expression1 : expression of gene 1 as number of concordant pairs aligned to exons

expression2 : expression of gene 2 as number of concordant pairs aligned to exons

gene_chromosome1 : chromosome of gene 1

gene_chromosome2 : chromosome of gene 2

gene_end1 : end position for gene 1

gene_end2 : end position for gene 2

gene_location1 : location of breakpoint in gene 1

gene_location2 : location of breakpoint in gene 2

gene_start1 : start of gene 1

gene_start2 : start of gene 2

gene_strand1 : strand of gene 1

gene_strand2 : strand of gene 2

genome_breakseqs_percident : maximum percent identity of fusion sequence alignments to genome

genomic_break_pos1 : genomic position in gene 1 of fusion splice / breakpoint

genomic_break_pos2 : genomic position in gene 2 of fusion splice / breakpoint

genomic_strand1 : genomic strand in gene 1 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream

genomic_strand2 : genomic strand in gene 2 of fusion splice / breakpoint, retained sequence upstream on this strand, breakpoint is downstream

interchromosomal : fusion produced by an interchromosomal translocation

interrupted_index1 : ratio of coverage before and after the fusion splice / breakpoint in gene 1

interrupted_index2 : ratio of coverage before and after the fusion splice / breakpoint in gene 2

inversion : fusion produced by genomic inversion

orf : fusion combines genes in a way that preserves a reading frame

probability : probability produced by classification using adaboost and example positives/negatives (only given in results.classified.txt)

read_through : fusion involving adjacent potentially resulting from co-transcription rather than genome rearrangement

repeat_proportion1 : proportion of the spanning reads in gene 1 that span a repeat region

repeat_proportion2 : proportion of the spanning reads in gene 2 that span a repeat region

max_repeat_proportion : max of repeat_proportion1 and repeat_proportion2

splice_score : number of nucleotides similar to GTAG at fusion splice

num_splice_variants : number of potential splice variants for this gene pair

splicing_index1 : number of concordant pairs in gene 1 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 2

splicing_index2 : number of concordant pairs in gene 2 spanning the fusion splice / breakpoint, divided by number of spanning reads supporting the fusion with gene 1

Example

results.tsv:

cluster_id    splitr_sequence splitr_count    splitr_span_pvalue      splitr_pos_pvalue       splitr_min_pvalue       adjacent        altsplice       break_adj_entropy1      break_adj_entropy2      break_adj_entropy_min   break_predict   breakpoint_homology     breakseqs_estislands_percident  cdna_breakseqs_percident        concordant_ratio        deletion        est_breakseqs_percident eversion        exonboundaries  expression1     expression2     gene1   gene2   gene_align_strand1      gene_align_strand2      gene_chromosome1        gene_chromosome2        gene_end1       gene_end2       gene_location1  gene_location2  gene_name1      gene_name2      gene_start1     gene_start2     gene_strand1    gene_strand2    genome_breakseqs_percident      genomic_break_pos1      genomic_break_pos2      genomic_strand1 genomic_strand2 interchromosomal        interrupted_index1      interrupted_index2      inversion       library_name    max_map_count   max_repeat_proportion   mean_map_count  min_map_count   num_multi_map   num_splice_variants     orf     read_through    repeat_proportion1      repeat_proportion2      span_count      span_coverage1  span_coverage2  span_coverage_max       span_coverage_min       splice_score    splicing_index1 splicing_index2
1169  GCTTACTGTATGCCAGGCCCCAGAGGGGCAACCACCCTCTAAAGAGAGCGGCTCCTGCCTCCCAGAAAGCTCACAGACTGTGGGAGGGAAACAGGCAGCAGGTGAAGATGCCAAATGCCAGGATATCTGCCCTGTCCTTGCTTGATGCAGCTGCTGGCTCCCACGTTCTCCCCAGAATCCCCTCACACTCCTGCTGTTTTCTCTGCAGGTTGGCAGAGCCCCATGAGGGCAGGGCAGCCACTTTGTTCTTGGGCGGCAAACCTCCCTGGGCGGCACGGAAACCACGGTGAGAAGGGGGCAGGTCGGGCACGTGCAGGGACCACGCTGCAGG|TGTACCCAACAGCTCCGAAGAGACAGCGACCATCGAGAACGGGCCATGATGACGATGGCGGTTTTGTCGAAAAGAAAAGGGGGAAATGTGGGGAAAAGCAAGAGAGATCAGATTGTTACTGTGTCTGTGTAGAAAGAAGTAGACATGGGAGACTCCATTTTGTTCTGTACTAAGAAAAATTCTTCTGCCTTGAGATTCGGTGACCCCACCCCCAACCCCGTGCTCTCTGAAACATGTGCTGTGTCCACTCAGGGTTGAATGGATTAAGGGCGGTGCGAGACGTGCTTT    2       0.000436307890680442    0.110748295953850       0.0880671602973091      N       Y       3.19872427442695        3.48337348351473        3.19872427442695        splitr  0       0       0       0       Y       0       N       N       0       0       ENSG00000105549 ENSG00000213753 +       -       19      19      376013  59111168        intron  upstream        THEG    AC016629.2      361750  59084870        -       +       0       375099  386594  +       -       N       8.34107429512245        -       N       output_dir      82      0.677852348993289       40.6666666666667        1       11      1       N       N       0.361271676300578       0.677852348993289       12      0.758602776578432       0.569678713445872       0.758602776578432       0.569678713445872       2       0.416666666666667       -
3596  TGGGGGTTGAGGCTTCTGTTCCCAGGTTCCATGACCTCAGAGGTGGCTGGTGAGGTTATGACCTTTGCCCTCCAGCCCTGGCTTAAAACCTCAGCCCTAGGACCTGGTTAAAGGAAGGGGAGATGGAGCTTTGCCCCGACCCCCCCCCGTTCCCCTCACCTGTCAGCCCGAGCTGGGCCAGGGCCCCTAGGTGGGGAACTGGGCCGGGGGGCGGGCACAAGCGGAGGTGGTGCCCCCAAAAGGGCTCCCGGTGGGGTCTTGCTGAGAAGGTGAGGGGTTCCCGGGGCCGCAGCAGGTGGTGGTGGAGGAGCCAAGCGGCTGTAGAGCAAGGGGTGAGCAGGTTCCAGACCGTAGAGGCGGGCAGCGGCCACGGCCCCGGGTCCAGTTAGCTCCTCACCCGCCTCATAGAAGCGGGGTGGCCTTGCCAGGCGTGGGGGTGCTGCC|TTCCTTGGATGTGGTAGCCGTTTCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTAGGCACGGCGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCACGGGGGGCGTGCGATCAGCCCGAGGTTATCTAGAGTCACCAAAGCCGCCGGCGCCCGCCCCCCGGCCGGGGCCGGAGAGGGGCTGACCGGGTTGGTTTTGATCTGATAAATGCACGCATCCCCCCCGCGAAGGGGGTCAGCGCCCGTCGGCATGTATTAGCTCTAGAATTACCACAGTTATCCAAGTAGGAGAGGAGCGAGCGACCAAAGGAACCATAACTGATTTAATGAGCCATTCGCAGTTTCACTGTACCGGCCGTGCGTACTTAGACATGCATGGCTTAATCTTTGAGACAAGCATATGCTACTGGCAGG  250     7.00711162298275e-72    0.00912124762512338     0.00684237452309549     N       N       3.31745197152461        3.47233119514066        3.31745197152461        splitr  7       0.0157657657657656      0       0       N       0.0135135135135136      N       N       0       0       ENSG00000156860 ENSG00000212932 -       +       16      21      30682131        48111157        coding  upstream        FBRS    RPL23AP4        30670289        48110676        +       +       0.0157657657657656      30680678        9827473 -       +       Y       -       -       N       output_dir      2       1       1.11111111111111        1       1       1       N       N       0       1       9       0.325530693397641       0.296465452915709       0.325530693397641       0.296465452915709       2       -       -