Galaxy | Tool Preview

Filter SPAdes repeats (version 1.0.1)
Contigs/Scaffolds output file from Spades
Enter the corresponding stats file of the fasta file input above
This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated.
This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats.
Contigs with length under the chosen cut-off will be eliminated.
Only contigs above this length will be used to calculate the average coverage.

What does it do?

Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.

Output

  • Filtered sequences (with repeats)
  • Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.
  • For workflows, this output is named output_with_repeats
  • Filtered sequences (no repeats)
  • Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.
  • For workflows, this output is named output_without_repeats
  • Repeat sequences
  • Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).
  • For workflows, this output is named repeat_sequences_only
  • Discarded sequences
  • If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.
  • For workflows, this output is named discarded_sequences
  • Results summary : If selected, will contain a summary of all the results.

Example

Stats file input:

#name length coverage
NODE_1 2500 15.5
NODE_2 102 3.0
NODE_3 1300 50.0
NODE_4 1000 2.3
NODE_5 5000 14.3
NODE_6 450 25.2

User Inputs:

  • Coverage cut-off ratio = 0.33
  • Repeat cut-off ratio = 1.75
  • Length cut-off = 500
  • Length for average coverage calculation = 1000

Calculations:

Average coverage will be calculatd based on contigs with length >= 1000bp

  • Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5

Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.

  • Coverage cut-off = 20.5 * 0.33 = 6.8

Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.

  • Repeat cut-off = 20.5 * 1.75 = 35.9

Number of copies are calculated by dividing the sequence coverage by the average coverage.

  • Number of repeats for NODE_3 = 50 / 20.5 = 2 copies

Output (in fasta format):

Filtered sequences (with repeats)

>NODE_1
>NODE_3 (2 copies)
>NODE_5

Filtered sequences (no repeats)

>NODE_1
>NODE_5

Repeat sequences

>NODE_3 (2 copies)

Discarded sequences

>NODE_2
>NODE_4
>NODE_6