Filter SPAdes repeats (version 1.0.1)

Contigs or scaffolds file:

Contigs/Scaffolds output file from Spades

Stats file:

Enter the corresponding stats file of the fasta file input above

Coverage cut-off ratio:

This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated.

Repeat cut-off ratio:

This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats.

Length cut-off:

Contigs with length under the chosen cut-off will be eliminated.

Length for average coverage calculation:

Only contigs above this length will be used to calculate the average coverage.

Print out a fasta file containing the discarded sequences?:

Print out a summary of all the results?:

What does it do?

Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.

Output

Filtered sequences (with repeats)

Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.

For workflows, this output is named output_with_repeats

Filtered sequences (no repeats)

Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.

For workflows, this output is named output_without_repeats

Repeat sequences

Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).

For workflows, this output is named repeat_sequences_only

Discarded sequences

If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.

For workflows, this output is named discarded_sequences

Results summary : If selected, will contain a summary of all the results.

Example

Stats file input:

#name	length	coverage
NODE_1	2500	15.5
NODE_2	102	3.0
NODE_3	1300	50.0
NODE_4	1000	2.3
NODE_5	5000	14.3
NODE_6	450	25.2

User Inputs:

Coverage cut-off ratio = 0.33
Repeat cut-off ratio = 1.75
Length cut-off = 500
Length for average coverage calculation = 1000

Calculations:

Average coverage will be calculatd based on contigs with length >= 1000bp

Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5

Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.

Coverage cut-off = 20.5 * 0.33 = 6.8

Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.

Repeat cut-off = 20.5 * 1.75 = 35.9

Number of copies are calculated by dividing the sequence coverage by the average coverage.

Number of repeats for NODE_3 = 50 / 20.5 = 2 copies

Output (in fasta format):

Filtered sequences (with repeats)

>NODE_1
>NODE_3 (2 copies)
>NODE_5

Filtered sequences (no repeats)

>NODE_1
>NODE_5

Repeat sequences

>NODE_3 (2 copies)

Discarded sequences

>NODE_2
>NODE_4
>NODE_6