Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.
- Filtered sequences (with repeats)
- Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.
- For workflows, this output is named output_with_repeats
- Filtered sequences (no repeats)
- Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.
- For workflows, this output is named output_without_repeats
- Repeat sequences
- Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).
- For workflows, this output is named repeat_sequences_only
- Discarded sequences
- If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.
- For workflows, this output is named discarded_sequences
- Results summary : If selected, will contain a summary of all the results.
#name | length | coverage |
---|---|---|
NODE_1 | 2500 | 15.5 |
NODE_2 | 102 | 3.0 |
NODE_3 | 1300 | 50.0 |
NODE_4 | 1000 | 2.3 |
NODE_5 | 5000 | 14.3 |
NODE_6 | 450 | 25.2 |
Average coverage will be calculatd based on contigs with length >= 1000bp
Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.
Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.
Number of copies are calculated by dividing the sequence coverage by the average coverage.
Filtered sequences (with repeats)
>NODE_1 >NODE_3 (2 copies) >NODE_5
Filtered sequences (no repeats)
>NODE_1 >NODE_5
Repeat sequences
>NODE_3 (2 copies)
Discarded sequences
>NODE_2 >NODE_4 >NODE_6