Description

Removes duplicate sequences using one of two modes (below), from the Usearch-Tool-Suite.

Input

File of reads in FASTA format.

Parameters

Full length: Matching is performed over the full length of the sequences, all identical sequences except one are removed.
Prefix: A sequence (A) is discarded, if it is a prefix of another sequence (B). The first part of the sequence is identical.

Output

A FASTA file containing only unique sequences according to the criteria chosen for the duplicate detection. The identifier line for each sequence states the representative sequence followed by the number of identical sequences found.

e.g. >sequenceXXXX;size=1443;

sequenceXXXX is the representative of 1443 identical sequences.

Resources

Dereplication

Author

Robert C. Edgar (bob@drive5.com)

Wrapper Author

QFAB Bioinformatics (support@qfab.org)