Galaxy |

cd-hit (version 4.8.1+galaxy0)

Sequences to cluster/compare:

Cluster / Compare (i.e. call cd-hit[-est] / cd-hit[-est]-2d)?:

Sequence type?:

For nucleotides the -est variant of cd-hit is called

Sequence identity threshold:

Global sequence identity: number of identical alignment positions divided by the full length of the shorter sequence

Word size:

Suggested word size: 5 for thresholds 0.7 ~ 1.0 4 for thresholds 0.6 ~ 0.7 3 for thresholds 0.5 ~ 0.6 2 for thresholds 0.4 ~ 0.5 (-n)

Tolerance for redundance:

Advanced options

Advanced options 0

Print alignment overlap in .clstr file?:

What it does

cd-hit stands for Cluster Database at High Identity with Tolerance. The tool implements four variants: cd-hit, cd-hit-est, cd-hit-2d, and cd-hit-est-2d.

The program cd-hit (resp. cd-hit-est) takes a FASTA format aminoacid (resp. nucleotide) sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the members of the sequence clusters for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit (resp. cd-hit-est) produces a set of closely related protein (resp. nucleotide sequence) families from a given FASTA sequence database.

The program cd-hit-2d (resp. cd-hit-est-2d) compares two aminoacid (resp. nucleotide) sequence datasets (db1, db2) in FASTA format. It identifies the sequences in db2 that are similar to db1 at a certain threshold. It outputs two files: a FASTA file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

Inputs

cd-hit/cd-hit-2d requires a (two) protein FASTA file(s) as input.

cd-hit-est/cd-hit-est-2d requires a (two) nucleotide FASTA file(s) as input.

Outputs

For cd-hit and cd-hit-est:

The first output is a FASTA file containing representative sequences.
The second output is a text file listing the mapping of sequences to the representative sequences:

>Cluster 0 0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... * >Cluster 1 0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80% 1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84% 2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... * 3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84% 4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63% >Cluster 2 0 2202aa, >PF06317.1|Q6UY61_9VIRU/8-2209... at 60% 1 2208aa, >PF06317.1|Q6IVU4_JUNIN/1-2208... * 2 2207aa, >PF06317.1|Q6IVU0_MACHU/1-2207... at 73% 3 2208aa, >PF06317.1|RRPO_TACV/1-2208... at 69%

For cd-hit-2d and cd-hit-est-2d:

The first output is a FASTA file of sequences in db2 that are not similar to db1.
The second output is a text file that lists similar sequences between db1 & db2