What it does
cd-hit stands for Cluster Database at High Identity with Tolerance. The tool implements four variants: cd-hit, cd-hit-est, cd-hit-2d, and cd-hit-est-2d.
The program cd-hit (resp. cd-hit-est) takes a FASTA format aminoacid (resp. nucleotide) sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the members of the sequence clusters for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit (resp. cd-hit-est) produces a set of closely related protein (resp. nucleotide sequence) families from a given FASTA sequence database.
The program cd-hit-2d (resp. cd-hit-est-2d) compares two aminoacid (resp. nucleotide) sequence datasets (db1, db2) in FASTA format. It identifies the sequences in db2 that are similar to db1 at a certain threshold. It outputs two files: a FASTA file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.
Inputs
cd-hit/cd-hit-2d requires a (two) protein FASTA file(s) as input.
cd-hit-est/cd-hit-est-2d requires a (two) nucleotide FASTA file(s) as input.
Outputs
For cd-hit and cd-hit-est:
>Cluster 0 0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... * >Cluster 1 0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80% 1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84% 2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... * 3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84% 4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63% >Cluster 2 0 2202aa, >PF06317.1|Q6UY61_9VIRU/8-2209... at 60% 1 2208aa, >PF06317.1|Q6IVU4_JUNIN/1-2208... * 2 2207aa, >PF06317.1|Q6IVU0_MACHU/1-2207... at 73% 3 2208aa, >PF06317.1|RRPO_TACV/1-2208... at 69%
For cd-hit-2d and cd-hit-est-2d: