annotate README @ 3:43724ea1c85f

Add cd-hit for protein fastas
author Jim Johnson <jj@umn.edu>
date Thu, 27 Jun 2013 21:37:08 -0500
parents cca0838c1597
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
1 CD-HIT-EST
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
2
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
3 CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
4
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
5 Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, (2010). 26:680
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
6
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
7
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
8 From: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
9
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
10
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
11 CD-HIT was originally a protein clustering program. The main advantage of this program is its ultra-fast speed. It can be hundreds of times faster than other clustering programs, for example, BLASTCLUST. Therefore it can handle very large databases, like NR.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
12
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
13 The 1st version of this program, CD-HI, was published and released in 2001. The 2nd version, called CD-HIT, was published in 2002 with significant improvements. Since 2004, CD-HIT has been hosted at bioinformatics.org as an open source project.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
14
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
15 Since its release, CD-HIT has been getting more and more popular. It has a significant user base, I estimated at over several thousands users. It is used at many research and educational institutions. For example, at UniProt, CD-HIT is used to generate the UniRef reference data sets (http://www.pir.uniprot.org/database/DBDescription.shtml). It is also used in PDB to treat redundant sequences (http://rutgers.rcsb.org/pdb/redundancy.html).
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
16
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
17 In 2006, the 3rd major updates were published and released with abilities to perform various jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), generating protein families, and many others.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
18
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
19 The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command CD-HIT. The server provides interactive interface and additional visualization tools. It also provides pre-calculated and regularly updated sequence clusters for several widely used databases.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
20
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
21 CD-HIT-454, a special version of CD-HIT was implemented in 2010 to cluster artificial duplicated reads in pyrosequencing (454) data.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
22
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
23 Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, psi-cd-hit-2d, cd-hit-454. I also developed some utility tools, written in Perl, to help run and analyze CD-HIT jobs.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
24
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
25
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
26 NOTE to installer:
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
27 The tool_dependency will set an environment variable: "CDHIT_SITE_OPTIONS" to -M 4000 -T 0 which will be in the commandline.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
28 You can adjust the values of -M and -T to match the memory and thread capabilities of your site.
cca0838c1597 Add an environment variable for the -M and -T options for memory and thread allocation
Jim Johnson <jj@umn.edu>
parents:
diff changeset
29