Mercurial > repos > jjohnson > cdhit

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README	Tue Feb 26 12:11:36 2013 -0600
@@ -0,0 +1,29 @@
+CD-HIT-EST
+
+CD-HIT-EST clusters a nucleotide dataset into clusters that meet a user-defined similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format and the output are two files: a fasta file of representative sequences and a text file of list of clusters. Since eukaryotic genes usually have long introns, which cause long gaps, it is difficult to make full-length alignments for these genes. So, CD-HIT-EST is good for non-intron containing sequences like EST.
+
+Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, (2010). 26:680
+
+
+From: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide
+
+
+CD-HIT was originally a protein clustering program. The main advantage of this program is its ultra-fast speed. It can be hundreds of times faster than other clustering programs, for example, BLASTCLUST. Therefore it can handle very large databases, like NR.
+
+The 1st version of this program, CD-HI, was published and released in 2001. The 2nd version, called CD-HIT, was published in 2002 with significant improvements. Since 2004, CD-HIT has been hosted at bioinformatics.org as an open source project.
+
+Since its release, CD-HIT has been getting more and more popular. It has a significant user base, I estimated at over several thousands users. It is used at many research and educational institutions. For example, at UniProt, CD-HIT is used to generate the UniRef reference data sets (http://www.pir.uniprot.org/database/DBDescription.shtml). It is also used in PDB to treat redundant sequences (http://rutgers.rcsb.org/pdb/redundancy.html).
+
+In 2006, the 3rd major updates were published and released with abilities to perform various jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), generating protein families, and many others.
+
+The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command CD-HIT. The server provides interactive interface and additional visualization tools. It also provides pre-calculated and regularly updated sequence clusters for several widely used databases.
+
+CD-HIT-454, a special version of CD-HIT was implemented in 2010 to cluster artificial duplicated reads in pyrosequencing (454) data.
+
+Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, psi-cd-hit-2d, cd-hit-454. I also developed some utility tools, written in Perl, to help run and analyze CD-HIT jobs.
+
+
+NOTE to installer:
+The tool_dependency will set an environment variable: "CDHIT_SITE_OPTIONS" to -M 4000 -T 0  which will be in the commandline.
+You can adjust the values of -M and -T to match the memory and thread capabilities of your site.
+
--- a/cd_hit_est.xml	Fri Sep 07 13:52:03 2012 -0500
+++ b/cd_hit_est.xml	Tue Feb 26 12:11:36 2013 -0600
@@ -1,10 +1,10 @@
-<tool id="cd_hit_est" name="CD-HIT-EST" version="1.1">
+<tool id="cd_hit_est" name="CD-HIT-EST" version="1.2">
  <description>Cluster a nucleotide dataset into representative sequences</description>
  <requirements>
   <requirement type="package" version="4.6.1">cd-hit</requirement>
  </requirements>
  <command>
-  cd-hit-est -i $fasta_in -o rep_seq -c $similarity -n $wordsize $strand
+  cd-hit-est \$CDHIT_SITE_OPTIONS -i "$fasta_in" -o rep_seq -c $similarity -n $wordsize $strand
  </command>
  <inputs>
   <param name="fasta_in" type="data" format="fasta" label="EST Sequences to cluster"/>
--- a/tool_dependencies.xml	Fri Sep 07 13:52:03 2012 -0500
+++ b/tool_dependencies.xml	Tue Feb 26 12:11:36 2013 -0600
@@ -7,6 +7,7 @@
                 <action type="shell_command">make openmp=yes</action>
                 <action type="set_environment">
                     <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR</environment_variable>
+                    <environment_variable name="CDHIT_SITE_OPTIONS" action="set_to">"-M 4000 -T 0"</environment_variable>
                 </action>
             </actions>
         </install>
@@ -19,6 +20,9 @@

 https://code.google.com/p/cdhit/source/browse/README

+Change the CDHIT_SITE_OPTIONS variable in the installed env.sh file to adjust
+the maximum memory Mb (-M) or to limit the number of threads (-T)
+to match your site
         </readme>
     </package>
 </tool_dependency>