Mercurial > repos > iuc > cd_hit

<tool id="cd_hit" name="cd-hit" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="20.01">
    <description>Cluster or compare biological sequence datasets</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <expand macro="xrefs"/>
    <version_command><![CDATA[
cd-hit | grep "CD-HIT version" | cut -d" " -f 4
    ]]></version_command>
    <command detect_errors="exit_code"><![CDATA[
cd-hit$est.est_select$twod.twod_select
-i '$fasta_in'
-o rep_seq
-c $est.similarity
-n $est.wordsize
#if $est.est_select == '-est':
    $est.strand
    #if str($est.estadvalign.mask) != 'None':
        -mask '$est.estadvalign.mask'
    #end if
    -match $est.estadvalign.match
    -mismatch $est.estadvalign.mismatch
    -gap $est.estadvalign.gap
    -gap-ext $est.estadvalign.gapext
#else:
    -t $est.redtol
#end if
#if $twod.twod_select == '-2d':
    -i2 '$fasta_in2'
    #if $advanced.advalign.style == 'local':
        -s2 $advanced.advalign.advancedtwod.cutoff_diff_len2
        -S2 $advanced.advalign.advancedtwod.aa_cutoff_diff_len2
    #end if
#end if

-b $advanced.band_width
-l $advanced.throw_away_len
#if $advanced.advalign.style == 'local':
    -G 0
    -aL $advanced.advalign.align_coverage_long
    -AL $advanced.advalign.align_coverage_long_control
    -aS $advanced.advalign.align_coverage_short
    -AS $advanced.advalign.align_coverage_short_control
    -A $advanced.advalign.align_coverage_min
    -s $advanced.advalign.cutoff_diff_len
    -S $advanced.advalign.aa_cutoff_diff_len
#end if
-uL $advanced.max_unmatched_per_l
-uS $advanced.max_unmatched_per_s
-U $advanced.max_unmatched_len
$advanced.accurate
$advanced.inram
$advanced.sort_cluster
$advanced.sort_fasta
#if $print_alnovl.print_alnovl_select == "yes":
    -p 1
    -d $print_alnovl.desclen
#end if

## instead of 800 (default) we use 0:unlimited
-M \${GALAXY_MEMORY_MB:-0}
-T \${GALAXY_SLOTS:-1}
    ]]></command>
    <inputs>
        <param name="fasta_in" argument="-i" type="data" format="fasta" label="Sequences to cluster/compare"/>
        <conditional name="twod">
            <param name="twod_select" type="select" label="Cluster / Compare (i.e. call cd-hit[-est] / cd-hit[-est]-2d)?">
                <option value="" selected="true">Cluster sequences</option>
                <option value="-2d">Compare with 2nd sequence data set</option>
            </param>
            <when value=""/>
            <when value="-2d">
                <param name="fasta_in2" argument="-i2" type="data" format="fasta" label="Other sequences to cluster/compare"/>
            </when>
        </conditional>
        <conditional name="est">
            <param name="est_select" type="select" label="Sequence type?" help="For nucleotides the -est variant of cd-hit is called">
                <option value="" selected="true">Protein</option>
                <option value="-est">Nucleotides</option>
            </param>
            <when value="">
                <param name="similarity" argument="-c" type="float" min="0.4" max="1.0" value="0.9" label="Sequence identity threshold" help="Global sequence identity: number of identical alignment positions divided by the full length of the shorter sequence"/>
                <param name="wordsize" argument="-n" type="integer" min="2" max="5" value="5" label="Word size">
                    <help>Suggested word size:
            5 for thresholds 0.7 ~ 1.0
            4 for thresholds 0.6 ~ 0.7
            3 for thresholds 0.5 ~ 0.6
            2 for thresholds 0.4 ~ 0.5 (-n)
                    </help>
                </param>
                <param name="redtol" argument="-t" type="integer" value="2" label="Tolerance for redundance"/>
            </when>
            <when value="-est">
                <param name="similarity" argument="-c" type="float" min="0.8" max="1.0" value="0.9" label="Sequence identity threshold" help="Global sequence identity: number of identical alignment positions divided by the full length of the shorter sequence"/>
                <param name="wordsize" argument="-n" type="integer" min="4" max="11" value="10" label="Word size">
                    <help>Suggested word size:
            10,11 for threshold in 0.95 ~ 1.0
            8,9   for threshold in 0.9  ~ 0,95
            7     for threshold in 0.88 ~ 0.9
            6     for threshold in 0.85 ~ 0.88
            5     for threshold in 0.80 ~ 0.85
            4     for threshold in 0.75 ~ 0.8 (-n)
                    </help>
                </param>
                <param name="strand" argument="-r" type="boolean" truevalue="-r 1" falsevalue="-r 0" checked="false" label="Compare both strands?"/>
                <section name="estadvalign" title="Advanced EST alignment options">
                    <param argument="-mask" type="text" optional="true" label="Masking letters" help="NX, to mask out both 'N' and 'X'"/>
                    <param argument="-match" type="integer" value="2" label="Match score" help="Default 2 (1 for T-U and N-N)"/>
                    <param argument="-mismatch" type="integer" value="-2" label="Mismatch score"/>
                    <param argument="-gap" type="integer" value="-6" label="Gap opening score"/>
                    <param name="gapext" argument="-gap-ext" type="integer" value="-1" label="Gap extension score"/>
                </section>
            </when>
        </conditional>
        <section name="advanced" title="Advanced options">
            <param name="band_width" argument="-b" type="integer" min="1" value="20" label="Alignment band width"/>
            <param name="throw_away_len" argument="-l" type="integer" min="1" value="10" label="Length of throw away sequences"/>
            <conditional name="advalign">
                <param name="style" type="select" label="Use local/global sequence identity" help="Local sequence identity: number of identical amino alignment positions divided by the length of the alignment. Note: in local mode one of -aL, aS, or A must be > 0!">
                    <option value="global" selected="true">Global</option>
                    <option value="local" >Local</option>
                </param>
                <when value="global"/>
                <when value="local">
                    <param name="align_coverage_long" argument="-aL" type="float" min="0.0" max="1.0" value="0.0" label="Alignment coverage for the longer sequence" help="Fraction of the longer sequence that must be covered"/>
                    <param name="align_coverage_long_control" argument="-AL" type="integer" min="0" value="99999999" label="Alignment coverage control for the longer sequence " help="Maximum number of residues uncovered residues of the longer sequence"/>
                    <param name="align_coverage_short" argument="-aS" type="float" min="0.0" max="1.0" value="0.0" label="Alignment coverage for the shorter sequence" help="Fraction of the shorter sequence that must be covered"/>
                    <param name="align_coverage_short_control" argument="-AS" type="integer" min="0" value="99999999" label="Alignment coverage control for the shorter sequence" help="Maximum number of residues uncovered residues of the shorter sequence"/>
                    <param name="align_coverage_min" argument="-A" type="integer" min="0" value="0" label="Minimal alignment coverage control for the both sequences" help="Minimum number of residues of both sequences that must be covered"/>
                    <param name="cutoff_diff_len" argument="-s" type="float" min="0.0" max="1.0" value="0.0" label="Length difference cutoff" help="If set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster"/>
                    <param name="aa_cutoff_diff_len" argument="-S" type="integer" min="0" value="999999" label="Length difference cutoff in residues" help="If set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60"/>
                    <section name="advancedtwod" title="Advanced options for 2D local alignment">
                        <param name="cutoff_diff_len2" argument="-s2" type="float" min="0.0" max="1.0" value="1.0" label="Length difference cutoff for other sequences" help="By default, seqs in db1 >= seqs in db2 in a same cluster if set to 0.9, seqs in db1 may just >= 90% seqs in db2. Only effective in local alignment mode"/>
                        <param name="aa_cutoff_diff_len2" argument="-S2" type="integer" min="0" value="0" label="Length difference cutoff in residues for other sequences" help="by default, seqs in db1 >= seqs in db2 in a same cluster if set to 60, seqs in db2 may 60aa longer than seqs in db1. Only effective in local alignment mode"/>
                    </section>
                </when>
            </conditional>
            <param name="max_unmatched_per_l" argument="-uL" type="float" min="0.0" max="1.0" value="1.0" label="Maximum unmatched percentage for the shorter sequence" help="If set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence"/>
            <param name="max_unmatched_per_s" argument="-uS" type="float" min="0.0" max="1.0" value="1.0" label="Maximum unmatched percentage for the shorter sequence" help="If set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence"/>
            <param name="max_unmatched_len" argument="-U" type="integer" min="0" value="99999999" label="Maximum unmatched length" help="If set to 10, the unmatched region (excluding leading and tailing gaps) must not be more than 10 bases"/>
            <param name="inram" argument="-B" type="boolean" truevalue="-B 0" falsevalue="-B 1" checked="true" label="Sequences are stored in RAM" help="If false: sequence are stored on hard drive - use for huge data sets"/>
            <param name="accurate" argument="-g" type="boolean" truevalue="-g 1" falsevalue="-g 0" checked="false" label="Accurate but slow mode" help="By cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to true, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode)"/>
            <param name="sort_cluster" argument="-sc" type="boolean" truevalue="-sc 1" falsevalue="-sc 0" label="Sort clusters by size" help="When disabled, clusters are sorted by decreasing length; if enabled, clusters are sorted by decreasing size" />
            <param name="sort_fasta" argument="-sf" type="boolean" truevalue="-sf 1" falsevalue="-sf 0" label="Sort FASTA/FASTQ by cluster size" help="When enabled, output sequences are sorted by decreasing cluster size" />
        </section>

        <conditional name="print_alnovl">
            <param name="print_alnovl_select" type="select" label="Print alignment overlap in .clstr file?">
                <option value="no" selected="true">No</option>
                <option value="yes">Yes</option>
            </param>
            <when value="no"/>
            <when value="yes">
                <param name="desclen" argument="-d" type="integer" min="0" value="20" label="Length of description in .clstr file" help="If set to 0, it takes the fasta defline and stops at first space"/>
            </when>
        </conditional>
    </inputs>

    <outputs>
        <data name="clusters_out" format="txt" label="${tool.name} on ${on_string}: Clusters" from_work_dir="rep_seq.clstr"/>
        <data name="fasta_out" format="fasta" label="${tool.name} on ${on_string}: Representative sequences" from_work_dir="rep_seq"/>
    </outputs>

    <tests>
        <!--cd-hit with default options -->
        <test>
            <param name="fasta_in" value="cd_hit_protein_in.fasta" />
            <conditional name="twod">
                <param name="twod_select" value="" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="" />
                <param name="wordsize" value="5" />
            </conditional>
            <param name="similarity" value="0.9" />
            <output name="clusters_out" file="protein_clusters_output.txt"/>
            <output name="fasta_out" file="protein_fasta_output.fasta"/>
        </test>
        <!--cd-hit local (default options), changed similarity, and output string length -->
        <test>
            <param name="fasta_in" value="cd_hit_protein_in.fasta" />
            <conditional name="twod">
                <param name="twod_select" value="" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="" />
                <param name="wordsize" value="5" />
            </conditional>
            <param name="similarity" value="0.8" />
            <section name="advanced">
                <conditional name="advalign">
                    <param name="style" value="local" />
                    <param name="align_coverage_short" value="0.9" />
                </conditional>
            </section>
            <conditional name="print_alnovl">
                <param name="print_alnovl_select" value="yes"/>
                <param name="desclen" value="40"/>
            </conditional>
            <output name="clusters_out" file="protein_clusters_output_local.txt"/>
            <output name="fasta_out" file="protein_fasta_output_local.fasta"/>
        </test>
        <!-- cd-hit-est -->
        <test>
            <param name="fasta_in" value="cd_hit_est_in.fa" />
            <conditional name="twod">
                <param name="twod_select" value="" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="-est" />
                <param name="wordsize" value="8"/>
                <param name="strand" value="false"/>
            </conditional>
            <param name="similarity" value="0.9"/>
            <output name="clusters_out" file="est_clusters_output.txt"/>
            <output name="fasta_out" file="est_fasta_output.fasta"/>
        </test>
        <!-- cd-hit-est-2d (also changing strand param) -->
        <test>
            <param name="fasta_in" value="db1.fasta" />
            <conditional name="twod">
                <param name="twod_select" value="-2d" />
                <param name="fasta_in2" value="db2.fasta" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="-est" />
                <param name="wordsize" value="8"/>
                <param name="strand" value="true"/>
            </conditional>
            <param name="similarity" value="0.9"/>
            <output name="clusters_out" file="est-2d.txt.clstr"/>
            <output name="fasta_out" file="est-2d.txt"/>
        </test>
        <!-- Test sort cluster parameter -->
        <test>
            <param name="fasta_in" value="cd_hit_est_in.fa" />
            <conditional name="twod">
                <param name="twod_select" value="" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="-est" />
                <param name="wordsize" value="8"/>
                <param name="strand" value="false"/>
            </conditional>
            <param name="similarity" value="0.9"/>
            <section name="advanced">
                <param name="sort_cluster" value="true"/>
            </section>
            <output name="clusters_out" file="est_clusters_sorted.txt"/>
            <output name="fasta_out" file="est_fasta_output.fasta"/>
        </test>
        <!-- Test sort fasta parameter -->
        <test>
            <param name="fasta_in" value="cd_hit_est_in.fa" />
            <conditional name="twod">
                <param name="twod_select" value="" />
            </conditional>
            <conditional name="est">
                <param name="est_select" value="-est" />
                <param name="wordsize" value="8"/>
                <param name="strand" value="false"/>
            </conditional>
            <param name="similarity" value="0.9"/>
            <section name="advanced">
                <param name="sort_fasta" value="true"/>
            </section>
            <output name="clusters_out" file="est_clusters_output.txt"/>
            <output name="fasta_out" file="est_fasta_sorted.fasta"/>
        </test>
    </tests>
    <help><![CDATA[
**What it does**

cd-hit stands for Cluster Database at High Identity with Tolerance. The tool implements four variants: cd-hit, cd-hit-est, cd-hit-2d, and cd-hit-est-2d.

The program cd-hit (resp. cd-hit-est) takes a FASTA format aminoacid (resp. nucleotide) sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the members of the sequence clusters for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit (resp. cd-hit-est) produces a set of closely related protein (resp. nucleotide sequence) families from a given FASTA sequence database.

The program cd-hit-2d (resp. cd-hit-est-2d) compares two aminoacid (resp. nucleotide) sequence datasets (db1, db2) in FASTA format. It identifies the sequences in db2 that are similar to db1 at a certain threshold. It outputs two files: a FASTA file of sequences in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

.. _CD-HIT: http://weizhongli-lab.org/cd-hit/

------

**Inputs**

cd-hit/cd-hit-2d requires a (two) protein FASTA file(s) as input.

cd-hit-est/cd-hit-est-2d requires a (two) nucleotide FASTA file(s) as input.

------

**Outputs**

For cd-hit and cd-hit-est:

1. The first output is a FASTA file containing representative sequences.

2. The second output is a text file listing the mapping of sequences to the representative sequences:

  >Cluster 0
  0 2799aa, >PF04998.6|RPOC2_CHLRE/275-3073... *
  >Cluster 1
  0 2214aa, >PF06317.1|Q6Y625_9VIRU/1-2214... at 80%
  1 2215aa, >PF06317.1|O09705_9VIRU/1-2215... at 84%
  2 2217aa, >PF06317.1|Q6Y630_9VIRU/1-2217... *
  3 2216aa, >PF06317.1|Q6GWS6_9VIRU/1-2216... at 84%
  4 527aa, >PF06317.1|Q67E14_9VIRU/6-532... at 63%
  >Cluster 2
  0 2202aa, >PF06317.1|Q6UY61_9VIRU/8-2209... at 60%
  1 2208aa, >PF06317.1|Q6IVU4_JUNIN/1-2208... *
  2 2207aa, >PF06317.1|Q6IVU0_MACHU/1-2207... at 73%
  3 2208aa, >PF06317.1|RRPO_TACV/1-2208... at 69%

For cd-hit-2d and cd-hit-est-2d:

1. The first output is a FASTA file of sequences in db2 that are not similar to db1.

2. The second output is a text file that lists similar sequences between db1 & db2
    ]]></help>
    <expand macro="citations" />
</tool>
author	iuc
date	Fri, 05 Nov 2021 08:23:26 +0000
parents	e0da3400ac2f
children