view profrep_db_reducing.xml @ 6:1c25246f6e68 draft default tip

Uploaded
author petr-novak
date Thu, 27 Jun 2019 09:51:41 -0400
parents a5f1638b73be
children
line wrap: on
line source

<tool id="profrep_db_reducing" name="cd-hit based size reduction of Profrep database" version="1.0.0">
<description> Tool to reduce database of reads sequences based on their similarities to speed up ProfRep </description>
<requirements>
    <requirement type="package" version="1.0.0">profrep</requirement>
    <requirement type="package" version="4.6.4">cd-hit</requirement>
</requirements>
<command>
python3 ${__tool_directory__}/profrep_db_reducing.py --reads_all ${reads_all} --cls ${cls} --cluster_size ${cluster_size} --identity_th ${identity_th} --reads_reduced ${reads_reduced} --cls_reduced ${cls_reduced}
</command>
<inputs>
 <param format="fasta" type="data" name="reads_all" label="NGS reads" help="Choose input file containing all reads sequences to be reduced" />
 <param format="fasta" type="data" name="cls" label="RE clusters" help="Choose file containing all clusters and belonging reads [ RE archive -> seqclust -> clustering -> hitsort.cls]" />
 <param name="cluster_size" type="integer" value="1000" min="1" max ="1000000000" label="Min cluster size" help="Only the reads from most represented clusters will be reduced - parameter indicates min. number of reads in a cluster to be involved in reducing" />
 <param name="identity_th" type="float" value="0.90" min="0.1" max ="1.0" label="Reads identity threshold" help="Proportion of identity between reads sequences to group and reduce them" />
</inputs>

<outputs>
 <data format="fasta" name="cls_reduced" label="Modified cls file of ${cls.hid}" />
 <data format="fasta" name="reads_reduced" label="Reduced reads database of ${reads_all.hid}" />
</outputs>

 <help>

**WHAT IT DOES**

This tool will reduce the database of all reads based on similarities between them using **cd-hit**. Basically, it creates groups of similar reads and the reduced database will then be composed of one representative read replacing the group. New reads IDs also indicate the number of reads that they represents. The identity threshold between the reads to create a group (cd-hit parameter) is by default set to **0.9**. This value usually makes a good balance between reduction level and accuracy. As the new reads database is produced, CLS file containing reads connected to clusters has to be modified as well. As the result we will obtain reduced database of reads sequences and modified cls file adjusted to the new reads database. The actual reduction level depends on number of clusters envolved and how big they are. Default value for cluster size to be involved in reducing is **1000**, which means all clusters containing 1000 and more reads will undergo the reduction.

 </help>
</tool>