Mercurial > repos > mheinzl > hd

<?xml version="1.0" encoding="UTF-8"?>
<tool id="hd" name="Duplex Sequencing Analysis:" version="0.0.1">
    <requirements>
        <requirement type="package" version="2.7">python</requirement>
        <requirement type="package" version="1.4">matplotlib</requirement>
    </requirements>
    <description>Hamming distance (HD) analysis of tags</description>
    <command>
        python2 $__tool_directory__/hd.py --inputFile "$inputFile" --inputName1 "$inputFile.name" --inputFile2 "$inputFile2" --inputName2 "$inputFile2.name" --sample_size $sampleSize --sep $separator --subset_tag $subsetTag --nproc $nproc $onlyDCS --minFS $minFS --maxFS $maxFS --output_csv $output_csv --output_pdf $output_pdf
        #if $inputFile2:
        --output_pdf2 $output_pdf2 --output_csv2 $output_csv2
        #end if
    </command>
    <inputs>
        <param name="inputFile" type="data" format="tabular" label="Dataset 1: input tags" optional="false"/>
        <param name="inputFile2" type="data" format="tabular" label="Dataset 2: input tags" optional="true" help="Input in tabular format with the family size, tags and the direction of the strand ('ab' or 'ba') for each family."/>
        <param name="sampleSize" type="integer" label="number of tags in the sample" value="1000" min="0" help="specifies the number of tags in one analysis. If sample size is 0, all tags of the dataset are compared against all tags."/>
        <param name="minFS" type="integer" label="minimum family size of the tags" min="1" value="1" help="filters the tags after their family size: Families with smaller size are skipped. Default: min. family size = 1."/>
        <param name="maxFS" type="integer" label="max family size of the tags" min="0" value="0" help="filters the tags after their family size: Families with larger size are skipped. If max. family size is 0, no upper bound is defined and the maximum family size in the analysis will be the maximum family size of the whole dataset. Default: max. family size = 0."/>
        <param name="separator" type="text" label="Separator of the CSV file." help="can be a single character" value=","/>
        <param name="onlyDCS" type="boolean" label="only DCS in the analysis?" truevalue="" falsevalue="--only_DCS" checked="False" help="Only tags, which have a partner tag in the dataset, are included in the analysis."/>
        <param name="subsetTag" type="integer" label="shorten tag in the analysis?" value="0" help="An analysis with shorter tag length, which is specified by this parameter, is simulated. If this parameter is 0 (by default), the tag with its original length is used in the analysis."/>
        <param name="nproc" type="integer" label="number of processors" value="8" help="Number of processor used for computing."/>
    </inputs>
    <outputs>
        <data name="output_csv" format="csv"/>
        <data name="output_csv2" format="csv">
            <filter>inputFile2</filter>
        </data>
        <data name="output_pdf" format="pdf" />
        <data name="output_pdf2" format="pdf" >
            <filter>inputFile2</filter>
        </data>
    </outputs>
    <help> <![CDATA[
**What it does**

    This tool calculates the Hamming distance for the tags by comparing them to all tags in the dataset and finally searches for the minimum Hamming distance.
    The Hamming distance is shown in a histogram separated by the family sizes or in a family size distribution separated by the Hamming distances.
    This similarity measure was calculated for each tag to distinguish whether similar tags truly stem from different molecules or occured due to sequencing or PCR errros.
    In addition the tags of chimeric reads can be identified by calculating the Hamming distance for each half of the tag.
    This analysis can be performed on only a sample (by default: sample size=1000) or on the whole dataset (sample size=0).
    It is also possible to select on only those tags, which have a partner tag in the dataset (DCSs) or to filter the dataset after the tag's family size.

**Input**

    This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.

    +-----+----------------------------+----+
    | 1   | AAAAAAAAAAAATGTTGGAATCTT   | ba |
    +-----+----------------------------+----+
    | 10  | AAAAAAAAAAAGGCGGTCCACCCC   | ab |
    +-----+----------------------------+----+
    | 28  | AAAAAAAAAAATGGTATGGACCGA   | ab |
    +-----+----------------------------+----+


**Output**

    The output is one PDF file with the plots of the Hamming distance and a CSV with the data of the plot for each dataset.


**About Author**

    Author: Monika Heinzl

    Department: Institute of Bioinformatics, Johannes Kepler University Linz, Austria

    Contact: monika.heinzl@edumail.at

   ]]>

    </help>
    <citations>
        <citation type="bibtex">
            @misc{duplex,
            author = {Heinzl, Monika},
            year = {2018},
            title = {Development of algorithms for the analysis of duplex sequencing data}
         }
        </citation>
    </citations>
</tool>
author	mheinzl
date	Thu, 10 May 2018 07:30:27 -0400
parents
children	7414792e1cb8