Mercurial > repos > mheinzl > td

<?xml version="1.0" encoding="UTF-8"?>
<tool id="td" name="TD:" version="1.0.5">
    <description>Tag distance analysis of duplex tags</description>
    <requirements>
        <requirement type="package" version="2.7">python</requirement>
        <requirement type="package" version="1.4.0">matplotlib</requirement>
    </requirements>
    <command>
        python2 '$__tool_directory__/td.py' --inputFile '$inputFile' --inputName1 '$inputFile.name' --sample_size $sampleSize --subset_tag $subsetTag --nproc $nproc $onlyDCS $rel_freq --minFS $minFS --maxFS $maxFS
		$nr_above_bars --output_pdf $output_pdf --output_tabular $output_tabular --output_chimeras_tabular $output_chimeras_tabular
    </command>
    <inputs>
        <param name="inputFile" type="data" format="tabular" label="Dataset 1: input tags" optional="false" help="Input in tabular format with the family size, tag and the direction of the strand ('ab' or 'ba') for each family."/>
        <param name="sampleSize" type="integer" label="number of tags in the sample" value="1000" min="0" help="specifies the number of tags in one analysis. If sample size is 0, all tags of the dataset are compared against all tags."/>
        <param name="minFS" type="integer" label="minimum family size of the tags" min="1" value="1" help="filters the tags after their family size: Families with a smaller size are skipped. Default: min. family size = 1."/>
        <param name="maxFS" type="integer" label="max family size of the tags" min="0" value="0" help="filters the tags after their family size: Families with a larger size are skipped. If max. family size is 0, no upper bound is defined and the maximum family size in the analysis will be the maximum family size of the whole dataset. Default: max. family size = 0."/>
        <param name="onlyDCS" type="boolean" label="only DCS in the analysis?" truevalue="" falsevalue="--only_DCS" checked="False" help="Only tags, which have a partner tag (ab and ba) in the dataset, are included in the analysis."/>
        <param name="rel_freq" type="boolean" label="relative frequency?" truevalue="" falsevalue="--rel_freq" checked="False" help="If True, the relative frequencies instead of the absolute values are displayed in the plots."/>
        <param name="subsetTag" type="integer" label="shorten tag in the analysis?" value="0" help="By this parameter an analysis with shorter tag length is simulated. If this parameter is 0 (by default), the tags with its original length are used in the analysis."/>
        <param name="nproc" type="integer" label="number of processors" value="8" help="Number of processor used for computing."/>
        <param name="nr_above_bars" type="boolean" label="include numbers above bars?" truevalue="--nr_above_bars" falsevalue="" checked="True" help="The absolute and relative values of the data can be included or removed from the plots. "/>

    </inputs>
    <outputs>
        <data name="output_pdf" format="pdf" />
        <data name="output_tabular" format="tabular"/>
        <data name="output_chimeras_tabular" format="tabular"/>

    </outputs>
    <tests>
        <test>
            <param name="inputFile" value="td_data.tab"/>
            <param name="sampleSize" value="0"/>
            <output name="output_pdf" file="td_output.pdf" lines_diff="6"/>
            <output name="output_tabular" file="td_output.tab"/>
            <output name="output_chimeras_tabular" file="td_chimeras_output.tab"/>
        </test>
    </tests>
    <help> <![CDATA[
**What it does**

Tags used in Duplex Sequencing (DS) are randomized barcodes, e.g 12 base pairs long. Since each DNA fragment is labeled by two tags at each end there are theoretically 4 to the power of (12+12) unique combinations. However, the input DNA in a typical DS experiment contains only ~1,000,000 molecules creating a large tag-to-input excess (4^24
≫ 1,000,000). Because of such excess it is highly unlikely to tag distinct input DNA molecules with highly similar barcodes.

This tool calculates the number of nucleotide differences among tags, also known as `Hamming distance <https://en.wikipedia.org/wiki/Hamming_distance>`_. In this context the Hamming distance is simply the number of differences between two tags. The tool compares in a randomly selected subset of tags (default n=1000), the difference between each tag of the subset with the tags of the complete dataset. Each tag will differ by a certain number of nucleotides with the other tags; yet the tool uses the smallest difference observed with any other tag.

**Input**

This tools expects a tabular file with the tags of all families, the family sizes and information about forward (ab) and reverse (ba) strands::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

.. class:: infomark

**How to generate the input**

The first step of the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ is the **Make Families** tool or the **Correct Barcodes** tool that produces output in this form::

 1                        2  3     4
 ------------------------------------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab read1 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAATAGCTCGAT ab read2 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAATAGCTCGAT ab read3 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAAAATGGTATG ba read3 CGCTACGTGACTAAAACATG

We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAAAATGGTATG ba

Next, the tags are sorted in ascending or descending order using the **Sort** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAAAATGGTATG ba
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab

Finally, unique occurencies of each tag are counted. This is done using **Unique lines** tool that adds an additional column with the counts that also represent the family size (column 1)::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

These data can now be used in this tool.

**Output**

The output is one PDF file with various plots of the Tag distance, a tabular file with the summarized data of the plots and a tabular file with the chimeras. The PDF file contains several pages:

 1. This first page contains a graph representing the minimum tag distance (smallest number of differences) categorized after the family sizes.

 2. The second page contains the same information as the first page, but plots the family size categorized by the minimum tag distance.

 3. The third page contains the **first step** of the **chimera analysis**, which examines the differences between the tags at both ends of a read (a/b). Chimeras can be distinguished by carrying the same tag at one end combined with multiple different tags at the other end of a read. Here, we describe the calculation of the TDs for only one tag in detail, but the process is repeated for each tag in the sample (default n=1000). First, the tool splits the tag into its upstream and downstream part (named a and b) and compares it with all other a parts of the families in the dataset. Next, the tool estimates the sequence differences (TD) among the a parts and extracts those tags with the smallest difference (TD a.min) and calculates the TD of the b part. The tags with the largest differences are extracted to estimate the maximum TD (TD b.max). The process is repeated starting with the b part instead and estimates TD a.max and TD b.min. Next, we calculate the sum of TD a.min and TD b.max.

 4. The fourth page contains the **second step** of the **chimera analysis**: the absolute difference (=delta TD) between the partial TDs (TD a.min & TD b.max and TD b.min & TD a.max). The partial TDs of chimeric tags are normally very different which means that multiple combinations of the same a part with different b parts is likely. But it is possible that small delta TDs occur due to a half of a tag that is identical to other halves in the data. For this purpose, the relative difference between the partial TDs is estimated in the next step.

 5. The fifth page contains the **third step** of the **chimera analysis**: the relative differences of the partial TDs (=relative delta TD). These are calculated as the absolute difference between TD a.min and TD b.max equal to TD delta. Since it is not known whether the absolute difference originates due to a low and a very large TD within a tag or an identical half (TD=0), the tool estimates the relative TD delta as the ratio of the difference to the sum of the partial TDs. In a chimera, it is expected that only one end of the tag contributes the TD of the whole tag. In other words, if the same a part is observed in combination with several different b parts, then one end will have a TD = 0. Thus, the TD difference between the parts (TD a.min - TD b.max) is the same as the sum of the parts (TD a.min + TD b.max) or the ratio of the difference to the sum (relative delta TD = TD a.min - TD b.max / TD a.min + TD b.max) will equal 1 in chimeric families. The plot can be interpreted as the following:

    - A low relative difference indicates that the total TD is equally distributed in the two partial TDs. This case would be expected, if all tags originate from different molecules.
    - A relative delta TD of 1 means that one part of the tags is identical. Since it is very unlikely that by chance two different tags have a TD of 0, the TDs in the other half are probably artificially introduced and represents chimeric families.

 6. The sixth page is an analysis only of **chimeric tags** (relative delta TD =1) from step 5.

 7. The last page is only generated when the parameter "only DCS in the analysis?" is set to **False (NO)**. The graph represents the **TD of the chimeric tags** that form a DCS (complementary ab and ba).

 .. class:: infomark

**Note:**
Chimeras can be identical in the first or second part of the tag and can have an identical TD with mutliple tags. Therefore, the second column of the output file can have multiple tag entries. The file also contains the family sizes and the direction of the read (ab, ba). The asterisks mark the identical part of the tag.::

 1                                      2
 --------------------------------------------------------------------------------------------------
 GAAAGGGAGG GCGCTTCACG	1 ba            GCAATCGACG *GCGCTTCACG* 1 ba
 CCCTCCCTGA GGTTCGTTAT	1 ba            CGTCCTTTTC *GGTTCGTTAT* 1 ba, GCACCTCCTT *GGTTCGTTAT* 1  ba
 ATGCTGATCT CGAATGCATA	55 ba, 59 ab    AGGTGCCGCC *CGAATGCATA* 27 ba, *ATGCTGATCT* GAATGTTTAC 1 ba

**About Author**

Author: Monika Heinzl

Department: Institute of Biophysics, Johannes Kepler University Linz, Austria

Contact: monika.heinzl@edumail.at

   ]]>

    </help>
    <citations>
        <citation type="bibtex">
            @misc{duplex,
            author = {},
            year = {},
            title = {}
         }
        </citation>
    </citations>
</tool>
author	mheinzl
date	Wed, 16 Oct 2019 04:17:59 -0400
parents
children