view overlapping_reads.xml @ 4:20d28cfdeefe draft

planemo upload for repository https://github.com/ARTbio/tools-artbio/tree/master/tools/small_rna_signatures commit cfdc08418887bfe4a35588cd78d0a2b6ffa6e19e
author artbio
date Fri, 08 Sep 2017 04:44:22 -0400
parents 4d9682bd3a6b
children a7fd04208764
line wrap: on
line source

<tool id="overlapping_reads" name="Get overlapping reads" version="0.9.3">
    <description />
    <requirements>
        <requirement type="package" version="0.11.2.1=py27_0">pysam</requirement>
    </requirements>
    <stdio>
        <exit_code range="1:" level="fatal" description="Tool exception" />
    </stdio>
      <command detect_errors="exit_code"><![CDATA[
        samtools index '$input' &&
        python '$__tool_directory__'/overlapping_reads.py
           --input '$input'
           --minquery '$minquery'
           --maxquery '$maxquery'
           --mintarget '$mintarget'
           --maxtarget '$maxtarget'
           --overlap '$overlap'
           --output '$output'
    ]]></command>
    <inputs>
        <param format="bam" label="Compute signature from this bowtie standard output" name="input" type="data" />
        <param help="'23' = 23 nucleotides" label="Min size of query small RNAs" name="minquery" size="3" type="integer" value="23" />
        <param help="'29' = 29 nucleotides" label="Max size of query small RNAs" name="maxquery" size="3" type="integer" value="29" />
        <param help="'23' = 23 nucleotides" label="Min size of target small RNAs" name="mintarget" size="3" type="integer" value="23" />
        <param help="'29' = 29 nucleotides" label="Max size of target small RNAs" name="maxtarget" size="3" type="integer" value="29" />
        <param help="'10' = 10 nucleotides overlap" label="Overlap (in nt)" name="overlap" size="3" type="integer" value="10" />
    </inputs>
    <outputs>
        <data format="fasta" label="pairable reads" name="output" />
    </outputs>
    <tests>
        <test>
            <param ftype="bam" name="input" value="sr_bowtie.bam" />
            <param name="minquery" value="23" />
            <param name="maxquery" value="29" />
            <param name="mintarget" value="23" />
            <param name="maxtarget" value="29" />
            <param name="overlap" value="10" />
            <output file="paired.fa" ftype="fasta" name="output" />
        </test>
        <test>
            <param ftype="bam" name="input" value="sr_bowtie.bam" />
            <param name="minquery" value="20" />
            <param name="maxquery" value="22" />
            <param name="mintarget" value="23" />
            <param name="maxtarget" value="29" />
            <param name="overlap" value="10" />
            <output file="paired_2.fa" ftype="fasta" name="output" />
        </test>
    </tests>
    <help>

**What it does**

Extract reads with overlap signatures of the specified overlap (in nt) and 
return a fasta file of these "pairable" reads.

See `Antoniewski (2014)`_ for background and details

.. _Antoniewski (2014): https://link.springer.com/protocol/10.1007%2F978-1-4939-0931-5_12

**Input**

*A **sorted** BAM alignment file.*

*Query and target sizes:*

The algorithm search for each *query* reads (of specified size) in the bam alignment if
there are *target* reads (of specified size) that align on the opposite strand with a 10 nt
overlap.

Searching query reads of 20-22 nt that overlap by 10 nt with target
reads of 23-29 nt is different from searching query reads of 23-29 nt that overlap by 10 nt
with target reads of 20-22 nt. i.e, searching for siRNAs that pair with piRNAs is distinct
from searching for siRNAs that pairs with piRNAs, although of course the number of possibly
formed piRNA/siRNA pairs is the same as the number of possibly formed siRNA/piRNA pairs.

*Overlap*
The number of nucleotides by which the pairs of sequences will overlap



**Outputs**

a fasta file of pairable reads such as :

>FBgn0000004_17.6|5855|F|23|n=1

TTGACGAAAATGATCGAGTGGAT

>FBgn0000004_17.6|5839|R|26|n=1

TTTTCGTCAATTGTGCCAAATAGGTA

where FBgn0000004_17.6 stands for the chromosome, 5839 stands for the 1-based read position, 
R stand for reverse strand (F forward strand), 26 stands for the size of the sequence and
n=1 stands for the number of reads of the sequence in the dataset.

the second sequence in this example corresponds to 1 read that overlap by 10 nt with
1 read of the first sequence.

        </help>
    <citations>
            <citation type="doi">10.1007/978-1-4939-0931-5_12</citation>
    </citations>
</tool>