Galaxy |

What it does

Extract reads with overlap signatures of the specified overlap (in nt) and return a fasta file of these "pairable" reads.

See Antoniewski (2014) for background and details

Input

A **sorted* BAM alignment file.*

Query and target sizes:

The algorithm search for each query reads (of specified size) in the bam alignment if there are target reads (of specified size) that align on the opposite strand with a 10 nt overlap.

Searching query reads of 20-22 nt that overlap by 10 nt with target reads of 23-29 nt is equivalent to searching query reads of 23-29 nt that overlap by 10 nt with target reads of 20-22 nt. i.e, searching for siRNAs that pair with piRNAs is equivalent to searching for siRNAs that pairs with piRNAs. In contrast, searching query reads of 20-22 nt that overlap by 10 nt with target reads of 23-29 nt is different from searching query reads of 23-29 nt that overlap by 10 nt with target reads of 23-29 nt, since the number of "heterotypic" pairs of reads is likely to be different from the number of "homotypic" pairs of reads.

Overlap The number of nucleotides by which the pairs of sequences will overlap

Outputs

a fasta file of pairable reads such as :

>FBgn0000004_17.6|coord=5839|strand -|size=26|nreads=1

System Message: WARNING/2 (<string>, line 39); backlink

Inline substitution_reference start-string without end-string.

TTTTCGTCAATTGTGCCAAATAGGTA

>FBgn0000004_17.6|coord=5855|strand +|size=23|nreads=1

TTGACGAAAATGATCGAGTGGAT

where FBgn0000004_17.6 stands for the chromosome, 5839 stands for the 1-based read position, 'strand -' stands for lower strand of chromosome, 26 stands for the size of the sequence and nreads=1 stands for the number of reads of the sequence in the dataset.

the second sequence in this example corresponds to 1 read that overlap by 10 nt with 1 read of the first sequence.

The tool also returns in the standard output the numbers of pairs of reads that can be formed simultaneously in silico. Note that these numbers are distinct from the numbers of pairs of read alignments (as computed by the small_rna_signature tool) when analysis is performed with multi-mapping reads.