comparison microsatellite.xml @ 0:07588b899c13 draft

Uploaded
author arkarachai-fungtammasan
date Wed, 01 Apr 2015 17:05:51 -0400
parents
children d5ed5c2e25c3
comparison
equal deleted inserted replaced
-1:000000000000 0:07588b899c13
1 <tool id="microsatellite" name="Microsatellite detection" version="1.0.0">
2 <description>for short read, reference, and mapped data</description>
3 <command interpreter="python2.7"> microsatellite.py
4 "${filePath}"
5 #if $inputFileSource.inputFileType == "fasta"
6 --fasta
7 #elif $inputFileSource.inputFileType == "fastq"
8 --fastq
9 #elif $inputFileSource.inputFileType == "fastq_noquals"
10 --fastq:noquals
11 #elif $inputFileSource.inputFileType == "sam"
12 --sam
13 #end if
14
15 #if $inputFileSource.inputFileType == "sam"
16 #if $inputFileSource.referenceFileSource.requireReference
17 --r --ref="${inputFileSource.referenceFileSource.referencePath}"
18 #end if
19 #end if
20
21 --period="${period}"
22
23 #if $partialmotifs == "true"
24 --partialmotifs
25 #end if
26
27 --minlength="${minlength}"
28
29
30 --prefix="${prefix}"
31 --suffix="${surfix}"
32
33 --hamming="${hammingThreshold}"
34
35 #if $multipleruns
36 --multipleruns
37 #end if
38
39 #if $flankSetting.noflankdisplay
40 --noflankdisplay
41 #else
42 --flankdisplay=${flankSetting.flankdisplay}
43 #end if
44 &gt; $stdout
45 </command>
46
47 <inputs>
48 <param name="filePath" label="Select input file" type="data"/>
49 <conditional name="inputFileSource">
50 <param name="inputFileType" type="select" label="Select input file type">
51 <option value="fasta">Fasta File</option>
52 <option value="fastq">Fastq File</option>
53 <option value="fastq_noquals">Fastq File without Quality Information</option>
54 <option value="sam">SAM File</option>
55 </param>
56 <when value="sam">
57 <conditional name="referenceFileSource">
58 <param name="requireReference" label="Do you want to extract correspond microsatellites in reference for comparison?" type="boolean">
59 </param>
60 <when value="true">
61 <param name="referencePath" label="Select reference file" type="data"/>
62 </when>
63 </conditional>
64 </when>
65 </conditional>
66
67 <param name="period" label="Motif size of microsatellites of interest (e.g. Mononucleotide microsatellite =1) (must be less than 10)" type="integer" size="2" value="1"/>
68 <param name="partialmotifs" label="Consider microsatellites with a partial motif?" type="boolean" checked="True"/>
69 <param name="minlength" label="Minimal length (bp) of microsatellite sequence reported" type="integer" size="2" value="5"/>
70
71
72 <param name="prefix" label="Do not report candidate repeat intervals that have left flanking region less than (bp):" type="integer" size="4" value="20"/>
73 <param name="surfix" label="Do not report candidate repeat intervals that have left flanking region less than (bp):" type="integer" size="4" value="20"/>
74
75
76 <param name="hammingThreshold" label="Hamming threshold of microsatellite, If greater than 0, interrupted microsatellites will also be reported" type="integer" size="2" value="0"/>
77 <param name="multipleruns" label="Consider all candidate intervals in a sequence. If not check, only the longest one will be considered" type="boolean" checked="True"> </param>
78 <conditional name="flankSetting">
79 <param name="noflankdisplay" label="Show the entire flanking regions" type="boolean" checked="True"/>
80 <when value="false">
81 <param name="flankdisplay" label="Limit length (bp) of flanking regions shown" type="integer" size="4" value="5"/>
82 </when>
83 </conditional>
84
85 </inputs>
86 <outputs>
87 <data name="stdout" format="tabular"/>
88 </outputs>
89 <tests>
90 <!-- Test data with valid values -->
91 <test>
92 <param name="filePath" value="C_sample_fastq"/>
93 <param name="period" value="1"/>
94 <param name="partialmotifs" value="true" />
95 <param name="minlength" value="3" />
96 <param name="prefix" value="5"/>
97 <param name="surfix" value="5"/>
98 <param name="hammingThreshold" value="0"/>
99 <param name="multipleruns" value="true"> </param>
100 <output name="microsatellite" file="C_sample_snoope"/>
101 </test>
102
103 </tests>
104 <help>
105
106
107 .. class:: infomark
108
109 **What it does**
110
111 We use different algorithms to detect microsatellites depend on hamming distance parameter.
112 If hamming distance is set to zero, the program will only concern about uninterrupted microsatellites. The process works as follows.
113
114 1) Scanning reads using sliding windows. For a given repeat period ‘k’ (e.g. k=2 for dinucleotide TRs), we compared consecutive k-mer window size sequences, with a step size of k. If a base at a given position matches one k positions earlier it was marked with a plus, if corresponding sites had different bases it was marked with a minus. The first k position is blank.
115
116 2) Since we do not allow mutations in reported TR, consecutive “+” signal sequence means that a k-mer TR is present in this sample.
117
118 3) Report k-mer TRs if the length is larger than a threshold provided by the user.
119
120 If hamming distance is set to integer more than zero, the program will concern both uninterrupted and interrupted microsatellites. The process works as follows:
121
122 (1) Identify intervals that are highly correlated with the interval shifted by ‘k’ (the repeat period). These intervals are called "runs" or "candidates". The allowed level of correlation is 6/7. Depending on whether we want to look for more than one microsat, we either find the longest such run (simple algorithm) or many runs (more complicated algorithm). The following steps are then performed on each run.
123
124 (2) Find the most likely repeat motif in the run. This is done by counting all kmers (of length P) and choosing the most frequent. If that kmer is itself covered by a sub-repeat we discard this run. The idea is that we can ignore a 6-mer like ACGACG because we will find it when we are looking for 3-mers.
125
126 (3) Once we identify the most likely repeat motif, we then modify the interval, adjusting start and end to find the interval that has the fewest mismatches vs. a sequence of the motif repeated (hamming distance).
127
128 (4) At this point we have a valid microsat interval (in the eyes of the program). It is subjected to some filtering stages (hamming distance or too close to an end), and if it satisfies those conditions, it's reported to the user
129
130 For more option, the script to run this program can be downloaded and run with python independently from Galaxy. There are more option for the script mode. Help page is build-in inside the script.
131
132 **Citation**
133
134 When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
135 This tool is developed by Chen Sun (cxs1031@cse.psu.edu) and Bob Harris (rsharris@bx.psu.edu)
136
137 **Input**
138
139 - The input files can be fastq, fasta, fastq without quality score, and SAM format.
140
141 **Output**
142
143 For fastq, the output will contain the following columns:
144
145 - Column 1 = length of microsatellites (bp)
146 - Column 2 = length of left flanking regions (bp)
147 - Column 3 = length of right flanking regions (bp)
148 - Column 4 = repeat motif (bp)
149 - Column 5 = hamming distance
150 - Column 6 = read name
151 - Column 7 = read sequence with soft masking of microsatellites
152 - Column 8 = read quality (the same Phred score scale as input)
153
154 For fasta, fastq without quality score and sam format, column 8 will be replaced with dot(.).
155
156 If the users have mapped file (SAM) and would like to profile microsatellites from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond microsatellites in reference for comparison. The output will be as follow:
157
158 - Column 1 = length of microsatellites (bp)
159 - Column 2 = length of left flanking regions (bp)
160 - Column 3 = length of right flanking regions (bp)
161 - Column 4 = repeat motif (bp)
162 - Column 5 = hamming distance
163 - Column 6 = read name
164 - Column 7 = read sequence with soft masking of microsatellites
165 - Column 8 = read quality (the same Phred score scale as input)
166 - Column 9 = read name (The same as column 6)
167 - Column 10 = chromosome
168 - Column 11 = left flanking region start
169 - Column 12 = left flanking region stop
170 - Column 13 = microsatellite start as infer from pair-end
171 - Column 14 = microsatellite stop as infer from pair-end
172 - Column 15 = right flanking region start
173 - Column 16 = right flanking region stop
174 - Column 17 = microsatellite length in reference
175 - Column 18 = microsatellite sequence in reference
176
177 </help>
178 </tool>