Mercurial > repos > arkarachai-fungtammasan > str_fm
comparison microsatellite.xml @ 2:d5ed5c2e25c3 draft
Uploaded
author | arkarachai-fungtammasan |
---|---|
date | Wed, 22 Apr 2015 12:48:40 -0400 |
parents | 07588b899c13 |
children |
comparison
equal
deleted
inserted
replaced
1:f2bab38e3cbd | 2:d5ed5c2e25c3 |
---|---|
1 <tool id="microsatellite" name="Microsatellite detection" version="1.0.0"> | 1 <tool id="microsatellite" name="STR detection" version="1.0.0"> |
2 <description>for short read, reference, and mapped data</description> | 2 <description>for short read, reference, and mapped data</description> |
3 <command interpreter="python2.7"> microsatellite.py | 3 <command interpreter="python2.7"> microsatellite.py |
4 "${filePath}" | 4 "${filePath}" |
5 #if $inputFileSource.inputFileType == "fasta" | 5 #if $inputFileSource.inputFileType == "fasta" |
6 --fasta | 6 --fasta |
89 <tests> | 89 <tests> |
90 <!-- Test data with valid values --> | 90 <!-- Test data with valid values --> |
91 <test> | 91 <test> |
92 <param name="filePath" value="C_sample_fastq"/> | 92 <param name="filePath" value="C_sample_fastq"/> |
93 <param name="period" value="1"/> | 93 <param name="period" value="1"/> |
94 <param name="inputFileType" value="fastq"/> | |
94 <param name="partialmotifs" value="true" /> | 95 <param name="partialmotifs" value="true" /> |
95 <param name="minlength" value="3" /> | 96 <param name="minlength" value="3" /> |
96 <param name="prefix" value="5"/> | 97 <param name="prefix" value="5"/> |
97 <param name="surfix" value="5"/> | 98 <param name="surfix" value="5"/> |
98 <param name="hammingThreshold" value="0"/> | 99 <param name="hammingThreshold" value="0"/> |
106 | 107 |
107 .. class:: infomark | 108 .. class:: infomark |
108 | 109 |
109 **What it does** | 110 **What it does** |
110 | 111 |
111 We use different algorithms to detect microsatellites depend on hamming distance parameter. | 112 This tool identifies simple as well interrupted STRs. Choosing a hamming distance of zero will return simple STRs. |
112 If hamming distance is set to zero, the program will only concern about uninterrupted microsatellites. The process works as follows. | 113 Choosing a hamming distance of greater than zero will return both simple and interrupted STRs. |
113 | 114 The algorithms used to identify simple and interrupted STRs are described oin the manuscript cited below (see TABLE XXXX). |
114 1) Scanning reads using sliding windows. For a given repeat period ‘k’ (e.g. k=2 for dinucleotide TRs), we compared consecutive k-mer window size sequences, with a step size of k. If a base at a given position matches one k positions earlier it was marked with a plus, if corresponding sites had different bases it was marked with a minus. The first k position is blank. | |
115 | |
116 2) Since we do not allow mutations in reported TR, consecutive “+” signal sequence means that a k-mer TR is present in this sample. | |
117 | |
118 3) Report k-mer TRs if the length is larger than a threshold provided by the user. | |
119 | |
120 If hamming distance is set to integer more than zero, the program will concern both uninterrupted and interrupted microsatellites. The process works as follows: | |
121 | |
122 (1) Identify intervals that are highly correlated with the interval shifted by ‘k’ (the repeat period). These intervals are called "runs" or "candidates". The allowed level of correlation is 6/7. Depending on whether we want to look for more than one microsat, we either find the longest such run (simple algorithm) or many runs (more complicated algorithm). The following steps are then performed on each run. | |
123 | |
124 (2) Find the most likely repeat motif in the run. This is done by counting all kmers (of length P) and choosing the most frequent. If that kmer is itself covered by a sub-repeat we discard this run. The idea is that we can ignore a 6-mer like ACGACG because we will find it when we are looking for 3-mers. | |
125 | |
126 (3) Once we identify the most likely repeat motif, we then modify the interval, adjusting start and end to find the interval that has the fewest mismatches vs. a sequence of the motif repeated (hamming distance). | |
127 | |
128 (4) At this point we have a valid microsat interval (in the eyes of the program). It is subjected to some filtering stages (hamming distance or too close to an end), and if it satisfies those conditions, it's reported to the user | |
129 | |
130 For more option, the script to run this program can be downloaded and run with python independently from Galaxy. There are more option for the script mode. Help page is build-in inside the script. | |
131 | 115 |
132 **Citation** | 116 **Citation** |
133 | 117 |
134 When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research** | 118 When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research** |
135 This tool is developed by Chen Sun (cxs1031@cse.psu.edu) and Bob Harris (rsharris@bx.psu.edu) | 119 This tool is developed by Chen Sun (cxs1031@cse.psu.edu) and Bob Harris (rsharris@bx.psu.edu) |
140 | 124 |
141 **Output** | 125 **Output** |
142 | 126 |
143 For fastq, the output will contain the following columns: | 127 For fastq, the output will contain the following columns: |
144 | 128 |
145 - Column 1 = length of microsatellites (bp) | 129 - Column 1 = length of STR (bp) |
146 - Column 2 = length of left flanking regions (bp) | 130 - Column 2 = length of left flanking region (bp) |
147 - Column 3 = length of right flanking regions (bp) | 131 - Column 3 = length of right flanking region (bp) |
148 - Column 4 = repeat motif (bp) | 132 - Column 4 = repeat motif (bp) |
149 - Column 5 = hamming distance | 133 - Column 5 = hamming distance |
150 - Column 6 = read name | 134 - Column 6 = read name |
151 - Column 7 = read sequence with soft masking of microsatellites | 135 - Column 7 = read sequence with soft masking of STR |
152 - Column 8 = read quality (the same Phred score scale as input) | 136 - Column 8 = read quality (the same Phred score scale as input) |
153 | 137 |
154 For fasta, fastq without quality score and sam format, column 8 will be replaced with dot(.). | 138 For fasta, fastq without quality score and sam format, column 8 will be replaced with dot(.). |
155 | 139 |
156 If the users have mapped file (SAM) and would like to profile microsatellites from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond microsatellites in reference for comparison. The output will be as follow: | 140 If the users have mapped file (SAM) and would like to profile STRs from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond STRs in reference for comparison. The output will be as follow: |
157 | 141 |
158 - Column 1 = length of microsatellites (bp) | 142 - Column 1 = length of STR (bp) |
159 - Column 2 = length of left flanking regions (bp) | 143 - Column 2 = length of left flanking region (bp) |
160 - Column 3 = length of right flanking regions (bp) | 144 - Column 3 = length of right flanking region (bp) |
161 - Column 4 = repeat motif (bp) | 145 - Column 4 = repeat motif (bp) |
162 - Column 5 = hamming distance | 146 - Column 5 = hamming distance |
163 - Column 6 = read name | 147 - Column 6 = read name |
164 - Column 7 = read sequence with soft masking of microsatellites | 148 - Column 7 = read sequence with soft masking of STR |
165 - Column 8 = read quality (the same Phred score scale as input) | 149 - Column 8 = read quality (the same Phred score scale as input) |
166 - Column 9 = read name (The same as column 6) | 150 - Column 9 = read name (The same as column 6) |
167 - Column 10 = chromosome | 151 - Column 10 = chromosome |
168 - Column 11 = left flanking region start | 152 - Column 11 = left flanking region start |
169 - Column 12 = left flanking region stop | 153 - Column 12 = left flanking region stop |
170 - Column 13 = microsatellite start as infer from pair-end | 154 - Column 13 = STR start as infer from pair-end |
171 - Column 14 = microsatellite stop as infer from pair-end | 155 - Column 14 = STR stop as infer from pair-end |
172 - Column 15 = right flanking region start | 156 - Column 15 = right flanking region start |
173 - Column 16 = right flanking region stop | 157 - Column 16 = right flanking region stop |
174 - Column 17 = microsatellite length in reference | 158 - Column 17 = STR length in reference |
175 - Column 18 = microsatellite sequence in reference | 159 - Column 18 = STR sequence in reference |
176 | 160 |
177 </help> | 161 </help> |
178 </tool> | 162 </tool> |