comparison cmsearch.xml @ 3:2c2c5e5e495b draft

planemo upload for repository https://github.com/bgruening/galaxytools/tree/master/tools/infernal commit 9eeedfaf35c069d75014c5fb2e42046106bf813c-dirty
author bgruening
date Fri, 04 Mar 2016 07:24:53 -0500
parents fac157e22e1b
children 6e18e0b098cd
comparison
equal deleted inserted replaced
2:fac157e22e1b 3:2c2c5e5e495b
133 <option value="--incE">Use E-value</option> 133 <option value="--incE">Use E-value</option>
134 <option value="--incT">Use bit score</option> 134 <option value="--incT">Use bit score</option>
135 </param> 135 </param>
136 <when value=""/> 136 <when value=""/>
137 <when value="--incE"> 137 <when value="--incE">
138 <param name="incE" type="float" value="0.01" size="5" label="Use E-value" help="of &lt;= X as the hit inclusion threshold."> 138 <param name="incE" type="float" value="0.01" label="Use E-value" help="of &lt;= X as the hit inclusion threshold.">
139 <sanitizer> 139 <sanitizer>
140 <valid initial="string.printable"> 140 <valid initial="string.printable">
141 <remove value="&apos;"/> 141 <remove value="&apos;"/>
142 </valid> 142 </valid>
143 </sanitizer> 143 </sanitizer>
144 </param> 144 </param>
145 </when> 145 </when>
146 <when value="--incT"> 146 <when value="--incT">
147 <param name="incT" type="integer" size="5" value="0" label="Use bit score" help="of >= X as the hit inclusion threshold."> 147 <param name="incT" type="integer" value="0" label="Use bit score" help="of >= X as the hit inclusion threshold.">
148 <sanitizer> 148 <sanitizer>
149 <valid initial="string.printable"> 149 <valid initial="string.printable">
150 <remove value="&apos;"/> 150 <remove value="&apos;"/>
151 </valid> 151 </valid>
152 </sanitizer> 152 </sanitizer>
163 <option value="-E">Use E-value</option> 163 <option value="-E">Use E-value</option>
164 <option value="-T">Use bit score</option> 164 <option value="-T">Use bit score</option>
165 </param> 165 </param>
166 <when value=""/> 166 <when value=""/>
167 <when value="-E"> 167 <when value="-E">
168 <param name="E" type="float" value="10.0" size="5" label="Use E-value" help="of &lt;= X as the hit reporting threshold. The default is 10.0, meaning that on average, about 10 false positives will be reported per query, so you can see the top of the noise and decide for yourself if it’s really noise."> 168 <param name="E" type="float" value="10.0" label="Use E-value" help="of &lt;= X as the hit reporting threshold. The default is 10.0, meaning that on average, about 10 false positives will be reported per query, so you can see the top of the noise and decide for yourself if it’s really noise.">
169 <sanitizer> 169 <sanitizer>
170 <valid initial="string.printable"> 170 <valid initial="string.printable">
171 <remove value="&apos;"/> 171 <remove value="&apos;"/>
172 </valid> 172 </valid>
173 </sanitizer> 173 </sanitizer>
174 </param> 174 </param>
175 </when> 175 </when>
176 <when value="-T"> 176 <when value="-T">
177 <param name="T" type="integer" size="5" value="0" label="Use bit score" help="of >= X as the hit reporting threshold."> 177 <param name="T" type="integer" value="0" label="Use bit score" help="of >= X as the hit reporting threshold.">
178 <sanitizer> 178 <sanitizer>
179 <valid initial="string.printable"> 179 <valid initial="string.printable">
180 <remove value="&apos;"/> 180 <remove value="&apos;"/>
181 </valid> 181 </valid>
182 </sanitizer> 182 </sanitizer>
200 <![CDATA[ 200 <![CDATA[
201 201
202 202
203 **What it does** 203 **What it does**
204 204
205 Infernal is used to search sequence databases for homologs of structural RNA sequences, and to make 205 cmsearch belongs to the INFERNAL software package that allows you to make consensus RNA secondary structure profiles, and use them to search nucleic acid sequence databases for homologous RNAs, or to create new structure-based multiple sequence alignments.
206 sequence- and structure-based RNA sequence alignments. Infernal needs a profile from a structurally 206 You can use your model to search for new homologues of your RNA family. cmsearch is used to search one or more covariance models (CMs) against a sequence database. cmsearch searches both strands of each sequence in the target database, and returns alignments for high scoring hits.
207 annotated multiple sequence alignment of an RNA family with a position-specific scoring system for substitutions, 207
208 insertions, and deletions. Positions in the profile that are basepaired in the consensus secondary 208 To build CMs from multiple alignments, see cmbuild (build covariance models).
209 structure of the alignment are modeled as dependent on one another, allowing Infernal’s scoring system to 209
210 consider the secondary structure, in addition to the primary sequence, of the family being modeled. Infernal 210
211 profiles are probabilistic models called “covariance models”, a specialized type of stochastic context-free 211 **Input**
212 grammar (SCFG) (Lari and Young, 1990). 212
213 213 The CM query file must have been calibrated for E-values with cmcalibrate. As a special exception, any models CM query files that have zero basepairs need not be calibrated.
214 Compared to other alignment and database search tools based only on sequence comparison, Infernal 214
215 aims to be significantly more accurate and more able to detect remote homologs because it models sequence 215
216 and structure. 216 **Options**
217 217
218 218 - *Turn on the glocal alignment algorithm*: global with respect to the query model and local with respect to the target database. By default, the local alignment algorithm is used which is local with respect to both the target sequence and the model. In local mode, the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. Local mode performs better on empirical benchmarks and is significantly more sensitive for remote homology detection. Empirically, glocal searches return many fewer hits than local searches, so glocal may be desired for some applications. With *Turn on the glocal alignment algorithm*, all models must be calibrated, even those with zero basepairs.
219 Output format 219
220 ------------- 220 - *Only search the bottom (Crick) strand of target sequences*: Hits can occur on either the top (Watson) or bottom (Crick) strand of the target sequence. By default, both strands are searched.
221 221
222 (1) target name: The name of the target sequence or profile. 222 - *Only search the top (Watson) strand of target sequences*: Hits can occur on either the top (Watson) or bottom (Crick) strand of the target sequence. By default, both strands are searched.
223 (2) accession: The accession of the target sequence or profile, or ’-’ if none. 223
224 (3) query name: The name of the query sequence or profile. 224 - *Use the CYK algorithm, not Inside, to determine the final score of all hits*: If selecting "yes", the CYK algorithm instead of the CM Inside algorithm (the SCFG analog of the HMM Forward algorithm) is used.
225 (4) accession: The accession of the query sequence or profile, or ’-’ if none. 225
226 (5) mdl (model): Which type of model was used to compute the final score. Either ’cm’ or ’hmm’. A CM is used to compute the final hit scores unless the model has zero basepairs or the --hmmonly option is used, in which case a HMM will be used. 226 - *Use the CYK algorithm to align hits*: By default, the Durbin/Holmes optimal accuracy algo-
227 (6) mdl from (model coord): The start of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions. 227 rithm is used, which finds the alignment that maximizes the expected accuracy of all aligned
228 (7) mdl to (model coord): The end of the alignment of this hit with respect to the profile (CM or HMM), numbered 1..N for a profile of N consensus positions. 228 residues.
229 (8) seq from (ali coord): The start of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues. 229
230 (9) seq to (ali coord): The end of the alignment of this hit with respect to the sequence, numbered 1..L for a sequence of L residues. 230 - *Turn off truncated hit detection*: Turns off truncated hit detection and will reduce the running time most significantly for target files that include many short sequences.
231 (10) strand: The strand on which the hit occurs on the sequence. ’+’ if the hit is on the top (Watson) strand, ’-’ if the hit is on the bottom (Crick) strand. If on the top strand, the “seq from” value will be less than or equal to the “seq to” value, else it will be greater than or equal to it. 231
232 (11) trunc: Indicates if this is predicted to be a truncated CM hit or not. This will be “no” if it is a CM hit that is not predicted to be truncated by the end of the sequence, “5’ ” or “3’ ” if the hit is predicted to have one or more 5’ or 3’ residues missing due to a artificial truncation of the sequence, or “5’&3”’ if the hit is predicted to have one or more 5’ residues missing and one or more 3’ residues missing. If the hit is an HMM hit, this will always be ’-’. 232 - *Turn off all filters, and run non-banded Inside on every full-length target sequence*: This
233 (12) pass: Indicates what “pass” of the pipeline the hit was detected on. This is probably only useful for testing and debugging. Non-truncated hits are found on the first pass, truncated hits are found on successive passes. 233 increases sensitivity somewhat, at an extremely large cost in speed.
234 (13) gc: Fraction of G and C nucleotides in the hit. 234
235 (14) bias: The biased-composition correction: the bit score difference contributed by the null3 model for CM hits, or the null2 model for HMM hits. High bias scores may be a red flag for a false positive. It is difficult to correct for all possible ways in which a nonrandom but nonhomologous biological sequences can appear to be similar, such as short-period tandem repeats, so there are cases where the bias correction is not strong enough (creating false positives). 235 - *Turn off all HMM filter stages*: The CYK filter, using QDBs, will be run on every full-length target sequence and will enforce a P-value threshold of 0.0001. Each subsequence that survives CYK will be passed to Inside, which will also use QDBs (but a looser set). This increases sensitivity somewhat, at a very large cost in speed.
236 (15) score: The score (in bits) for this target/query comparison. It includes the biased-composition cor-rection (the “null3” model for CM hits, or the “null2” model for HMM hits). 236
237 (16) E-value: The expectation value (statistical significance) of the target. This is a per query E-value; i.e. calculated as the expected number of false positives achieving this comparison’s score for a single query against the search space Z. For cmsearch Z is defined as the total number of nucleotides in the target dataset multiplied by 2 because both strands are searched. For cmscan Z is the total number of nucleotides in the query sequence multiplied by 2 because both strands are searched and multiplied by the number of models in the target database. If you search with multiple queries and if you want to control the overall false positive rate of that search rather than the false positive rate per query, you will want to multiply this per-query E-value by how many queries you’re doing. 237 -*Turn off the HMM SSV and Viterbi filter stages*:Sets remaining HMM filter
238 (17) inc: Indicates whether or not this hit achieves the inclusion threshold: ’!’ if it does, ’?’ if it does not (and rather only achieves the reporting threshold). By default, the inclusion threshold is an E-value of 0.01 and the reporting threshold is an E-value of 10.0, but these can be changed with command line options as described in the manual pages. 238 thresholds to 0.02 by default. This may increase sensitivity, at a significant cost in speed.
239 (18) description of target: The remainder of the line is the target’s description line, as free text. 239
240 240 - *Inclusion thresholds*: *Use E-value* - Use an E-value as the hit inclusion threshold. The default is 0.01, meaning that on average, about 1 false positive would be expected in every 100 searches with different
241 241 query sequences. *Use Bit Score* - Instead of using E-values for setting the inclusion threshold, instead use a bit score as the hit inclusion threshold. By default this option is unset.
242 For further questions please refere to the Infernal Userguide_. 242
243 243
244 .. _Userguide: http://selab.janelia.org/software/infernal/Userguide.pdf 244 **Output Options**
245 245
246 246 - *reporting thresholds*: Hits are ranked by statistical significance (E-value). By *default*, all hits with an E-value <= 10 are reported. The following options allow you to change the default *E-value* reporting thresholds, or to use *bit score* thresholds instead.
247 How do I cite Infernal? 247
248 ----------------------- 248
249 249 Output Example:
250 The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches , Bioinformatics 29:2933-2935 (2013). 250
251 251
252 **Galaxy Wrapper Author**:: 252 # cmsearch :: search CM(s) against a sequence database
253 253 # INFERNAL 1.1.1 (July 2014)
254 * Bjoern Gruening, University of Freiburg 254 # Copyright (C) 2014 Howard Hughes Medical Institute.
255 # Freely distributed under the GNU General Public License (GPLv3).
256 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
257 # query CM file: tRNA5.cm
258 # target sequence database: tutorial/mrum-genome.fa
259 # number of worker threads: 8
260 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
261
262
263 The second section is a list of ranked top hits (sorted by E-value, most significant hit first):
264
265 rank E-value score bias sequence start end mdl trunc gc description
266 ---- --------- ------ ----- ----------- ------- ------- --- ----- ---- -----------
267 (1) ! 1.3e-18 71.5 0.0 NC_013790.1 362026 361955 - cm no 0.50 Methanobrevibacter ruminantium M1
268 (2) ! 3.3e-18 70.2 0.0 NC_013790.1 2585265 2585193 - cm no 0.60 Methanobrevibacter ruminantium M1
269
270
271
272 For further questions please refere to the Infernal `Userguide <http://selab.janelia.org/software/infernal/Userguide.pdf>`_.
255 273
256 ]]> 274 ]]>
257 </help> 275 </help>
276 <citations>
277 <citation type="doi">10.1093/bioinformatics/btt509</citation>
278 <citation type="bibtex">
279 @ARTICLE{bgruening_galaxytools,
280 Author = {Björn Grüning, Cameron Smith, Torsten Houwaart, Nicola Soranzo, Eric Rasche},
281 keywords = {bioinformatics, ngs, galaxy, cheminformatics, rna},
282 title = {{Galaxy Tools - A collection of bioinformatics and cheminformatics tools for the Galaxy environment}},
283 url = {https://github.com/bgruening/galaxytools}
284 }
285 </citation>
286 </citations>
287
288
258 </tool> 289 </tool>