comparison macros.xml @ 0:fb0ce7937a85 draft

"planemo upload commit 50c5525c05d834545335e0273352b1aff79e5702"
author diodupima
date Thu, 15 Jul 2021 16:51:26 +0000
parents
children 6a33bad2db7f
comparison
equal deleted inserted replaced
-1:000000000000 0:fb0ce7937a85
1 <macros>
2 <token name="@TOOL_VERSION@">0.1.2</token>
3 <xml name="requirements">
4 <requirement type="package" version="0.1.2">coast</requirement>
5 </xml>
6 <xml name="citations_coast">
7 <citation type="bibtex">@misc{noauthor_coast_nodate,
8 title = {{COAST} - {Compartive} {Ominc} {Alignment} {Search} {Tool}},
9 url = {https://gitlab.com/coast_tool/COAST},
10 abstract = {Alignment search tool that identifies similar proteomes},
11 language = {en},
12 urldate = {2021-06-22},
13 }
14 </citation>
15 </xml>
16 <xml name="citations_taxonkit">
17 <citation type="bibtex">@article{shen_taxonkit_2021,
18 abstract = {The National Center for Biotechnology Information (NCBI) Taxonomy is widely applied in biomedical and ecological studies. Typical demands include querying taxonomy identifier (TaxIds) by taxonomy names, querying complete taxonomic lineages by TaxIds, listing descendants of given TaxIds, and others. However, existed tools are either limited in functionalities or inefficient in terms of runtime. In this work, we present TaxonKit, a command-line toolkit for comprehensive and efficient manipulation of NCBI Taxonomy data. TaxonKit comprises seven core subcommands providing functions, including TaxIds querying, listing, filtering, lineage retrieving and reformatting, lowest common ancestor computation, and TaxIds change tracking. The practical functions, competitive processing performance, scalability with different scales of datasets and good accessibility could facilitate taxonomy data manipulations. TaxonKit provides free access under the permissive MIT license on GitHub, Brewsci, and Bioconda. The documents are also available at https://bioinf.shenwei.me/taxonkit/.},
19 author = {Shen, Wei and Ren, Hong},
20 doi = {10.1016/j.jgg.2021.03.006},
21 file = {ScienceDirect Snapshot:/home/dm/Zotero/storage/Q3KYT6QS/S1673852721000837.html:text/html},
22 issn = {1673-8527},
23 journal = {Journal of Genetics and Genomics},
24 keywords = {Lineage; NCBI Taxonomy; TaxId; TaxId changelog; TaxonKit},
25 language = {en},
26 month = apr,
27 shorttitle = {{TaxonKit}},
28 title = {{TaxonKit}: {A} practical and efficient {NCBI} taxonomy toolkit},
29 url = {https://www.sciencedirect.com/science/article/pii/S1673852721000837},
30 urldate = {2021-06-21},
31 year = {2021}
32 }
33 </citation>
34 </xml>
35 <xml name="citations_diamond">
36 <citation type="bibtex">@article{buchfink_sensitive_2021,
37 title = {Sensitive protein alignments at tree-of-life scale using {DIAMOND}},
38 volume = {18},
39 issn = {1548-7091, 1548-7105},
40 url = {http://www.nature.com/articles/s41592-021-01101-x},
41 doi = {10.1038/s41592-021-01101-x},
42 abstract = {Abstract
43 We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.},
44 language = {en},
45 number = {4},
46 urldate = {2021-04-14},
47 journal = {Nature Methods},
48 author = {Buchfink, Benjamin and Reuter, Klaus and Drost, Hajk-Georg},
49 month = apr,
50 year = {2021},
51 pages = {366--368},
52 file = {Full Text:/home/dm/Zotero/storage/6HKCWF6S/Buchfink et al. - 2021 - Sensitive protein alignments at tree-of-life scale.pdf:application/pdf},
53 }
54 </citation>
55 </xml>
56 <xml name="citations_blast">
57 <citation type="bibtex">@article{camacho_blast_2009,
58 title = {{BLAST}+: architecture and applications},
59 volume = {10},
60 issn = {1471-2105},
61 shorttitle = {{BLAST}+},
62 url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2803857/},
63 doi = {10.1186/1471-2105-10-421},
64 abstract = {Background
65 Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications.
66
67 Results
68 We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site.
69
70 Conclusion
71 The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.},
72 urldate = {2021-04-14},
73 journal = {BMC Bioinformatics},
74 author = {Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
75 month = dec,
76 year = {2009},
77 pmid = {20003500},
78 pmcid = {PMC2803857},
79 pages = {421},
80 file = {PubMed Central Full Text PDF:/home/dm/Zotero/storage/5FCYSMW5/Camacho et al. - 2009 - BLAST+ architecture and applications.pdf:application/pdf},
81 }
82 </citation>
83 </xml>
84 <xml name="input_query">
85 <conditional name="query_type">
86 <param name="source" type="select" label="Select the type of input file">
87 <option value="coast_gb">COAST from GenBank</option>
88 <option value="coast_fa">COAST from FASTA</option>
89 </param>
90 <when value="coast_gb">
91 <param name="query_file" type="data" format="GenBank" label="Load a query proteome in Genebank"/>
92 <param name="query_key" type="select" label="List the GB file Features to be used as Proteins, do so in a way to prevent duplicated proteins.">
93 <option value="CDS" selected="true">CDS</option>
94 <option value="product">product</option>
95 </param>
96 </when>
97 <when value="coast_fa">
98 <param name="query_file" type="data" format="FASTA" label="Load a query proteome in FASTA"/>
99 </when>
100 </conditional>
101 </xml>
102 <token name="@QUERY@"><![CDATA[
103 "$query_type.query_file"
104 ]]></token>
105 <token name="@QUERY_KEYWORDS@"><![CDATA[
106 #if $query_type.source == 'coast_gb'
107 --keywords '$query_type.query_key'
108 #end if
109 ]]></token>
110
111 <xml name="protein_db">
112 <param name="db" type="select" optional="false" label="BLAST-Ready protein sequences database.">
113 <options from_data_table="blastdb" />
114 </param>
115 </xml>
116 <token name="@DB@"><![CDATA[
117 "$db"
118 ]]></token>
119
120 <xml name="protein_db_diamond">
121 <param name="db" type="select" optional="false" label="Diamond protein sequences database.">
122 <options from_data_table="diamond_database" />
123 </param>
124 </xml>
125
126 <xml name="output_format">
127 <param name="outfmt" type="select" optional="true" multiple="true" display="checkboxes" label="Select outputs">
128 <option value="b" selected="true">Best-hits tabular file</option>
129 <option value="a" selected="true">Results tabular file</option>
130 <option value="r" selected="true">Summarized Report</option>
131 </param>
132 </xml>
133 <token name="@OUTPUT_FORMAT@"><![CDATA[
134 #if $outfmt
135 --outfmt
136 #for $format in $outfmt
137 '${format}'
138 #end for
139 #end if
140 ]]></token>
141 <token name="@OUTPUT@"><![CDATA[
142 --quiet
143 ]]></token>
144
145 <xml name="aai_filter">
146 <param name="aai" type="integer" value="10" label="AAIc filtering score">
147 <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
148 </param>
149 <param name="min_cov" type="integer" value="50" label="Minimum Coverage for AAIbd hit selection">
150 <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
151 </param>
152 <param name="min_id" type="integer" value="40" label="Minimum Amino Acid Identity for AAIbd hit selection">
153 <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
154 </param>
155 </xml>
156 <token name="@AAI_FILTER@"><![CDATA[
157 --aai '$aai'
158 --cov '$min_cov'
159 --id '$min_id'
160 ]]></token>
161
162 <xml name="hypothetical_filter">
163 <param name="hypothetical" type="boolean" checked="false" label="Filter hypothetical proteins from query. Read description for more information." truevalue="--filter_hypothetical" falsevalue=""/>
164 </xml>
165 <token name="@HYPO_FILTER@"><![CDATA[
166 #if $hypothetical
167 '$hypothetical'
168 #end if
169 ]]></token>
170
171 <xml name="results_alignment">
172 <data format_source="tabular" format="tabular" name="blast_results" label="COAST - Batch alignment results" from_work_dir="blast_results.tab"/>
173 </xml>
174
175 <xml name="results_report">
176 <data format_source="html" format="html" name="coast_report" label="COAST - Summarized report" from_work_dir="coast_report.html">
177 <filter>"r" in outfmt</filter>
178 </data>
179 <data format_source="tabular" format="tabular" label="COAST - Best-hits table" name="bh_results" from_work_dir="bh_results.tab">
180 <filter>"b" in outfmt</filter>
181 </data>
182 <data format_source="tabular" format="tabular" label="COAST - Results table" name="coast_results" from_work_dir="coast_results.tab">
183 <filter>"a" in outfmt</filter>
184 </data>
185 </xml>
186
187 <xml name="blast_taxon_filter">
188 <conditional name="filter_type">
189 <param name="taxon_filter_type" type="select" label="Type of taxonomic filter">
190 <option value="taxidlist_dm">Pre-defined taxonomic filters</option>
191 <option value="taxidlist_user">User-provided file based list</option>
192 <option value="taxonlist">Comma separated list</option>
193 </param>
194 <when value="taxidlist_dm">
195 <param name="taxidlist" type="select" optional="true" label="Select pre-defined taxonomic filters">
196 <options from_data_table="coast_taxonomic_filters" />
197 </param>
198 </when>
199 <when value="taxidlist_user">
200 <param name="taxidlist" type="data" format="txt" optional="true" label="Load file with filtering taxids."/>
201 </when>
202 <when value="taxonlist">
203 <param name="taxonlist" type="text" optional="true" label="Comma separated list of TAXIDs nodes, ranking species or lower"/>
204 </when>
205 </conditional>
206 </xml>
207 <token name="@BLAST_TAX_FILTER@"><![CDATA[
208 #if $filter_type.taxon_filter_type == "taxidlist_dm"
209 --taxidlist '$filter_type.taxidlist.fields.path'
210 #end if
211 #if $filter_type.taxon_filter_type == "taxidlist_user"
212 --taxidlist '$filter_type.taxidlist'
213 #end if
214 #if $filter_type.taxon_filter_type == "taxonlist"
215 --taxonlist '$filter_type.taxonlist'
216 #end if
217 ]]></token>
218
219 <xml name="diamond_taxon_filter">
220 <conditional name="filter_type">
221 <param name="taxon_filter_type" type="select" label="Type of taxonomic filter">
222 <option value="taxonlist_pre_defined">Pre-defined taxonomic filters</option>
223 <option value="taxonlist">Comma separated list</option>
224 </param>
225 <when value="taxonlist_pre_defined">
226 <param name="taxonlist" type="select" optional="true" label="Select pre-defined taxonomic filters">
227 <option value="10239">Viruses - 10239</option>
228 <option value="2157">Archaea - 2157</option>
229 <option value="2">Bacteria - 2</option>
230 </param>
231 </when>
232 <when value="taxonlist">
233 <param name="taxonlist" type="text" optional="true" label="Comma separated list of TAXIDs nodes, ranking species or lower"/>
234 </when>
235 </conditional>
236 </xml>
237 <token name="@DIAMOND_TAX_FILTER@"><![CDATA[
238 #if $taxonlist
239 --taxonlist '$taxonlist'
240 #end if
241 ]]></token>
242
243 <xml name="generic_aln_options">
244 <param name="threshold_no" type="float" size="15" value="0.001" optional="true" label="E-Value Threshold"/>
245 <param name="scoring_matrix" type="select" optional="true" label="Scoring matrix">
246 <option value="BLOSUM45">BLOSUM45</option>
247 <option value="BLOSUM50">BLOSUM50</option>
248 <option value="BLOSUM62">BLOSUM62</option>
249 <option value="BLOSUM80">BLOSUM80</option>
250 <option value="BLOSUM90">BLOSUM90</option>
251 <option value="PAM250">PAM250</option>
252 <option value="PAM70">PAM70</option>
253 <option value="PAM30">PAM30</option>
254 </param>
255 <param name="gap_open" type="integer" optional="true" label="Gap opening penalty">
256 <validator type="in_range" min="0" max="50" message="Value not in the permitted range. Only values from O to 50 allowed."/>
257 </param>
258 <param name="gap_ext" type="integer" optional="true" label="Gap extension penalty">
259 <validator type="in_range" min="0" max="50" message="Value not in the permitted range. Only values from O to 50 allowed."/>
260 </param>
261 </xml>
262 <token name="@GENERIC_ALN_OPTIONS@"><![CDATA[
263 #if $aln_adv.scoring_matrix
264 --matrix '$aln_adv.scoring_matrix'
265 #end if
266 #if $aln_adv.threshold_no
267 --evalue '$aln_adv.threshold_no'
268 #end if
269 #if $aln_adv.gap_open
270 --gapopen '$aln_adv.gap_open'
271 #end if
272 #if $aln_adv.gap_ext
273 --gapextend '$aln_adv.gap_ext'
274 #end if
275 ]]></token>
276
277 <xml name="blast_aln_options">
278 <param name="task" type="select" optional="true" label="Type of BLAST">
279 <option value="blast">blast</option>
280 <option value="blastp-fast">blastp-fast</option>
281 <option value="blastp-short">blastp-short</option>
282 </param>
283 </xml>
284 <token name="@BLAST_ALN_OPTIONS@"><![CDATA[
285 #if $aln_adv.task
286 --task '$aln_adv.task'
287 #end if
288 ]]></token>
289
290 <xml name="diamond_aln_options">
291 <param name="diamond_sens" type="select" label="Select the desired sensibility">
292 <option value="sensitive" selected="true">sensitive</option>
293 <option value="more-sensitive">more sensitive</option>
294 <option value="very-sensitive">very sensitive</option>
295 <option value="ultra-sensitive">ultra sensitive</option>
296 </param>
297 </xml>
298 <token name="@DIAMOND_ALN_OPTIONS@"><![CDATA[
299 #if $aln_adv.diamond_sens
300 --sens '$aln_adv.diamond_sens'
301 #end if
302 ]]></token>
303
304 <xml name="merlin_db_selection">
305 <param name="db" type="select" label="Select the desired database">
306 <option value="UniProtKB_SwissProt">SwissProt</option>
307 <option value="UniProtKB_Trembl">Trembl</option>
308 </param>
309 </xml>
310 <token name="@TIME_WARNING@"><![CDATA[
311 .. class:: warningmark
312
313 **WARNING** Proteome wide search time is affected by the its size and database size. This might result in slow queries.
314 Please use taxonomic filters to decrease search time significantly.
315
316 ]]></token>
317 <token name="@GENERAL_DESC@"><![CDATA[
318
319 COAST is tool designed to identify close proteomes for a user provided query, particulary for virus, using conventional alignment tools.
320 The close proteomes are provided at NCBI's taxonomy node level. For more information you can visit https://coast-tools.readthedocs.io
321
322 ]]></token>
323 <token name="@AAI_DESC@"><![CDATA[
324
325 Indices and Metrics
326 ___________________
327
328 **AAIc - Average Amino Acid Identity coast**
329
330 The AAIc is an attempt modify the AAI into a measure to compare proteomes for all annotated proteins.
331 Low identity hits will be considered, when they are usually removed by the traditional method.
332 Proteins that have no match at all will be also considered, as having 0 identity match.
333 It provides a way to compare the actual annotation and select organisms, even if more taxonomically distant, with proteins that could be
334 relevant for the function determination in hypothetical proteins, as an example.
335 For this the best hit is selected by the highest identity.
336
337 **AAIbd - Average Amino Acid Identity blast-diamond**
338
339 The AAIbd, is a implementation of a similar calculation to that of the original
340 AAI, but calculated only one way. It has by default a coverage and identity
341 of 50 and 40 respectively. This values are also used by EzAAI, based in the recent study
342 done by Nicholson et. all in 2020. The best hit is then selected by the the
343 highest identity.
344 The main purpose of this metric is to provide the user with an
345 estimate of how close taxonomically that Taxonomic node might be. The designation **bd** is used
346 to distinguish it from the original AAIb. It identifies that the score might be
347 produced using either BLAST results or diamond results.
348
349 The following options might be used to calibrate this selection to the user's preference:
350
351 - Minimum Identity: Minimum Amino Acid Identity, for hit selection for the AAIbd calculation;
352 - Minimum Coverage: Minimum coverage, for hit selection for the AAIbd calculation.
353
354 **HITSPP - Hits Per Protein**
355
356 The score is calculated by the quotient of the count of all the hits all proteins got, by the number of proteins in the query
357 proteome.
358 This will help the user understand how represented the proteome’s proteins might be in that particular database.
359
360 .. class:: warningmark
361
362 **WARNING** Very high values, above 100, might indicate that the taxonomic node very represented in the database.
363 Intermediate steps only deal with up to 500 hits per proteins, before best-hit selection.
364 As such, a small number of organisms with very high HITSPP scores can reduce the amount of organisms returned.
365
366 ]]></token>
367 <token name="@OUT_DESC@"><![CDATA[
368
369 Outputs
370 _______
371
372 **Batch alignment results** This is a non-optional output. It contains the all alignment search results for all proteins in the proteome. It can also be used to generated new outputs from the COAST Report tool, using different parameters.
373
374 **Summarized report** Is an HTML document that contains a list of filtered results ordered by AAIc. This report includes an heatmap visualization for protein identities.
375 It also contains metadata for the COAST job.
376
377 **Best-hits table** Tabular file with all the individual selected best-hits for each protein in the proteome. These are the hits selected for the AAIc calculation.
378
379 **Results table** Tabular file with aggregated metrics for each proteome match. Aggregated for taxid.
380
381 ]]></token>
382 <token name="@TAX_FILTER_WARNING@"><![CDATA[
383
384 Taxonomic Filtering
385 ___________________
386
387 Taxonomic based filtering is present in both BLAST and diamond. It is **THE** key for short COAST run times in large databases.
388
389 Most organisms in a database, like nr or Trembl, are not useful in the close proteome identification process.
390 When users, for example, try to identify similar viruses the bacteria and eukaryotes in the same database will only slow the search down.
391 You should determine how wide you desire the search to be and identify the corresponding TAXID node.
392 Some of these filters are provided along with this tool.
393
394 ]]></token>
395 <token name="@HYPO_FILTER_WARNING@"><![CDATA[
396 .. class:: warningmark
397
398 **WARNING - Experimental feature** Hypothetical protein filtering can lead to worse results. Should only be used when few of the proteins have corresponding best-hits and the database might lack poorly studied proteins.
399
400 ]]></token>
401
402 </macros>