2
|
1 <tool name="mashmap" id="mashmap" version="3.1.3" profile="22.05">
|
|
2 <!--Source in git at: https://github.com/fubar2/galaxy_tf_overlay-->
|
6
|
3 <!--Created by toolfactory@galaxy.org at 23/02/2024 21:34:16 using the Galaxy Tool Factory.-->
|
2
|
4 <description>Fast local alignment boundaries</description>
|
|
5 <requirements>
|
|
6 <requirement version="3.1.3" type="package">mashmap</requirement>
|
|
7 </requirements>
|
|
8 <version_command><![CDATA[echo "3.1.3"]]></version_command>
|
|
9 <command><![CDATA[bash '$runme']]></command>
|
|
10 <configfiles>
|
5
|
11 <configfile name="runme"><![CDATA[#if len($reflist) > 1:
|
|
12 #for i, mash in enumerate($reflist):
|
|
13 #if i == 0:
|
6
|
14 echo '$mash' > 'reflist' &&
|
5
|
15 #else:
|
6
|
16 echo '$mash' >> 'reflist' &&
|
5
|
17 #end if
|
|
18 #end for
|
|
19 #end if
|
|
20 mashmap --pi '$perc_identity' -s '$seqLength' -f '$filtermode' $dense \
|
2
|
21 #if int($sketchSize) > 0:
|
|
22 -J '$sketchSize' \
|
|
23 #end if
|
|
24 #if len($reflist) == 1:
|
6
|
25 -r '$reflist' -q '$query' &&
|
2
|
26 #else
|
5
|
27 --rl 'reflist' -q '$query' &&
|
2
|
28 #end if
|
6
|
29 cp 'mashmap.out' '$mashout']]></configfile>
|
2
|
30 </configfiles>
|
|
31 <inputs>
|
|
32 <param name="query" type="data" optional="false" label="Query sequences (as fasta) to mash against the references supplied below" help="" format="fasta" multiple="false"/>
|
|
33 <param name="reflist" type="data" optional="false" label="Reference or references to mash the query sequences on" help="Choose one or more reference sequences to mash the query sequences against." format="fasta" multiple="true"/>
|
|
34 <param name="perc_identity" type="float" value="85.0" label="Identity threshold" help="By default, it is set to 85, implying mappings with 85 or more identity should be reported. For example, it can be set to 80to account for more noisy long-read datasets or 95 for mapping human genome assembly to human reference."/>
|
|
35 <param name="seqLength" type="integer" value="5000" label="Minimum segment length" help="Default is 5,000 bp. Sequences below this length are ignored. Mashmap provides guarantees on reporting local alignments of length twice this value."/>
|
|
36 <param name="sketchSize" type="integer" value="0" label="Sketch size - leave 0 for automatic setting based" help="This parameter sets the seed density of the winnowing scheme, gauranteeing that the minhash will be calculated from a sample of sketchSize k-mers for each segment. It is set automatically based on --pi but can be manually set as well."/>
|
4
|
37 <param name="dense" type="select" label="Dense sketching" help="This flag will increase the seed density substantially, resulting in a density of roughly 0.02 * (1 + (1 - pi) / .05) where pi is the perc_identity threshold. This leads to longer runtimes and higher RAM usage, but significantly more accurate estimates of ANI.">
|
|
38 <option value="">No dense sketching</option>
|
|
39 <option value="--dense">Dense sketching</option>
|
3
|
40 </param>
|
2
|
41 <param name="filtermode" type="select" label="Filter mode" help="Mashmap implements a plane-sweep based algorithm to perform the alignment filtering. Similar to delta-filter in nucmer, different filtering options are provided that are suitable for long read or assembly mapping. Option -f map is suitable for reporting the best mappings for long reads, whereas -f one-to-one is suitable for reporting orthologous mappings among all computed assembly to genome mappings.">
|
|
42 <option value="map">map - best mapping for long reads</option>
|
|
43 <option value="one-to-one">one-to-one - best for mapping orthologous reads</option>
|
|
44 <option value="none">None</option>
|
|
45 </param>
|
|
46 </inputs>
|
|
47 <outputs>
|
|
48 <data name="mashout" format="paf" label="mashmap on $query.element_identifier" hidden="false"/>
|
|
49 </outputs>
|
|
50 <tests>
|
|
51 <test>
|
|
52 <output name="mashout" value="mashout_sample" compare="diff" lines_diff="0"/>
|
|
53 <param name="query" value="query_sample"/>
|
|
54 <param name="reflist" value="reflist_sample"/>
|
|
55 <param name="perc_identity" value="85.0"/>
|
|
56 <param name="seqLength" value="5000"/>
|
|
57 <param name="sketchSize" value="0"/>
|
4
|
58 <param name="dense" value=""/>
|
2
|
59 <param name="filtermode" value="map"/>
|
|
60 </test>
|
|
61 </tests>
|
|
62 <help><![CDATA[
|
|
63 *MashMap* implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments,
|
|
64
|
|
65 Mashmap computes alignment boundaries and identity estimates using k-mers. It does not compute the alignments explicitly, but rather estimates an unbiased k-mer based Jaccard similarity using a combination of minmers (a novel winnowing scheme) and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds.
|
|
66
|
|
67 As an example, Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. We describe the algorithms associated with Mashmap, and report on speed, scalability, and accuracy of the software in the publications listed below. Unlike traditional mappers, MashMap does not compute exact sequence alignments. In future, we plan to add an optional alignment support to generate base-to-base alignments.
|
|
68
|
|
69 Map set of query sequences against a reference genome:
|
|
70
|
|
71 mashmap -r reference.fna -q query.fa
|
|
72
|
6
|
73 The output is a paf format file (https://github.com/lh3/miniasm/blob/master/PAF.md).
|
|
74 Thi is space-delimited with each line consisting of query name, length, 0-based start, end, strand, target name, length, start, end and mapping nucleotide identity.
|
2
|
75
|
|
76 Map set of query seqences against a list of reference genomes:
|
|
77
|
|
78 mashmap --rl referenceList.txt -q query.fa
|
|
79
|
|
80 File 'referenceList.txt' containing the list of reference genomes should contain path to the reference genomes, one per line.
|
|
81
|
|
82 Source code: https://github.com/marbl/MashMap
|
|
83 ]]></help>
|
|
84 <citations>
|
|
85 <citation type="doi">10.1093/bioinformatics/btad512</citation>
|
|
86 <citation type="doi">10.1093/bioinformatics/bts573</citation>
|
|
87 </citations>
|
|
88 </tool>
|
|
89
|