Mercurial > repos > jjohnson > fgbio_group_reads_by_umi

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/fgbio_group_reads_by_umi.xml	Sun Feb 21 23:40:34 2021 +0000
@@ -0,0 +1,108 @@
+<tool id="fgbio_group_reads_by_umi" name="fgbio GroupReadsByUmi" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" python_template_version="3.5">
+    <description>Groups reads together that appear to have come from the same original molecule</description>
+    <macros>
+        <import>macros.xml</import>
+    </macros>
+    <expand macro="requirements" />
+    <version_command>fgbio --version</version_command>
+    <command detect_errors="exit_code"><![CDATA[
+        fgbio GroupReadsByUmi
+        --input '$input'
+        --strategy=$strategy
+        --output '$output'
+        ## optional settings
+        #if $optional.edits:
+            --edits $optional.edits
+        #end if
+        #if $optional.min_umi_length
+            --min-umi-length $optional.min_umi_length
+        #end if
+        #if $optional.min_map_q:
+            --min-map-q $optional.min_map_q
+        #end if
+        #if $optional.raw_tag
+            --raw-tag=$optional.raw_tag
+        #end if
+        #if $optional.assign_tag
+            --assign-tag=$optional.assign_tag
+        #end if
+        #if $optional.include_non_pf_reads
+            --include-non-pf-reads=$optional.include_non_pf_reads
+        #end if
+        #if $output_counts
+            --family-size-histogram='$family_size_histogram'
+        #end if
+    ]]></command>
+    <inputs>
+        <param name="input" type="data" format="bam" label="Fastq files corresponding to each sequencing read"/>
+        <param argument="--strategy" type="select" label="UMI assignment strategy">
+            <option value="identity">identity</option>
+            <option value="edit">edit</option>
+            <option value="adjacency">adjacency</option>
+            <option value="paired">paired</option>
+        </param>
+        <section name="optional" title="Optional Settings" expanded="false">
+            <param argument="--edits" type="integer" value="" optional="true" label="Allowable number of edits between UMIs"
+               help="Control the matching of non-identical UMIs. Default: 1"/>
+            <param argument="--min-umi-length" type="integer" value="" optional="true" label="The minimum UMI length" >
+                <help>If not specified then all UMIs must have the same length,
+                      otherwise discard reads with UMIs shorter than this length
+                      and allow for differing UMI lengths.
+                </help>
+            </param>
+            <param argument="--min-map-q" type="integer" value="" optional="true" label="Minimum mapping quality" help="Default: 30"/>
+            <param argument="--raw-tag" type="text" value="" label="The tag containing the raw UMI" help="Default: RX">
+                <expand macro="sam_tag_validator"/>
+            </param>
+            <param argument="--assign-tag" type="text" value="" label="The output tag for UMI grouping" help="Default: MI">
+                <expand macro="sam_tag_validator"/>
+            </param>
+            <param argument="--include-non-pf-reads" type="select" value="true" optional="true" label="Include non-PF reads">
+                <option value="true">Yes</option>
+                <option value="flse">No</option>
+            </param>
+        </section>
+        <param argument="output_counts" type="boolean" truevalue="true" falsevalue="false" checked="false" label="Output tag family size counts"/>
+
+    </inputs>
+    <outputs>
+        <data name="family_size_histogram" format="tabular" >
+            <filter>output_counts == True</filter>
+        </data>
+        <data name="output" format="bam" />
+    </outputs>
+    <help><![CDATA[
+**fgbio GroupReadsByUmi**
+
+Groups reads together that appear to have come from the same original molecule. Reads are grouped by template, and then templates are sorted by the 5’ mapping positions of the reads from the template, used from earliest mapping position to latest. Reads that have the same end positions are then sub-grouped by UMI sequence.
+
+Accepts reads in any order (including unsorted) and outputs reads sorted by:
+
+  - The lower genome coordinate of the two outer ends of the templates
+  - The sequencing library
+  - The assigned UMI tag
+  - Read Name
+
+Reads are aggressively filtered out so that only high quality reads/mappings are taken forward. Single-end reads must have mapping quality >= min-map-q. Paired-end reads must have both reads mapped to the same chromosome with both reads having mapping quality >= min-mapq. (Note: the MQ tag is required on reads with mapped mates).
+
+This is done with the expectation that the next step is building consensus reads, where it is undesirable to either:
+
+  - Assign reads together that are really from different source molecules
+  - Build two groups from reads that are really from the same molecule
+  - Errors in mapping reads could lead to both and therefore are minimized.
+
+Grouping of UMIs is performed by one of three strategies:
+
+  - identity: only reads with identical UMI sequences are grouped together. This strategy may be useful for evaluating data, but should generally be avoided as it will generate multiple UMI groups per original molecule in the presence of errors.
+  - edit: reads are clustered into groups such that each read within a group has at least one other read in the group with <= edits differences and there are inter-group pairings with <= edits differences. Effective when there are small numbers of reads per UMI, but breaks down at very high coverage of UMIs.
+  - adjacency: a version of the directed adjacency method described in umi_tools that allows for errors between UMIs but only when there is a count gradient.
+  - paired: similar to adjacency but for methods that produce template with a pair of UMIs such that a read with A-B is related to but not identical to a read with B-A. Expects the pair of UMIs to be stored in a single tag, separated by a hyphen (e.g. ACGT-CCGG). The molecular IDs produced have more structure than for single UMI strategies, and are of the form {base}/{AB|BA}. E.g. two UMI pairs would be mapped as follows AAAA-GGGG -> 1/AB, GGGG-AAAA -> 1/BA.
+
+edit, adjacency and paired make use of the --edits parameter to control the matching of non-identical UMIs.
+
+By default, all UMIs must be the same length. If --min-umi-length=len is specified then reads that have a UMI shorter than len will be discarded, and when comparing UMIs of different lengths, the first len bases will be compared, where len is the length of the shortest UMI. The UMI length is the number of [ACGT] bases in the UMI (i.e. does not count dashes and other non-ACGT characters). This option is not implemented for reads with UMI pairs (i.e. using the paired assigner).
+
+http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html
+    ]]></help>
+    <expand macro="citations" />
+</tool>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/macros.xml	Sun Feb 21 23:40:34 2021 +0000
@@ -0,0 +1,56 @@
+<macros>
+    <token name="@TOOL_VERSION@">1.3.0</token>
+    <token name="@VERSION_SUFFIX@">0</token>
+    <xml name="requirements">
+        <requirements>
+            <requirement type="package" version="@TOOL_VERSION@">fgbio</requirement>
+            <yield/>
+        </requirements>
+    </xml>
+    <token name="@READ_STRUCTURE_PATTERN@">(([1-9][0-9]*[TBMS])*([+]|[1-9][0-9]*)[TBMS])</token>
+    <token name="@READ_STRUCTURES_PATTERN@">@READ_STRUCTURE_PATTERN@(\s@READ_STRUCTURE_PATTERN@)*</token>
+    <xml name="read_structures_validator">
+            <validator type="regex" message="">^@READ_STRUCTURES_PATTERN@$</validator>
+    </xml>
+    <xml name="sam_tag_validator">
+            <validator type="regex" message="">^[A-Za-z][A-Za-z]$</validator>
+    </xml>
+    <xml name="sam_sort_order">
+        <param argument="--sort-order" type="select" optional="true" label="Sort BAM by">
+            <option value="Coordinate">Coordinate</option>
+            <option value="Queryname">Queryname</option>
+            <option value="Random">Random</option>
+            <option value="RandomQuery">RandomQuery</option>
+        </param>
+    </xml>
+
+    <token name="@READ_STRUCTURES_HELP@"><![CDATA[
+**Read Structures**
+
+Read structures are made up of <number><operator> pairs much like the CIGAR string in BAM files. Four kinds of operators are recognized:
+
+    T identifies a template read
+    B identifies a sample barcode read
+    M identifies a unique molecular index read
+    S identifies a set of bases that should be skipped or ignored
+
+The last <number><operator> pair may be specified using a + sign instead of number to denote “all remaining bases”. This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. For example to convert a paired-end run with an index read and where the first 5 bases of R1 are a UMI and the second five bases are monotemplate you might specify:
+
+::
+
+    --input r1.fq r2.fq i1.fq --read-structures 5M5S+T +T +B
+
+Alternative if you know your reads are of fixed length you could specify:
+
+::
+
+    --input r1.fq r2.fq i1.fq --read-structures 5M5S65T 75T 8B
+
+
+]]></token>
+    <xml name="citations">
+        <citations>
+            <yield />
+        </citations>
+    </xml>
+</macros>