changeset 3:57b3ea22aff3 draft

Uploaded v0.1.0 which was already on the Test Tool Shed. Included Python 3 support.
author peterjc
date Tue, 11 Aug 2020 18:23:05 -0400
parents 7254ece0c0ff
children ed501717f6cd
files coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.bam coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular test-data/ex1.bam test-data/ex1.coverage_stats.md50.tabular test-data/ex1.coverage_stats.tabular tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.py tools/coverage_stats/coverage_stats.xml
diffstat 16 files changed, 581 insertions(+), 503 deletions(-) [+]
line wrap: on
line diff
Binary file coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,7 @@
+#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
+ref	100	5	0	0	2	0.50
+none	200	0	0	0	0	0.00
+full	10	3	0	2	2	2.00
+head	10	1	0	0	1	0.50
+tail	20	1	0	0	1	0.25
+*	0	0	1	0	0	0.00
Binary file coverage_stats-1b5523d3d2c2/test-data/ex1.bam has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,4 @@
+#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
+chr1	1575	1446	18	0	50	32.06
+chr2	1584	1789	17	0	50	38.70
+*	0	0	0	0	0	0.00
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,4 @@
+#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
+chr1	1575	1446	18	0	63	32.32
+chr2	1584	1789	17	0	64	39.78
+*	0	0	0	0	0	0.00
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,124 @@
+BAM coverage statistics using samtools idxstats and depth
+=========================================================
+
+This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the licence text below.
+
+Internally this tool uses the command-line samtools suite.
+
+This tool is available from the Galaxy Tool Shed at:
+http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+
+
+Automated Installation
+======================
+
+This should be straightforward, provided your Galaxy is setup to use Conda and
+BioConda for dependencies - Galaxy should then automatically download and
+install the expected version of samtools.
+
+
+Manual Installation
+===================
+
+This expects samtools to be on the ``$PATH``, and was tested using v1.3.1.
+
+To install the wrapper copy or move the following files under the Galaxy tools
+folder, e.g. in a ``tools/coverage_stats`` folder:
+
+* ``coverage_stats.xml`` (the Galaxy tool definition)
+* ``coverage_stats.py`` (the Python wrapper script)
+* ``README.rst`` (this file)
+
+You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
+the tool. Just add the line, perhaps under the NGS tools section::
+
+  <tool file="coverage_stats/coverage_stats.xml" />
+
+If you wish to run the unit tests, also move/copy the ``test-data/`` files
+under Galaxy's ``test-data/`` folder. Then::
+
+    $ ./run_tests.sh -id coverage_stats
+
+That's it.
+
+
+History
+=======
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.1  - Initial public release
+v0.0.2  - Cope with samtools' default depth limit using modified samtools,
+          see https://github.com/samtools/samtools/pull/322
+v0.0.3  - Cope with no coverage in final contigs.
+v0.0.4  - Reorder XML elements (internal change only).
+        - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
+v0.0.5  - Expose new ``samtools depth -d ...`` argument added in samtools v1.3
+        - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a
+          legacy Tool Shed package.
+        - Apply the new maximum depth parameter within the script to ensure
+          excess coverage is clear by getting the max coverage equal to the
+          max depth setting (the raw output from samtools is more fuzzy).
+        - Report total length and overall mean coverage in stdout.
+        - Use ``<command detect_errors="aggressive">`` (internal change only).
+        - Single quote command line arguments (internal change only).
+v0.0.6  - Python 3 compatibility fix.
+v0.1.0  - Provide proper command line API for the Python script.
+======= ======================================================================
+
+
+Developers
+==========
+
+Development is on this GitHub repository:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+    $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/
+    ...
+
+or::
+
+    $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/
+    ...
+
+To just build and check the tar ball, use::
+
+    $ planemo shed_upload --tar_only tools/coverage_stats/
+    ...
+    $ tar -tzf shed_upload.tar.gz
+    tools/coverage_stats/README.rst
+    tools/coverage_stats/coverage_stats.xml
+    tools/coverage_stats/coverage_stats.py
+    ...
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+
+NOTE: This is the licence for the Galaxy Wrapper only.
+samtools is available and licenced separately.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,323 @@
+#!/usr/bin/env python
+"""BAM coverage statistics using samtools idxstats and depth.
+
+This script takes exactly three command line arguments:
+ * Input BAM filename
+ * Input BAI filename (via Galaxy metadata)
+ * Output tabular filename
+
+Optional fourth argument:
+ * Max coverage depth (integer)
+
+This messes about with the filenames to make samtools happy, then
+runs "samtools idxstats" and "samtools depth", captures and combines
+the output to the desired output tabular file.
+
+Because "samtools depth" treats the max depth a little fuzzily, this
+tool tries to account for this and applies a clear max-depth cut off.
+"""
+
+import os
+import subprocess
+import sys
+import tempfile
+
+from optparse import OptionParser
+
+usage = """Example usage:
+
+$ python coverage_stats.py -b mapped.bam -i mapped.bai -o summary.tsv -d 10000
+"""
+
+parser = OptionParser(usage=usage)
+parser.add_option(
+    "-b", "--bam", dest="input_bam", default=None, help="Input BAM file", metavar="FILE"
+)
+parser.add_option(
+    "-i",
+    "--bai",
+    dest="input_bai",
+    default="-",
+    help="Input BAI file (BAM index, optional, default or - infer from BAM filename)",
+    metavar="FILE",
+)
+parser.add_option(
+    "-o",
+    dest="output_tabular",
+    default="-",
+    help="Output tabular file (optional, default stdout)",
+    metavar="FILE",
+)
+parser.add_option(
+    "-d",
+    "--maxdepth",
+    dest="max_depth",
+    default=8000,
+    help="Max coverage depth (integer, default 8000)",
+    type="int",
+)
+parser.add_option(
+    "-v",
+    "--version",
+    dest="version",
+    default=False,
+    action="store_true",
+    help="Show version and quit",
+)
+
+options, args = parser.parse_args()
+
+if options.version:
+    # Galaxy seems to invert the order of the two lines
+    print("BAM coverage statistics v0.1.0")
+    cmd = "samtools 2>&1 | grep -i ^Version"
+    sys.exit(os.system(cmd))
+
+bam_filename = options.input_bam
+bai_filename = options.input_bai
+tabular_filename = options.output_tabular
+max_depth = options.max_depth
+
+if args:
+    sys.exit("Sorry, the legacy API has been dropped.")
+
+if not bam_filename:
+    sys.exit("Input BAM filename is required.")
+if not os.path.isfile(bam_filename):
+    sys.exit("Input BAM file not found: %s" % bam_filename)
+if bai_filename == "-":
+    if os.path.isfile(bam_filename + ".bai"):
+        bai_filename = bam_filename + ".bai"
+if not os.path.isfile(bai_filename):
+    if bai_filename == "None":
+        sys.exit("Error: Galaxy did not index your BAM file")
+    sys.exit("Input BAI file not found: %s" % bai_filename)
+
+try:
+    max_depth = int(max_depth)
+except ValueError:
+    sys.exit("Bad argument for max depth: %r" % max_depth)
+if max_depth < 0:
+    sys.exit("Bad argument for max depth: %r" % max_depth)
+
+# fuzz factor to ensure can reach max_depth, e.g. region with
+# many reads having a deletion present could underestimate the
+# coverage by capping the number of reads considered
+depth_margin = 100
+
+# Assign sensible names with real extensions, and setup symlinks:
+tmp_dir = tempfile.mkdtemp()
+bam_file = os.path.join(tmp_dir, "temp.bam")
+bai_file = os.path.join(tmp_dir, "temp.bam.bai")
+idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")
+depth_filename = os.path.join(tmp_dir, "depth.tsv")
+os.symlink(os.path.abspath(bam_filename), bam_file)
+os.symlink(os.path.abspath(bai_filename), bai_file)
+assert os.path.isfile(bam_file), bam_file
+assert os.path.isfile(bai_file), bai_file
+assert os.path.isfile(bam_file + ".bai"), bam_file
+
+
+def clean_up():
+    """Remove our temporary files and directory."""
+    os.remove(idxstats_filename)
+    os.remove(depth_filename)
+    os.remove(bam_file)
+    os.remove(bai_file)
+    os.rmdir(tmp_dir)
+
+
+def samtools_depth_opt_available():
+    """Determine if samtools depth supports maximum coverage argument."""
+    child = subprocess.Popen(
+        ["samtools", "depth"],
+        universal_newlines=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    # Combined stdout/stderr in case samtools is ever inconsistent
+    output, tmp = child.communicate()
+    assert tmp is None
+    # Expect to find this line in the help text, exact wording could change:
+    #    -d/-m <int>         maximum coverage depth [8000]
+    return " -d/-m " in output
+
+
+depth_hack = False
+if not samtools_depth_opt_available():
+    if max_depth + depth_margin <= 8000:
+        sys.stderr.write(
+            "WARNING: The version of samtools depth installed does not "
+            "support the -d option, however, the requested max-depth "
+            "is safely under the default of 8000.\n"
+        )
+        depth_hack = True
+    else:
+        sys.exit(
+            "The version of samtools depth installed does not support the -d option."
+        )
+
+# Run samtools idxstats:
+cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename)
+return_code = os.system(cmd)
+if return_code:
+    clean_up()
+    sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
+
+# Run samtools depth:
+# TODO - Parse stdout instead?
+if depth_hack:
+    # Using an old samtools without the -d option, but hard coded default
+    # of 8000 should be fine even allowing a margin for fuzzy output
+    cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename)
+else:
+    cmd = 'samtools depth -d %i "%s" > "%s"' % (
+        max_depth + depth_margin,
+        bam_file,
+        depth_filename,
+    )
+return_code = os.system(cmd)
+if return_code:
+    clean_up()
+    sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
+
+
+def load_total_coverage(depth_handle, identifier, length):
+    """Parse some of the 'samtools depth' output for coverages.
+
+    Returns min_cov (int), max_cov (int) and mean cov (float).
+
+    Uses global variables to cache the first line of output from the
+    next reference sequence.
+    """
+    global depth_ref, depth_pos, depth_reads
+
+    # print("====")
+    # print("%s coverage calculation, length %i, ..." % (identifier, length))
+
+    if depth_ref is None:
+        # Right at start of file / new contig
+        line = depth_handle.readline()
+        # Are we at the end of the file?
+        if not line:
+            # Must be at the end of the file.
+            # This can happen if the file contig(s) had no reads mapped
+            return 0, 0, 0.0, 0
+        depth_ref, depth_pos, depth_reads = line.rstrip("\n").split()
+        depth_pos = int(depth_pos)
+        depth_reads = min(max_depth, int(depth_reads))
+        # Can now treat as later references where first line cached
+    elif identifier != depth_ref:
+        # Infer that identifier had coverage zero,
+        # and so was not in the 'samtools depth'
+        # output.
+        # print("%s appears to have no coverage at all" % identifier)
+        return 0, 0, 0.0, 0
+
+    # Good, at start of expected reference
+    bases = depth_reads
+    if depth_pos == 1:
+        min_cov = depth_reads
+    else:
+        # print("%s has no coverage at start" % identifier)
+        min_cov = 0
+    max_cov = depth_reads
+
+    last_pos = depth_pos
+    depth_ref = None
+    depth_pos = 0
+    depth_reads = 0
+    for line in depth_handle:
+        ref, pos, depth = line.rstrip("\n").split()
+        pos = int(pos)
+        depth = min(max_depth, int(depth))
+        if ref != identifier:
+            # Reached the end of this identifier's coverage
+            # so cache this ready for next identifier
+            depth_ref, depth_pos, depth_reads = ref, pos, depth
+            break
+        bases += depth
+        if last_pos + 1 < pos:
+            # print("%s has no coverage between %i and %i"
+            #       % (identifier, last_pos, pos))
+            min_cov = 0
+        else:
+            min_cov = min(min_cov, depth)
+        max_cov = max(max_cov, depth)
+        last_pos = pos
+
+    # Reach the end of this identifier's coverage or end of file
+    if last_pos < length:
+        # print("%s has no coverage at end" % identifier)
+        min_cov = 0
+    mean_cov = bases / float(length)
+    return min_cov, max_cov, mean_cov, bases
+
+
+# Parse and combine the output
+if tabular_filename == "-":
+    out_handle = sys.stdout
+else:
+    out_handle = open(tabular_filename, "w")
+out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n")
+
+idxstats_handle = open(idxstats_filename)
+depth_handle = open(depth_filename)
+
+depth_ref = None
+depth_pos = 0
+depth_reads = 0
+
+global_bases = 0
+global_length = 0
+for line in idxstats_handle:
+    identifier, length, mapped, placed = line.rstrip("\n").split()
+    length = int(length)
+    mapped = int(mapped)
+    placed = int(placed)
+    if identifier == "*":
+        min_cov = 0
+        max_cov = 0
+        mean_cov = 0.0
+        bases = 0
+    else:
+        min_cov, max_cov, mean_cov, bases = load_total_coverage(
+            depth_handle, identifier, length
+        )
+    if max_cov > max_depth:
+        sys.exit(
+            "Using max depth %i yet saw max coverage %i for %s"
+            % (max_depth, max_cov, identifier)
+        )
+    out_handle.write(
+        "%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n"
+        % (identifier, length, mapped, placed, min_cov, max_cov, mean_cov)
+    )
+    if not (min_cov <= mean_cov <= max_cov):
+        out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n")
+        idxstats_handle.close()
+        depth_handle.close()
+        out_handle.close()
+        clean_up()
+        sys.exit(
+            "Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"
+            % identifier
+        )
+    global_length += length
+    global_bases += bases
+
+idxstats_handle.close()
+depth_handle.close()
+if tabular_filename != "-":
+    out_handle.close()
+
+print(
+    "Total reference length %i with overall mean coverage %0.2f"
+    % (global_length, float(global_bases) / global_length)
+)
+
+# Remove the temp symlinks and files:
+clean_up()
+
+if depth_ref is not None:
+    sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml	Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,119 @@
+<tool id="coverage_stats" name="BAM coverage statistics" version="0.1.0">
+    <description>using samtools idxstats and depth</description>
+    <requirements>
+        <requirement type="package" version="1.4.1">samtools</requirement>
+    </requirements>
+    <version_command>
+python $__tool_directory__/coverage_stats.py --version
+    </version_command>
+    <command detect_errors="aggressive">
+python $__tool_directory__/coverage_stats.py
+-b '$input_bam'
+-i '${input_bam.metadata.bam_index}'
+-o '$out_tabular'
+-d '$max_depth'
+    </command>
+    <inputs>
+        <param name="input_bam" type="data" format="bam" label="Input BAM file" />
+        <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" />
+    </inputs>
+    <outputs>
+        <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" />
+    </outputs>
+    <tests>
+        <test>
+            <param name="input_bam" value="ex1.bam" ftype="bam" />
+            <param name="max_depth" value="123" />
+            <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" />
+        </test>
+        <test>
+            <param name="input_bam" value="ex1.bam" ftype="bam" />
+            <param name="max_depth" value="50" />
+            <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" />
+        </test>
+        <test>
+            <param name="input_bam" value="coverage_test.bam" ftype="bam" />
+            <param name="max_depth" value="123" />
+            <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" />
+        </test>
+    </tests>
+    <help>
+**What it does**
+
+This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the
+SAMtools toolkit, and parses their output to produce a consise summary of the
+coverage information for each reference sequence.
+
+Input is a sorted and indexed BAM file, the output is tabular. The first four
+columns match the output from ``samtools idxstats``, the additional columns are
+calculated from the ``samtools depth`` output. The final row with a star as the
+reference identifier represents unmapped reads, and will have zeros in every
+column except columns one and four.
+
+====== =================================================================================
+Column Description
+------ ---------------------------------------------------------------------------------
+     1 Reference sequence identifier
+     2 Reference sequence length
+     3 Number of mapped reads
+     4 Number of placed but unmapped reads (typically unmapped partners of mapped reads)
+     5 Minimum coverage (per base of reference)
+     6 Maximum coverage (per base of reference)
+     7 Mean coverage (given to 2 dp)
+====== =================================================================================
+
+Example output from a *de novo* assembly:
+
+========== ====== ====== ====== ======= ======= ========
+identiifer length mapped placed min_cov max_cov mean_cov
+---------- ------ ------ ------ ------- ------- --------
+contig_1   833604 436112      0       1     157    71.95
+contig_2    14820   9954      0       1     152    91.27
+contig_3   272099 142958      0       1     150    72.31
+contig_4   135519  73288      0       1     149    75.23
+contig_5    91245  46759      0       1     157    70.92
+contig_6   175604  95744      0       1     146    75.99
+contig_7    90586  48158      0       1     151    72.93
+contig_9   234347 126458      0       1     159    75.40
+contig_10  121515  60211      0       1     152    68.12
+...           ...    ...    ...     ...     ...      ...
+contig_604    712     85      0       1      49    21.97
+\*              0      0 950320       0       0     0.00
+========== ====== ====== ====== ======= ======= ========
+
+In this example there were 604 contigs, each with one line in the output table,
+plus the final row (labelled with an asterisk) representing 950320 unmapped reads.
+In this BAM file, the fourth column was otherwise zero.
+
+.. class:: warningmark
+
+**Note**. If using this on a mapping BAM file, beware that the coverage counting is
+done per base of the reference.  This means if your reference has any extra bases
+compared to the reads being mapped, those bases will be skipped by CIGAR D operators
+and these "extra" bases can have an extremely low coverage, giving a potentially
+misleading ``min_cov`` values. A sliding window coverage may be more appropriate.
+
+**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the
+pileup routine, meaning the reported coverage from ``samtools depth`` would show
+maximum coverage depths *around* 8000. This is now a run time option.
+
+
+**Citation**
+
+If you use this Galaxy tool in work leading to a scientific publication please
+cite:
+
+Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools.
+Bioinformatics 25(16), 2078-9.
+https://doi.org/10.1093/bioinformatics/btp352
+
+Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth.
+http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+
+This wrapper is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+    </help>
+    <citations>
+        <citation type="doi">10.1093/bioinformatics/btp352</citation>
+    </citations>
+</tool>
Binary file test-data/coverage_test.bam has changed
--- a/test-data/coverage_test.coverage_stats.tabular	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,7 +0,0 @@
-#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
-ref	100	5	0	0	2	0.50
-none	200	0	0	0	0	0.00
-full	10	3	0	2	2	2.00
-head	10	1	0	0	1	0.50
-tail	20	1	0	0	1	0.25
-*	0	0	1	0	0	0.00
Binary file test-data/ex1.bam has changed
--- a/test-data/ex1.coverage_stats.md50.tabular	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
-chr1	1575	1446	18	0	50	32.06
-chr2	1584	1789	17	0	50	38.70
-*	0	0	0	0	0	0.00
--- a/test-data/ex1.coverage_stats.tabular	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#identifer	length	mapped	placed	min_cov	max_cov	mean_cov
-chr1	1575	1446	18	0	63	32.32
-chr2	1584	1789	17	0	64	39.78
-*	0	0	0	0	0	0.00
--- a/tools/coverage_stats/README.rst	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,122 +0,0 @@
-BAM coverage statistics using samtools idxstats and depth
-=========================================================
-
-This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute
-(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
-See the licence text below.
-
-Internally this tool uses the command-line samtools suite.
-
-This tool is available from the Galaxy Tool Shed at:
-http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-
-
-Automated Installation
-======================
-
-This should be straightforward, provided your Galaxy is setup to use Conda and
-BioConda for dependencies - Galaxy should then automatically download and
-install the expected version of samtools.
-
-
-Manual Installation
-===================
-
-This expects samtools to be on the ``$PATH``, and was tested using v1.3.1.
-
-To install the wrapper copy or move the following files under the Galaxy tools
-folder, e.g. in a ``tools/coverage_stats`` folder:
-
-* ``coverage_stats.xml`` (the Galaxy tool definition)
-* ``coverage_stats.py`` (the Python wrapper script)
-* ``README.rst`` (this file)
-
-You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
-the tool. Just add the line, perhaps under the NGS tools section::
-
-  <tool file="coverage_stats/coverage_stats.xml" />
-
-If you wish to run the unit tests, also move/copy the ``test-data/`` files
-under Galaxy's ``test-data/`` folder. Then::
-
-    $ ./run_tests.sh -id coverage_stats
-
-That's it.
-
-
-History
-=======
-
-======= ======================================================================
-Version Changes
-------- ----------------------------------------------------------------------
-v0.0.1  - Initial public release
-v0.0.2  - Cope with samtools' default depth limit using modified samtools,
-          see https://github.com/samtools/samtools/pull/322
-v0.0.3  - Cope with no coverage in final contigs.
-v0.0.4  - Reorder XML elements (internal change only).
-        - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
-v0.0.5  - Expose new ``samtools depth -d ...`` argument added in samtools v1.3
-        - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a
-          legacy Tool Shed package.
-        - Apply the new maximum depth parameter within the script to ensure
-          excess coverage is clear by getting the max coverage equal to the
-          max depth setting (the raw output from samtools is more fuzzy).
-        - Report total length and overall mean coverage in stdout.
-        - Use ``<command detect_errors="aggressive">`` (internal change only).
-        - Single quote command line arguments (internal change only).
-======= ======================================================================
-
-
-Developers
-==========
-
-Development is on this GitHub repository:
-https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats
-
-For pushing a release to the test or main "Galaxy Tool Shed", use the following
-Planemo commands (which requires you have set your Tool Shed access details in
-``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
-
-    $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/
-    ...
-
-or::
-
-    $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/
-    ...
-
-To just build and check the tar ball, use::
-
-    $ planemo shed_upload --tar_only tools/coverage_stats/
-    ...
-    $ tar -tzf shed_upload.tar.gz
-    tools/coverage_stats/README.rst
-    tools/coverage_stats/coverage_stats.xml
-    tools/coverage_stats/coverage_stats.py
-    ...
-
-
-Licence (MIT)
-=============
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in
-all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-THE SOFTWARE.
-
-NOTE: This is the licence for the Galaxy Wrapper only.
-samtools is available and licenced separately.
--- a/tools/coverage_stats/coverage_stats.py	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,251 +0,0 @@
-#!/usr/bin/env python
-"""BAM coverage statistics using samtools idxstats and depth.
-
-This script takes exactly three command line arguments:
- * Input BAM filename
- * Input BAI filename (via Galaxy metadata)
- * Output tabular filename
-
-Optional fourth argument:
- * Max coverage depth (integer)
-
-This messes about with the filenames to make samtools happy, then
-runs "samtools idxstats" and "samtools depth", captures and combines
-the output to the desired output tabular file.
-
-Because "samtools depth" treats the max depth a little fuzzily, this
-tool tries to account for this and applies a clear max-depth cut off.
-"""
-
-import os
-import subprocess
-import sys
-import tempfile
-
-if "-v" in sys.argv or "--version" in sys.argv:
-    # Galaxy seems to invert the order of the two lines
-    print("BAM coverage statistics v0.0.5")
-    cmd = "samtools 2>&1 | grep -i ^Version"
-    sys.exit(os.system(cmd))
-
-# TODO - Proper command line API
-if len(sys.argv) == 4:
-    bam_filename, bai_filename, tabular_filename = sys.argv[1:]
-    max_depth = "8000"
-elif len(sys.argv) == 5:
-    bam_filename, bai_filename, tabular_filename, max_depth = sys.argv[1:]
-else:
-    sys.exit("Require 3 or 4 arguments: BAM, BAI, tabular filename, [max depth]")
-
-if not os.path.isfile(bam_filename):
-    sys.exit("Input BAM file not found: %s" % bam_filename)
-if bai_filename == "-":
-    # Make this optional for ease of use at the command line by hand:
-    if os.path.isfile(bam_filename + ".bai"):
-        bai_filename = bam_filename + ".bai"
-if not os.path.isfile(bai_filename):
-    if bai_filename == "None":
-        sys.exit("Error: Galaxy did not index your BAM file")
-    sys.exit("Input BAI file not found: %s" % bai_filename)
-
-try:
-    max_depth = int(max_depth)
-except ValueError:
-    sys.exit("Bad argument for max depth: %r" % max_depth)
-if max_depth < 0:
-    sys.exit("Bad argument for max depth: %r" % max_depth)
-
-# fuzz factor to ensure can reach max_depth, e.g. region with
-# many reads having a deletion present could underestimate the
-# coverage by capping the number of reads considered
-depth_margin = 100
-
-# Assign sensible names with real extensions, and setup symlinks:
-tmp_dir = tempfile.mkdtemp()
-bam_file = os.path.join(tmp_dir, "temp.bam")
-bai_file = os.path.join(tmp_dir, "temp.bam.bai")
-idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")
-depth_filename = os.path.join(tmp_dir, "depth.tsv")
-os.symlink(os.path.abspath(bam_filename), bam_file)
-os.symlink(os.path.abspath(bai_filename), bai_file)
-assert os.path.isfile(bam_file), bam_file
-assert os.path.isfile(bai_file), bai_file
-assert os.path.isfile(bam_file + ".bai"), bam_file
-
-
-def clean_up():
-    """Remove our temporary files and directory."""
-    os.remove(idxstats_filename)
-    os.remove(depth_filename)
-    os.remove(bam_file)
-    os.remove(bai_file)
-    os.rmdir(tmp_dir)
-
-
-def samtools_depth_opt_available():
-    """Determine if samtools depth supports maximum coverage argument."""
-    child = subprocess.Popen(["samtools", "depth"],
-                             stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
-    # Combined stdout/stderr in case samtools is ever inconsistent
-    output, tmp = child.communicate()
-    assert tmp is None
-    # Expect to find this line in the help text, exact wording could change:
-    #    -d/-m <int>         maximum coverage depth [8000]
-    return " -d/-m " in output
-
-
-depth_hack = False
-if not samtools_depth_opt_available():
-    if max_depth + depth_margin <= 8000:
-        sys.stderr.write("WARNING: The version of samtools depth installed does not "
-                         "support the -d option, however, the requested max-depth "
-                         "is safely under the default of 8000.\n")
-        depth_hack = True
-    else:
-        sys.exit("The version of samtools depth installed does not support the -d option.")
-
-# Run samtools idxstats:
-cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename)
-return_code = os.system(cmd)
-if return_code:
-    clean_up()
-    sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
-
-# Run samtools depth:
-# TODO - Parse stdout instead?
-if depth_hack:
-    # Using an old samtools without the -d option, but hard coded default
-    # of 8000 should be fine even allowing a margin for fuzzy output
-    cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename)
-else:
-    cmd = 'samtools depth -d %i "%s" > "%s"' % (max_depth + depth_margin, bam_file, depth_filename)
-return_code = os.system(cmd)
-if return_code:
-    clean_up()
-    sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
-
-
-def load_total_coverage(depth_handle, identifier, length):
-    """Parse some of the 'samtools depth' output for coverages.
-
-    Returns min_cov (int), max_cov (int) and mean cov (float).
-
-    Uses global variables to cache the first line of output from the
-    next reference sequence.
-    """
-    global depth_ref, depth_pos, depth_reads
-
-    # print("====")
-    # print("%s coverage calculation, length %i, ..." % (identifier, length))
-
-    if depth_ref is None:
-        # Right at start of file / new contig
-        line = depth_handle.readline()
-        # Are we at the end of the file?
-        if not line:
-            # Must be at the end of the file.
-            # This can happen if the file contig(s) had no reads mapped
-            return 0, 0, 0.0, 0
-        depth_ref, depth_pos, depth_reads = line.rstrip("\n").split()
-        depth_pos = int(depth_pos)
-        depth_reads = min(max_depth, int(depth_reads))
-        # Can now treat as later references where first line cached
-    elif identifier != depth_ref:
-        # Infer that identifier had coverage zero,
-        # and so was not in the 'samtools depth'
-        # output.
-        # print("%s appears to have no coverage at all" % identifier)
-        return 0, 0, 0.0, 0
-
-    # Good, at start of expected reference
-    bases = depth_reads
-    if depth_pos == 1:
-        min_cov = depth_reads
-    else:
-        # print("%s has no coverage at start" % identifier)
-        min_cov = 0
-    max_cov = depth_reads
-
-    last_pos = depth_pos
-    depth_ref = None
-    depth_pos = 0
-    depth_reads = 0
-    for line in depth_handle:
-        ref, pos, depth = line.rstrip("\n").split()
-        pos = int(pos)
-        depth = min(max_depth, int(depth))
-        if ref != identifier:
-            # Reached the end of this identifier's coverage
-            # so cache this ready for next identifier
-            depth_ref, depth_pos, depth_reads = ref, pos, depth
-            break
-        bases += depth
-        if last_pos + 1 < pos:
-            # print("%s has no coverage between %i and %i" % (identifier, last_pos, pos))
-            min_cov = 0
-        else:
-            min_cov = min(min_cov, depth)
-        max_cov = max(max_cov, depth)
-        last_pos = pos
-
-    # Reach the end of this identifier's coverage or end of file
-    if last_pos < length:
-        # print("%s has no coverage at end" % identifier)
-        min_cov = 0
-    mean_cov = bases / float(length)
-    return min_cov, max_cov, mean_cov, bases
-
-
-# Parse and combine the output
-out_handle = open(tabular_filename, "w")
-out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n")
-
-idxstats_handle = open(idxstats_filename)
-depth_handle = open(depth_filename)
-
-depth_ref = None
-depth_pos = 0
-depth_reads = 0
-
-global_bases = 0
-global_length = 0
-for line in idxstats_handle:
-    identifier, length, mapped, placed = line.rstrip("\n").split()
-    length = int(length)
-    mapped = int(mapped)
-    placed = int(placed)
-    if identifier == "*":
-        min_cov = 0
-        max_cov = 0
-        mean_cov = 0.0
-        bases = 0
-    else:
-        min_cov, max_cov, mean_cov, bases = load_total_coverage(depth_handle, identifier, length)
-    if max_cov > max_depth:
-        sys.exit("Using max depth %i yet saw max coverage %i for %s"
-                 % (max_depth, max_cov, identifier))
-    out_handle.write("%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n"
-                     % (identifier, length, mapped, placed,
-                        min_cov, max_cov, mean_cov))
-    if not (min_cov <= mean_cov <= max_cov):
-        out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n")
-        idxstats_handle.close()
-        depth_handle.close()
-        out_handle.close()
-        clean_up()
-        sys.exit("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"
-                 % identifier)
-    global_length += length
-    global_bases += bases
-
-idxstats_handle.close()
-depth_handle.close()
-out_handle.close()
-
-print("Total reference length %i with overall mean coverage %0.2f" % (global_length, float(global_bases) / global_length))
-
-# Remove the temp symlinks and files:
-clean_up()
-
-if depth_ref is not None:
-    sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
--- a/tools/coverage_stats/coverage_stats.xml	Thu May 11 12:16:10 2017 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,115 +0,0 @@
-<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.5">
-    <description>using samtools idxstats and depth</description>
-    <requirements>
-        <requirement type="package" version="1.4.1">samtools</requirement>
-    </requirements>
-    <version_command>
-python $__tool_directory__/coverage_stats.py --version
-    </version_command>
-    <command detect_errors="aggressive">
-python $__tool_directory__/coverage_stats.py '$input_bam' '${input_bam.metadata.bam_index}' '$out_tabular' '$max_depth'
-    </command>
-    <inputs>
-        <param name="input_bam" type="data" format="bam" label="Input BAM file" />
-        <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" />
-    </inputs>
-    <outputs>
-        <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" />
-    </outputs>
-    <tests>
-        <test>
-            <param name="input_bam" value="ex1.bam" ftype="bam" />
-	    <param name="max_depth" value="123" />
-            <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" />
-        </test>
-        <test>
-            <param name="input_bam" value="ex1.bam" ftype="bam" />
-            <param name="max_depth" value="50" />
-            <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" />
-        </test>
-        <test>
-            <param name="input_bam" value="coverage_test.bam" ftype="bam" />
-	    <param name="max_depth" value="123" />
-            <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" />
-        </test>
-    </tests>
-    <help>
-**What it does**
-
-This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the
-SAMtools toolkit, and parses their output to produce a consise summary of the
-coverage information for each reference sequence.
-
-Input is a sorted and indexed BAM file, the output is tabular. The first four
-columns match the output from ``samtools idxstats``, the additional columns are
-calculated from the ``samtools depth`` output. The final row with a star as the
-reference identifier represents unmapped reads, and will have zeros in every
-column except columns one and four.
-
-====== =================================================================================
-Column Description
------- ---------------------------------------------------------------------------------
-     1 Reference sequence identifier
-     2 Reference sequence length
-     3 Number of mapped reads
-     4 Number of placed but unmapped reads (typically unmapped partners of mapped reads)
-     5 Minimum coverage (per base of reference)
-     6 Maximum coverage (per base of reference)
-     7 Mean coverage (given to 2 dp)
-====== =================================================================================
-
-Example output from a *de novo* assembly:
-
-========== ====== ====== ====== ======= ======= ========
-identiifer length mapped placed min_cov max_cov mean_cov
----------- ------ ------ ------ ------- ------- --------
-contig_1   833604 436112      0       1     157    71.95
-contig_2    14820   9954      0       1     152    91.27
-contig_3   272099 142958      0       1     150    72.31
-contig_4   135519  73288      0       1     149    75.23
-contig_5    91245  46759      0       1     157    70.92
-contig_6   175604  95744      0       1     146    75.99
-contig_7    90586  48158      0       1     151    72.93
-contig_9   234347 126458      0       1     159    75.40
-contig_10  121515  60211      0       1     152    68.12
-...           ...    ...    ...     ...     ...      ...
-contig_604    712     85      0       1      49    21.97
-\*              0      0 950320       0       0     0.00
-========== ====== ====== ====== ======= ======= ========
-
-In this example there were 604 contigs, each with one line in the output table,
-plus the final row (labelled with an asterisk) representing 950320 unmapped reads.
-In this BAM file, the fourth column was otherwise zero.
-
-.. class:: warningmark
-
-**Note**. If using this on a mapping BAM file, beware that the coverage counting is
-done per base of the reference.  This means if your reference has any extra bases
-compared to the reads being mapped, those bases will be skipped by CIGAR D operators
-and these "extra" bases can have an extremely low coverage, giving a potentially
-misleading ``min_cov`` values. A sliding window coverage may be more appropriate.
-
-**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the
-pileup routine, meaning the reported coverage from ``samtools depth`` would show
-maximum coverage depths *around* 8000. This is now a run time option.
-
-
-**Citation**
-
-If you use this Galaxy tool in work leading to a scientific publication please
-cite:
-
-Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools.
-Bioinformatics 25(16), 2078-9.
-http://dx.doi.org/10.1093/bioinformatics/btp352
-
-Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth.
-http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-
-This wrapper is available to install into other Galaxy Instances via the Galaxy
-Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-    </help>
-    <citations>
-        <citation type="doi">10.1093/bioinformatics/btp352</citation>
-    </citations>
-</tool>