Previous changeset 2:7254ece0c0ff (2017-05-11) Next changeset 4:ed501717f6cd (2021-04-16) |
Commit message:
Uploaded v0.1.0 which was already on the Test Tool Shed. Included Python 3 support. |
added:
coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.bam coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml |
removed:
test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular test-data/ex1.bam test-data/ex1.coverage_stats.md50.tabular test-data/ex1.coverage_stats.tabular tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.py tools/coverage_stats/coverage_stats.xml |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam |
b |
Binary file coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam has changed |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400 |
b |
@@ -0,0 +1,7 @@ +#identifer length mapped placed min_cov max_cov mean_cov +ref 100 5 0 0 2 0.50 +none 200 0 0 0 0 0.00 +full 10 3 0 2 2 2.00 +head 10 1 0 0 1 0.50 +tail 20 1 0 0 1 0.25 +* 0 0 1 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.bam |
b |
Binary file coverage_stats-1b5523d3d2c2/test-data/ex1.bam has changed |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular Tue Aug 11 18:23:05 2020 -0400 |
b |
@@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 50 32.06 +chr2 1584 1789 17 0 50 38.70 +* 0 0 0 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400 |
b |
@@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 63 32.32 +chr2 1584 1789 17 0 64 39.78 +* 0 0 0 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst Tue Aug 11 18:23:05 2020 -0400 |
b |
@@ -0,0 +1,124 @@ +BAM coverage statistics using samtools idxstats and depth +========================================================= + +This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +Internally this tool uses the command-line samtools suite. + +This tool is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + + +Automated Installation +====================== + +This should be straightforward, provided your Galaxy is setup to use Conda and +BioConda for dependencies - Galaxy should then automatically download and +install the expected version of samtools. + + +Manual Installation +=================== + +This expects samtools to be on the ``$PATH``, and was tested using v1.3.1. + +To install the wrapper copy or move the following files under the Galaxy tools +folder, e.g. in a ``tools/coverage_stats`` folder: + +* ``coverage_stats.xml`` (the Galaxy tool definition) +* ``coverage_stats.py`` (the Python wrapper script) +* ``README.rst`` (this file) + +You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer +the tool. Just add the line, perhaps under the NGS tools section:: + + <tool file="coverage_stats/coverage_stats.xml" /> + +If you wish to run the unit tests, also move/copy the ``test-data/`` files +under Galaxy's ``test-data/`` folder. Then:: + + $ ./run_tests.sh -id coverage_stats + +That's it. + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial public release +v0.0.2 - Cope with samtools' default depth limit using modified samtools, + see https://github.com/samtools/samtools/pull/322 +v0.0.3 - Cope with no coverage in final contigs. +v0.0.4 - Reorder XML elements (internal change only). + - Planemo for Tool Shed upload (``.shed.yml``, internal change only). +v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3 + - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a + legacy Tool Shed package. + - Apply the new maximum depth parameter within the script to ensure + excess coverage is clear by getting the max coverage equal to the + max depth setting (the raw output from samtools is more fuzzy). + - Report total length and overall mean coverage in stdout. + - Use ``<command detect_errors="aggressive">`` (internal change only). + - Single quote command line arguments (internal change only). +v0.0.6 - Python 3 compatibility fix. +v0.1.0 - Provide proper command line API for the Python script. +======= ====================================================================== + + +Developers +========== + +Development is on this GitHub repository: +https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats + +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/ + ... + +or:: + + $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/ + ... + +To just build and check the tar ball, use:: + + $ planemo shed_upload --tar_only tools/coverage_stats/ + ... + $ tar -tzf shed_upload.tar.gz + tools/coverage_stats/README.rst + tools/coverage_stats/coverage_stats.xml + tools/coverage_stats/coverage_stats.py + ... + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. + +NOTE: This is the licence for the Galaxy Wrapper only. +samtools is available and licenced separately. |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py Tue Aug 11 18:23:05 2020 -0400 |
[ |
b'@@ -0,0 +1,323 @@\n+#!/usr/bin/env python\n+"""BAM coverage statistics using samtools idxstats and depth.\n+\n+This script takes exactly three command line arguments:\n+ * Input BAM filename\n+ * Input BAI filename (via Galaxy metadata)\n+ * Output tabular filename\n+\n+Optional fourth argument:\n+ * Max coverage depth (integer)\n+\n+This messes about with the filenames to make samtools happy, then\n+runs "samtools idxstats" and "samtools depth", captures and combines\n+the output to the desired output tabular file.\n+\n+Because "samtools depth" treats the max depth a little fuzzily, this\n+tool tries to account for this and applies a clear max-depth cut off.\n+"""\n+\n+import os\n+import subprocess\n+import sys\n+import tempfile\n+\n+from optparse import OptionParser\n+\n+usage = """Example usage:\n+\n+$ python coverage_stats.py -b mapped.bam -i mapped.bai -o summary.tsv -d 10000\n+"""\n+\n+parser = OptionParser(usage=usage)\n+parser.add_option(\n+ "-b", "--bam", dest="input_bam", default=None, help="Input BAM file", metavar="FILE"\n+)\n+parser.add_option(\n+ "-i",\n+ "--bai",\n+ dest="input_bai",\n+ default="-",\n+ help="Input BAI file (BAM index, optional, default or - infer from BAM filename)",\n+ metavar="FILE",\n+)\n+parser.add_option(\n+ "-o",\n+ dest="output_tabular",\n+ default="-",\n+ help="Output tabular file (optional, default stdout)",\n+ metavar="FILE",\n+)\n+parser.add_option(\n+ "-d",\n+ "--maxdepth",\n+ dest="max_depth",\n+ default=8000,\n+ help="Max coverage depth (integer, default 8000)",\n+ type="int",\n+)\n+parser.add_option(\n+ "-v",\n+ "--version",\n+ dest="version",\n+ default=False,\n+ action="store_true",\n+ help="Show version and quit",\n+)\n+\n+options, args = parser.parse_args()\n+\n+if options.version:\n+ # Galaxy seems to invert the order of the two lines\n+ print("BAM coverage statistics v0.1.0")\n+ cmd = "samtools 2>&1 | grep -i ^Version"\n+ sys.exit(os.system(cmd))\n+\n+bam_filename = options.input_bam\n+bai_filename = options.input_bai\n+tabular_filename = options.output_tabular\n+max_depth = options.max_depth\n+\n+if args:\n+ sys.exit("Sorry, the legacy API has been dropped.")\n+\n+if not bam_filename:\n+ sys.exit("Input BAM filename is required.")\n+if not os.path.isfile(bam_filename):\n+ sys.exit("Input BAM file not found: %s" % bam_filename)\n+if bai_filename == "-":\n+ if os.path.isfile(bam_filename + ".bai"):\n+ bai_filename = bam_filename + ".bai"\n+if not os.path.isfile(bai_filename):\n+ if bai_filename == "None":\n+ sys.exit("Error: Galaxy did not index your BAM file")\n+ sys.exit("Input BAI file not found: %s" % bai_filename)\n+\n+try:\n+ max_depth = int(max_depth)\n+except ValueError:\n+ sys.exit("Bad argument for max depth: %r" % max_depth)\n+if max_depth < 0:\n+ sys.exit("Bad argument for max depth: %r" % max_depth)\n+\n+# fuzz factor to ensure can reach max_depth, e.g. region with\n+# many reads having a deletion present could underestimate the\n+# coverage by capping the number of reads considered\n+depth_margin = 100\n+\n+# Assign sensible names with real extensions, and setup symlinks:\n+tmp_dir = tempfile.mkdtemp()\n+bam_file = os.path.join(tmp_dir, "temp.bam")\n+bai_file = os.path.join(tmp_dir, "temp.bam.bai")\n+idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")\n+depth_filename = os.path.join(tmp_dir, "depth.tsv")\n+os.symlink(os.path.abspath(bam_filename), bam_file)\n+os.symlink(os.path.abspath(bai_filename), bai_file)\n+assert os.path.isfile(bam_file), bam_file\n+assert os.path.isfile(bai_file), bai_file\n+assert os.path.isfile(bam_file + ".bai"), bam_file\n+\n+\n+def clean_up():\n+ """Remove our temporary files and directory."""\n+ os.remove(idxstats_filename)\n+ os.remove(depth_filename)\n+ os.remove(bam_file)\n+ os.remove(bai_file)\n+ os.rmdir(tmp_dir)\n+\n+\n+def samtools_depth_opt_available():\n+ """Determine if samtools depth supports maximum coverage argument."""\n+ child = subprocess.Popen(\n+ ["samtools", "depth"],\n+ univers'..b'ine = depth_handle.readline()\n+ # Are we at the end of the file?\n+ if not line:\n+ # Must be at the end of the file.\n+ # This can happen if the file contig(s) had no reads mapped\n+ return 0, 0, 0.0, 0\n+ depth_ref, depth_pos, depth_reads = line.rstrip("\\n").split()\n+ depth_pos = int(depth_pos)\n+ depth_reads = min(max_depth, int(depth_reads))\n+ # Can now treat as later references where first line cached\n+ elif identifier != depth_ref:\n+ # Infer that identifier had coverage zero,\n+ # and so was not in the \'samtools depth\'\n+ # output.\n+ # print("%s appears to have no coverage at all" % identifier)\n+ return 0, 0, 0.0, 0\n+\n+ # Good, at start of expected reference\n+ bases = depth_reads\n+ if depth_pos == 1:\n+ min_cov = depth_reads\n+ else:\n+ # print("%s has no coverage at start" % identifier)\n+ min_cov = 0\n+ max_cov = depth_reads\n+\n+ last_pos = depth_pos\n+ depth_ref = None\n+ depth_pos = 0\n+ depth_reads = 0\n+ for line in depth_handle:\n+ ref, pos, depth = line.rstrip("\\n").split()\n+ pos = int(pos)\n+ depth = min(max_depth, int(depth))\n+ if ref != identifier:\n+ # Reached the end of this identifier\'s coverage\n+ # so cache this ready for next identifier\n+ depth_ref, depth_pos, depth_reads = ref, pos, depth\n+ break\n+ bases += depth\n+ if last_pos + 1 < pos:\n+ # print("%s has no coverage between %i and %i"\n+ # % (identifier, last_pos, pos))\n+ min_cov = 0\n+ else:\n+ min_cov = min(min_cov, depth)\n+ max_cov = max(max_cov, depth)\n+ last_pos = pos\n+\n+ # Reach the end of this identifier\'s coverage or end of file\n+ if last_pos < length:\n+ # print("%s has no coverage at end" % identifier)\n+ min_cov = 0\n+ mean_cov = bases / float(length)\n+ return min_cov, max_cov, mean_cov, bases\n+\n+\n+# Parse and combine the output\n+if tabular_filename == "-":\n+ out_handle = sys.stdout\n+else:\n+ out_handle = open(tabular_filename, "w")\n+out_handle.write("#identifer\\tlength\\tmapped\\tplaced\\tmin_cov\\tmax_cov\\tmean_cov\\n")\n+\n+idxstats_handle = open(idxstats_filename)\n+depth_handle = open(depth_filename)\n+\n+depth_ref = None\n+depth_pos = 0\n+depth_reads = 0\n+\n+global_bases = 0\n+global_length = 0\n+for line in idxstats_handle:\n+ identifier, length, mapped, placed = line.rstrip("\\n").split()\n+ length = int(length)\n+ mapped = int(mapped)\n+ placed = int(placed)\n+ if identifier == "*":\n+ min_cov = 0\n+ max_cov = 0\n+ mean_cov = 0.0\n+ bases = 0\n+ else:\n+ min_cov, max_cov, mean_cov, bases = load_total_coverage(\n+ depth_handle, identifier, length\n+ )\n+ if max_cov > max_depth:\n+ sys.exit(\n+ "Using max depth %i yet saw max coverage %i for %s"\n+ % (max_depth, max_cov, identifier)\n+ )\n+ out_handle.write(\n+ "%s\\t%i\\t%i\\t%i\\t%i\\t%i\\t%0.2f\\n"\n+ % (identifier, length, mapped, placed, min_cov, max_cov, mean_cov)\n+ )\n+ if not (min_cov <= mean_cov <= max_cov):\n+ out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\\n")\n+ idxstats_handle.close()\n+ depth_handle.close()\n+ out_handle.close()\n+ clean_up()\n+ sys.exit(\n+ "Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"\n+ % identifier\n+ )\n+ global_length += length\n+ global_bases += bases\n+\n+idxstats_handle.close()\n+depth_handle.close()\n+if tabular_filename != "-":\n+ out_handle.close()\n+\n+print(\n+ "Total reference length %i with overall mean coverage %0.2f"\n+ % (global_length, float(global_bases) / global_length)\n+)\n+\n+# Remove the temp symlinks and files:\n+clean_up()\n+\n+if depth_ref is not None:\n+ sys.exit("Left over output from \'samtools depth\'? %r" % depth_ref)\n' |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml Tue Aug 11 18:23:05 2020 -0400 |
b |
@@ -0,0 +1,119 @@ +<tool id="coverage_stats" name="BAM coverage statistics" version="0.1.0"> + <description>using samtools idxstats and depth</description> + <requirements> + <requirement type="package" version="1.4.1">samtools</requirement> + </requirements> + <version_command> +python $__tool_directory__/coverage_stats.py --version + </version_command> + <command detect_errors="aggressive"> +python $__tool_directory__/coverage_stats.py +-b '$input_bam' +-i '${input_bam.metadata.bam_index}' +-o '$out_tabular' +-d '$max_depth' + </command> + <inputs> + <param name="input_bam" type="data" format="bam" label="Input BAM file" /> + <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" /> + </inputs> + <outputs> + <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> + </outputs> + <tests> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="123" /> + <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> + </test> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="50" /> + <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" /> + </test> + <test> + <param name="input_bam" value="coverage_test.bam" ftype="bam" /> + <param name="max_depth" value="123" /> + <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> + </test> + </tests> + <help> +**What it does** + +This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the +SAMtools toolkit, and parses their output to produce a consise summary of the +coverage information for each reference sequence. + +Input is a sorted and indexed BAM file, the output is tabular. The first four +columns match the output from ``samtools idxstats``, the additional columns are +calculated from the ``samtools depth`` output. The final row with a star as the +reference identifier represents unmapped reads, and will have zeros in every +column except columns one and four. + +====== ================================================================================= +Column Description +------ --------------------------------------------------------------------------------- + 1 Reference sequence identifier + 2 Reference sequence length + 3 Number of mapped reads + 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) + 5 Minimum coverage (per base of reference) + 6 Maximum coverage (per base of reference) + 7 Mean coverage (given to 2 dp) +====== ================================================================================= + +Example output from a *de novo* assembly: + +========== ====== ====== ====== ======= ======= ======== +identiifer length mapped placed min_cov max_cov mean_cov +---------- ------ ------ ------ ------- ------- -------- +contig_1 833604 436112 0 1 157 71.95 +contig_2 14820 9954 0 1 152 91.27 +contig_3 272099 142958 0 1 150 72.31 +contig_4 135519 73288 0 1 149 75.23 +contig_5 91245 46759 0 1 157 70.92 +contig_6 175604 95744 0 1 146 75.99 +contig_7 90586 48158 0 1 151 72.93 +contig_9 234347 126458 0 1 159 75.40 +contig_10 121515 60211 0 1 152 68.12 +... ... ... ... ... ... ... +contig_604 712 85 0 1 49 21.97 +\* 0 0 950320 0 0 0.00 +========== ====== ====== ====== ======= ======= ======== + +In this example there were 604 contigs, each with one line in the output table, +plus the final row (labelled with an asterisk) representing 950320 unmapped reads. +In this BAM file, the fourth column was otherwise zero. + +.. class:: warningmark + +**Note**. If using this on a mapping BAM file, beware that the coverage counting is +done per base of the reference. This means if your reference has any extra bases +compared to the reads being mapped, those bases will be skipped by CIGAR D operators +and these "extra" bases can have an extremely low coverage, giving a potentially +misleading ``min_cov`` values. A sliding window coverage may be more appropriate. + +**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the +pileup routine, meaning the reported coverage from ``samtools depth`` would show +maximum coverage depths *around* 8000. This is now a run time option. + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite: + +Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. +Bioinformatics 25(16), 2078-9. +https://doi.org/10.1093/bioinformatics/btp352 + +Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth. +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + </help> + <citations> + <citation type="doi">10.1093/bioinformatics/btp352</citation> + </citations> +</tool> |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/coverage_test.bam |
b |
Binary file test-data/coverage_test.bam has changed |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/coverage_test.coverage_stats.tabular --- a/test-data/coverage_test.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,7 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -ref 100 5 0 0 2 0.50 -none 200 0 0 0 0 0.00 -full 10 3 0 2 2 2.00 -head 10 1 0 0 1 0.50 -tail 20 1 0 0 1 0.25 -* 0 0 1 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.bam |
b |
Binary file test-data/ex1.bam has changed |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.coverage_stats.md50.tabular --- a/test-data/ex1.coverage_stats.md50.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,4 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -chr1 1575 1446 18 0 50 32.06 -chr2 1584 1789 17 0 50 38.70 -* 0 0 0 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.coverage_stats.tabular --- a/test-data/ex1.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,4 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -chr1 1575 1446 18 0 63 32.32 -chr2 1584 1789 17 0 64 39.78 -* 0 0 0 0 0 0.00 |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/README.rst --- a/tools/coverage_stats/README.rst Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,122 +0,0 @@ -BAM coverage statistics using samtools idxstats and depth -========================================================= - -This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute -(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. -See the licence text below. - -Internally this tool uses the command-line samtools suite. - -This tool is available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - - -Automated Installation -====================== - -This should be straightforward, provided your Galaxy is setup to use Conda and -BioConda for dependencies - Galaxy should then automatically download and -install the expected version of samtools. - - -Manual Installation -=================== - -This expects samtools to be on the ``$PATH``, and was tested using v1.3.1. - -To install the wrapper copy or move the following files under the Galaxy tools -folder, e.g. in a ``tools/coverage_stats`` folder: - -* ``coverage_stats.xml`` (the Galaxy tool definition) -* ``coverage_stats.py`` (the Python wrapper script) -* ``README.rst`` (this file) - -You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer -the tool. Just add the line, perhaps under the NGS tools section:: - - <tool file="coverage_stats/coverage_stats.xml" /> - -If you wish to run the unit tests, also move/copy the ``test-data/`` files -under Galaxy's ``test-data/`` folder. Then:: - - $ ./run_tests.sh -id coverage_stats - -That's it. - - -History -======= - -======= ====================================================================== -Version Changes -------- ---------------------------------------------------------------------- -v0.0.1 - Initial public release -v0.0.2 - Cope with samtools' default depth limit using modified samtools, - see https://github.com/samtools/samtools/pull/322 -v0.0.3 - Cope with no coverage in final contigs. -v0.0.4 - Reorder XML elements (internal change only). - - Planemo for Tool Shed upload (``.shed.yml``, internal change only). -v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3 - - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a - legacy Tool Shed package. - - Apply the new maximum depth parameter within the script to ensure - excess coverage is clear by getting the max coverage equal to the - max depth setting (the raw output from samtools is more fuzzy). - - Report total length and overall mean coverage in stdout. - - Use ``<command detect_errors="aggressive">`` (internal change only). - - Single quote command line arguments (internal change only). -======= ====================================================================== - - -Developers -========== - -Development is on this GitHub repository: -https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats - -For pushing a release to the test or main "Galaxy Tool Shed", use the following -Planemo commands (which requires you have set your Tool Shed access details in -``~/.planemo.yml`` and that you have access rights on the Tool Shed):: - - $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/ - ... - -or:: - - $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/ - ... - -To just build and check the tar ball, use:: - - $ planemo shed_upload --tar_only tools/coverage_stats/ - ... - $ tar -tzf shed_upload.tar.gz - tools/coverage_stats/README.rst - tools/coverage_stats/coverage_stats.xml - tools/coverage_stats/coverage_stats.py - ... - - -Licence (MIT) -============= - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in -all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -THE SOFTWARE. - -NOTE: This is the licence for the Galaxy Wrapper only. -samtools is available and licenced separately. |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/coverage_stats.py --- a/tools/coverage_stats/coverage_stats.py Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,251 +0,0 @@\n-#!/usr/bin/env python\n-"""BAM coverage statistics using samtools idxstats and depth.\n-\n-This script takes exactly three command line arguments:\n- * Input BAM filename\n- * Input BAI filename (via Galaxy metadata)\n- * Output tabular filename\n-\n-Optional fourth argument:\n- * Max coverage depth (integer)\n-\n-This messes about with the filenames to make samtools happy, then\n-runs "samtools idxstats" and "samtools depth", captures and combines\n-the output to the desired output tabular file.\n-\n-Because "samtools depth" treats the max depth a little fuzzily, this\n-tool tries to account for this and applies a clear max-depth cut off.\n-"""\n-\n-import os\n-import subprocess\n-import sys\n-import tempfile\n-\n-if "-v" in sys.argv or "--version" in sys.argv:\n- # Galaxy seems to invert the order of the two lines\n- print("BAM coverage statistics v0.0.5")\n- cmd = "samtools 2>&1 | grep -i ^Version"\n- sys.exit(os.system(cmd))\n-\n-# TODO - Proper command line API\n-if len(sys.argv) == 4:\n- bam_filename, bai_filename, tabular_filename = sys.argv[1:]\n- max_depth = "8000"\n-elif len(sys.argv) == 5:\n- bam_filename, bai_filename, tabular_filename, max_depth = sys.argv[1:]\n-else:\n- sys.exit("Require 3 or 4 arguments: BAM, BAI, tabular filename, [max depth]")\n-\n-if not os.path.isfile(bam_filename):\n- sys.exit("Input BAM file not found: %s" % bam_filename)\n-if bai_filename == "-":\n- # Make this optional for ease of use at the command line by hand:\n- if os.path.isfile(bam_filename + ".bai"):\n- bai_filename = bam_filename + ".bai"\n-if not os.path.isfile(bai_filename):\n- if bai_filename == "None":\n- sys.exit("Error: Galaxy did not index your BAM file")\n- sys.exit("Input BAI file not found: %s" % bai_filename)\n-\n-try:\n- max_depth = int(max_depth)\n-except ValueError:\n- sys.exit("Bad argument for max depth: %r" % max_depth)\n-if max_depth < 0:\n- sys.exit("Bad argument for max depth: %r" % max_depth)\n-\n-# fuzz factor to ensure can reach max_depth, e.g. region with\n-# many reads having a deletion present could underestimate the\n-# coverage by capping the number of reads considered\n-depth_margin = 100\n-\n-# Assign sensible names with real extensions, and setup symlinks:\n-tmp_dir = tempfile.mkdtemp()\n-bam_file = os.path.join(tmp_dir, "temp.bam")\n-bai_file = os.path.join(tmp_dir, "temp.bam.bai")\n-idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")\n-depth_filename = os.path.join(tmp_dir, "depth.tsv")\n-os.symlink(os.path.abspath(bam_filename), bam_file)\n-os.symlink(os.path.abspath(bai_filename), bai_file)\n-assert os.path.isfile(bam_file), bam_file\n-assert os.path.isfile(bai_file), bai_file\n-assert os.path.isfile(bam_file + ".bai"), bam_file\n-\n-\n-def clean_up():\n- """Remove our temporary files and directory."""\n- os.remove(idxstats_filename)\n- os.remove(depth_filename)\n- os.remove(bam_file)\n- os.remove(bai_file)\n- os.rmdir(tmp_dir)\n-\n-\n-def samtools_depth_opt_available():\n- """Determine if samtools depth supports maximum coverage argument."""\n- child = subprocess.Popen(["samtools", "depth"],\n- stdout=subprocess.PIPE, stderr=subprocess.STDOUT)\n- # Combined stdout/stderr in case samtools is ever inconsistent\n- output, tmp = child.communicate()\n- assert tmp is None\n- # Expect to find this line in the help text, exact wording could change:\n- # -d/-m <int> maximum coverage depth [8000]\n- return " -d/-m " in output\n-\n-\n-depth_hack = False\n-if not samtools_depth_opt_available():\n- if max_depth + depth_margin <= 8000:\n- sys.stderr.write("WARNING: The version of samtools depth installed does not "\n- "support the -d option, however, the requested max-depth "\n- "is safely under the default of 8000.\\n")\n- depth_hack = True\n- else:\n- sys.exit("The version of samtools depth installed does not support the -d option.")\n-\n-# Run samtools idxstats:\n-cmd = \''..b'nt("====")\n- # print("%s coverage calculation, length %i, ..." % (identifier, length))\n-\n- if depth_ref is None:\n- # Right at start of file / new contig\n- line = depth_handle.readline()\n- # Are we at the end of the file?\n- if not line:\n- # Must be at the end of the file.\n- # This can happen if the file contig(s) had no reads mapped\n- return 0, 0, 0.0, 0\n- depth_ref, depth_pos, depth_reads = line.rstrip("\\n").split()\n- depth_pos = int(depth_pos)\n- depth_reads = min(max_depth, int(depth_reads))\n- # Can now treat as later references where first line cached\n- elif identifier != depth_ref:\n- # Infer that identifier had coverage zero,\n- # and so was not in the \'samtools depth\'\n- # output.\n- # print("%s appears to have no coverage at all" % identifier)\n- return 0, 0, 0.0, 0\n-\n- # Good, at start of expected reference\n- bases = depth_reads\n- if depth_pos == 1:\n- min_cov = depth_reads\n- else:\n- # print("%s has no coverage at start" % identifier)\n- min_cov = 0\n- max_cov = depth_reads\n-\n- last_pos = depth_pos\n- depth_ref = None\n- depth_pos = 0\n- depth_reads = 0\n- for line in depth_handle:\n- ref, pos, depth = line.rstrip("\\n").split()\n- pos = int(pos)\n- depth = min(max_depth, int(depth))\n- if ref != identifier:\n- # Reached the end of this identifier\'s coverage\n- # so cache this ready for next identifier\n- depth_ref, depth_pos, depth_reads = ref, pos, depth\n- break\n- bases += depth\n- if last_pos + 1 < pos:\n- # print("%s has no coverage between %i and %i" % (identifier, last_pos, pos))\n- min_cov = 0\n- else:\n- min_cov = min(min_cov, depth)\n- max_cov = max(max_cov, depth)\n- last_pos = pos\n-\n- # Reach the end of this identifier\'s coverage or end of file\n- if last_pos < length:\n- # print("%s has no coverage at end" % identifier)\n- min_cov = 0\n- mean_cov = bases / float(length)\n- return min_cov, max_cov, mean_cov, bases\n-\n-\n-# Parse and combine the output\n-out_handle = open(tabular_filename, "w")\n-out_handle.write("#identifer\\tlength\\tmapped\\tplaced\\tmin_cov\\tmax_cov\\tmean_cov\\n")\n-\n-idxstats_handle = open(idxstats_filename)\n-depth_handle = open(depth_filename)\n-\n-depth_ref = None\n-depth_pos = 0\n-depth_reads = 0\n-\n-global_bases = 0\n-global_length = 0\n-for line in idxstats_handle:\n- identifier, length, mapped, placed = line.rstrip("\\n").split()\n- length = int(length)\n- mapped = int(mapped)\n- placed = int(placed)\n- if identifier == "*":\n- min_cov = 0\n- max_cov = 0\n- mean_cov = 0.0\n- bases = 0\n- else:\n- min_cov, max_cov, mean_cov, bases = load_total_coverage(depth_handle, identifier, length)\n- if max_cov > max_depth:\n- sys.exit("Using max depth %i yet saw max coverage %i for %s"\n- % (max_depth, max_cov, identifier))\n- out_handle.write("%s\\t%i\\t%i\\t%i\\t%i\\t%i\\t%0.2f\\n"\n- % (identifier, length, mapped, placed,\n- min_cov, max_cov, mean_cov))\n- if not (min_cov <= mean_cov <= max_cov):\n- out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\\n")\n- idxstats_handle.close()\n- depth_handle.close()\n- out_handle.close()\n- clean_up()\n- sys.exit("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"\n- % identifier)\n- global_length += length\n- global_bases += bases\n-\n-idxstats_handle.close()\n-depth_handle.close()\n-out_handle.close()\n-\n-print("Total reference length %i with overall mean coverage %0.2f" % (global_length, float(global_bases) / global_length))\n-\n-# Remove the temp symlinks and files:\n-clean_up()\n-\n-if depth_ref is not None:\n- sys.exit("Left over output from \'samtools depth\'? %r" % depth_ref)\n' |
b |
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/coverage_stats.xml --- a/tools/coverage_stats/coverage_stats.xml Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,115 +0,0 @@ -<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.5"> - <description>using samtools idxstats and depth</description> - <requirements> - <requirement type="package" version="1.4.1">samtools</requirement> - </requirements> - <version_command> -python $__tool_directory__/coverage_stats.py --version - </version_command> - <command detect_errors="aggressive"> -python $__tool_directory__/coverage_stats.py '$input_bam' '${input_bam.metadata.bam_index}' '$out_tabular' '$max_depth' - </command> - <inputs> - <param name="input_bam" type="data" format="bam" label="Input BAM file" /> - <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" /> - </inputs> - <outputs> - <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> - </outputs> - <tests> - <test> - <param name="input_bam" value="ex1.bam" ftype="bam" /> - <param name="max_depth" value="123" /> - <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> - </test> - <test> - <param name="input_bam" value="ex1.bam" ftype="bam" /> - <param name="max_depth" value="50" /> - <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" /> - </test> - <test> - <param name="input_bam" value="coverage_test.bam" ftype="bam" /> - <param name="max_depth" value="123" /> - <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> - </test> - </tests> - <help> -**What it does** - -This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the -SAMtools toolkit, and parses their output to produce a consise summary of the -coverage information for each reference sequence. - -Input is a sorted and indexed BAM file, the output is tabular. The first four -columns match the output from ``samtools idxstats``, the additional columns are -calculated from the ``samtools depth`` output. The final row with a star as the -reference identifier represents unmapped reads, and will have zeros in every -column except columns one and four. - -====== ================================================================================= -Column Description ------- --------------------------------------------------------------------------------- - 1 Reference sequence identifier - 2 Reference sequence length - 3 Number of mapped reads - 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) - 5 Minimum coverage (per base of reference) - 6 Maximum coverage (per base of reference) - 7 Mean coverage (given to 2 dp) -====== ================================================================================= - -Example output from a *de novo* assembly: - -========== ====== ====== ====== ======= ======= ======== -identiifer length mapped placed min_cov max_cov mean_cov ----------- ------ ------ ------ ------- ------- -------- -contig_1 833604 436112 0 1 157 71.95 -contig_2 14820 9954 0 1 152 91.27 -contig_3 272099 142958 0 1 150 72.31 -contig_4 135519 73288 0 1 149 75.23 -contig_5 91245 46759 0 1 157 70.92 -contig_6 175604 95744 0 1 146 75.99 -contig_7 90586 48158 0 1 151 72.93 -contig_9 234347 126458 0 1 159 75.40 -contig_10 121515 60211 0 1 152 68.12 -... ... ... ... ... ... ... -contig_604 712 85 0 1 49 21.97 -\* 0 0 950320 0 0 0.00 -========== ====== ====== ====== ======= ======= ======== - -In this example there were 604 contigs, each with one line in the output table, -plus the final row (labelled with an asterisk) representing 950320 unmapped reads. -In this BAM file, the fourth column was otherwise zero. - -.. class:: warningmark - -**Note**. If using this on a mapping BAM file, beware that the coverage counting is -done per base of the reference. This means if your reference has any extra bases -compared to the reads being mapped, those bases will be skipped by CIGAR D operators -and these "extra" bases can have an extremely low coverage, giving a potentially -misleading ``min_cov`` values. A sliding window coverage may be more appropriate. - -**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the -pileup routine, meaning the reported coverage from ``samtools depth`` would show -maximum coverage depths *around* 8000. This is now a run time option. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite: - -Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. -Bioinformatics 25(16), 2078-9. -http://dx.doi.org/10.1093/bioinformatics/btp352 - -Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth. -http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - </help> - <citations> - <citation type="doi">10.1093/bioinformatics/btp352</citation> - </citations> -</tool> |