Next changeset 1:3ee6f4d0ac80 (2017-05-16) |
Commit message:
v0.0.4 - Previously only on Test Tool Shed |
added:
test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam.bai test-data/SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular test-data/coverage_test.bam test-data/coverage_test.count_roi_variants.tabular test-data/ex1.bam test-data/ex1.count_roi_variants.tabular tools/count_roi_variants/README.rst tools/count_roi_variants/count_roi_variants.py tools/count_roi_variants/count_roi_variants.xml tools/count_roi_variants/tool_dependencies.xml |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam |
b |
Binary file test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam has changed |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam.bai |
b |
Binary file test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam.bai has changed |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular Wed Feb 01 07:10:26 2017 -0500 |
b |
@@ -0,0 +1,4 @@ +Variant Count Percentage +AGCCCATGAGATGGGAAGCAATGGGCTACA 14 87.50 +AGCCCATGAGATGGGAAGCAATGGGCTACG 1 6.25 +AGCGCATGAGATGGGAAGCAATGGGCTACG 1 6.25 |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/coverage_test.bam |
b |
Binary file test-data/coverage_test.bam has changed |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/coverage_test.count_roi_variants.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/coverage_test.count_roi_variants.tabular Wed Feb 01 07:10:26 2017 -0500 |
b |
@@ -0,0 +1,2 @@ +Variant Count Percentage +TAAAGG 1 100.00 |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/ex1.bam |
b |
Binary file test-data/ex1.bam has changed |
b |
diff -r 000000000000 -r 95efbdb72961 test-data/ex1.count_roi_variants.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/ex1.count_roi_variants.tabular Wed Feb 01 07:10:26 2017 -0500 |
b |
@@ -0,0 +1,3 @@ +Variant Count Percentage +CTATCTA 39 97.50 +ATATATA 1 2.50 |
b |
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/count_roi_variants/README.rst Wed Feb 01 07:10:26 2017 -0500 |
b |
@@ -0,0 +1,138 @@ +Count sequence variants in region of interest in BAM file +========================================================= + +This tool is copyright 2016 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +This tool runs the command ``samtools view`` (taking advantage of an +indexed BAM file) to access only those reads mapped to the region of +interest (ROI), and then counts the different sequence variants found. + +Internally this tool uses the command-line samtools suite. + +This tool is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants + + +Use outside of Galaxy +===================== + +You just need the ``count_roi_variants.py`` script and to have samtools +on the ``$PATH``. If you move/copy the script somewhere on your ``$PATH`` +and then you can run it like this:: + + $ count_roi_variants.py --help + +Or, call the script at an explicit path:: + + $ /path/to/my/stuff/count_roi_variants.py --help + +Run like this it will use the current default Python. This was written and +tested under Python 2.7, but should also work under Python 2.6. e.g.:: + + $ python2.6 /path/to/my/stuff/count_roi_variants.py --help + +This does not yet support Python 3. + +The sample data and tests are designed to be run via Galaxy. + + +Automated Galaxy Installation +============================= + +This should be straightforward, Galaxy should automatically download and install +samtools if required. + + +Manual Galaxy Installation +========================== + +This expects samtools to be on the ``$PATH``, and was tested using v0.1.3 + +To install the wrapper copy or move the following files under the Galaxy tools +folder, e.g. in a ``tools/count_roi_variants`` folder: + +* ``count_roi_variants.xml`` (the Galaxy tool definition) +* ``count_roi_variants.py`` (the Python wrapper script) +* ``README.rst`` (this file) + +You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer +the tool. Just add the line, perhaps under the NGS tools section:: + + <tool file="count_roi_variants/count_roi_variants.xml" /> + +If you wish to run the unit tests, also move/copy the ``test-data/`` files +under Galaxy's ``test-data/`` folder. Then:: + + $ ./run_tests.sh -id count_roi_variants + +That's it. + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial public release +v0.0.2 - Cope with pipes in reference name (e.g. NCBI style FASTA naming) +v0.0.3 - Include a functional test for using an unrecognised reference. +v0.0.4 - Improved usage text and README for use outside of Galaxy. +======= ====================================================================== + + +Developers +========== + +Development is on this GitHub repository: +https://github.com/peterjc/pico_galaxy/tree/master/tools/count_roi_variants + +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_update -t testtoolshed --check_diff ~/repositories/pico_galaxy/tools/count_roi_variants/ + ... + +or:: + + $ planemo shed_update -t toolshed --check_diff ~/repositories/pico_galaxy/tools/count_roi_variants/ + ... + +To just build and check the tar ball, use:: + + $ planemo shed_upload --tar_only ~/repositories/pico_galaxy/tools/count_roi_variants/ + ... + $ tar -tzf shed_upload.tar.gz + tools/count_roi_variants/README.rst + tools/count_roi_variants/count_roi_variants.xml + tools/count_roi_variants/count_roi_variants.py + tools/count_roi_variants/tool_dependencies.xml + ... + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. + +NOTE: This is the licence for the Galaxy Wrapper only. +samtools is available and licenced separately. |
b |
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/count_roi_variants.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/count_roi_variants/count_roi_variants.py Wed Feb 01 07:10:26 2017 -0500 |
[ |
b'@@ -0,0 +1,240 @@\n+#!/usr/bin/env python\n+"""Count sequence variants in region of interest in a BAM file.\n+\n+This script takes exactly four command line arguments:\n+ * Input BAM filename\n+ * Input BAI filename (via Galaxy metadata)\n+ * Output tabular filename\n+ * Region of interest (reference:start-end as used in samtools)\n+\n+This messes about with the filenames to make samtools happy, then\n+runs "samtools view" and parses the reads mapped to the ROI, counts\n+the observed variants spanning the ROI, and outputs this as a\n+tabular file.\n+"""\n+import sys\n+import os\n+import subprocess\n+import tempfile\n+\n+if "-v" in sys.argv or "--version" in sys.argv:\n+ # Galaxy seems to invert the order of the two lines\n+ print("BAM coverage statistics v0.0.4 (using samtools)")\n+ cmd = "samtools 2>&1 | grep -i ^Version"\n+ sys.exit(os.system(cmd))\n+\n+# TODO - Proper command line API\n+usage = """Requires 4 arguments: BAM, BAI, tabular filename, samtools-style region\n+\n+For ease of use, you can use a minus sign as the BAI filename which will use the BAM\n+filename with the suffix .bai added to it. Example using one of the test-data files:\n+\n+$ count_roi_variants.py "SRR639755_mito_pairs_vs_NC_010642_clc.bam" "-" "counts.txt" "gi|187250362|ref|NC_010642.1|:1695-1725"\n+Counted 3 variants from 16 reads spanning gi|187250362|ref|NC_010642.1|:1695-1725\n+$ cat counts.txt\n+Variant\tCount\tPercentage\n+AGCCCATGAGATGGGAAGCAATGGGCTACA\t14\t87.50\n+AGCCCATGAGATGGGAAGCAATGGGCTACG\t1\t6.25\n+AGCGCATGAGATGGGAAGCAATGGGCTACG\t1\t6.25\n+"""\n+if len(sys.argv) == 5:\n+ bam_filename, bai_filename, tabular_filename, region = sys.argv[1:]\n+else:\n+ sys.exit(usage)\n+\n+if not os.path.isfile(bam_filename):\n+ sys.exit("Input BAM file not found: %s" % bam_filename)\n+if bai_filename == "-":\n+ # Make this optional for ease of use at the command line by hand:\n+ if os.path.isfile(bam_filename + ".bai"):\n+ bai_filename = bam_filename + ".bai"\n+if not os.path.isfile(bai_filename):\n+ if bai_filename == "None":\n+ sys.exit("Error: Galaxy did not index your BAM file")\n+ sys.exit("Input BAI file not found: %s" % bai_filename)\n+\n+try:\n+ # Sanity check we have "ref:start-end" to give clear error message\n+ # Note can have semi-colon in the reference name\n+ # Note can have thousand separator commas in the start/end\n+ ref, start_end = region.rsplit(":", 1)\n+ start, end = start_end.split("-")\n+ start = int(start.replace(",", ""))\n+ end = int(end.replace(",", ""))\n+except ValueError:\n+ sys.exit("Bad argument for region: %r" % region)\n+\n+\n+# Assign sensible names with real extensions, and setup symlinks:\n+tmp_dir = tempfile.mkdtemp()\n+bam_file = os.path.join(tmp_dir, "temp.bam")\n+bai_file = os.path.join(tmp_dir, "temp.bam.bai")\n+os.symlink(os.path.abspath(bam_filename), bam_file)\n+os.symlink(os.path.abspath(bai_filename), bai_file)\n+assert os.path.isfile(bam_file), bam_file\n+assert os.path.isfile(bai_file), bai_file\n+assert os.path.isfile(bam_file + ".bai"), bam_file\n+\n+\n+def clean_up():\n+ os.remove(bam_file)\n+ os.remove(bai_file)\n+ os.rmdir(tmp_dir)\n+\n+\n+def decode_cigar(cigar):\n+ """Returns a list of 2-tuples, integer count and operator char."""\n+ count = ""\n+ answer = []\n+ for letter in cigar:\n+ if letter.isdigit():\n+ count += letter # string addition\n+ elif letter in "MIDNSHP=X":\n+ answer.append((int(count), letter))\n+ count = ""\n+ else:\n+ raise ValueError("Invalid character %s in CIGAR %s" % (letter, cigar))\n+ return answer\n+\n+\n+assert decode_cigar("14S15M1P1D3P54M1D34M5S") == [(14, \'S\'), (15, \'M\'), (1, \'P\'), (1, \'D\'), (3, \'P\'), (54, \'M\'), (1, \'D\'), (34, \'M\'), (5, \'S\')]\n+\n+\n+def align_len(cigar_ops):\n+ """Sums the CIGAR M/=/X/D/N operators."""\n+ return sum(count for count, op in cigar_ops if op in "M=XDN")\n+\n+\n+def expand_cigar(seq, cigar_ops):\n+ """Yields (ref_offset, seq_base) pairs."""\n+ ref_offset = 0\n+ seq_offset = 0\n+ for'..b'ar("8M1I4M"))) == [(0, \'A\'), (1, \'A\'), (2, \'A\'), (3, \'A\'), (4, \'G\'), (5, \'G\'), (6, \'G\'), (7, \'G\'), (7.5, "c"), (8, \'T\'), (9, \'T\'), (10, \'T\'), (11, \'T\')]\n+assert list(expand_cigar("AAAAcGGGGcTTTT", decode_cigar("4M1I4M1I4M"))) == [(0, \'A\'), (1, \'A\'), (2, \'A\'), (3, \'A\'), (3.5, \'c\'), (4, \'G\'), (5, \'G\'), (6, \'G\'), (7, \'G\'), (7.5, \'c\'), (8, \'T\'), (9, \'T\'), (10, \'T\'), (11, \'T\')]\n+\n+\n+def get_roi(seq, cigar_ops, start, end):\n+ """Extract region of seq mapping to the ROI.\n+\n+ Expect start and end to be zero based Python style end points.\n+\n+ i.e. The ROI relative to the mapping start recorded in the POS field.\n+ Will return part of the SAM/BAM value SEQ based on interpretting the\n+ passed CIGAR operators.\n+ """\n+ if len(cigar_ops) == 1 and cigar_ops[0][1] in "M=X":\n+ # Easy case, note start/end/pos all one-based\n+ assert cigar_ops[0][0] == len(seq)\n+ return seq[start:end]\n+ # Would use "start <= i < end" if they were all integers, but\n+ # want to exclude e.g. 3.5 and 7.5 when given start 4 and end 8.\n+ return "".join(base for i, base in expand_cigar(seq, cigar_ops) if start <= i <= end - 1)\n+\n+assert "GGGG" == get_roi("AAAAGGGGTTTT", decode_cigar("12M"), 4, 8)\n+assert "GGGG" == get_roi("AAAAcGGGGTTTT", decode_cigar("4M1I8M"), 4, 8)\n+assert "GGGG" == get_roi("AAAAGGGGcTTTT", decode_cigar("8M1I4M"), 4, 8)\n+assert "GGGG" == get_roi("AAAAcGGGGcTTTT", decode_cigar("4M1I4M1I4M"), 4, 8)\n+assert "GGaGG" == get_roi("AAAAGGaGGTTTT", decode_cigar("6M1I6M"), 4, 8)\n+assert "GGGgA" == get_roi("AAAAGGGgATTTT", decode_cigar("7M1I5M"), 4, 8)\n+\n+\n+def count_region():\n+ # Could recreate the region string (with no commas in start/end)?\n+ # region = "%s:%i-%i" % (ref, start, end)\n+\n+ tally = dict()\n+\n+ # Call samtools view, don\'t need header so no -h added.\n+ # Only want mapped reads, thus flag filter -F 4.\n+ child = subprocess.Popen(["samtools", "view", "-F", "4", bam_file, region],\n+ stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n+ for line in child.stdout:\n+ assert line[0] != "@", "Got unexpected SAM header line: %s" % line\n+ qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, rest = line.split("\\t", 10)\n+ pos = int(pos) # one-based\n+ if start < pos:\n+ # Does not span the ROI\n+ continue\n+ cigar_ops = decode_cigar(cigar)\n+ if pos + align_len(cigar_ops) - 1 < end:\n+ # Does not span the ROI\n+ continue\n+ # All of start/end/pos are currently one-based, making offsets Python style....\n+ roi_seq = get_roi(seq, cigar_ops, start - pos, end - pos + 1)\n+ assert roi_seq, "Error, empty ROI sequence for: %s" % line\n+ try:\n+ tally[roi_seq] += 1\n+ except KeyError:\n+ tally[roi_seq] = 1\n+\n+ stderr = child.stderr.read()\n+ child.stdout.close()\n+ child.stderr.close()\n+ return_code = child.wait()\n+ if return_code:\n+ sys.exit("Got return code %i from samtools view" % return_code)\n+ elif "specifies an unknown reference name. Continue anyway." in stderr:\n+ sys.exit(stderr.strip() + "\\n\\nERROR: samtools did not recognise the region requested, can\'t count any variants.")\n+\n+ return tally\n+\n+\n+def record_counts():\n+\n+ tally = count_region()\n+ total = sum(tally.values())\n+\n+ # Using negative count to get sort with highest count first,\n+ # while tie-breaking by the ROI sequence alphabetically.\n+ table = sorted((-count, roi_seq) for (roi_seq, count) in tally.items())\n+ del tally\n+\n+ with open(tabular_filename, "w") as handle:\n+ handle.write("Variant\\tCount\\tPercentage\\n")\n+ for count, roi_seq in table:\n+ handle.write("%s\\t%i\\t%0.2f\\n" % (roi_seq, -count, -count * 100.0 / total))\n+\n+ print("Counted %i variants from %i reads spanning %s" % (len(table), total, region))\n+\n+\n+# Run it!\n+record_counts()\n+# Remove the temp symlinks and files:\n+clean_up()\n' |
b |
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/count_roi_variants.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/count_roi_variants/count_roi_variants.xml Wed Feb 01 07:10:26 2017 -0500 |
[ |
@@ -0,0 +1,112 @@ +<tool id="count_roi_variants" name="Count sequence variants in region of interest" version="0.0.3"> + <description>using samtools view</description> + <requirements> + <requirement type="binary">samtools</requirement> + <requirement type="package" version="1.2.2">samtools</requirement> + </requirements> + <stdio> + <!-- Assume anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + </stdio> + <version_command interpreter="python">count_roi_variants.py --version</version_command> + <command interpreter="python">count_roi_variants.py "$input_bam" "${input_bam.metadata.bam_index}" "$out_tabular" "$region"</command> + <inputs> + <param name="input_bam" type="data" format="bam" label="Input BAM file" /> + <param name="region" type="text" label="Region of interest" help="Use the reference:start-end syntax as in samtools."> + <sanitizer> + <!-- + SAM/BAM spec says the name must match regex [!-)+-<>-~][!-~]* + but will focus on ASCII 33 ("!") to ASCII 126 ("~"), i.e. + !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ + of which some are going to be potential trouble like !, ", $, &, ', \, `, {, |, } + which is how I came to this hopefully safe set.. + --> + <valid initial="string.letters,string.digits"> + <add value="#%+,-./:;=@^_|~" /> + <remove value=" " /> + </valid> + <mapping initial="none" /> + </sanitizer> + </param> + </inputs> + <outputs> + <data name="out_tabular" format="tabular" label="$input_bam.name $region variants" /> + </outputs> + <tests> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="region" value="chr2:400-406" /> + <output name="out_tabular" file="ex1.count_roi_variants.tabular" ftype="tabular" /> + <assert_stdout> + <has_line line="Counted 2 variants from 40 reads spanning chr2:400-406" /> + </assert_stdout> + </test> + <test> + <param name="input_bam" value="coverage_test.bam" ftype="bam" /> + <param name="region" value="ref:10-15" /> + <output name="out_tabular" file="coverage_test.count_roi_variants.tabular" ftype="tabular" /> + <assert_stdout> + <has_line line="Counted 1 variants from 1 reads spanning ref:10-15" /> + </assert_stdout> + </test> + <!-- This test is a tricky one due to the NCBI's love of pipe characters in identifiers --> + <test> + <param name="input_bam" value="SRR639755_mito_pairs_vs_NC_010642_clc.bam" ftype="bam" /> + <param name="region" value="gi|187250362|ref|NC_010642.1|:1695-1725" /> + <output name="out_tabular" file="SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular" ftype="tabular" /> + <assert_stdout> + <has_line line="Counted 3 variants from 16 reads spanning gi|187250362|ref|NC_010642.1|:1695-1725" /> + </assert_stdout> + </test> + <test expect_failure="true" expect_exit_code="1"> + <param name="input_bam" value="SRR639755_mito_pairs_vs_NC_010642_clc.bam" ftype="bam" /> + <param name="region" value="ref:1695-1725" /> + <assert_stderr> + <has_line line="ERROR: samtools did not recognise the region requested, can't count any variants." /> + </assert_stderr> + </test> + </tests> + <help> +**What it does** + +This tool runs the command ``samtools view`` from the SAMtools toolkit, getting +all the reads in your BAM file mapped to the given region of interest (ROI). +It then counts all the different sequence variants in reads spanning that ROI, +which are returned as a tab-separated table. + +Reads mapped to the ROI but not spanning it completely are ignored. + +Input is a sorted and indexed BAM file, the output is tabular. The first column +is the observed sequence variants within the ROI, the second column is the number +of reads with that sequence, and the third column gives this as a percentage of +the reads spanning the ROI. + +====== ================================================================================= +Column Description +------ --------------------------------------------------------------------------------- + 1 Sequence variant from ROI + 2 Number of reads with that sequence variant + 3 Percentage of reads with that sequence variant (2 dp) +====== ================================================================================= + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite: + +Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. +Bioinformatics 25(16), 2078-9. +http://dx.doi.org/10.1093/bioinformatics/btp352 + +Peter J.A. Cock (2016), Count sequence variants in region of interest in BAM file. +http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants + </help> + <citations> + <citation type="doi">10.1093/bioinformatics/btp352</citation> + </citations> +</tool> |
b |
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/count_roi_variants/tool_dependencies.xml Wed Feb 01 07:10:26 2017 -0500 |
b |
@@ -0,0 +1,6 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="samtools" version="1.2"> + <repository changeset_revision="f6ae3ba3f3c1" name="package_samtools_1_2" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" /> + </package> +</tool_dependency> |