Mercurial > repos > peterjc > make_nr
changeset 0:c84f12187af9 draft
v0.0.1
author | peterjc |
---|---|
date | Fri, 09 Nov 2018 11:00:03 -0500 |
parents | |
children | 84e483325b04 |
files | test-data/deduplicate.nosortids.fasta test-data/deduplicate.sortids.fasta test-data/duplicates.fasta test-data/duplicates.fasta.gz test-data/duplicates.nr.fasta test-data/more_duplicates.fasta tools/make_nr/README.rst tools/make_nr/make_nr.py tools/make_nr/make_nr.xml tools/make_nr/tool_dependencies.xml |
diffstat | 10 files changed, 447 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/deduplicate.nosortids.fasta Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,10 @@ +>Quick;Brown;Fox;3;5 representing 5 records +acgt +>1 first entry +act +>2 The A-Team +AAaa +>4 +CCCC +>6 last! +GGGG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/deduplicate.sortids.fasta Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,10 @@ +>3;5;Brown;Fox;Quick representing 5 records +acgt +>1 first entry +act +>2 The A-Team +AAaa +>4 +CCCC +>6 last! +GGGG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/duplicates.fasta Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,12 @@ +>1 first entry +act +>2 The A-Team +AAaa +>3 not unique... +ACgt +>4 +CCCC +>5 a duplicate +acgt +>6 last! +GGGG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/duplicates.nr.fasta Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,10 @@ +>1 first entry +act +>2 The A-Team +AAaa +>3;5 representing 2 records +ACgt +>4 +CCCC +>6 last! +GGGG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/more_duplicates.fasta Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,6 @@ +>Quick +acgt +>Brown +ACGT +>Fox +ACGT
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/make_nr/README.rst Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,140 @@ +Make a FASTA file non-redundant, with a Galaxy wrapper +====================================================== + +This tool is copyright 2018 by Peter Cock, The James Hutton Institute, UK. +All rights reserved. See the licence text below. + +This tool is a short Python script intended to be run prior to calling +the NCBI BLAST+ command line tool ``makeblastdb`` or in other settings +where you want to collapse duplicated sequences in a FASTA file to a +single representative. + +The script ``make_nr.py`` can be used directly (without Galaxy). +It requires Biopython. + +It comes with an optional Galaxy tool definition file ``make_nr.xml`` +allowing the Python script to be run from within Galaxy. It is available +from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr + + +Citation +======== + +If you cannot cite the GitHub repository directly, please cite one of the +following papers: + +Cock et al 2009. Biopython: freely available Python tools for computational +molecular biology and bioinformatics. *Bioinformatics* 25(11) 1422-3. +https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878. + +or (and this would be more appropriate in a Galaxy setting): + +NCBI BLAST+ integrated into Galaxy. +P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo +*GigaScience* 2015, 4:39 +https://doi.org/10.1186/s13742-015-0080-7 + + +Standalone Installation (outside Galaxy) +======================================== + +Outside of Galaxy, you will need Python and Biopython, the later can usually +be installed with ``pip install biopython`` or if you are using Conda, try +``conda install biopython`` instead. Then to run the script, simply call it +using ``python /full/path/to/make_nr.py -h`` or similar. + + +Automated Installation +====================== + +Installation via the Galaxy Tool Shed should take care of the Galaxy side of +things, including the dependency on Biopython. + + +Manual Installation +=================== + +There are just two files to install: + +- ``make_nr.py`` (the Python script) +- ``make_nr.xml`` (the Galaxy tool definition) + +The suggested location is in a ``tools/make_nr/`` folder. You will then +need to modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool +by adding the line:: + + <tool file="make_nr/make_nr.xml" /> + +If you want to run the functional tests, copy the sample test files under +sample test files under Galaxy's ``test-data/`` directory. Then:: + + ./run_tests.sh -id make_nr + + +History +======= + +TODO: + + - Option to follow BLAST NR style with ctrl+a separator? + - Option to give representative sequences in upper case? + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.0 - Initial version +======= ====================================================================== + + +Developers +========== + +This tool is developed on the following GitHub repository: +https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr + +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_update -t testtoolshed --check_diff tools/make_nr/ + ... + +or:: + + $ planemo shed_update -t toolshed --check_diff tools/make_nr/ + ... + +To just build and check the tar ball, use:: + + $ planemo shed_upload --tar_only tools/make_nr/ + ... + $ tar -tzf shed_upload.tar.gz + tools/make_nr/README.rst + tools/make_nr/make_nr.py + tools/make_nr/make_nr.xml + tools/make_nr/tool_dependencies.xml + test-data/duplicates.fasta + test-data/duplicates.nr.fasta + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/make_nr/make_nr.py Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,133 @@ +#!/usr/bin/env python3 +"""Make FASTA files non-redundant by combining duplicated sequences. + +This script takes one or more (optionally gzipped) FASTA filenames as input, +and will return a non-zero error if any duplicate identifiers are found. + +Writes output to stdout by default. + +Keeps all the sequences in memory, beware! +""" +from __future__ import print_function + +import gzip +import os +import sys + +from optparse import OptionParser + + +if "-v" in sys.argv or "--version" in sys.argv: + print("v0.0.1") + sys.exit(0) + + +# Parse Command Line +usage = """Use as follows: + +$ python make_nr.py [options] A.fasta [B.fasta ...] + +For example, + +$ python make_nr.py -o dedup.fasta -s ";" input1.fasta input2.fasta + +The input files should be plain text FASTA format, optionally gzipped. + +The -a option controls how the representative replacement record for +duplicated records are named. By default the identifiers are taken +in the input file order, combined with the separator. If the -a or +alphasort option is picked, the identifiers are alphabetically sorted +first. This ensures the same names are used even if the input file +order (or the record order within the input files) is randomised. + +There is additional guidance in the help text in the make_nr.xml file, +which is shown to the user via the Galaxy interface to this tool. +""" + +parser = OptionParser(usage=usage) +parser.add_option("-s", "--sep", dest="sep", + default=";", + help="Separator character for combining identifiers " + "of duplicated records e.g. '|' or ';' (required)") +parser.add_option("-a", "--alphasort", action="store_true", + help="When merging duplicated records sort their " + "identifiers alphabetically before combining them. " + "Default is input file order.") +parser.add_option("-o", "--output", dest="output", + default="/dev/stdout", metavar="FILE", + help="Output filename (defaults to stdout)") +options, args = parser.parse_args() + +if not args: + sys.exit("Expects at least one input FASTA filename") + + +def gzip_open(filename): + """Open a possibly gzipped text file.""" + with open(filename, "rb") as h: + magic = h.read(2) + if magic == b'\x1f\x8b': + return gzip.open(filename, "rt") + else: + return open(filename) + + +def make_nr(input_fasta, output_fasta, sep=";", sort_ids=False): + """Make the sequences in FASTA files non-redundant. + + Argument input_fasta is a list of filenames. + """ + by_seq = dict() + try: + from Bio.SeqIO.FastaIO import SimpleFastaParser + except ImportError: + sys.exit("Missing Biopython") + for f in input_fasta: + with gzip_open(f) as handle: + for title, seq in SimpleFastaParser(handle): + idn = title.split(None, 1)[0] # first word only + seq = seq.upper() + try: + by_seq[seq].append(idn) + except KeyError: + by_seq[seq] = [idn] + unique = 0 + representatives = dict() + duplicates = set() + for cluster in by_seq.values(): + if len(cluster) > 1: + # Is it useful to offer to sort here? + # if sort_ids: + # cluster.sort() + representatives[cluster[0]] = cluster + duplicates.update(cluster[1:]) + else: + unique += 1 + del by_seq + if duplicates: + # TODO - refactor as a generator with single SeqIO.write(...) call + with open(output_fasta, "w") as handle: + for f in input_fasta: + with gzip_open(f) as in_handle: + for title, seq in SimpleFastaParser(in_handle): + idn = title.split(None, 1)[0] # first word only + if idn in representatives: + cluster = representatives[idn] + if sort_ids: + cluster.sort() + idn = sep.join(cluster) + title = "%s representing %i records" % (idn, len(cluster)) + elif idn in duplicates: + continue + # TODO - line wrapping + handle.write(">%s\n%s\n" % (title, seq)) + sys.stderr.write("%i unique entries; removed %i duplicates " + "leaving %i representative records\n" + % (unique, len(duplicates), len(representatives))) + else: + os.symlink(os.path.abspath(input_fasta), output_fasta) + sys.stderr.write("No perfect duplicates in file, %i unique entries\n" + % unique) + + +make_nr(args, options.output, options.sep, options.alphasort)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/make_nr/make_nr.xml Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,120 @@ +<tool id="make_nr" name="Make FASTA non-redundant" version="0.0.1"> + <description>by combining duplicated sequences</description> + <requirements> + <requirement type="package" version="1.67">biopython</requirement> + </requirements> + <version_command> +python $__tool_directory__/make_nr.py --version + </version_command> + <command detect_errors="aggressive"> +python $__tool_directory__/make_nr.py $alphasort -s '$separator' -o '$output' +#for $f in $input +'$f' +#end for + </command> + <inputs> + <param name="input" type="data" format="fasta,fasta.gz" multiple="True" + label="Input FASTA sequence file(s)"/> + <param argument="separator" type="text" size="10" area="False" value=";" + label="Separator string to use when combining the identifiers of duplicate sequences" + help="A single character is recommended, e.g. the semi-colon, or comma"> + <sanitizer> + <valid initial="default"> + <add value=";"/> + <add value="|"/> + </valid> + </sanitizer> + </param> + <param argument="alphasort" type="select" label="Treatment of identifiers when combining duplicates with the separator"> + <option value="">Use the order they appear in the input file(s)</option> + <option value="-a">Sort alphabetically before combining them</option> + </param> + </inputs> + <outputs> + <data name="output" format="fasta" label="$on_string (NR)" /> + </outputs> + <tests> + <test> + <param name="input" value="duplicates.fasta" ftype="fasta"/> + <output name="output" file="duplicates.nr.fasta" ftype="fasta"/> + </test> + <test> + <param name="input" value="duplicates.fasta.gz" ftype="fasta.gz"/> + <output name="output" file="duplicates.nr.fasta" ftype="fasta"/> + </test> + <test> + <param name="input" value="more_duplicates.fasta,duplicates.fasta" ftype="fasta"/> + <output name="output" file="deduplicate.nosortids.fasta" ftype="fasta"/> + </test> + <test> + <param name="input" value="more_duplicates.fasta,duplicates.fasta" ftype="fasta"/> + <param name="alphasort" value="-a"/> + <output name="output" file="deduplicate.sortids.fasta" ftype="fasta"/> + </test> + </tests> + <help> +**What it does** + +Takes one or more input FASTA files, checks them to find any duplicate sequences +(ignoring the case), and writes an output FASTA file where any duplicates appear +once with combined identifier. + +For example, using the default separator of a semi-colon:: + + >1 first entry + act + >2 The A-Team + AAaa + >3 not unique... + ACgt + >4 + CCCC + >5 a duplicate + acgt + >6 last! + GGGG + +In this simple example ``ACGT`` appears twice (ignoring case) as entries ``3`` +and ``6``. Entry ``3`` is renamed as ``3;6`` and entry ``4`` is omitted:: + + >1 first entry + act + >2 The A-Team + AAaa + >3;6 representing 2 records + ACgt + >4 + CCCC + >6 last! + GGGG + +This means that the representative records take the position and sequence case +from the first entry with that sequence. + +In this case the combined entry is labelled as ``3;6``, so the sort option +has no effect. However, if the records appears in the file with ``6`` before +``3`` you can choose to get ``6;3`` (order from file, default) or ``3;6`` +(ordered alphabetically). + +Notice the unique sequences are preserved as they were with any description +or mixed case. + + +**References** + +If you cannot cite this tool directly via the GitHub URL +https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr +and need a traditional paper, then please cite: + +P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo (2015). +NCBI BLAST+ integrated into Galaxy. +*GigaScience* 4:39 +https://doi.org/10.1186/s13742-015-0080-7 + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr + </help> + <citations> + <citation type="doi">10.1186/1471-2105-10-421</citation> + </citations> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/make_nr/tool_dependencies.xml Fri Nov 09 11:00:03 2018 -0500 @@ -0,0 +1,6 @@ +<?xml version="1.0" ?> +<tool_dependency> + <package name="biopython" version="1.67"> + <repository changeset_revision="a12f73c3b116" name="package_biopython_1_67" owner="biopython" toolshed="https://toolshed.g2.bx.psu.edu"/> + </package> +</tool_dependency> \ No newline at end of file