Mercurial > repos > peterjc > make_nr

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/deduplicate.nosortids.fasta	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>Quick;Brown;Fox;3;5 representing 5 records
+acgt
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>4
+CCCC
+>6 last!
+GGGG
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/deduplicate.sortids.fasta	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>3;5;Brown;Fox;Quick representing 5 records
+acgt
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>4
+CCCC
+>6 last!
+GGGG
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/duplicates.fasta	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,12 @@
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>3 not unique...
+ACgt
+>4
+CCCC
+>5 a duplicate
+acgt
+>6 last!
+GGGG
Binary file test-data/duplicates.fasta.gz has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/duplicates.nr.fasta	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>3;5 representing 2 records
+ACgt
+>4
+CCCC
+>6 last!
+GGGG
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/more_duplicates.fasta	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,6 @@
+>Quick
+acgt
+>Brown
+ACGT
+>Fox
+ACGT
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/README.rst	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,140 @@
+Make a FASTA file non-redundant, with a Galaxy wrapper
+======================================================
+
+This tool is copyright 2018 by Peter Cock, The James Hutton Institute, UK.
+All rights reserved. See the licence text below.
+
+This tool is a short Python script intended to be run prior to calling
+the NCBI BLAST+ command line tool ``makeblastdb`` or in other settings
+where you want to collapse duplicated sequences in a FASTA file to a
+single representative.
+
+The script ``make_nr.py`` can be used directly (without Galaxy).
+It requires Biopython.
+
+It comes with an optional Galaxy tool definition file ``make_nr.xml``
+allowing the Python script to be run from within Galaxy. It is available
+from the Galaxy Tool Shed at:
+http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr
+
+
+Citation
+========
+
+If you cannot cite the GitHub repository directly, please cite one of the
+following papers:
+
+Cock et al 2009. Biopython: freely available Python tools for computational
+molecular biology and bioinformatics. *Bioinformatics* 25(11) 1422-3.
+https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
+
+or (and this would be more appropriate in a Galaxy setting):
+
+NCBI BLAST+ integrated into Galaxy.
+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
+*GigaScience* 2015, 4:39
+https://doi.org/10.1186/s13742-015-0080-7
+
+
+Standalone Installation (outside Galaxy)
+========================================
+
+Outside of Galaxy, you will need Python and Biopython, the later can usually
+be installed with ``pip install biopython`` or if you are using Conda, try
+``conda install biopython`` instead. Then to run the script, simply call it
+using ``python /full/path/to/make_nr.py -h`` or similar.
+
+
+Automated Installation
+======================
+
+Installation via the Galaxy Tool Shed should take care of the Galaxy side of
+things, including the dependency on Biopython.
+
+
+Manual Installation
+===================
+
+There are just two files to install:
+
+- ``make_nr.py`` (the Python script)
+- ``make_nr.xml`` (the Galaxy tool definition)
+
+The suggested location is in a ``tools/make_nr/`` folder. You will then
+need to modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool
+by adding the line::
+
+    <tool file="make_nr/make_nr.xml" />
+
+If you want to run the functional tests, copy the sample test files under
+sample test files under Galaxy's ``test-data/`` directory. Then::
+
+    ./run_tests.sh -id make_nr
+
+
+History
+=======
+
+TODO:
+
+ - Option to follow BLAST NR style with ctrl+a separator?
+ - Option to give representative sequences in upper case?
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.0  - Initial version
+======= ======================================================================
+
+
+Developers
+==========
+
+This tool is developed on the following GitHub repository:
+https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+    $ planemo shed_update -t testtoolshed --check_diff tools/make_nr/
+    ...
+
+or::
+
+    $ planemo shed_update -t toolshed --check_diff tools/make_nr/
+    ...
+
+To just build and check the tar ball, use::
+
+    $ planemo shed_upload --tar_only tools/make_nr/
+    ...
+    $ tar -tzf shed_upload.tar.gz
+    tools/make_nr/README.rst
+    tools/make_nr/make_nr.py
+    tools/make_nr/make_nr.xml
+    tools/make_nr/tool_dependencies.xml
+    test-data/duplicates.fasta
+    test-data/duplicates.nr.fasta
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/make_nr.py	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""Make FASTA files non-redundant by combining duplicated sequences.
+
+This script takes one or more (optionally gzipped) FASTA filenames as input,
+and will return a non-zero error if any duplicate identifiers are found.
+
+Writes output to stdout by default.
+
+Keeps all the sequences in memory, beware!
+"""
+from __future__ import print_function
+
+import gzip
+import os
+import sys
+
+from optparse import OptionParser
+
+
+if "-v" in sys.argv or "--version" in sys.argv:
+    print("v0.0.1")
+    sys.exit(0)
+
+
+# Parse Command Line
+usage = """Use as follows:
+
+$ python make_nr.py [options] A.fasta [B.fasta ...]
+
+For example,
+
+$ python make_nr.py -o dedup.fasta -s ";" input1.fasta input2.fasta
+
+The input files should be plain text FASTA format, optionally gzipped.
+
+The -a option controls how the representative replacement record for
+duplicated records are named. By default the identifiers are taken
+in the input file order, combined with the separator. If the -a or
+alphasort option is picked, the identifiers are alphabetically sorted
+first. This ensures the same names are used even if the input file
+order (or the record order within the input files) is randomised.
+
+There is additional guidance in the help text in the make_nr.xml file,
+which is shown to the user via the Galaxy interface to this tool.
+"""
+
+parser = OptionParser(usage=usage)
+parser.add_option("-s", "--sep", dest="sep",
+                  default=";",
+                  help="Separator character for combining identifiers "
+                  "of duplicated records e.g. '|' or ';' (required)")
+parser.add_option("-a", "--alphasort", action="store_true",
+                  help="When merging duplicated records sort their "
+                  "identifiers alphabetically before combining them. "
+                  "Default is input file order.")
+parser.add_option("-o", "--output", dest="output",
+                  default="/dev/stdout", metavar="FILE",
+                  help="Output filename (defaults to stdout)")
+options, args = parser.parse_args()
+
+if not args:
+    sys.exit("Expects at least one input FASTA filename")
+
+
+def gzip_open(filename):
+    """Open a possibly gzipped text file."""
+    with open(filename, "rb") as h:
+        magic = h.read(2)
+    if magic == b'\x1f\x8b':
+        return gzip.open(filename, "rt")
+    else:
+        return open(filename)
+
+
+def make_nr(input_fasta, output_fasta, sep=";", sort_ids=False):
+    """Make the sequences in FASTA files non-redundant.
+
+    Argument input_fasta is a list of filenames.
+    """
+    by_seq = dict()
+    try:
+        from Bio.SeqIO.FastaIO import SimpleFastaParser
+    except ImportError:
+        sys.exit("Missing Biopython")
+    for f in input_fasta:
+        with gzip_open(f) as handle:
+            for title, seq in SimpleFastaParser(handle):
+                idn = title.split(None, 1)[0]  # first word only
+                seq = seq.upper()
+                try:
+                    by_seq[seq].append(idn)
+                except KeyError:
+                    by_seq[seq] = [idn]
+    unique = 0
+    representatives = dict()
+    duplicates = set()
+    for cluster in by_seq.values():
+        if len(cluster) > 1:
+            # Is it useful to offer to sort here?
+            # if sort_ids:
+            #     cluster.sort()
+            representatives[cluster[0]] = cluster
+            duplicates.update(cluster[1:])
+        else:
+            unique += 1
+    del by_seq
+    if duplicates:
+        # TODO - refactor as a generator with single SeqIO.write(...) call
+        with open(output_fasta, "w") as handle:
+            for f in input_fasta:
+                with gzip_open(f) as in_handle:
+                    for title, seq in SimpleFastaParser(in_handle):
+                        idn = title.split(None, 1)[0]  # first word only
+                        if idn in representatives:
+                            cluster = representatives[idn]
+                            if sort_ids:
+                                cluster.sort()
+                            idn = sep.join(cluster)
+                            title = "%s representing %i records" % (idn, len(cluster))
+                        elif idn in duplicates:
+                            continue
+                        # TODO - line wrapping
+                        handle.write(">%s\n%s\n" % (title, seq))
+        sys.stderr.write("%i unique entries; removed %i duplicates "
+                         "leaving %i representative records\n"
+                         % (unique, len(duplicates), len(representatives)))
+    else:
+        os.symlink(os.path.abspath(input_fasta), output_fasta)
+        sys.stderr.write("No perfect duplicates in file, %i unique entries\n"
+                         % unique)
+
+
+make_nr(args, options.output, options.sep, options.alphasort)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/make_nr.xml	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,120 @@
+<tool id="make_nr" name="Make FASTA non-redundant" version="0.0.1">
+    <description>by combining duplicated sequences</description>
+    <requirements>
+        <requirement type="package" version="1.67">biopython</requirement>
+    </requirements>
+    <version_command>
+python $__tool_directory__/make_nr.py --version
+    </version_command>
+    <command detect_errors="aggressive">
+python $__tool_directory__/make_nr.py $alphasort -s '$separator' -o '$output'
+#for $f in $input
+'$f'
+#end for
+    </command>
+    <inputs>
+        <param name="input" type="data" format="fasta,fasta.gz" multiple="True"
+               label="Input FASTA sequence file(s)"/>
+        <param argument="separator" type="text" size="10" area="False" value=";"
+               label="Separator string to use when combining the identifiers of duplicate sequences"
+               help="A single character is recommended, e.g. the semi-colon, or comma">
+            <sanitizer>
+                <valid initial="default">
+                    <add value=";"/>
+                    <add value="|"/>
+                </valid>
+            </sanitizer>
+        </param>
+        <param argument="alphasort" type="select" label="Treatment of identifiers when combining duplicates with the separator">
+            <option value="">Use the order they appear in the input file(s)</option>
+            <option value="-a">Sort alphabetically before combining them</option>
+        </param>
+    </inputs>
+    <outputs>
+        <data name="output" format="fasta" label="$on_string (NR)" />
+    </outputs>
+    <tests>
+        <test>
+            <param name="input" value="duplicates.fasta" ftype="fasta"/>
+            <output name="output" file="duplicates.nr.fasta" ftype="fasta"/>
+        </test>
+        <test>
+            <param name="input" value="duplicates.fasta.gz" ftype="fasta.gz"/>
+            <output name="output" file="duplicates.nr.fasta" ftype="fasta"/>
+        </test>
+        <test>
+            <param name="input" value="more_duplicates.fasta,duplicates.fasta" ftype="fasta"/>
+            <output name="output" file="deduplicate.nosortids.fasta" ftype="fasta"/>
+        </test>
+        <test>
+            <param name="input" value="more_duplicates.fasta,duplicates.fasta" ftype="fasta"/>
+            <param name="alphasort" value="-a"/>
+            <output name="output" file="deduplicate.sortids.fasta" ftype="fasta"/>
+        </test>
+    </tests>
+    <help>
+**What it does**
+
+Takes one or more input FASTA files, checks them to find any duplicate sequences
+(ignoring the case), and writes an output FASTA file where any duplicates appear
+once with combined identifier.
+
+For example, using the default separator of a semi-colon::
+
+    >1 first entry
+    act
+    >2 The A-Team
+    AAaa
+    >3 not unique...
+    ACgt
+    >4
+    CCCC
+    >5 a duplicate
+    acgt
+    >6 last!
+    GGGG
+
+In this simple example ``ACGT`` appears twice (ignoring case) as entries ``3``
+and ``6``. Entry ``3`` is renamed as ``3;6`` and entry ``4`` is omitted::
+
+    >1 first entry
+    act
+    >2 The A-Team
+    AAaa
+    >3;6 representing 2 records
+    ACgt
+    >4
+    CCCC
+    >6 last!
+    GGGG
+
+This means that the representative records take the position and sequence case
+from the first entry with that sequence.
+
+In this case the combined entry is labelled as ``3;6``, so the sort option
+has no effect. However, if the records appears in the file with ``6`` before
+``3`` you can choose to get ``6;3`` (order from file, default) or ``3;6``
+(ordered alphabetically).
+
+Notice the unique sequences are preserved as they were with any description
+or mixed case.
+
+
+**References**
+
+If you cannot cite this tool directly via the GitHub URL
+https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr
+and need a traditional paper, then please cite:
+
+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo (2015).
+NCBI BLAST+ integrated into Galaxy.
+*GigaScience* 4:39
+https://doi.org/10.1186/s13742-015-0080-7
+
+This wrapper is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr
+    </help>
+    <citations>
+        <citation type="doi">10.1186/1471-2105-10-421</citation>
+    </citations>
+</tool>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/tool_dependencies.xml	Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,6 @@
+<?xml version="1.0" ?>
+<tool_dependency>
+    <package name="biopython" version="1.67">
+        <repository changeset_revision="a12f73c3b116" name="package_biopython_1_67" owner="biopython" toolshed="https://toolshed.g2.bx.psu.edu"/>
+    </package>
+</tool_dependency>
\ No newline at end of file