# HG changeset patch
# User peterjc
# Date 1541779203 18000
# Node ID c84f12187af92968a14a317515212b8f4eb451c4
v0.0.1
diff -r 000000000000 -r c84f12187af9 test-data/deduplicate.nosortids.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/deduplicate.nosortids.fasta Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>Quick;Brown;Fox;3;5 representing 5 records
+acgt
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>4
+CCCC
+>6 last!
+GGGG
diff -r 000000000000 -r c84f12187af9 test-data/deduplicate.sortids.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/deduplicate.sortids.fasta Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>3;5;Brown;Fox;Quick representing 5 records
+acgt
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>4
+CCCC
+>6 last!
+GGGG
diff -r 000000000000 -r c84f12187af9 test-data/duplicates.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/duplicates.fasta Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,12 @@
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>3 not unique...
+ACgt
+>4
+CCCC
+>5 a duplicate
+acgt
+>6 last!
+GGGG
diff -r 000000000000 -r c84f12187af9 test-data/duplicates.fasta.gz
Binary file test-data/duplicates.fasta.gz has changed
diff -r 000000000000 -r c84f12187af9 test-data/duplicates.nr.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/duplicates.nr.fasta Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,10 @@
+>1 first entry
+act
+>2 The A-Team
+AAaa
+>3;5 representing 2 records
+ACgt
+>4
+CCCC
+>6 last!
+GGGG
diff -r 000000000000 -r c84f12187af9 test-data/more_duplicates.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/more_duplicates.fasta Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,6 @@
+>Quick
+acgt
+>Brown
+ACGT
+>Fox
+ACGT
diff -r 000000000000 -r c84f12187af9 tools/make_nr/README.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/README.rst Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,140 @@
+Make a FASTA file non-redundant, with a Galaxy wrapper
+======================================================
+
+This tool is copyright 2018 by Peter Cock, The James Hutton Institute, UK.
+All rights reserved. See the licence text below.
+
+This tool is a short Python script intended to be run prior to calling
+the NCBI BLAST+ command line tool ``makeblastdb`` or in other settings
+where you want to collapse duplicated sequences in a FASTA file to a
+single representative.
+
+The script ``make_nr.py`` can be used directly (without Galaxy).
+It requires Biopython.
+
+It comes with an optional Galaxy tool definition file ``make_nr.xml``
+allowing the Python script to be run from within Galaxy. It is available
+from the Galaxy Tool Shed at:
+http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr
+
+
+Citation
+========
+
+If you cannot cite the GitHub repository directly, please cite one of the
+following papers:
+
+Cock et al 2009. Biopython: freely available Python tools for computational
+molecular biology and bioinformatics. *Bioinformatics* 25(11) 1422-3.
+https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
+
+or (and this would be more appropriate in a Galaxy setting):
+
+NCBI BLAST+ integrated into Galaxy.
+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
+*GigaScience* 2015, 4:39
+https://doi.org/10.1186/s13742-015-0080-7
+
+
+Standalone Installation (outside Galaxy)
+========================================
+
+Outside of Galaxy, you will need Python and Biopython, the later can usually
+be installed with ``pip install biopython`` or if you are using Conda, try
+``conda install biopython`` instead. Then to run the script, simply call it
+using ``python /full/path/to/make_nr.py -h`` or similar.
+
+
+Automated Installation
+======================
+
+Installation via the Galaxy Tool Shed should take care of the Galaxy side of
+things, including the dependency on Biopython.
+
+
+Manual Installation
+===================
+
+There are just two files to install:
+
+- ``make_nr.py`` (the Python script)
+- ``make_nr.xml`` (the Galaxy tool definition)
+
+The suggested location is in a ``tools/make_nr/`` folder. You will then
+need to modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool
+by adding the line::
+
+
+
+If you want to run the functional tests, copy the sample test files under
+sample test files under Galaxy's ``test-data/`` directory. Then::
+
+ ./run_tests.sh -id make_nr
+
+
+History
+=======
+
+TODO:
+
+ - Option to follow BLAST NR style with ctrl+a separator?
+ - Option to give representative sequences in upper case?
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.0 - Initial version
+======= ======================================================================
+
+
+Developers
+==========
+
+This tool is developed on the following GitHub repository:
+https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+ $ planemo shed_update -t testtoolshed --check_diff tools/make_nr/
+ ...
+
+or::
+
+ $ planemo shed_update -t toolshed --check_diff tools/make_nr/
+ ...
+
+To just build and check the tar ball, use::
+
+ $ planemo shed_upload --tar_only tools/make_nr/
+ ...
+ $ tar -tzf shed_upload.tar.gz
+ tools/make_nr/README.rst
+ tools/make_nr/make_nr.py
+ tools/make_nr/make_nr.xml
+ tools/make_nr/tool_dependencies.xml
+ test-data/duplicates.fasta
+ test-data/duplicates.nr.fasta
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff -r 000000000000 -r c84f12187af9 tools/make_nr/make_nr.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/make_nr.py Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""Make FASTA files non-redundant by combining duplicated sequences.
+
+This script takes one or more (optionally gzipped) FASTA filenames as input,
+and will return a non-zero error if any duplicate identifiers are found.
+
+Writes output to stdout by default.
+
+Keeps all the sequences in memory, beware!
+"""
+from __future__ import print_function
+
+import gzip
+import os
+import sys
+
+from optparse import OptionParser
+
+
+if "-v" in sys.argv or "--version" in sys.argv:
+ print("v0.0.1")
+ sys.exit(0)
+
+
+# Parse Command Line
+usage = """Use as follows:
+
+$ python make_nr.py [options] A.fasta [B.fasta ...]
+
+For example,
+
+$ python make_nr.py -o dedup.fasta -s ";" input1.fasta input2.fasta
+
+The input files should be plain text FASTA format, optionally gzipped.
+
+The -a option controls how the representative replacement record for
+duplicated records are named. By default the identifiers are taken
+in the input file order, combined with the separator. If the -a or
+alphasort option is picked, the identifiers are alphabetically sorted
+first. This ensures the same names are used even if the input file
+order (or the record order within the input files) is randomised.
+
+There is additional guidance in the help text in the make_nr.xml file,
+which is shown to the user via the Galaxy interface to this tool.
+"""
+
+parser = OptionParser(usage=usage)
+parser.add_option("-s", "--sep", dest="sep",
+ default=";",
+ help="Separator character for combining identifiers "
+ "of duplicated records e.g. '|' or ';' (required)")
+parser.add_option("-a", "--alphasort", action="store_true",
+ help="When merging duplicated records sort their "
+ "identifiers alphabetically before combining them. "
+ "Default is input file order.")
+parser.add_option("-o", "--output", dest="output",
+ default="/dev/stdout", metavar="FILE",
+ help="Output filename (defaults to stdout)")
+options, args = parser.parse_args()
+
+if not args:
+ sys.exit("Expects at least one input FASTA filename")
+
+
+def gzip_open(filename):
+ """Open a possibly gzipped text file."""
+ with open(filename, "rb") as h:
+ magic = h.read(2)
+ if magic == b'\x1f\x8b':
+ return gzip.open(filename, "rt")
+ else:
+ return open(filename)
+
+
+def make_nr(input_fasta, output_fasta, sep=";", sort_ids=False):
+ """Make the sequences in FASTA files non-redundant.
+
+ Argument input_fasta is a list of filenames.
+ """
+ by_seq = dict()
+ try:
+ from Bio.SeqIO.FastaIO import SimpleFastaParser
+ except ImportError:
+ sys.exit("Missing Biopython")
+ for f in input_fasta:
+ with gzip_open(f) as handle:
+ for title, seq in SimpleFastaParser(handle):
+ idn = title.split(None, 1)[0] # first word only
+ seq = seq.upper()
+ try:
+ by_seq[seq].append(idn)
+ except KeyError:
+ by_seq[seq] = [idn]
+ unique = 0
+ representatives = dict()
+ duplicates = set()
+ for cluster in by_seq.values():
+ if len(cluster) > 1:
+ # Is it useful to offer to sort here?
+ # if sort_ids:
+ # cluster.sort()
+ representatives[cluster[0]] = cluster
+ duplicates.update(cluster[1:])
+ else:
+ unique += 1
+ del by_seq
+ if duplicates:
+ # TODO - refactor as a generator with single SeqIO.write(...) call
+ with open(output_fasta, "w") as handle:
+ for f in input_fasta:
+ with gzip_open(f) as in_handle:
+ for title, seq in SimpleFastaParser(in_handle):
+ idn = title.split(None, 1)[0] # first word only
+ if idn in representatives:
+ cluster = representatives[idn]
+ if sort_ids:
+ cluster.sort()
+ idn = sep.join(cluster)
+ title = "%s representing %i records" % (idn, len(cluster))
+ elif idn in duplicates:
+ continue
+ # TODO - line wrapping
+ handle.write(">%s\n%s\n" % (title, seq))
+ sys.stderr.write("%i unique entries; removed %i duplicates "
+ "leaving %i representative records\n"
+ % (unique, len(duplicates), len(representatives)))
+ else:
+ os.symlink(os.path.abspath(input_fasta), output_fasta)
+ sys.stderr.write("No perfect duplicates in file, %i unique entries\n"
+ % unique)
+
+
+make_nr(args, options.output, options.sep, options.alphasort)
diff -r 000000000000 -r c84f12187af9 tools/make_nr/make_nr.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/make_nr.xml Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,120 @@
+
+ by combining duplicated sequences
+
+ biopython
+
+
+python $__tool_directory__/make_nr.py --version
+
+
+python $__tool_directory__/make_nr.py $alphasort -s '$separator' -o '$output'
+#for $f in $input
+'$f'
+#end for
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+Takes one or more input FASTA files, checks them to find any duplicate sequences
+(ignoring the case), and writes an output FASTA file where any duplicates appear
+once with combined identifier.
+
+For example, using the default separator of a semi-colon::
+
+ >1 first entry
+ act
+ >2 The A-Team
+ AAaa
+ >3 not unique...
+ ACgt
+ >4
+ CCCC
+ >5 a duplicate
+ acgt
+ >6 last!
+ GGGG
+
+In this simple example ``ACGT`` appears twice (ignoring case) as entries ``3``
+and ``6``. Entry ``3`` is renamed as ``3;6`` and entry ``4`` is omitted::
+
+ >1 first entry
+ act
+ >2 The A-Team
+ AAaa
+ >3;6 representing 2 records
+ ACgt
+ >4
+ CCCC
+ >6 last!
+ GGGG
+
+This means that the representative records take the position and sequence case
+from the first entry with that sequence.
+
+In this case the combined entry is labelled as ``3;6``, so the sort option
+has no effect. However, if the records appears in the file with ``6`` before
+``3`` you can choose to get ``6;3`` (order from file, default) or ``3;6``
+(ordered alphabetically).
+
+Notice the unique sequences are preserved as they were with any description
+or mixed case.
+
+
+**References**
+
+If you cannot cite this tool directly via the GitHub URL
+https://github.com/peterjc/galaxy_blast/tree/master/tools/make_nr
+and need a traditional paper, then please cite:
+
+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo (2015).
+NCBI BLAST+ integrated into Galaxy.
+*GigaScience* 4:39
+https://doi.org/10.1186/s13742-015-0080-7
+
+This wrapper is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/make_nr
+
+
+ 10.1186/1471-2105-10-421
+
+
diff -r 000000000000 -r c84f12187af9 tools/make_nr/tool_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/make_nr/tool_dependencies.xml Fri Nov 09 11:00:03 2018 -0500
@@ -0,0 +1,6 @@
+
+
+
+
+
+
\ No newline at end of file