# HG changeset patch
# User peterjc
# Date 1486054143 18000
# Node ID e1398f2ba9fe7dc2de4ca0b36213503f5535148f
# Parent 7c0642fc57ad33b45f168baf926e789b0711405a
v0.0.8 galaxy_sequence_utils dependency etc
diff -r 7c0642fc57ad -r e1398f2ba9fe test-data/four_human_proteins.rename.tabular
--- a/test-data/four_human_proteins.rename.tabular Fri Oct 11 04:39:16 2013 -0400
+++ b/test-data/four_human_proteins.rename.tabular Thu Feb 02 11:49:03 2017 -0500
@@ -1,5 +1,5 @@
#FASTA ID
-sp|Q9BS26|ERP44_HUMAN Q9BS26
+sp|Q9BS26|ERP44_HUMAN Q9BS26 and ignore this description
sp|Q9NSY1|BMP2K_HUMAN Q9NSY1
-sp|P06213|INSR_HUMAN P06213
+sp|P06213|INSR_HUMAN and ignore this description P06213
sp|P08100|OPSD_HUMAN P08100
diff -r 7c0642fc57ad -r e1398f2ba9fe tools/seq_rename/README.rst
--- a/tools/seq_rename/README.rst Fri Oct 11 04:39:16 2013 -0400
+++ b/tools/seq_rename/README.rst Thu Feb 02 11:49:03 2017 -0500
@@ -1,7 +1,7 @@
Galaxy tool to rename FASTA, QUAL, FASTQ or SFF sequences
=========================================================
-This tool is copyright 2011-2013 by Peter Cock, The James Hutton Institute
+This tool is copyright 2011-2017 by Peter Cock, The James Hutton Institute
(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
See the licence text below.
@@ -35,20 +35,20 @@
There are just two files to install to use this tool from within Galaxy:
-* seq_rename.py (the Python script)
-* seq_rename.xml (the Galaxy tool definition)
+* ``seq_rename.py`` (the Python script)
+* ``seq_rename.xml`` (the Galaxy tool definition)
-The suggested location is in a dedicated tools/seq_rename folder.
+The suggested location is in a dedicated ``tools/seq_rename`` folder.
-You will also need to modify the tools_conf.xml file to tell Galaxy to offer the
+You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer the
tool. One suggested location is in the filters section. Simply add the line::
-If you wish to run the unit tests, also add this to tools_conf.xml.sample
-and move/copy the test-data files under Galaxy's test-data folder. Then::
+If you wish to run the unit tests, also move/copy the ``test-data/`` files
+under Galaxy's ``test-data/`` folder. Then::
- $ ./run_functional_tests.sh -id seq_rename
+ $ ./run_tests.sh -id seq_rename
You will also need to install Biopython 1.54 or later. That's it.
@@ -70,6 +70,18 @@
- Updated citation information (Cock et al. 2013).
- Development moved to GitHub, https://github.com/peterjc/pico_galaxy
- Renamed folder and adopted README.rst naming.
+v0.0.5 - Correct automated dependency definition.
+v0.0.6 - Simplified XML to apply input format to output data.
+ - Tool definition now embeds citation information.
+ - If white space is found in the requested tabular field then only
+ the first word is used as the identifier (with a warning to stderr).
+v0.0.7 - Use the ``format_source=...`` tag.
+ - Reorder XML elements (internal change only).
+ - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
+ - Capture the tool version via Galaxy (bug fix).
+v0.0.8 - Updated to point at Biopython 1.67 (latest version in Tool Shed).
+ - Explicit dependency on ``galaxy_sequence_utils``.
+ - Python style updates (internal change only).
======= ======================================================================
@@ -82,21 +94,30 @@
Development has now moved to a dedicated GitHub repository:
https://github.com/peterjc/pico_galaxy/tree/master/tools
-For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use
-the following command from the Galaxy root folder::
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+ $ planemo shed_update -t testtoolshed --check_diff ~/repositories/pico_galaxy/tools/seq_rename/
+ ...
+
+or::
- $ tar -czf seq_rename.tar.gz tools/seq_rename/README.rst tools/seq_rename/seq_rename.* tools/seq_rename/repository_dependencies.xml test-data/four_human_proteins.fasta test-data/four_human_proteins.rename.tabular test-data/four_human_proteins.rename.fasta
+ $ planemo shed_update -t toolshed --check_diff ~/repositories/pico_galaxy/tools/seq_rename/
+ ...
+
+To just build and check the tar ball, use::
-Check this worked::
-
- $ tar -tzf seq_rename.tar.gz
+ $ planemo shed_upload --tar_only ~/repositories/pico_galaxy/tools/seq_rename/
+ ...
+ $ tar -tzf shed_upload.tar.gz
+ test-data/four_human_proteins.fasta
+ test-data/four_human_proteins.rename.fasta
+ test-data/four_human_proteins.rename.tabular
tools/seq_rename/README.rst
tools/seq_rename/seq_rename.py
tools/seq_rename/seq_rename.xml
- tools/seq_rename/repository_dependencies.xml
- test-data/four_human_proteins.fasta
- test-data/four_human_proteins.rename.tabular
- test-data/four_human_proteins.rename.fasta
+ tools/seq_rename/tool_dependencies.xml
Licence (MIT)
diff -r 7c0642fc57ad -r e1398f2ba9fe tools/seq_rename/repository_dependencies.xml
--- a/tools/seq_rename/repository_dependencies.xml Fri Oct 11 04:39:16 2013 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,6 +0,0 @@
-
-
-
-
-
diff -r 7c0642fc57ad -r e1398f2ba9fe tools/seq_rename/seq_rename.py
--- a/tools/seq_rename/seq_rename.py Fri Oct 11 04:39:16 2013 -0400
+++ b/tools/seq_rename/seq_rename.py Thu Feb 02 11:49:03 2017 -0500
@@ -17,64 +17,85 @@
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
-This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute UK.
+This script is copyright 2011-2017 by Peter Cock, The James Hutton Institute UK.
All rights reserved. See accompanying text file for licence details (MIT
license).
-
-This is version 0.0.4 of the script.
"""
import sys
if "-v" in sys.argv or "--version" in sys.argv:
- print "v0.0.4"
+ print "v0.0.8"
sys.exit(0)
-def stop_err(msg, err=1):
- sys.stderr.write(msg.rstrip() + "\n")
- sys.exit(err)
-
-#Parse Command Line
+# Parse Command Line
try:
tabular_file, old_col_arg, new_col_arg, in_file, seq_format, out_file = sys.argv[1:]
except ValueError:
- stop_err("Expected six arguments (tabular file, old col, new col, input file, format, output file), got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
+ sys.exit("Expected six arguments (tabular file, old col, new col, input file, format, output file), got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv)))
try:
if old_col_arg.startswith("c"):
- old_column = int(old_col_arg[1:])-1
+ old_column = int(old_col_arg[1:]) - 1
else:
- old_column = int(old_col_arg)-1
+ old_column = int(old_col_arg) - 1
except ValueError:
- stop_err("Expected column number, got %s" % old_col_arg)
+ sys.exit("Expected column number, got %s" % old_col_arg)
try:
if old_col_arg.startswith("c"):
- new_column = int(new_col_arg[1:])-1
+ new_column = int(new_col_arg[1:]) - 1
else:
- new_column = int(new_col_arg)-1
+ new_column = int(new_col_arg) - 1
except ValueError:
- stop_err("Expected column number, got %s" % new_col_arg)
+ sys.exit("Expected column number, got %s" % new_col_arg)
if old_column == new_column:
- stop_err("Old and new column arguments are the same!")
+ sys.exit("Old and new column arguments are the same!")
+
def parse_ids(tabular_file, old_col, new_col):
- """Read tabular file and record all specified ID mappings."""
+ """Read tabular file and record all specified ID mappings.
+
+ Will print a single warning to stderr if any of the old/new column
+ entries have non-trailing white space (only the first word will
+ be used as the identifier).
+
+ Internal white space in the new column is taken as desired output.
+ """
handle = open(tabular_file, "rU")
+ old_warn = False
+ new_warn = False
for line in handle:
+ if not line.strip():
+ # Ignore blank lines
+ continue
if not line.startswith("#"):
parts = line.rstrip("\n").split("\t")
- yield parts[old_col].strip(), parts[new_col].strip()
+ old = parts[old_col].strip().split(None, 1)
+ new = parts[new_col].strip().split(None, 1)
+ if not old_warn and len(old) > 1:
+ old_warn = "WARNING: Some of your old identifiers had white space in them, " + \
+ "using first word only. e.g.:\n%s\n" % parts[old_col].strip()
+ if not new_warn and len(new) > 1:
+ new_warn = "WARNING: Some of your new identifiers had white space in them, " + \
+ "using first word only. e.g.:\n%s\n" % parts[new_col].strip()
+ yield old[0], new[0]
handle.close()
+ if old_warn:
+ sys.stderr.write(old_warn)
+ if new_warn:
+ sys.stderr.write(new_warn)
-#Load the rename mappings
+
+# Load the rename mappings
rename = dict(parse_ids(tabular_file, old_column, new_column))
print "Loaded %i ID mappings" % len(rename)
-
-#Rewrite the sequence file
-if seq_format.lower()=="sff":
- #Use Biopython for this format
+
+# Rewrite the sequence file
+if seq_format.lower() == "sff":
+ # Use Biopython for this format
renamed = 0
+
def rename_seqrecords(records, mapping):
- global renamed #nasty, but practical!
+ global renamed # nasty, but practical!
for record in records:
try:
record.id = mapping[record.id]
@@ -82,33 +103,33 @@
except KeyError:
pass
yield record
-
+
try:
from Bio.SeqIO.SffIO import SffIterator, SffWriter
except ImportError:
- stop_err("Requires Biopython 1.54 or later")
+ sys.exit("Requires Biopython 1.54 or later")
try:
from Bio.SeqIO.SffIO import ReadRocheXmlManifest
except ImportError:
- #Prior to Biopython 1.56 this was a private function
+ # Prior to Biopython 1.56 this was a private function
from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
- in_handle = open(in_file, "rb") #must be binary mode!
+ in_handle = open(in_file, "rb") # must be binary mode!
try:
manifest = ReadRocheXmlManifest(in_handle)
except ValueError:
manifest = None
out_handle = open(out_file, "wb")
writer = SffWriter(out_handle, xml=manifest)
- in_handle.seek(0) #start again after getting manifest
+ in_handle.seek(0) # start again after getting manifest
count = writer.write_file(rename_seqrecords(SffIterator(in_handle), rename))
out_handle.close()
in_handle.close()
else:
- #Use Galaxy for FASTA, QUAL or FASTQ
+ # Use Galaxy for FASTA, QUAL or FASTQ
if seq_format.lower() in ["fasta", "csfasta"] \
- or seq_format.lower().startswith("qual"):
+ or seq_format.lower().startswith("qual"):
from galaxy_utils.sequence.fasta import fastaReader, fastaWriter
reader = fastaReader(open(in_file, "rU"))
writer = fastaWriter(open(out_file, "w"))
@@ -119,13 +140,13 @@
writer = fastqWriter(open(out_file, "w"))
marker = "@"
else:
- stop_err("Unsupported file type %r" % seq_format)
- #Now do the renaming
+ sys.exit("Unsupported file type %r" % seq_format)
+ # Now do the renaming
count = 0
renamed = 0
for record in reader:
- #The [1:] is because the fastaReader leaves the > on the identifier,
- #likewise the fastqReader leaves the @ on the identifier
+ # The [1:] is because the fastaReader leaves the > on the identifier,
+ # likewise the fastqReader leaves the @ on the identifier
try:
idn, descr = record.identifier[1:].split(None, 1)
except ValueError:
diff -r 7c0642fc57ad -r e1398f2ba9fe tools/seq_rename/seq_rename.xml
--- a/tools/seq_rename/seq_rename.xml Fri Oct 11 04:39:16 2013 -0400
+++ b/tools/seq_rename/seq_rename.xml Thu Feb 02 11:49:03 2017 -0500
@@ -1,18 +1,19 @@
-
+
with ID mapping from a tabular file
- biopython
+ galaxy_sequence_utils
+ biopython
Bio
- seq_rename.py --version
-
-seq_rename.py $input_tabular $old_column $new_column $input_file $input_file.ext $output_file
-
+ seq_rename.py --version
+
+seq_rename.py $input_tabular $old_column $new_column $input_file $input_file.ext $output_file
+
@@ -20,17 +21,7 @@
-
-
-
-
-
-
-
-
-
-
-
+
@@ -55,12 +46,17 @@
new sequence file (of the same format) where the sequence identifiers have been
renamed according to the specified columns in your tabular file.
+Any original description is preserved (N/A for the SFF file format).
+
WARNING: If you have any duplicates in the input sequence file, you will still
have duplicate sequences in the output.
WARNING: If the tabular file has more than one new name for any old ID, the
last one is used.
+WARNING: The old and new names in your tabular file should not contain white space.
+If they do, only the first word is used as the identifier.
+
**References**
If you use this Galaxy tool in work leading to a scientific publication please
@@ -81,4 +77,8 @@
This tool is available to install into other Galaxy Instances via the Galaxy
Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/seq_rename
+
+ 10.7717/peerj.167
+ 10.1093/bioinformatics/btp163
+
diff -r 7c0642fc57ad -r e1398f2ba9fe tools/seq_rename/tool_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/seq_rename/tool_dependencies.xml Thu Feb 02 11:49:03 2017 -0500
@@ -0,0 +1,9 @@
+
+
+
+
+
+
+
+
+