# HG changeset patch # User peterjc # Date 1375186048 14400 # Node ID 6aafa0ced80211d626dece068a581d18162e935b # Parent a8ef75aab1f94d90ae19fb433e97b393a2464303 Uploaded v0.0.8, moved development to GitHub; no functional changes diff -r a8ef75aab1f9 -r 6aafa0ced802 blastxml_to_top_descr/README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/blastxml_to_top_descr/README.rst Tue Jul 30 08:07:28 2013 -0400 @@ -0,0 +1,120 @@ +Galaxy tool to extract top BLAST hit descriptions from BLAST XML +================================================================ + +This tool is copyright 2012-2013 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +This tool is a short Python script to parse a BLAST XML file, and extract the +identifiers with description for the top matches (by default the top 3), and +output these as a simple tabular file along with the query identifiers. + +It is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/blastxml_to_top_descr + +This requires the 'blast_datatypes' repository from the Galaxy Tool Shed +to provide the 'blastxml' file format definition. + + +Automated Installation +====================== + +This should be straightforward, Galaxy should automatically install the +'blast_datatypes' dependency. + + +Manual Installation +=================== + +If you haven't done so before, first install the 'blast_datatypes' repository. + +There are just two files to install (if doing this manually): + +* blastxml_to_top_descr.py (the Python script) +* blastxml_to_top_descr.xml (the Galaxy tool definition) + +The suggested location is in the Galaxy folder tools/ncbi_blast_plus next to +the NCBI BLAST+ tool wrappers. + +You will also need to modify the tools_conf.xml file to tell Galaxy to offer +the tool. e.g. next to the NCBI BLAST+ tools. Simply add the line:: + + + +To run the tool's tests, also add this line to tools_conf.xml.sample then:: + + $ sh run_functional_tests.sh -id blastxml_to_top_descr + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial version. +v0.0.2 - Since BLAST+ was moved out of the Galaxy core, now have a dependency + on the 'blast_datatypes' repository in the Tool Shed. +v0.0.3 - Include the test files required to run the unit tests +v0.0.4 - Quote filenames in case they contain spaces (internal change) +v0.0.5 - Include number of queries with BLAST matches in stdout (peek text) +v0.0.6 - Check for errors via the script's return code (internal change) +v0.0.7 - Link to Tool Shed added to help text and this documentation. + - Tweak dependency on blast_datatypes to also work on Test Tool Shed + - Adopt standard MIT License. +v0.0.8 - Development moved to GitHub, https://github.com/peterjc/galaxy_blast +======= ====================================================================== + + +Bug Reports +=========== + +You can file an issue here https://github.com/peterjc/galaxy_blast/issues or ask +us on the Galaxy development list http://lists.bx.psu.edu/listinfo/galaxy-dev + + +Developers +========== + +This script and related tools were originally developed on the 'tools' branch of +the following Mercurial repository: https://bitbucket.org/peterjc/galaxy-central/ + +As of July 2013, development is continuing on a dedicated GitHub repository: +https://github.com/peterjc/galaxy_blast + +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use +the following command from the GitHub repository root folder:: + + $ tar -czf blastxml_to_top_descr.tar.gz blastxml_to_top_descr/README.rst blastxml_to_top_descr/blastxml_to_top_descr.* blastxml_to_top_descr/repository_dependencies.xml test-data/blastp_four_human_vs_rhodopsin.xml test-data/blastp_four_human_vs_rhodopsin_top3.tabular + +Check this worked:: + + $ tar -tzf blastxml_to_top_descr.tar.gz + blastxml_to_top_descr/README.rst + blastxml_to_top_descr/blastxml_to_top_descr.py + blastxml_to_top_descr/blastxml_to_top_descr.xml + blastxml_to_top_descr/repository_dependencies.xml + test-data/blastp_four_human_vs_rhodopsin.xml + test-data/blastp_four_human_vs_rhodopsin_top3.tabular + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. diff -r a8ef75aab1f9 -r 6aafa0ced802 blastxml_to_top_descr/blastxml_to_top_descr.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/blastxml_to_top_descr/blastxml_to_top_descr.py Tue Jul 30 08:07:28 2013 -0400 @@ -0,0 +1,122 @@ +#!/usr/bin/env python +"""Convert a BLAST XML file to a top hits description table. + +Takes three command line options, input BLAST XML filename, output tabular +BLAST filename, number of hits to collect the descriptions of. +""" +import sys +import re + +if "-v" in sys.argv or "--version" in sys.argv: + print "v0.0.5" + sys.exit(0) + +if sys.version_info[:2] >= ( 2, 5 ): + import xml.etree.cElementTree as ElementTree +else: + from galaxy import eggs + import pkg_resources; pkg_resources.require( "elementtree" ) + from elementtree import ElementTree + +def stop_err( msg ): + sys.stderr.write("%s\n" % msg) + sys.exit(1) + +#Parse Command Line +try: + in_file, out_file, topN = sys.argv[1:] +except: + stop_err("Expect 3 arguments: input BLAST XML file, output tabular file, number of hits") + + +try: + topN = int(topN) +except ValueError: + stop_err("Number of hits argument should be an integer (at least 1)") +if topN < 1: + stop_err("Number of hits argument should be an integer (at least 1)") + +# get an iterable +try: + context = ElementTree.iterparse(in_file, events=("start", "end")) +except: + stop_err("Invalid data format.") +# turn it into an iterator +context = iter(context) +# get the root element +try: + event, root = context.next() +except: + stop_err( "Invalid data format." ) + + +re_default_query_id = re.compile("^Query_\d+$") +assert re_default_query_id.match("Query_101") +assert not re_default_query_id.match("Query_101a") +assert not re_default_query_id.match("MyQuery_101") +re_default_subject_id = re.compile("^Subject_\d+$") +assert re_default_subject_id.match("Subject_1") +assert not re_default_subject_id.match("Subject_") +assert not re_default_subject_id.match("Subject_12a") +assert not re_default_subject_id.match("TheSubject_1") + + +count = 0 +pos_count = 0 +outfile = open(out_file, 'w') +outfile.write("#Query\t%s\n" % "\t".join("BLAST hit %i" % (i+1) for i in range(topN))) +for event, elem in context: + # for every tag + if event == "end" and elem.tag == "Iteration": + #Expecting either this, from BLAST 2.2.25+ using FASTA vs FASTA + # sp|Q9BS26|ERP44_HUMAN + # Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 + # 406 + # + # + #Or, from BLAST 2.2.24+ run online + # Query_1 + # Sample + # 516 + # ... + qseqid = elem.findtext("Iteration_query-ID") + if qseqid is None: + stop_err("Missing (could be really old BLAST XML data?)") + if re_default_query_id.match(qseqid): + #Place holder ID, take the first word of the query definition + qseqid = elem.findtext("Iteration_query-def").split(None,1)[0] + # for every within + hit_descrs = [] + for hit in elem.findall("Iteration_hits/Hit"): + #Expecting either this, + # gi|3024260|sp|P56514.1|OPSD_BUFBU + # RecName: Full=Rhodopsin + # P56514 + #or, + # Subject_1 + # gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus] + # Subject_1 + # + #apparently depending on the parse_deflines switch + sseqid = hit.findtext("Hit_id").split(None,1)[0] + hit_def = sseqid + " " + hit.findtext("Hit_def") + if re_default_subject_id.match(sseqid) \ + and sseqid == hit.findtext("Hit_accession"): + #Place holder ID, take the first word of the subject definition + hit_def = hit.findtext("Hit_def") + sseqid = hit_def.split(None,1)[0] + assert hit_def not in hit_descrs + hit_descrs.append(hit_def) + #print "%r has %i hits" % (qseqid, len(hit_descrs)) + if hit_descrs: + pos_count += 1 + hit_descrs = hit_descrs[:topN] + while len(hit_descrs) < topN: + hit_descrs.append("") + outfile.write("%s\t%s\n" % (qseqid, "\t".join(hit_descrs))) + count += 1 + # prevents ElementTree from growing large datastructure + root.clear() + elem.clear() +outfile.close() +print "Of %i queries, %i had BLAST results" % (count, pos_count) diff -r a8ef75aab1f9 -r 6aafa0ced802 blastxml_to_top_descr/blastxml_to_top_descr.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/blastxml_to_top_descr/blastxml_to_top_descr.xml Tue Jul 30 08:07:28 2013 -0400 @@ -0,0 +1,58 @@ + + Make a table from BLAST XML + blastxml_to_top_descr.py --version + + blastxml_to_top_descr.py "${blastxml_file}" "${tabular_file}" ${topN} + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +NCBI BLAST+ (and the older NCBI 'legacy' BLAST) can output in a range of +formats including text, tabular and a more detailed XML format. You can +do a lot of things with tabular files in Galaxy (sorting, filtering, joins, +etc) however currently the BLAST tabular output omits the hit descriptions +found in the other output formats. + +This tool turns a BLAST XML file into a simple tabular file containing +one row per query sequence, containing the query identifier and then +the three (by default) top hit descriptions. If a query doesn't have +that many hits, then these entries are left blank. + +**Example Usage** + +One simple usage would be to take a transcriptome assembly or set of +gene predictions, run a BLAST search against the NCBI NR database, and +then use this tool to make a table of the top three BLAST hits. This +can give you a 'quick and dirty' crude annotation, potentially enough +to spot some problems (e.g. bacterial contaimination could be very +obvious). + +**Citation** + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/blastxml_to_top_descr + + + diff -r a8ef75aab1f9 -r 6aafa0ced802 blastxml_to_top_descr/repository_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/blastxml_to_top_descr/repository_dependencies.xml Tue Jul 30 08:07:28 2013 -0400 @@ -0,0 +1,5 @@ + + + + + diff -r a8ef75aab1f9 -r 6aafa0ced802 tools/ncbi_blast_plus/blastxml_to_top_descr.py --- a/tools/ncbi_blast_plus/blastxml_to_top_descr.py Wed Jul 24 11:56:22 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,122 +0,0 @@ -#!/usr/bin/env python -"""Convert a BLAST XML file to a top hits description table. - -Takes three command line options, input BLAST XML filename, output tabular -BLAST filename, number of hits to collect the descriptions of. -""" -import sys -import re - -if "-v" in sys.argv or "--version" in sys.argv: - print "v0.0.5" - sys.exit(0) - -if sys.version_info[:2] >= ( 2, 5 ): - import xml.etree.cElementTree as ElementTree -else: - from galaxy import eggs - import pkg_resources; pkg_resources.require( "elementtree" ) - from elementtree import ElementTree - -def stop_err( msg ): - sys.stderr.write("%s\n" % msg) - sys.exit(1) - -#Parse Command Line -try: - in_file, out_file, topN = sys.argv[1:] -except: - stop_err("Expect 3 arguments: input BLAST XML file, output tabular file, number of hits") - - -try: - topN = int(topN) -except ValueError: - stop_err("Number of hits argument should be an integer (at least 1)") -if topN < 1: - stop_err("Number of hits argument should be an integer (at least 1)") - -# get an iterable -try: - context = ElementTree.iterparse(in_file, events=("start", "end")) -except: - stop_err("Invalid data format.") -# turn it into an iterator -context = iter(context) -# get the root element -try: - event, root = context.next() -except: - stop_err( "Invalid data format." ) - - -re_default_query_id = re.compile("^Query_\d+$") -assert re_default_query_id.match("Query_101") -assert not re_default_query_id.match("Query_101a") -assert not re_default_query_id.match("MyQuery_101") -re_default_subject_id = re.compile("^Subject_\d+$") -assert re_default_subject_id.match("Subject_1") -assert not re_default_subject_id.match("Subject_") -assert not re_default_subject_id.match("Subject_12a") -assert not re_default_subject_id.match("TheSubject_1") - - -count = 0 -pos_count = 0 -outfile = open(out_file, 'w') -outfile.write("#Query\t%s\n" % "\t".join("BLAST hit %i" % (i+1) for i in range(topN))) -for event, elem in context: - # for every tag - if event == "end" and elem.tag == "Iteration": - #Expecting either this, from BLAST 2.2.25+ using FASTA vs FASTA - # sp|Q9BS26|ERP44_HUMAN - # Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 - # 406 - # - # - #Or, from BLAST 2.2.24+ run online - # Query_1 - # Sample - # 516 - # ... - qseqid = elem.findtext("Iteration_query-ID") - if qseqid is None: - stop_err("Missing (could be really old BLAST XML data?)") - if re_default_query_id.match(qseqid): - #Place holder ID, take the first word of the query definition - qseqid = elem.findtext("Iteration_query-def").split(None,1)[0] - # for every within - hit_descrs = [] - for hit in elem.findall("Iteration_hits/Hit"): - #Expecting either this, - # gi|3024260|sp|P56514.1|OPSD_BUFBU - # RecName: Full=Rhodopsin - # P56514 - #or, - # Subject_1 - # gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus] - # Subject_1 - # - #apparently depending on the parse_deflines switch - sseqid = hit.findtext("Hit_id").split(None,1)[0] - hit_def = sseqid + " " + hit.findtext("Hit_def") - if re_default_subject_id.match(sseqid) \ - and sseqid == hit.findtext("Hit_accession"): - #Place holder ID, take the first word of the subject definition - hit_def = hit.findtext("Hit_def") - sseqid = hit_def.split(None,1)[0] - assert hit_def not in hit_descrs - hit_descrs.append(hit_def) - #print "%r has %i hits" % (qseqid, len(hit_descrs)) - if hit_descrs: - pos_count += 1 - hit_descrs = hit_descrs[:topN] - while len(hit_descrs) < topN: - hit_descrs.append("") - outfile.write("%s\t%s\n" % (qseqid, "\t".join(hit_descrs))) - count += 1 - # prevents ElementTree from growing large datastructure - root.clear() - elem.clear() -outfile.close() -print "Of %i queries, %i had BLAST results" % (count, pos_count) diff -r a8ef75aab1f9 -r 6aafa0ced802 tools/ncbi_blast_plus/blastxml_to_top_descr.rst --- a/tools/ncbi_blast_plus/blastxml_to_top_descr.rst Wed Jul 24 11:56:22 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,109 +0,0 @@ -Galaxy tool to extract top BLAST hit descriptions from BLAST XML -================================================================ - -This tool is copyright 2012-2013 by Peter Cock, The James Hutton Institute -(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. -See the licence text below. - -This tool is a short Python script to parse a BLAST XML file, and extract the -identifiers with description for the top matches (by default the top 3), and -output these as a simple tabular file along with the query identifiers. - -It is available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/peterjc/blastxml_to_top_descr - -This requires the 'blast_datatypes' repository from the Galaxy Tool Shed -to provide the 'blastxml' file format definition. - - -Automated Installation -====================== - -This should be straightforward, Galaxy should automatically install the -'blast_datatypes' dependency. - - -Manual Installation -=================== - -If you haven't done so before, first install the 'blast_datatypes' repository. - -There are just two files to install (if doing this manually): - -* blastxml_to_top_descr.py (the Python script) -* blastxml_to_top_descr.xml (the Galaxy tool definition) - -The suggested location is in the Galaxy folder tools/ncbi_blast_plus next to -the NCBI BLAST+ tool wrappers. - -You will also need to modify the tools_conf.xml file to tell Galaxy to offer -the tool. e.g. next to the NCBI BLAST+ tools. Simply add the line:: - - - -To run the tool's tests, also add this line to tools_conf.xml.sample then:: - - $ sh run_functional_tests.sh -id blastxml_to_top_descr - - -History -======= - -======= ====================================================================== -Version Changes -------- ---------------------------------------------------------------------- -v0.0.1 - Initial version. -v0.0.2 - Since BLAST+ was moved out of the Galaxy core, now have a dependency - on the 'blast_datatypes' repository in the Tool Shed. -v0.0.3 - Include the test files required to run the unit tests -v0.0.4 - Quote filenames in case they contain spaces (internal change) -v0.0.5 - Include number of queries with BLAST matches in stdout (peek text) -v0.0.6 - Check for errors via the script's return code (internal change) -v0.0.7 - Link to Tool Shed added to help text and this documentation. - - Tweak dependency on blast_datatypes to also work on Test Tool Shed - - Adopt standard MIT License. -======= ====================================================================== - - -Developers -========== - -This script and related tools are being developed on the following hg branch: -http://bitbucket.org/peterjc/galaxy-central/src/tools - -For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use -the following command from the Galaxy root folder:: - - $ tar -czf blastxml_to_top_descr.tar.gz tools/ncbi_blast_plus/blastxml_to_top_descr.* tools/ncbi_blast_plus/repository_dependencies.xml test-data/blastp_four_human_vs_rhodopsin.xml test-data/blastp_four_human_vs_rhodopsin_top3.tabular - -Check this worked:: - - $ tar -tzf blastxml_to_top_descr.tar.gz - tools/ncbi_blast_plus/blastxml_to_top_descr.py - tools/ncbi_blast_plus/blastxml_to_top_descr.rst - tools/ncbi_blast_plus/blastxml_to_top_descr.xml - tools/ncbi_blast_plus/repository_dependencies.xml - test-data/blastp_four_human_vs_rhodopsin.xml - test-data/blastp_four_human_vs_rhodopsin_top3.tabular - - -Licence (MIT) -============= - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in -all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -THE SOFTWARE. diff -r a8ef75aab1f9 -r 6aafa0ced802 tools/ncbi_blast_plus/blastxml_to_top_descr.xml --- a/tools/ncbi_blast_plus/blastxml_to_top_descr.xml Wed Jul 24 11:56:22 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,58 +0,0 @@ - - Make a table from BLAST XML - blastxml_to_top_descr.py --version - - blastxml_to_top_descr.py "${blastxml_file}" "${tabular_file}" ${topN} - - - - - - - - - - - - - - - - - - - - - - - - -**What it does** - -NCBI BLAST+ (and the older NCBI 'legacy' BLAST) can output in a range of -formats including text, tabular and a more detailed XML format. You can -do a lot of things with tabular files in Galaxy (sorting, filtering, joins, -etc) however currently the BLAST tabular output omits the hit descriptions -found in the other output formats. - -This tool turns a BLAST XML file into a simple tabular file containing -one row per query sequence, containing the query identifier and then -the three (by default) top hit descriptions. If a query doesn't have -that many hits, then these entries are left blank. - -**Example Usage** - -One simple usage would be to take a transcriptome assembly or set of -gene predictions, run a BLAST search against the NCBI NR database, and -then use this tool to make a table of the top three BLAST hits. This -can give you a 'quick and dirty' crude annotation, potentially enough -to spot some problems (e.g. bacterial contaimination could be very -obvious). - -**Citation** - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/blastxml_to_top_descr - - - diff -r a8ef75aab1f9 -r 6aafa0ced802 tools/ncbi_blast_plus/repository_dependencies.xml --- a/tools/ncbi_blast_plus/repository_dependencies.xml Wed Jul 24 11:56:22 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,5 +0,0 @@ - - - - -