annotate tools/blast2go/massage_xml_for_blast2go.py @ 8:e23b621eb7bb draft

Uploaded v0.0.9, embed citation, updated README
author peterjc
date Thu, 26 Mar 2015 11:15:22 -0400
parents
children 887adf823bc0
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
8
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
2 """Script for reformatting Blast XML to suit Blast2GO.
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
3
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
4 This script takes exactly two command line arguments:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
5 * Input BLAST XML filename
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
6 * Output BLAST XML filename
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
7
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
8 Sadly b2g4pipe (at least v2.3.5 to v2.5.0) cannot cope with current
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
9 style large BLAST XML files (e.g. from BLAST 2.2.25+), so we reformat
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
10 these to avoid it crashing with a Java heap space OutOfMemoryError.
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
11
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
12 As part of this reformatting, we check for BLASTP or BLASTX output
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
13 (otherwise raise an error), and print the query count.
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
14
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
15 This script is called from my Galaxy wrapper for Blast2GO for pipelines,
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
16 available from the Galaxy Tool Shed here:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
17 http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
18
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
19 This script is under version control here:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
20 https://github.com/peterjc/galaxy_blast/tree/master/blast2go
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
21 """
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
22 import sys
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
23 import os
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
24 import subprocess
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
25
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
26 def stop_err(msg, error_level=1):
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
27 """Print error message to stdout and quit with given error level."""
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
28 sys.stderr.write("%s\n" % msg)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
29 sys.exit(error_level)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
30
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
31 def prepare_xml(original_xml, mangled_xml):
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
32 """Reformat BLAST XML to suit Blast2GO.
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
33
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
34 Blast2GO can't cope with 1000s of <Iteration> tags within a
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
35 single <BlastResult> tag, so instead split this into one
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
36 full XML record per interation (i.e. per query). This gives
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
37 a concatenated XML file mimicing old versions of BLAST.
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
38
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
39 This also checks for BLASTP or BLASTX output, and outputs
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
40 the number of queries. Galaxy will show this as "info".
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
41 """
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
42 in_handle = open(original_xml)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
43 footer = " </BlastOutput_iterations>\n</BlastOutput>\n"
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
44 header = ""
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
45 while True:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
46 line = in_handle.readline()
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
47 if not line:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
48 #No hits?
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
49 stop_err("Problem with XML file?")
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
50 if line.strip() == "<Iteration>":
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
51 break
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
52 header += line
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
53
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
54 if "<BlastOutput_program>blastx</BlastOutput_program>" in header:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
55 print "BLASTX output identified"
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
56 elif "<BlastOutput_program>blastp</BlastOutput_program>" in header:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
57 print "BLASTP output identified"
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
58 else:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
59 in_handle.close()
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
60 stop_err("Expect BLASTP or BLASTX output")
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
61
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
62 out_handle = open(mangled_xml, "w")
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
63 out_handle.write(header)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
64 out_handle.write(line)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
65 count = 1
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
66 while True:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
67 line = in_handle.readline()
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
68 if not line:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
69 break
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
70 elif line.strip() == "<Iteration>":
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
71 #Insert footer/header
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
72 out_handle.write(footer)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
73 out_handle.write(header)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
74 count += 1
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
75 out_handle.write(line)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
76
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
77 out_handle.close()
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
78 in_handle.close()
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
79 print "Input has %i queries" % count
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
80
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
81
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
82 if __name__ == "__main__":
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
83 # Run the conversion...
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
84 if len(sys.argv) != 3:
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
85 stop_err("Require two arguments: XML input filename, XML output filename")
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
86
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
87 xml_file, out_xml_file = sys.argv[1:]
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
88
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
89 if not os.path.isfile(xml_file):
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
90 stop_err("Input BLAST XML file not found: %s" % xml_file)
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
91
e23b621eb7bb Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
92 prepare_xml(xml_file, out_xml_file)