annotate tools/ncbi_blast_plus/blastxml_to_tabular.py @ 3:643338ac83c0 draft

Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
author peterjc
date Thu, 23 Aug 2012 09:00:40 -0400
parents
children 70e7dcbf6573
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
2 """Convert a BLAST XML file to 12 column tabular output
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
3
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
4 Takes three command line options, input BLAST XML filename, output tabular
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
5 BLAST filename, output format (std for standard 12 columns, or ext for the
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
6 extended 24 columns offered in the BLAST+ wrappers).
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
7
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
8 The 12 columns output are 'qseqid sseqid pident length mismatch gapopen qstart
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
9 qend sstart send evalue bitscore' or 'std' at the BLAST+ command line, which
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
10 mean:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
11
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
12 ====== ========= ============================================
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
13 Column NCBI name Description
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
14 ------ --------- --------------------------------------------
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
15 1 qseqid Query Seq-id (ID of your sequence)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
16 2 sseqid Subject Seq-id (ID of the database hit)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
17 3 pident Percentage of identical matches
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
18 4 length Alignment length
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
19 5 mismatch Number of mismatches
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
20 6 gapopen Number of gap openings
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
21 7 qstart Start of alignment in query
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
22 8 qend End of alignment in query
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
23 9 sstart Start of alignment in subject (database hit)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
24 10 send End of alignment in subject (database hit)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
25 11 evalue Expectation value (E-value)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
26 12 bitscore Bit score
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
27 ====== ========= ============================================
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
28
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
29 The additional columns offered in the Galaxy BLAST+ wrappers are:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
30
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
31 ====== ============= ===========================================
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
32 Column NCBI name Description
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
33 ------ ------------- -------------------------------------------
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
34 13 sallseqid All subject Seq-id(s), separated by a ';'
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
35 14 score Raw score
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
36 15 nident Number of identical matches
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
37 16 positive Number of positive-scoring matches
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
38 17 gaps Total number of gaps
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
39 18 ppos Percentage of positive-scoring matches
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
40 19 qframe Query frame
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
41 20 sframe Subject frame
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
42 21 qseq Aligned part of query sequence
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
43 22 sseq Aligned part of subject sequence
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
44 23 qlen Query sequence length
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
45 24 slen Subject sequence length
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
46 ====== ============= ===========================================
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
47
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
48 Most of these fields are given explicitly in the XML file, others some like
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
49 the percentage identity and the number of gap openings must be calculated.
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
50
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
51 Be aware that the sequence in the extended tabular output or XML direct from
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
52 BLAST+ may or may not use XXXX masking on regions of low complexity. This
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
53 can throw the off the calculation of percentage identity and gap openings.
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
54 [In fact, both BLAST 2.2.24+ and 2.2.25+ have a subtle bug in this regard,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
55 with these numbers changing depending on whether or not the low complexity
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
56 filter is used.]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
57
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
58 This script attempts to produce identical output to what BLAST+ would have done.
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
59 However, check this with "diff -b ..." since BLAST+ sometimes includes an extra
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
60 space character (probably a bug).
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
61 """
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
62 import sys
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
63 import re
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
64
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
65 if sys.version_info[:2] >= ( 2, 5 ):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
66 import xml.etree.cElementTree as ElementTree
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
67 else:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
68 from galaxy import eggs
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
69 import pkg_resources; pkg_resources.require( "elementtree" )
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
70 from elementtree import ElementTree
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
71
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
72 def stop_err( msg ):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
73 sys.stderr.write("%s\n" % msg)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
74 sys.exit(1)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
75
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
76 #Parse Command Line
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
77 try:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
78 in_file, out_file, out_fmt = sys.argv[1:]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
79 except:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
80 stop_err("Expect 3 arguments: input BLAST XML file, output tabular file, out format (std or ext)")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
81
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
82 if out_fmt == "std":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
83 extended = False
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
84 elif out_fmt == "x22":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
85 stop_err("Format argument x22 has been replaced with ext (extended 24 columns)")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
86 elif out_fmt == "ext":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
87 extended = True
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
88 else:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
89 stop_err("Format argument should be std (12 column) or ext (extended 24 columns)")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
90
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
91
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
92 # get an iterable
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
93 try:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
94 context = ElementTree.iterparse(in_file, events=("start", "end"))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
95 except:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
96 stop_err("Invalid data format.")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
97 # turn it into an iterator
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
98 context = iter(context)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
99 # get the root element
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
100 try:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
101 event, root = context.next()
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
102 except:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
103 stop_err( "Invalid data format." )
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
104
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
105
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
106 re_default_query_id = re.compile("^Query_\d+$")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
107 assert re_default_query_id.match("Query_101")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
108 assert not re_default_query_id.match("Query_101a")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
109 assert not re_default_query_id.match("MyQuery_101")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
110 re_default_subject_id = re.compile("^Subject_\d+$")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
111 assert re_default_subject_id.match("Subject_1")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
112 assert not re_default_subject_id.match("Subject_")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
113 assert not re_default_subject_id.match("Subject_12a")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
114 assert not re_default_subject_id.match("TheSubject_1")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
115
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
116
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
117 outfile = open(out_file, 'w')
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
118 blast_program = None
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
119 for event, elem in context:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
120 if event == "end" and elem.tag == "BlastOutput_program":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
121 blast_program = elem.text
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
122 # for every <Iteration> tag
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
123 if event == "end" and elem.tag == "Iteration":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
124 #Expecting either this, from BLAST 2.2.25+ using FASTA vs FASTA
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
125 # <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
126 # <Iteration_query-def>Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1</Iteration_query-def>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
127 # <Iteration_query-len>406</Iteration_query-len>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
128 # <Iteration_hits></Iteration_hits>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
129 #
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
130 #Or, from BLAST 2.2.24+ run online
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
131 # <Iteration_query-ID>Query_1</Iteration_query-ID>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
132 # <Iteration_query-def>Sample</Iteration_query-def>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
133 # <Iteration_query-len>516</Iteration_query-len>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
134 # <Iteration_hits>...
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
135 qseqid = elem.findtext("Iteration_query-ID")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
136 if re_default_query_id.match(qseqid):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
137 #Place holder ID, take the first word of the query definition
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
138 qseqid = elem.findtext("Iteration_query-def").split(None,1)[0]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
139 qlen = int(elem.findtext("Iteration_query-len"))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
140
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
141 # for every <Hit> within <Iteration>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
142 for hit in elem.findall("Iteration_hits/Hit"):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
143 #Expecting either this,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
144 # <Hit_id>gi|3024260|sp|P56514.1|OPSD_BUFBU</Hit_id>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
145 # <Hit_def>RecName: Full=Rhodopsin</Hit_def>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
146 # <Hit_accession>P56514</Hit_accession>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
147 #or,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
148 # <Hit_id>Subject_1</Hit_id>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
149 # <Hit_def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</Hit_def>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
150 # <Hit_accession>Subject_1</Hit_accession>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
151 #
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
152 #apparently depending on the parse_deflines switch
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
153 sseqid = hit.findtext("Hit_id").split(None,1)[0]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
154 hit_def = sseqid + " " + hit.findtext("Hit_def")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
155 if re_default_subject_id.match(sseqid) \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
156 and sseqid == hit.findtext("Hit_accession"):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
157 #Place holder ID, take the first word of the subject definition
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
158 hit_def = hit.findtext("Hit_def")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
159 sseqid = hit_def.split(None,1)[0]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
160 # for every <Hsp> within <Hit>
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
161 for hsp in hit.findall("Hit_hsps/Hsp"):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
162 nident = hsp.findtext("Hsp_identity")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
163 length = hsp.findtext("Hsp_align-len")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
164 pident = "%0.2f" % (100*float(nident)/float(length))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
165
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
166 q_seq = hsp.findtext("Hsp_qseq")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
167 h_seq = hsp.findtext("Hsp_hseq")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
168 m_seq = hsp.findtext("Hsp_midline")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
169 assert len(q_seq) == len(h_seq) == len(m_seq) == int(length)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
170 gapopen = str(len(q_seq.replace('-', ' ').split())-1 + \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
171 len(h_seq.replace('-', ' ').split())-1)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
172
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
173 mismatch = m_seq.count(' ') + m_seq.count('+') \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
174 - q_seq.count('-') - h_seq.count('-')
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
175 #TODO - Remove this alternative mismatch calculation and test
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
176 #once satisifed there are no problems
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
177 expected_mismatch = len(q_seq) \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
178 - sum(1 for q,h in zip(q_seq, h_seq) \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
179 if q == h or q == "-" or h == "-")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
180 xx = sum(1 for q,h in zip(q_seq, h_seq) if q=="X" and h=="X")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
181 if not (expected_mismatch - q_seq.count("X") <= int(mismatch) <= expected_mismatch + xx):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
182 stop_err("%s vs %s mismatches, expected %i <= %i <= %i" \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
183 % (qseqid, sseqid, expected_mismatch - q_seq.count("X"),
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
184 int(mismatch), expected_mismatch))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
185
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
186 #TODO - Remove this alternative identity calculation and test
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
187 #once satisifed there are no problems
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
188 expected_identity = sum(1 for q,h in zip(q_seq, h_seq) if q == h)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
189 if not (expected_identity - xx <= int(nident) <= expected_identity + q_seq.count("X")):
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
190 stop_err("%s vs %s identities, expected %i <= %i <= %i" \
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
191 % (qseqid, sseqid, expected_identity, int(nident),
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
192 expected_identity + q_seq.count("X")))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
193
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
194
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
195 evalue = hsp.findtext("Hsp_evalue")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
196 if evalue == "0":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
197 evalue = "0.0"
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
198 else:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
199 evalue = "%0.0e" % float(evalue)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
200
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
201 bitscore = float(hsp.findtext("Hsp_bit-score"))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
202 if bitscore < 100:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
203 #Seems to show one decimal place for lower scores
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
204 bitscore = "%0.1f" % bitscore
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
205 else:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
206 #Note BLAST does not round to nearest int, it truncates
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
207 bitscore = "%i" % bitscore
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
208
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
209 values = [qseqid,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
210 sseqid,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
211 pident,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
212 length, #hsp.findtext("Hsp_align-len")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
213 str(mismatch),
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
214 gapopen,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
215 hsp.findtext("Hsp_query-from"), #qstart,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
216 hsp.findtext("Hsp_query-to"), #qend,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
217 hsp.findtext("Hsp_hit-from"), #sstart,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
218 hsp.findtext("Hsp_hit-to"), #send,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
219 evalue, #hsp.findtext("Hsp_evalue") in scientific notation
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
220 bitscore, #hsp.findtext("Hsp_bit-score") rounded
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
221 ]
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
222
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
223 if extended:
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
224 sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(">"))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
225 #print hit_def, "-->", sallseqid
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
226 positive = hsp.findtext("Hsp_positive")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
227 ppos = "%0.2f" % (100*float(positive)/float(length))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
228 qframe = hsp.findtext("Hsp_query-frame")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
229 sframe = hsp.findtext("Hsp_hit-frame")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
230 if blast_program == "blastp":
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
231 #Probably a bug in BLASTP that they use 0 or 1 depending on format
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
232 if qframe == "0": qframe = "1"
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
233 if sframe == "0": sframe = "1"
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
234 slen = int(hit.findtext("Hit_len"))
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
235 values.extend([sallseqid,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
236 hsp.findtext("Hsp_score"), #score,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
237 nident,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
238 positive,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
239 hsp.findtext("Hsp_gaps"), #gaps,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
240 ppos,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
241 qframe,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
242 sframe,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
243 #NOTE - for blastp, XML shows original seq, tabular uses XXX masking
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
244 q_seq,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
245 h_seq,
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
246 str(qlen),
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
247 str(slen),
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
248 ])
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
249 #print "\t".join(values)
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
250 outfile.write("\t".join(values) + "\n")
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
251 # prevents ElementTree from growing large datastructure
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
252 root.clear()
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
253 elem.clear()
643338ac83c0 Uploaded v0.0.12b, same code but moving folders around to match all my other tools and make future development simpler.
peterjc
parents:
diff changeset
254 outfile.close()