annotate corebio/seq_io/nbrf_io.py @ 15:981eb8c3a756 default tip

Uploaded
author davidmurphy
date Sat, 31 Mar 2012 16:07:07 -0400
parents c55bdc2fb9fa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
2 # Copyright (c) 2006, The Regents of the University of California, through
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
3 # Lawrence Berkeley National Laboratory (subject to receipt of any required
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
4 # approvals from the U.S. Dept. of Energy). All rights reserved.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
5
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
6 # This software is distributed under the new BSD Open Source License.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
7 # <http://www.opensource.org/licenses/bsd-license.html>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
8 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
9 # Redistribution and use in source and binary forms, with or without
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
10 # modification, are permitted provided that the following conditions are met:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
11 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
12 # (1) Redistributions of source code must retain the above copyright notice,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
13 # this list of conditions and the following disclaimer.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
14 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
15 # (2) Redistributions in binary form must reproduce the above copyright
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
16 # notice, this list of conditions and the following disclaimer in the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
17 # documentation and or other materials provided with the distribution.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
18 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
19 # (3) Neither the name of the University of California, Lawrence Berkeley
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
20 # National Laboratory, U.S. Dept. of Energy nor the names of its contributors
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
21 # may be used to endorse or promote products derived from this software
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
22 # without specific prior written permission.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
23 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
24 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
25 # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
26 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
27 # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
28 # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
29 # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
30 # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
31 # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
32 # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
33 # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
34 # POSSIBILITY OF SUCH DAMAGE.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
35
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
36 """Sequence IO for NBRF/PIR format.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
37
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
38 The format is similar to fasta. The header line consistins of '>', a two-
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
39 letter sequence type (P1, F1, DL, DC, RL, RC, or XX), a semicolon, and a
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
40 sequence ID. The next line is a textual description of the sequence,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
41 followed by one or more lines containing the sequence data. The end of
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
42 the sequence is marked by a "*" (asterisk) character.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
43
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
44 type_code -- A map between NBRF two letter type codes and Alphabets.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
45
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
46
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
47 see: http://www.cmbi.kun.nl/bioinf/tools/crab_pir.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
48
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
49 --- Example NBRF File ---
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
50
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
51 >P1;CRAB_ANAPL
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
52 ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
53 MDITIHNPLI RRPLFSWLAP SRIFDQIFGE HLQESELLPA SPSLSPFLMR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
54 SPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMVEIH
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
55 GKHEERQDEH GFIAREFNRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
56 SDVPERSIPI TREEKPAIAG AQRK*
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
57
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
58 >P1;CRAB_BOVIN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
59 ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
60 MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPA STSLSPFYLR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
61 PPSFLRAPSW IDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
62 HGKHEERQDE HGFISREFHR KYRIPADVDP LAITSSLSSD GVLTVNGPRK
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
63 QASGPERTIP ITREEKPAVT AAPKK*
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
64
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
65 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
66
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
67 from corebio.utils import *
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
68 from corebio.seq import *
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
69 from corebio.seq_io import *
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
70
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
71 names = ("nbrf", "pir",)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
72 extensions = ('nbrf', 'pir', 'ali')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
73
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
74
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
75
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
76
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
77 type_code = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
78 'P1' : protein_alphabet, # Protein (complete)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
79 'F1' : protein_alphabet, # Protein (fragment)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
80 'DL' : dna_alphabet, # DNA (linear)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
81 'DC' : dna_alphabet, # DNA (circular)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
82 'RC' : rna_alphabet, # RNA (linear)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
83 'RL' : rna_alphabet, # RNA (circular)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
84 'N3' : rna_alphabet, # tRNA
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
85 'N1' : rna_alphabet, # other functional RNA
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
86 'XX' : generic_alphabet
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
87 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
88
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
89 def read(fin, alphabet=None):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
90 """Read and parse a NBRF seqquence file.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
91
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
92 Args:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
93 fin -- A stream or file to read
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
94 alphabet -- The expected alphabet of the data. If not supplied, then
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
95 an appropriate alphabet will be inferred from the data.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
96 Returns:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
97 SeqList -- A list of sequences
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
98 Raises:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
99 ValueError -- If the file is unparsable
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
100 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
101 seqs = [ s for s in iterseq(fin, alphabet)]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
102 return SeqList(seqs)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
103
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
104
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
105
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
106 def iterseq(fin, alphabet=None):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
107 """ Generate sequences from an NBRF file.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
108
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
109 arguments:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
110 fin -- A stream or file to read
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
111 alphabet --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
112 yeilds :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
113 Seq
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
114 raises :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
115 ValueError -- On a parse error.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
116 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
117
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
118 body, header,sequence = range(3) # Internal states
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
119
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
120 state = body
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
121 seq_id = None
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
122 seq_desc = None
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
123 seq_alpha = None
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
124 seqs = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
125
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
126 for lineno, line in enumerate(fin) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
127 if state == body :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
128 if line == "" or line.isspace() :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
129 continue
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
130 if line[0] == '>':
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
131 seq_type, seq_id = line[1:].split(';')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
132 if alphabet :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
133 seq_alpha = alphabet
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
134 else :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
135 seq_alpha = type_code[seq_type]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
136 state = header
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
137 continue
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
138 raise ValueError("Parse error on line: %d" % lineno)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
139
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
140 elif state == header :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
141 seq_desc = line.strip()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
142 state = sequence
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
143 continue
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
144
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
145 elif state == sequence :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
146 data = "".join(line.split()) # Strip out white space
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
147 if data[-1] =='*' :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
148 # End of sequence data
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
149 seqs.append(data[:-1])
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
150
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
151 seq = Seq( "".join(seqs), name = seq_id.strip(),
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
152 description = seq_desc, alphabet = seq_alpha)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
153
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
154 yield seq
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
155 state= body
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
156 seq_id = None
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
157 seq_desc = None
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
158 seqs = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
159 continue
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
160 else :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
161 seqs.append(data)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
162 continue
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
163 else :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
164 # If we ever get here something has gone terrible wrong
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
165 assert(False)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
166
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
167 # end for
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
168
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
169