annotate corebio/seq_io/__init__.py @ 0:c55bdc2fb9fa

Uploaded
author davidmurphy
date Thu, 27 Oct 2011 12:09:09 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
2 # Copyright (c) 2005 Gavin E. Crooks <gec@threeplusone.com>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
3 # Copyright (c) 2006, The Regents of the University of California, through
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
4 # Lawrence Berkeley National Laboratory (subject to receipt of any required
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
5 # approvals from the U.S. Dept. of Energy). All rights reserved.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
6
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
7 # This software is distributed under the new BSD Open Source License.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
8 # <http://www.opensource.org/licenses/bsd-license.html>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
9 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
10 # Redistribution and use in source and binary forms, with or without
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
11 # modification, are permitted provided that the following conditions are met:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
12 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
13 # (1) Redistributions of source code must retain the above copyright notice,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
14 # this list of conditions and the following disclaimer.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
15 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
16 # (2) Redistributions in binary form must reproduce the above copyright
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
17 # notice, this list of conditions and the following disclaimer in the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
18 # documentation and or other materials provided with the distribution.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
19 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
20 # (3) Neither the name of the University of California, Lawrence Berkeley
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
21 # National Laboratory, U.S. Dept. of Energy nor the names of its contributors
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
22 # may be used to endorse or promote products derived from this software
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
23 # without specific prior written permission.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
24 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
25 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
26 # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
27 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
28 # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
29 # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
30 # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
31 # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
32 # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
33 # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
34 # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
35 # POSSIBILITY OF SUCH DAMAGE.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
36
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
37
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
38
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
39
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
40
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
41 """ Sequence file reading and writing.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
42
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
43 Biological sequence data is stored and transmitted using a wide variety of
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
44 different file formats. This package provides convient methods to read and
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
45 write several of these file fomats.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
46
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
47 CoreBio is often capable of guessing the correct file type, either from the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
48 file extension or the structure of the file:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
49 >>> import corebio.seq_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
50 >>> afile = open("test_corebio/data/cap.fa")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
51 >>> seqs = corebio.seq_io.read(afile)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
52
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
53 Alternatively, each sequence file type has a seperate module named FILETYPE_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
54 (e.g. fasta_io, clustal_io).
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
55 >>> import corebio.seq_io.fasta_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
56 >>> afile = open("test_corebio/data/cap.fa")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
57 >>> seqs = corebio.seq_io.fasta_io.read( afile )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
58
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
59 Sequence data can also be written back to files:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
60 >>> fout = open("out.fa", "w")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
61 >>> corebio.seq_io.fasta_io.write( fout, seqs )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
62
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
63
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
64 Supported File Formats
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
65 ----------------------
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
66
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
67 Module Name Extension read write features
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
68 ---------------------------------------------------------------------------
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
69 array_io array, flatfile yes yes none
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
70 clustal_io clustalw aln yes yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
71 fasta_io fasta, Pearson fa yes yes none
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
72 genbank_io genbank gb yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
73 intelligenetics_io intelligenetics ig yes yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
74 msf_io msf msf yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
75 nbrf_io nbrf, pir pir yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
76 nexus_io nexus nexus yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
77 phylip_io phylip phy yes
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
78 plain_io plain, raw txt yes yes none
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
79 table_io table tbl yes yes none
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
80
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
81 Each IO module defines one or more of the following functions and variables:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
82
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
83 read(afile, alphabet=None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
84 Read a file of sequence data and return a SeqList, a collection
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
85 of Seq's (Alphabetic strings) and features.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
86
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
87 read_seq(afile, alphabet=None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
88 Read a single sequence from a file.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
89
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
90 iter_seq(afile, alphabet =None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
91 Iterate over the sequences in a file.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
92
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
93 index(afile, alphabet = None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
94 Instead of loading all of the sequences into memory, scan the file and
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
95 return an index map that will load sequences on demand. Typically not
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
96 implemented for formats with interleaved sequences.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
97
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
98 write(afile, seqlist)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
99 Write a collection of sequences to the specifed file.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
100
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
101 write_seq(afile, seq)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
102 Write one sequence to the file. Only implemented for non-inteleaved,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
103 headerless formats, such as fasta and plain.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
104
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
105 example
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
106 A string containing a short example of the file format
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
107
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
108 names
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
109 A list of synonyms for the file format. e.g. for fasta_io, ( 'fasta',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
110 'pearson', 'fa'). The first entry is the preferred format name.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
111
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
112 extensions
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
113 A list of file name extensions used for this file format. e.g.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
114 fasta_io.extensions is ('fa', 'fasta', 'fast', 'seq', 'fsa', 'fst', 'nt',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
115 'aa','fna','mpfa'). The preferred or standard extension is first in the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
116 list.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
117
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
118
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
119 Attributes :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
120 - formats -- Available seq_io format parsers
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
121 - format_names -- A map between format names and format parsers.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
122 - format_extensions -- A map between filename extensions and parsers.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
123
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
124 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
125
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
126 # Dev. References :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
127 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
128 # - http://iubio.bio.indiana.edu/soft/molbio/readseq/java/Readseq2-help.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
129 # - http://www.ebi.ac.uk/help/formats_frame.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
130 # - http://www.cmbi.kun.nl/bioinf/tools/crab_pir.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
131 # - http://bioperl.org/HOWTOs/html/SeqIO.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
132 # - http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
133 # - http://www.cse.ucsc.edu/research/compbio/a2m-desc.html (a2m)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
134 # - http://www.genomatix.de/online_help/help/sequence_formats.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
135
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
136 from corebio.seq import *
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
137
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
138 import clustal_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
139 import fasta_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
140 import msf_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
141 import nbrf_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
142 import nexus_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
143 import plain_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
144 import phylip_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
145 #import null_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
146 import stockholm_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
147 import intelligenetics_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
148 import table_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
149 import array_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
150 import genbank_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
151
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
152 __all__ = [
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
153 'clustal_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
154 'fasta_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
155 'msf_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
156 'nbrf_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
157 'nexus_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
158 'plain_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
159 'phylip_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
160 'null_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
161 'stockholm_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
162 'intelligenetics_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
163 'table_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
164 'array_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
165 'genbank_io',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
166 'read',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
167 'formats',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
168 'format_names',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
169 'format_extensions',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
170 ]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
171
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
172 formats = ( clustal_io, fasta_io, plain_io, msf_io, genbank_io,nbrf_io, nexus_io, phylip_io, stockholm_io, intelligenetics_io, table_io, array_io)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
173 """Available seq_io formats"""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
174
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
175
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
176 def format_names() :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
177 """Return a map between format names and format modules"""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
178 global formats
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
179 fnames = {}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
180 for f in formats :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
181 for name in f.names :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
182 assert name not in fnames # Insanity check
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
183 fnames[name] = f
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
184 return fnames
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
185
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
186 def format_extensions() :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
187 """Return a map between filename extensions and sequence file types"""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
188 global formats
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
189 fext = {}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
190 for f in formats :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
191 for ext in f.extensions :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
192 assert ext not in fext # Insanity check
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
193 fext[ext] = f
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
194 return fext
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
195
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
196
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
197 # seq_io._parsers is an ordered list of sequence parsers that are tried, in
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
198 # turn, on files of unknown format. Each parser must raise an exception when
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
199 # fed a format further down the list.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
200 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
201 # The general trend is most common to least common file format. However,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
202 # 'nbrf_io' is before 'fasta_io' because nbrf looks like fasta with extras, and
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
203 # 'array_io' is last, since it is very general.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
204 _parsers = (nbrf_io, fasta_io, clustal_io, phylip_io, genbank_io, stockholm_io, msf_io, nexus_io, table_io, array_io)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
205
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
206
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
207 def _get_parsers(fin) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
208 global _parsers
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
209
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
210 fnames = format_names()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
211 fext = format_extensions()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
212 parsers = list(_parsers)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
213 best_guess = parsers[0]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
214
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
215 # If a filename is supplied use the extension to guess the format.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
216 if hasattr(fin, "name") and '.' in fin.name :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
217 extension = fin.name.split('.')[-1]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
218 if extension in fnames:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
219 best_guess = fnames[extension]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
220 elif extension in fext :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
221 best_guess = fext[extension]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
222
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
223 if best_guess in parsers :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
224 parsers.remove(best_guess)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
225 parsers.insert(0,best_guess)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
226
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
227 return parsers
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
228
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
229
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
230
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
231 def read(fin, alphabet=None) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
232 """ Read a sequence file and attempt to guess its format.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
233 First the filename extension (if available) is used to infer the format.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
234 If that fails, then we attempt to parse the file using several common
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
235 formats.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
236
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
237 returns :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
238 SeqList
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
239 raises :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
240 ValueError - If the file cannot be parsed.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
241 ValueError - Sequence do not conform to the alphabet.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
242 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
243
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
244 alphabet = Alphabet(alphabet)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
245 parsers = _get_parsers(fin)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
246
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
247 for p in _get_parsers(fin) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
248 try:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
249 return p.read(fin, alphabet)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
250 except ValueError:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
251 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
252 fin.seek(0) # FIXME. Non seakable stdin?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
253
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
254 names = ", ".join([ p.names[0] for p in parsers])
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
255 raise ValueError("Cannot parse sequence file: Tried %s " % names)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
256
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
257
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
258
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
259
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
260
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
261