annotate tools/filters/seq_select_by_id.py @ 3:19e26966ed3e draft

Uploaded v0.0.6, handles Biopython dependency via the ToolShed, adopted MIT license, using reStructuredTest for the README file. No functional changes.
author peterjc
date Mon, 29 Jul 2013 09:13:13 -0400
parents 28d52478ace9
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Select FASTA, QUAL, FASTQ or SSF sequences by IDs from a tabular file.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 Takes five command line options, tabular filename, ID column number (using
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 one based counting), input filename, input type (e.g. FASTA or SFF) and the
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 output filename (same format as input sequence file).
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
7
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8 When selecting from an SFF file, any Roche XML manifest in the input file is
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 preserved in both output files.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11 This tool is a short Python script which requires Biopython 1.54 or later
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
12 for SFF file support. If you use this tool in scientific work leading to a
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
13 publication, please cite the Biopython application note:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
14
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15 Cock et al 2009. Biopython: freely available Python tools for computational
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18
2
28d52478ace9 Uploaded v0.0.4 which adds a unit test.
peterjc
parents: 1
diff changeset
19 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute UK.
3
19e26966ed3e Uploaded v0.0.6, handles Biopython dependency via the ToolShed, adopted MIT license, using reStructuredTest for the README file.
peterjc
parents: 2
diff changeset
20 All rights reserved. See accompanying text file for licence details (MIT
19e26966ed3e Uploaded v0.0.6, handles Biopython dependency via the ToolShed, adopted MIT license, using reStructuredTest for the README file.
peterjc
parents: 2
diff changeset
21 license).
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22
3
19e26966ed3e Uploaded v0.0.6, handles Biopython dependency via the ToolShed, adopted MIT license, using reStructuredTest for the README file.
peterjc
parents: 2
diff changeset
23 This is version 0.0.6 of the script.
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
24 """
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
25 import sys
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27 def stop_err(msg, err=1):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
28 sys.stderr.write(msg.rstrip() + "\n")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
29 sys.exit(err)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30
2
28d52478ace9 Uploaded v0.0.4 which adds a unit test.
peterjc
parents: 1
diff changeset
31 if "-v" in sys.argv or "--version" in sys.argv:
3
19e26966ed3e Uploaded v0.0.6, handles Biopython dependency via the ToolShed, adopted MIT license, using reStructuredTest for the README file.
peterjc
parents: 2
diff changeset
32 print "v0.0.6"
2
28d52478ace9 Uploaded v0.0.4 which adds a unit test.
peterjc
parents: 1
diff changeset
33 sys.exit(0)
28d52478ace9 Uploaded v0.0.4 which adds a unit test.
peterjc
parents: 1
diff changeset
34
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35 #Parse Command Line
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37 tabular_file, col_arg, in_file, seq_format, out_file = sys.argv[1:]
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 except ValueError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
40 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
41 if col_arg.startswith("c"):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
42 column = int(col_arg[1:])-1
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
43 else:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 column = int(col_arg)-1
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 except ValueError:
1
50a8a6917a9c Uploaded update (v0.0.3) to ignore blank lines in the ID file
peterjc
parents: 0
diff changeset
46 stop_err("Expected column number, got %s" % col_arg)
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
48 if seq_format == "fastqcssanger":
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 stop_err("Colorspace FASTQ not supported.")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 elif seq_format.lower() in ["sff", "fastq", "qual", "fasta"]:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
51 seq_format = seq_format.lower()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
52 elif seq_format.lower().startswith("fastq"):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
53 #We don't care how the qualities are encoded
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
54 seq_format = "fastq"
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
55 elif seq_format.lower().startswith("qual"):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
56 #We don't care what the scores are
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
57 seq_format = "qual"
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58 else:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59 stop_err("Unrecognised file format %r" % seq_format)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 from Bio import SeqIO
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 except ImportError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 stop_err("Biopython 1.54 or later is required")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 def parse_ids(tabular_file, col):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69 """Read tabular file and record all specified identifiers."""
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 handle = open(tabular_file, "rU")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 for line in handle:
1
50a8a6917a9c Uploaded update (v0.0.3) to ignore blank lines in the ID file
peterjc
parents: 0
diff changeset
72 if line.strip() and not line.startswith("#"):
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 yield line.rstrip("\n").split("\t")[col].strip()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74 handle.close()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76 #Index the sequence file.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 #If very big, could use SeqIO.index_db() to avoid memory bottleneck...
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 records = SeqIO.index(in_file, seq_format)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 print "Indexed %i sequences" % len(records)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81 if seq_format.lower()=="sff":
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82 #Special case to try to preserve the XML manifest
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
83 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
84 from Bio.SeqIO.SffIO import SffIterator, SffWriter
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
85 except ImportError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
86 stop_err("Requires Biopython 1.54 or later")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
87
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
88 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
89 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
90 except ImportError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
91 #Prior to Biopython 1.56 this was a private function
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
92 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
93
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
94 in_handle = open(in_file, "rb") #must be binary mode!
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
95 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
96 manifest = ReadRocheXmlManifest(in_handle)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
97 except ValueError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
98 manifest = None
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
99 in_handle.close()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
100
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
101 out_handle = open(out_file, "wb")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
102 writer = SffWriter(out_handle, xml=manifest)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
103 count = 0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
104 #This does have the overhead of parsing into SeqRecord objects,
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
105 #but doing the header and index at the low level is too fidly.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
106 iterator = (records[name] for name in parse_ids(tabular_file, column))
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
107 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
108 count = writer.write_file(iterator)
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
109 except KeyError, err:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
110 out_handle.close()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
111 if name not in records:
1
50a8a6917a9c Uploaded update (v0.0.3) to ignore blank lines in the ID file
peterjc
parents: 0
diff changeset
112 stop_err("Identifier %r not found in sequence file" % name)
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
113 else:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
114 raise err
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
115 out_handle.close()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
116 else:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
117 #Avoid overhead of parsing into SeqRecord objects,
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
118 #just re-use the original formatting from the input file.
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
119 out_handle = open(out_file, "w")
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
120 count = 0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
121 for name in parse_ids(tabular_file, column):
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
122 try:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
123 out_handle.write(records.get_raw(name))
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
124 except KeyError:
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
125 out_handle.close()
1
50a8a6917a9c Uploaded update (v0.0.3) to ignore blank lines in the ID file
peterjc
parents: 0
diff changeset
126 stop_err("Identifier %r not found in sequence file" % name)
0
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
127 count += 1
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
128 out_handle.close()
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
129
838b9bebfa3c Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
130 print "Selected %i sequences by ID" % count