Mercurial > repos > peterjc > seq_select_by_id
annotate tools/filters/seq_select_by_id.py @ 0:838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:43:38 -0400 |
parents | |
children | 50a8a6917a9c |
rev | line source |
---|---|
0
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Select FASTA, QUAL, FASTQ or SSF sequences by IDs from a tabular file. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
4 Takes five command line options, tabular filename, ID column number (using |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
5 one based counting), input filename, input type (e.g. FASTA or SFF) and the |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
6 output filename (same format as input sequence file). |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
7 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
8 When selecting from an SFF file, any Roche XML manifest in the input file is |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
9 preserved in both output files. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
10 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 This tool is a short Python script which requires Biopython 1.54 or later |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 for SFF file support. If you use this tool in scientific work leading to a |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 publication, please cite the Biopython application note: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 Cock et al 2009. Biopython: freely available Python tools for computational |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
17 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
18 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
19 This script is copyright 2011 by Peter Cock, The James Hutton Institute UK. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 All rights reserved. See accompanying text file for licence details (MIT/BSD |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 style). |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 This is version 0.0.1 of the script. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
24 """ |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 import sys |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 def stop_err(msg, err=1): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 sys.stderr.write(msg.rstrip() + "\n") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 sys.exit(err) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 #Parse Command Line |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 tabular_file, col_arg, in_file, seq_format, out_file = sys.argv[1:] |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 except ValueError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
35 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
36 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
37 if col_arg.startswith("c"): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
38 column = int(col_arg[1:])-1 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
39 else: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
40 column = int(col_arg)-1 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
41 except ValueError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 stop_err("Expected column number, got %s" % cols_arg) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 if seq_format == "fastqcssanger": |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
45 stop_err("Colorspace FASTQ not supported.") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
46 elif seq_format.lower() in ["sff", "fastq", "qual", "fasta"]: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 seq_format = seq_format.lower() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 elif seq_format.lower().startswith("fastq"): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 #We don't care how the qualities are encoded |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 seq_format = "fastq" |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
51 elif seq_format.lower().startswith("qual"): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
52 #We don't care what the scores are |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
53 seq_format = "qual" |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 else: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
55 stop_err("Unrecognised file format %r" % seq_format) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
57 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
58 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
59 from Bio import SeqIO |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
60 except ImportError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
61 stop_err("Biopython 1.54 or later is required") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
62 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
63 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
64 def parse_ids(tabular_file, col): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
65 """Read tabular file and record all specified identifiers.""" |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 handle = open(tabular_file, "rU") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 for line in handle: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 if not line.startswith("#"): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
69 yield line.rstrip("\n").split("\t")[col].strip() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
70 handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
71 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
72 #Index the sequence file. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 #If very big, could use SeqIO.index_db() to avoid memory bottleneck... |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 records = SeqIO.index(in_file, seq_format) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
75 print "Indexed %i sequences" % len(records) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
76 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 if seq_format.lower()=="sff": |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 #Special case to try to preserve the XML manifest |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 from Bio.SeqIO.SffIO import SffIterator, SffWriter |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 except ImportError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
82 stop_err("Requires Biopython 1.54 or later") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
83 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
84 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
85 from Bio.SeqIO.SffIO import ReadRocheXmlManifest |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
86 except ImportError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
87 #Prior to Biopython 1.56 this was a private function |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
88 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
89 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
90 in_handle = open(in_file, "rb") #must be binary mode! |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
91 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
92 manifest = ReadRocheXmlManifest(in_handle) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
93 except ValueError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
94 manifest = None |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
95 in_handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
96 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
97 out_handle = open(out_file, "wb") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
98 writer = SffWriter(out_handle, xml=manifest) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
99 count = 0 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
100 #This does have the overhead of parsing into SeqRecord objects, |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
101 #but doing the header and index at the low level is too fidly. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
102 iterator = (records[name] for name in parse_ids(tabular_file, column)) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
103 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
104 count = writer.write_file(iterator) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
105 except KeyError, err: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
106 out_handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
107 if name not in records: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
108 stop_err("Identifier %s not found in sequence file" % name) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
109 else: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
110 raise err |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
111 out_handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
112 else: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
113 #Avoid overhead of parsing into SeqRecord objects, |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
114 #just re-use the original formatting from the input file. |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
115 out_handle = open(out_file, "w") |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
116 count = 0 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
117 for name in parse_ids(tabular_file, column): |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
118 try: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
119 out_handle.write(records.get_raw(name)) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
120 except KeyError: |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
121 out_handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
122 stop_err("Identifier %s not found in sequence file" % name) |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
123 count += 1 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
124 out_handle.close() |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
125 |
838b9bebfa3c
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
126 print "Selected %i sequences by ID" % count |