Mercurial > repos > peterjc > seq_filter_by_id
annotate tools/seq_filter_by_id/seq_filter_by_id.py @ 12:85ef5f5a0562 draft default tip
v0.2.9 - Fixed file open mode for Python 3.11 onwards.
author | peterjc |
---|---|
date | Thu, 21 Dec 2023 10:47:58 +0000 |
parents | 4a7d8ad2a983 |
children |
rev | line source |
---|---|
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
3 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
4 Takes six command line options, tabular filename, ID column numbers (comma |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
5 separated list using one based counting), input filename, input type (e.g. |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
6 FASTA or SFF) and up to two output filenames (for records with and without |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
7 the given IDs, same format as input sequence file). |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
8 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
9 When filtering an SFF file, any Roche XML manifest in the input file is |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
10 preserved in both output files. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
11 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
13 in column one, and the ID of the match from the database is in column two. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
14 Here sensible values for the column numbers would therefore be "1" or "2". |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
15 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
16 This tool is a short Python script which requires Biopython 1.54 or later. |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
17 If you use this tool in scientific work leading to a publication, please |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
18 cite the Biopython application note: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
19 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
20 Cock et al 2009. Biopython: freely available Python tools for computational |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
21 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. |
10 | 22 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878. |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
23 |
12
85ef5f5a0562
v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents:
10
diff
changeset
|
24 This script is copyright 2010-2023 by Peter Cock, The James Hutton Institute |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
25 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
26 See accompanying text file for licence details (MIT license). |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
27 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
28 Use -v or --version to get the version, -h or --help for help. |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
29 """ |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
30 |
9 | 31 from __future__ import print_function |
32 | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
33 import os |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
34 import re |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
35 import sys |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
36 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
37 from optparse import OptionParser |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
38 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
39 # Parse Command Line |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
40 usage = """Use as follows: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
41 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
42 $ python seq_filter_by_id.py [options] tab1 cols1 [, tab2 cols2, ...] |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
43 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
44 e.g. Positive matches using column one from tabular file: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
45 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
46 $ seq_filter_by_id.py -i my_seqs.fastq -f fastq -p matches.fastq ids.tabular 1 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
47 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
48 Multiple tabular files and column numbers may be given, or replaced with |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
49 the -t or --text option. |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
50 """ |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
51 parser = OptionParser(usage=usage) |
10 | 52 parser.add_option( |
53 "-i", | |
54 "--input", | |
55 dest="input", | |
56 default=None, | |
57 help="Input sequences filename", | |
58 metavar="FILE", | |
59 ) | |
60 parser.add_option( | |
61 "-f", | |
62 "--format", | |
63 dest="format", | |
64 default=None, | |
65 help="Input sequence format (e.g. fasta, fastq, sff)", | |
66 ) | |
67 parser.add_option( | |
68 "-t", | |
69 "--text", | |
70 dest="id_list", | |
71 default=None, | |
72 help="Lists of white space separated IDs (instead of a tabular file)", | |
73 ) | |
74 parser.add_option( | |
75 "-p", | |
76 "--positive", | |
77 dest="output_positive", | |
78 default=None, | |
79 help="Output filename for matches", | |
80 metavar="FILE", | |
81 ) | |
82 parser.add_option( | |
83 "-n", | |
84 "--negative", | |
85 dest="output_negative", | |
86 default=None, | |
87 help="Output filename for non-matches", | |
88 metavar="FILE", | |
89 ) | |
90 parser.add_option( | |
91 "-l", | |
92 "--logic", | |
93 dest="logic", | |
94 default="UNION", | |
95 help="How to combined multiple ID columns (UNION or INTERSECTION)", | |
96 ) | |
97 parser.add_option( | |
98 "-s", | |
99 "--suffix", | |
100 dest="suffix", | |
101 action="store_true", | |
102 help="Ignore pair-read suffixes for matching names", | |
103 ) | |
104 parser.add_option( | |
105 "-v", | |
106 "--version", | |
107 dest="version", | |
108 default=False, | |
109 action="store_true", | |
110 help="Show version and quit", | |
111 ) | |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
112 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
113 options, args = parser.parse_args() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
114 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
115 if options.version: |
12
85ef5f5a0562
v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents:
10
diff
changeset
|
116 print("v0.2.9") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
117 sys.exit(0) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
118 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
119 in_file = options.input |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
120 seq_format = options.format |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
121 out_positive_file = options.output_positive |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
122 out_negative_file = options.output_negative |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
123 logic = options.logic |
10 | 124 drop_suffixes = bool(options.suffix) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
125 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
126 if in_file is None or not os.path.isfile(in_file): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
127 sys.exit("Missing input file: %r" % in_file) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
128 if out_positive_file is None and out_negative_file is None: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
129 sys.exit("Neither output file requested") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
130 if seq_format is None: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
131 sys.exit("Missing sequence format") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
132 if logic not in ["UNION", "INTERSECTION"]: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
133 sys.exit("Logic agrument should be 'UNION' or 'INTERSECTION', not %r" % logic) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
134 if options.id_list and args: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
135 sys.exit("Cannot accept IDs via both -t in the command line, and as tabular files") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
136 elif not options.id_list and not args: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
137 sys.exit("Expected matched pairs of tabular files and columns (or -t given)") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
138 if len(args) % 2: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
139 sys.exit("Expected matched pairs of tabular files and columns, not: %r" % args) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
140 |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
141 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
142 # Cope with three widely used suffix naming convensions, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
143 # Illumina: /1 or /2 |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
144 # Forward/revered: .f or .r |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
145 # Sanger, e.g. .p1k and .q1k |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
146 # See http://staden.sourceforge.net/manual/pregap4_unix_50.html |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
147 # re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$") |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
148 # re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
149 re_suffix = re.compile(r"(/1|\.f|\.[sfp]\d\w*|/2|\.r|\.[rq]\d\w*)$") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
150 assert re_suffix.search("demo.f") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
151 assert re_suffix.search("demo.s1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
152 assert re_suffix.search("demo.f1k") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
153 assert re_suffix.search("demo.p1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
154 assert re_suffix.search("demo.p1k") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
155 assert re_suffix.search("demo.p1lk") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
156 assert re_suffix.search("demo/2") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
157 assert re_suffix.search("demo.r") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
158 assert re_suffix.search("demo.q1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
159 assert re_suffix.search("demo.q1lk") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
160 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
161 identifiers = [] |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
162 for i in range(len(args) // 2): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
163 tabular_file = args[2 * i] |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
164 cols_arg = args[2 * i + 1] |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
165 if not os.path.isfile(tabular_file): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
166 sys.exit("Missing tabular identifier file %r" % tabular_file) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
167 try: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
168 columns = [int(arg) - 1 for arg in cols_arg.split(",")] |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
169 except ValueError: |
10 | 170 sys.exit( |
171 "Expected list of columns (comma separated integers), got %r" % cols_arg | |
172 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
173 if min(columns) < 0: |
10 | 174 sys.exit( |
175 "Expect one-based column numbers (not zero-based counting), got %r" | |
176 % cols_arg | |
177 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
178 identifiers.append((tabular_file, columns)) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
179 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
180 name_warn = False |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
181 |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
182 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
183 def check_white_space(name): |
9 | 184 """Check identifier for white space, take first word only.""" |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
185 parts = name.split(None, 1) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
186 global name_warn |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
187 if not name_warn and len(parts) > 1: |
10 | 188 name_warn = ( |
189 "WARNING: Some of your identifiers had white space in them, " | |
190 + "using first word only. e.g.:\n%s\n" % name | |
191 ) | |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
192 return parts[0] |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
193 |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
194 |
10 | 195 if drop_suffixes: |
196 | |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
197 def clean_name(name): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
198 """Remove suffix.""" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
199 name = check_white_space(name) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
200 match = re_suffix.search(name) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
201 if match: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
202 # Use the fact this is a suffix, and regular expression will be |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
203 # anchored to the end of the name: |
10 | 204 return name[: match.start()] |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
205 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
206 # Nothing to do |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
207 return name |
10 | 208 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
209 assert clean_name("foo/1") == "foo" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
210 assert clean_name("foo/2") == "foo" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
211 assert clean_name("bar.f") == "bar" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
212 assert clean_name("bar.r") == "bar" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
213 assert clean_name("baz.p1") == "baz" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
214 assert clean_name("baz.q2") == "baz" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
215 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
216 # Just check the white space |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
217 clean_name = check_white_space |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
218 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
219 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
220 mapped_chars = { |
10 | 221 ">": "__gt__", |
222 "<": "__lt__", | |
223 "'": "__sq__", | |
224 '"': "__dq__", | |
225 "[": "__ob__", | |
226 "]": "__cb__", | |
227 "{": "__oc__", | |
228 "}": "__cc__", | |
229 "@": "__at__", | |
230 "\n": "__cn__", | |
231 "\r": "__cr__", | |
232 "\t": "__tc__", | |
233 "#": "__pd__", | |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
234 } |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
235 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
236 # Read tabular file(s) and record all specified identifiers |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
237 ids = None # Will be a set |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
238 if options.id_list: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
239 assert not identifiers |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
240 ids = set() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
241 id_list = options.id_list |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
242 # Galaxy turns \r into __cr__ (CR) etc |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
243 for k in mapped_chars: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
244 id_list = id_list.replace(mapped_chars[k], k) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
245 for x in options.id_list.split(): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
246 ids.add(clean_name(x.strip())) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
247 print("Have %i unique identifiers from list" % len(ids)) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
248 for tabular_file, columns in identifiers: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
249 file_ids = set() |
12
85ef5f5a0562
v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents:
10
diff
changeset
|
250 handle = open(tabular_file) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
251 if len(columns) > 1: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
252 # General case of many columns |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
253 for line in handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
254 if line.startswith("#"): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
255 # Ignore comments |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
256 continue |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
257 parts = line.rstrip("\n").split("\t") |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
258 for col in columns: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
259 name = clean_name(parts[col]) |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
260 if name: |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
261 file_ids.add(name) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
262 else: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
263 # Single column, special case speed up |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
264 col = columns[0] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
265 for line in handle: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
266 if not line.strip(): # skip empty lines |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
267 continue |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
268 if not line.startswith("#"): |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
269 name = clean_name(line.rstrip("\n").split("\t")[col]) |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
270 if name: |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
271 file_ids.add(name) |
10 | 272 print( |
273 "Using %i IDs from column %s in tabular file" | |
274 % (len(file_ids), ", ".join(str(col + 1) for col in columns)) | |
275 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
276 if ids is None: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
277 ids = file_ids |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
278 if logic == "UNION": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
279 ids.update(file_ids) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
280 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
281 ids.intersection_update(file_ids) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
282 handle.close() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
283 if len(identifiers) > 1: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
284 if logic == "UNION": |
10 | 285 print( |
286 "Have %i IDs combined from %i tabular files" % (len(ids), len(identifiers)) | |
287 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
288 else: |
10 | 289 print( |
290 "Have %i IDs in common from %i tabular files" % (len(ids), len(identifiers)) | |
291 ) | |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
292 if name_warn: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
293 sys.stderr.write(name_warn) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
294 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
295 |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
296 def crude_fasta_iterator(handle): |
10 | 297 """Parse FASTA file yielding tuples of (name, sequence).""" |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
298 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
299 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
300 if line == "": |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
301 return # Premature end of file, or just empty? |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
302 if line[0] == ">": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
303 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
304 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
305 no_id_warned = False |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
306 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
307 if line[0] != ">": |
10 | 308 raise ValueError("Records in Fasta files should start with '>' character") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
309 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
310 id = line[1:].split(None, 1)[0] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
311 except IndexError: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
312 if not no_id_warned: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
313 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n") |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
314 no_id_warned = True |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
315 id = None |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
316 lines = [line] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
317 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
318 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
319 if not line: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
320 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
321 if line[0] == ">": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
322 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
323 lines.append(line) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
324 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
325 yield id, "".join(lines) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
326 if not line: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
327 return # StopIteration |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
328 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
329 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
330 def fasta_filter(in_file, pos_file, neg_file, wanted): |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
331 """FASTA filter producing 60 character line wrapped outout.""" |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
332 pos_count = neg_count = 0 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
333 # Galaxy now requires Python 2.5+ so can use with statements, |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
334 with open(in_file) as in_handle: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
335 # Doing the if statement outside the loop for speed |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
336 # (with the downside of three very similar loops). |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
337 if pos_file is not None and neg_file is not None: |
9 | 338 print("Generating two FASTA files") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
339 with open(pos_file, "w") as pos_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
340 with open(neg_file, "w") as neg_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
341 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
342 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
343 pos_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
344 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
345 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
346 neg_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
347 neg_count += 1 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
348 elif pos_file is not None: |
9 | 349 print("Generating matching FASTA file") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
350 with open(pos_file, "w") as pos_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
351 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
352 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
353 pos_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
354 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
355 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
356 neg_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
357 else: |
9 | 358 print("Generating non-matching FASTA file") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
359 assert neg_file is not None |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
360 with open(neg_file, "w") as neg_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
361 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
362 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
363 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
364 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
365 neg_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
366 neg_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
367 return pos_count, neg_count |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
368 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
369 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
370 def fastq_filter(in_file, pos_file, neg_file, wanted): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
371 """FASTQ filter.""" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
372 from Bio.SeqIO.QualityIO import FastqGeneralIterator |
10 | 373 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
374 handle = open(in_file, "r") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
375 if pos_file is not None and neg_file is not None: |
9 | 376 print("Generating two FASTQ files") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
377 positive_handle = open(pos_file, "w") |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
378 negative_handle = open(neg_file, "w") |
9 | 379 print(in_file) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
380 for title, seq, qual in FastqGeneralIterator(handle): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
381 print("%s --> %s" % (title, clean_name(title.split(None, 1)[0]))) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
382 if clean_name(title.split(None, 1)[0]) in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
383 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
384 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
385 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
386 positive_handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
387 negative_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
388 elif pos_file is not None: |
9 | 389 print("Generating matching FASTQ file") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
390 positive_handle = open(pos_file, "w") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
391 for title, seq, qual in FastqGeneralIterator(handle): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
392 if clean_name(title.split(None, 1)[0]) in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
393 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
394 positive_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
395 elif neg_file is not None: |
9 | 396 print("Generating non-matching FASTQ file") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
397 negative_handle = open(neg_file, "w") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
398 for title, seq, qual in FastqGeneralIterator(handle): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
399 if clean_name(title.split(None, 1)[0]) not in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
400 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
401 negative_handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
402 handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
403 # This does not currently bother to record record counts (faster) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
404 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
405 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
406 def sff_filter(in_file, pos_file, neg_file, wanted): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
407 """SFF filter.""" |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
408 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
409 from Bio.SeqIO.SffIO import SffIterator, SffWriter |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
410 except ImportError: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
411 sys.exit("SFF filtering requires Biopython 1.54 or later") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
412 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
413 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
414 from Bio.SeqIO.SffIO import ReadRocheXmlManifest |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
415 except ImportError: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
416 # Prior to Biopython 1.56 this was a private function |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
417 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
418 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
419 in_handle = open(in_file, "rb") # must be binary mode! |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
420 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
421 manifest = ReadRocheXmlManifest(in_handle) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
422 except ValueError: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
423 manifest = None |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
424 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
425 # This makes two passes though the SFF file with isn't so efficient, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
426 # but this makes the code simple. |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
427 pos_count = neg_count = 0 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
428 if pos_file is not None: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
429 out_handle = open(pos_file, "wb") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
430 writer = SffWriter(out_handle, xml=manifest) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
431 in_handle.seek(0) # start again after getting manifest |
10 | 432 pos_count = writer.write_file( |
433 rec for rec in SffIterator(in_handle) if clean_name(rec.id) in wanted | |
434 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
435 out_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
436 if neg_file is not None: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
437 out_handle = open(neg_file, "wb") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
438 writer = SffWriter(out_handle, xml=manifest) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
439 in_handle.seek(0) # start again |
10 | 440 neg_count = writer.write_file( |
441 rec for rec in SffIterator(in_handle) if clean_name(rec.id) not in wanted | |
442 ) | |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
443 out_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
444 # And we're done |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
445 in_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
446 # At the time of writing, Galaxy doesn't show SFF file read counts, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
447 # so it is useful to put them in stdout and thus shown in job info. |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
448 return pos_count, neg_count |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
449 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
450 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
451 if seq_format.lower() == "sff": |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
452 # Now write filtered SFF file based on IDs wanted |
10 | 453 pos_count, neg_count = sff_filter( |
454 in_file, out_positive_file, out_negative_file, ids | |
455 ) | |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
456 # At the time of writing, Galaxy doesn't show SFF file read counts, |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
457 # so it is useful to put them in stdout and thus shown in job info. |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
458 elif seq_format.lower() == "fasta": |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
459 # Write filtered FASTA file based on IDs from tabular file |
10 | 460 pos_count, neg_count = fasta_filter( |
461 in_file, out_positive_file, out_negative_file, ids | |
462 ) | |
9 | 463 print("%i with and %i without specified IDs" % (pos_count, neg_count)) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
464 elif seq_format.lower().startswith("fastq"): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
465 # Write filtered FASTQ file based on IDs from tabular file |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
466 fastq_filter(in_file, out_positive_file, out_negative_file, ids) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
467 # This does not currently track the counts |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
468 else: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
469 sys.exit("Unsupported file type %r" % seq_format) |