annotate tools/seq_filter_by_id/seq_filter_by_id.py @ 12:85ef5f5a0562 draft default tip

v0.2.9 - Fixed file open mode for Python 3.11 onwards.
author peterjc
date Thu, 21 Dec 2023 10:47:58 +0000
parents 4a7d8ad2a983
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
4 Takes six command line options, tabular filename, ID column numbers (comma
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
5 separated list using one based counting), input filename, input type (e.g.
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
6 FASTA or SFF) and up to two output filenames (for records with and without
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
7 the given IDs, same format as input sequence file).
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
8
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
9 When filtering an SFF file, any Roche XML manifest in the input file is
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
10 preserved in both output files.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
11
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
13 in column one, and the ID of the match from the database is in column two.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
14 Here sensible values for the column numbers would therefore be "1" or "2".
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
15
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
16 This tool is a short Python script which requires Biopython 1.54 or later.
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
17 If you use this tool in scientific work leading to a publication, please
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
18 cite the Biopython application note:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
19
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
20 Cock et al 2009. Biopython: freely available Python tools for computational
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
21 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
22 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
23
12
85ef5f5a0562 v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents: 10
diff changeset
24 This script is copyright 2010-2023 by Peter Cock, The James Hutton Institute
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
25 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
26 See accompanying text file for licence details (MIT license).
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
27
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
28 Use -v or --version to get the version, -h or --help for help.
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
29 """
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
30
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
31 from __future__ import print_function
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
32
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
33 import os
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
34 import re
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
35 import sys
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
36
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
37 from optparse import OptionParser
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
38
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
39 # Parse Command Line
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
40 usage = """Use as follows:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
41
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
42 $ python seq_filter_by_id.py [options] tab1 cols1 [, tab2 cols2, ...]
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
43
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
44 e.g. Positive matches using column one from tabular file:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
45
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
46 $ seq_filter_by_id.py -i my_seqs.fastq -f fastq -p matches.fastq ids.tabular 1
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
47
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
48 Multiple tabular files and column numbers may be given, or replaced with
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
49 the -t or --text option.
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
50 """
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
51 parser = OptionParser(usage=usage)
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
52 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
53 "-i",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
54 "--input",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
55 dest="input",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
56 default=None,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
57 help="Input sequences filename",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
58 metavar="FILE",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
59 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
60 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
61 "-f",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
62 "--format",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
63 dest="format",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
64 default=None,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
65 help="Input sequence format (e.g. fasta, fastq, sff)",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
66 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
67 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
68 "-t",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
69 "--text",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
70 dest="id_list",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
71 default=None,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
72 help="Lists of white space separated IDs (instead of a tabular file)",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
73 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
74 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
75 "-p",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
76 "--positive",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
77 dest="output_positive",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
78 default=None,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
79 help="Output filename for matches",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
80 metavar="FILE",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
81 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
82 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
83 "-n",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
84 "--negative",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
85 dest="output_negative",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
86 default=None,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
87 help="Output filename for non-matches",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
88 metavar="FILE",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
89 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
90 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
91 "-l",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
92 "--logic",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
93 dest="logic",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
94 default="UNION",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
95 help="How to combined multiple ID columns (UNION or INTERSECTION)",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
96 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
97 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
98 "-s",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
99 "--suffix",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
100 dest="suffix",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
101 action="store_true",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
102 help="Ignore pair-read suffixes for matching names",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
103 )
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
104 parser.add_option(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
105 "-v",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
106 "--version",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
107 dest="version",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
108 default=False,
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
109 action="store_true",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
110 help="Show version and quit",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
111 )
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
112
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
113 options, args = parser.parse_args()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
114
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
115 if options.version:
12
85ef5f5a0562 v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents: 10
diff changeset
116 print("v0.2.9")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
117 sys.exit(0)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
118
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
119 in_file = options.input
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
120 seq_format = options.format
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
121 out_positive_file = options.output_positive
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
122 out_negative_file = options.output_negative
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
123 logic = options.logic
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
124 drop_suffixes = bool(options.suffix)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
125
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
126 if in_file is None or not os.path.isfile(in_file):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
127 sys.exit("Missing input file: %r" % in_file)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
128 if out_positive_file is None and out_negative_file is None:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
129 sys.exit("Neither output file requested")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
130 if seq_format is None:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
131 sys.exit("Missing sequence format")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
132 if logic not in ["UNION", "INTERSECTION"]:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
133 sys.exit("Logic agrument should be 'UNION' or 'INTERSECTION', not %r" % logic)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
134 if options.id_list and args:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
135 sys.exit("Cannot accept IDs via both -t in the command line, and as tabular files")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
136 elif not options.id_list and not args:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
137 sys.exit("Expected matched pairs of tabular files and columns (or -t given)")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
138 if len(args) % 2:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
139 sys.exit("Expected matched pairs of tabular files and columns, not: %r" % args)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
140
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
141
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
142 # Cope with three widely used suffix naming convensions,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
143 # Illumina: /1 or /2
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
144 # Forward/revered: .f or .r
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
145 # Sanger, e.g. .p1k and .q1k
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
146 # See http://staden.sourceforge.net/manual/pregap4_unix_50.html
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
147 # re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$")
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
148 # re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
149 re_suffix = re.compile(r"(/1|\.f|\.[sfp]\d\w*|/2|\.r|\.[rq]\d\w*)$")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
150 assert re_suffix.search("demo.f")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
151 assert re_suffix.search("demo.s1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
152 assert re_suffix.search("demo.f1k")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
153 assert re_suffix.search("demo.p1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
154 assert re_suffix.search("demo.p1k")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
155 assert re_suffix.search("demo.p1lk")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
156 assert re_suffix.search("demo/2")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
157 assert re_suffix.search("demo.r")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
158 assert re_suffix.search("demo.q1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
159 assert re_suffix.search("demo.q1lk")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
160
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
161 identifiers = []
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
162 for i in range(len(args) // 2):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
163 tabular_file = args[2 * i]
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
164 cols_arg = args[2 * i + 1]
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
165 if not os.path.isfile(tabular_file):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
166 sys.exit("Missing tabular identifier file %r" % tabular_file)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
167 try:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
168 columns = [int(arg) - 1 for arg in cols_arg.split(",")]
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
169 except ValueError:
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
170 sys.exit(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
171 "Expected list of columns (comma separated integers), got %r" % cols_arg
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
172 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
173 if min(columns) < 0:
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
174 sys.exit(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
175 "Expect one-based column numbers (not zero-based counting), got %r"
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
176 % cols_arg
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
177 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
178 identifiers.append((tabular_file, columns))
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
179
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
180 name_warn = False
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
181
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
182
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
183 def check_white_space(name):
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
184 """Check identifier for white space, take first word only."""
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
185 parts = name.split(None, 1)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
186 global name_warn
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
187 if not name_warn and len(parts) > 1:
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
188 name_warn = (
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
189 "WARNING: Some of your identifiers had white space in them, "
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
190 + "using first word only. e.g.:\n%s\n" % name
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
191 )
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
192 return parts[0]
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
193
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
194
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
195 if drop_suffixes:
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
196
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
197 def clean_name(name):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
198 """Remove suffix."""
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
199 name = check_white_space(name)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
200 match = re_suffix.search(name)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
201 if match:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
202 # Use the fact this is a suffix, and regular expression will be
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
203 # anchored to the end of the name:
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
204 return name[: match.start()]
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
205 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
206 # Nothing to do
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
207 return name
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
208
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
209 assert clean_name("foo/1") == "foo"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
210 assert clean_name("foo/2") == "foo"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
211 assert clean_name("bar.f") == "bar"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
212 assert clean_name("bar.r") == "bar"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
213 assert clean_name("baz.p1") == "baz"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
214 assert clean_name("baz.q2") == "baz"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
215 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
216 # Just check the white space
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
217 clean_name = check_white_space
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
218
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
219
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
220 mapped_chars = {
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
221 ">": "__gt__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
222 "<": "__lt__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
223 "'": "__sq__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
224 '"': "__dq__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
225 "[": "__ob__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
226 "]": "__cb__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
227 "{": "__oc__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
228 "}": "__cc__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
229 "@": "__at__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
230 "\n": "__cn__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
231 "\r": "__cr__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
232 "\t": "__tc__",
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
233 "#": "__pd__",
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
234 }
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
235
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
236 # Read tabular file(s) and record all specified identifiers
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
237 ids = None # Will be a set
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
238 if options.id_list:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
239 assert not identifiers
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
240 ids = set()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
241 id_list = options.id_list
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
242 # Galaxy turns \r into __cr__ (CR) etc
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
243 for k in mapped_chars:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
244 id_list = id_list.replace(mapped_chars[k], k)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
245 for x in options.id_list.split():
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
246 ids.add(clean_name(x.strip()))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
247 print("Have %i unique identifiers from list" % len(ids))
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
248 for tabular_file, columns in identifiers:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
249 file_ids = set()
12
85ef5f5a0562 v0.2.9 - Fixed file open mode for Python 3.11 onwards.
peterjc
parents: 10
diff changeset
250 handle = open(tabular_file)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
251 if len(columns) > 1:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
252 # General case of many columns
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
253 for line in handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
254 if line.startswith("#"):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
255 # Ignore comments
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
256 continue
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
257 parts = line.rstrip("\n").split("\t")
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
258 for col in columns:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
259 name = clean_name(parts[col])
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
260 if name:
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
261 file_ids.add(name)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
262 else:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
263 # Single column, special case speed up
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
264 col = columns[0]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
265 for line in handle:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
266 if not line.strip(): # skip empty lines
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
267 continue
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
268 if not line.startswith("#"):
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
269 name = clean_name(line.rstrip("\n").split("\t")[col])
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
270 if name:
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
271 file_ids.add(name)
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
272 print(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
273 "Using %i IDs from column %s in tabular file"
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
274 % (len(file_ids), ", ".join(str(col + 1) for col in columns))
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
275 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
276 if ids is None:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
277 ids = file_ids
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
278 if logic == "UNION":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
279 ids.update(file_ids)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
280 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
281 ids.intersection_update(file_ids)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
282 handle.close()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
283 if len(identifiers) > 1:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
284 if logic == "UNION":
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
285 print(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
286 "Have %i IDs combined from %i tabular files" % (len(ids), len(identifiers))
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
287 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
288 else:
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
289 print(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
290 "Have %i IDs in common from %i tabular files" % (len(ids), len(identifiers))
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
291 )
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
292 if name_warn:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
293 sys.stderr.write(name_warn)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
294
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
295
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
296 def crude_fasta_iterator(handle):
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
297 """Parse FASTA file yielding tuples of (name, sequence)."""
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
298 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
299 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
300 if line == "":
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
301 return # Premature end of file, or just empty?
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
302 if line[0] == ">":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
303 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
304
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
305 no_id_warned = False
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
306 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
307 if line[0] != ">":
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
308 raise ValueError("Records in Fasta files should start with '>' character")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
309 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
310 id = line[1:].split(None, 1)[0]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
311 except IndexError:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
312 if not no_id_warned:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
313 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
314 no_id_warned = True
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
315 id = None
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
316 lines = [line]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
317 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
318 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
319 if not line:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
320 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
321 if line[0] == ">":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
322 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
323 lines.append(line)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
324 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
325 yield id, "".join(lines)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
326 if not line:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
327 return # StopIteration
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
328
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
329
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
330 def fasta_filter(in_file, pos_file, neg_file, wanted):
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
331 """FASTA filter producing 60 character line wrapped outout."""
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
332 pos_count = neg_count = 0
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
333 # Galaxy now requires Python 2.5+ so can use with statements,
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
334 with open(in_file) as in_handle:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
335 # Doing the if statement outside the loop for speed
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
336 # (with the downside of three very similar loops).
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
337 if pos_file is not None and neg_file is not None:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
338 print("Generating two FASTA files")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
339 with open(pos_file, "w") as pos_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
340 with open(neg_file, "w") as neg_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
341 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
342 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
343 pos_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
344 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
345 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
346 neg_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
347 neg_count += 1
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
348 elif pos_file is not None:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
349 print("Generating matching FASTA file")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
350 with open(pos_file, "w") as pos_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
351 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
352 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
353 pos_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
354 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
355 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
356 neg_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
357 else:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
358 print("Generating non-matching FASTA file")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
359 assert neg_file is not None
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
360 with open(neg_file, "w") as neg_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
361 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
362 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
363 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
364 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
365 neg_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
366 neg_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
367 return pos_count, neg_count
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
368
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
369
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
370 def fastq_filter(in_file, pos_file, neg_file, wanted):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
371 """FASTQ filter."""
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
372 from Bio.SeqIO.QualityIO import FastqGeneralIterator
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
373
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
374 handle = open(in_file, "r")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
375 if pos_file is not None and neg_file is not None:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
376 print("Generating two FASTQ files")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
377 positive_handle = open(pos_file, "w")
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
378 negative_handle = open(neg_file, "w")
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
379 print(in_file)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
380 for title, seq, qual in FastqGeneralIterator(handle):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
381 print("%s --> %s" % (title, clean_name(title.split(None, 1)[0])))
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
382 if clean_name(title.split(None, 1)[0]) in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
383 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
384 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
385 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
386 positive_handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
387 negative_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
388 elif pos_file is not None:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
389 print("Generating matching FASTQ file")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
390 positive_handle = open(pos_file, "w")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
391 for title, seq, qual in FastqGeneralIterator(handle):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
392 if clean_name(title.split(None, 1)[0]) in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
393 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
394 positive_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
395 elif neg_file is not None:
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
396 print("Generating non-matching FASTQ file")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
397 negative_handle = open(neg_file, "w")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
398 for title, seq, qual in FastqGeneralIterator(handle):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
399 if clean_name(title.split(None, 1)[0]) not in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
400 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
401 negative_handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
402 handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
403 # This does not currently bother to record record counts (faster)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
404
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
405
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
406 def sff_filter(in_file, pos_file, neg_file, wanted):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
407 """SFF filter."""
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
408 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
409 from Bio.SeqIO.SffIO import SffIterator, SffWriter
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
410 except ImportError:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
411 sys.exit("SFF filtering requires Biopython 1.54 or later")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
412
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
413 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
414 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
415 except ImportError:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
416 # Prior to Biopython 1.56 this was a private function
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
417 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
418
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
419 in_handle = open(in_file, "rb") # must be binary mode!
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
420 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
421 manifest = ReadRocheXmlManifest(in_handle)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
422 except ValueError:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
423 manifest = None
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
424
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
425 # This makes two passes though the SFF file with isn't so efficient,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
426 # but this makes the code simple.
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
427 pos_count = neg_count = 0
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
428 if pos_file is not None:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
429 out_handle = open(pos_file, "wb")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
430 writer = SffWriter(out_handle, xml=manifest)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
431 in_handle.seek(0) # start again after getting manifest
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
432 pos_count = writer.write_file(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
433 rec for rec in SffIterator(in_handle) if clean_name(rec.id) in wanted
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
434 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
435 out_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
436 if neg_file is not None:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
437 out_handle = open(neg_file, "wb")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
438 writer = SffWriter(out_handle, xml=manifest)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
439 in_handle.seek(0) # start again
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
440 neg_count = writer.write_file(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
441 rec for rec in SffIterator(in_handle) if clean_name(rec.id) not in wanted
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
442 )
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
443 out_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
444 # And we're done
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
445 in_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
446 # At the time of writing, Galaxy doesn't show SFF file read counts,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
447 # so it is useful to put them in stdout and thus shown in job info.
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
448 return pos_count, neg_count
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
449
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
450
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
451 if seq_format.lower() == "sff":
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
452 # Now write filtered SFF file based on IDs wanted
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
453 pos_count, neg_count = sff_filter(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
454 in_file, out_positive_file, out_negative_file, ids
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
455 )
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
456 # At the time of writing, Galaxy doesn't show SFF file read counts,
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
457 # so it is useful to put them in stdout and thus shown in job info.
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
458 elif seq_format.lower() == "fasta":
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
459 # Write filtered FASTA file based on IDs from tabular file
10
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
460 pos_count, neg_count = fasta_filter(
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
461 in_file, out_positive_file, out_negative_file, ids
4a7d8ad2a983 Bump Biopython dependency
peterjc
parents: 9
diff changeset
462 )
9
141612f8c3e3 v0.2.7 Python 3 compatible print etc
peterjc
parents: 8
diff changeset
463 print("%i with and %i without specified IDs" % (pos_count, neg_count))
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
464 elif seq_format.lower().startswith("fastq"):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
465 # Write filtered FASTQ file based on IDs from tabular file
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
466 fastq_filter(in_file, out_positive_file, out_negative_file, ids)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
467 # This does not currently track the counts
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
468 else:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
469 sys.exit("Unsupported file type %r" % seq_format)