Mercurial > repos > peterjc > seq_filter_by_id
annotate tools/seq_filter_by_id/seq_filter_by_id.py @ 8:2d4537dbf0bc draft
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
author | peterjc |
---|---|
date | Wed, 10 May 2017 13:18:01 -0400 |
parents | fb1313d79396 |
children | 141612f8c3e3 |
rev | line source |
---|---|
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
3 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
4 Takes six command line options, tabular filename, ID column numbers (comma |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
5 separated list using one based counting), input filename, input type (e.g. |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
6 FASTA or SFF) and up to two output filenames (for records with and without |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
7 the given IDs, same format as input sequence file). |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
8 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
9 When filtering an SFF file, any Roche XML manifest in the input file is |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
10 preserved in both output files. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
11 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
13 in column one, and the ID of the match from the database is in column two. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
14 Here sensible values for the column numbers would therefore be "1" or "2". |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
15 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
16 This tool is a short Python script which requires Biopython 1.54 or later. |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
17 If you use this tool in scientific work leading to a publication, please |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
18 cite the Biopython application note: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
19 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
20 Cock et al 2009. Biopython: freely available Python tools for computational |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
21 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
22 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
23 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
24 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
25 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
26 See accompanying text file for licence details (MIT license). |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
27 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
28 Use -v or --version to get the version, -h or --help for help. |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
29 """ |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
30 |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
31 import os |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
32 import re |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
33 import sys |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
34 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
35 from optparse import OptionParser |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
36 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
37 # Parse Command Line |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
38 usage = """Use as follows: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
39 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
40 $ python seq_filter_by_id.py [options] tab1 cols1 [, tab2 cols2, ...] |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
41 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
42 e.g. Positive matches using column one from tabular file: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
43 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
44 $ seq_filter_by_id.py -i my_seqs.fastq -f fastq -p matches.fastq ids.tabular 1 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
45 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
46 Multiple tabular files and column numbers may be given, or replaced with |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
47 the -t or --text option. |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
48 """ |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
49 parser = OptionParser(usage=usage) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
50 parser.add_option('-i', '--input', dest='input', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
51 default=None, help='Input sequences filename', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
52 metavar="FILE") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
53 parser.add_option('-f', '--format', dest='format', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
54 default=None, |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
55 help='Input sequence format (e.g. fasta, fastq, sff)') |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
56 parser.add_option('-t', '--text', dest='id_list', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
57 default=None, help="Lists of white space separated IDs (instead of a tabular file)") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
58 parser.add_option('-p', '--positive', dest='output_positive', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
59 default=None, |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
60 help='Output filename for matches', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
61 metavar="FILE") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
62 parser.add_option('-n', '--negative', dest='output_negative', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
63 default=None, |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
64 help='Output filename for non-matches', |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
65 metavar="FILE") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
66 parser.add_option("-l", "--logic", dest="logic", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
67 default="UNION", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
68 help="How to combined multiple ID columns (UNION or INTERSECTION)") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
69 parser.add_option("-s", "--suffix", dest="suffix", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
70 action="store_true", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
71 help="Ignore pair-read suffices for matching names") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
72 parser.add_option("-v", "--version", dest="version", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
73 default=False, action="store_true", |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
74 help="Show version and quit") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
75 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
76 options, args = parser.parse_args() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
77 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
78 if options.version: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
79 print "v0.2.5" |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
80 sys.exit(0) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
81 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
82 in_file = options.input |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
83 seq_format = options.format |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
84 out_positive_file = options.output_positive |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
85 out_negative_file = options.output_negative |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
86 logic = options.logic |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
87 drop_suffices = bool(options.suffix) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
88 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
89 if in_file is None or not os.path.isfile(in_file): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
90 sys.exit("Missing input file: %r" % in_file) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
91 if out_positive_file is None and out_negative_file is None: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
92 sys.exit("Neither output file requested") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
93 if seq_format is None: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
94 sys.exit("Missing sequence format") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
95 if logic not in ["UNION", "INTERSECTION"]: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
96 sys.exit("Logic agrument should be 'UNION' or 'INTERSECTION', not %r" % logic) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
97 if options.id_list and args: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
98 sys.exit("Cannot accept IDs via both -t in the command line, and as tabular files") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
99 elif not options.id_list and not args: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
100 sys.exit("Expected matched pairs of tabular files and columns (or -t given)") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
101 if len(args) % 2: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
102 sys.exit("Expected matched pairs of tabular files and columns, not: %r" % args) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
103 |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
104 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
105 # Cope with three widely used suffix naming convensions, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
106 # Illumina: /1 or /2 |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
107 # Forward/revered: .f or .r |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
108 # Sanger, e.g. .p1k and .q1k |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
109 # See http://staden.sourceforge.net/manual/pregap4_unix_50.html |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
110 # re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$") |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
111 # re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
112 re_suffix = re.compile(r"(/1|\.f|\.[sfp]\d\w*|/2|\.r|\.[rq]\d\w*)$") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
113 assert re_suffix.search("demo.f") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
114 assert re_suffix.search("demo.s1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
115 assert re_suffix.search("demo.f1k") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
116 assert re_suffix.search("demo.p1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
117 assert re_suffix.search("demo.p1k") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
118 assert re_suffix.search("demo.p1lk") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
119 assert re_suffix.search("demo/2") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
120 assert re_suffix.search("demo.r") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
121 assert re_suffix.search("demo.q1") |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
122 assert re_suffix.search("demo.q1lk") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
123 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
124 identifiers = [] |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
125 for i in range(len(args) // 2): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
126 tabular_file = args[2 * i] |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
127 cols_arg = args[2 * i + 1] |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
128 if not os.path.isfile(tabular_file): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
129 sys.exit("Missing tabular identifier file %r" % tabular_file) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
130 try: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
131 columns = [int(arg) - 1 for arg in cols_arg.split(",")] |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
132 except ValueError: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
133 sys.exit("Expected list of columns (comma separated integers), got %r" % cols_arg) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
134 if min(columns) < 0: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
135 sys.exit("Expect one-based column numbers (not zero-based counting), got %r" % cols_arg) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
136 identifiers.append((tabular_file, columns)) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
137 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
138 name_warn = False |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
139 |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
140 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
141 def check_white_space(name): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
142 parts = name.split(None, 1) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
143 global name_warn |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
144 if not name_warn and len(parts) > 1: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
145 name_warn = "WARNING: Some of your identifiers had white space in them, " + \ |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
146 "using first word only. e.g.:\n%s\n" % name |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
147 return parts[0] |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
148 |
8
2d4537dbf0bc
v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents:
7
diff
changeset
|
149 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
150 if drop_suffices: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
151 def clean_name(name): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
152 """Remove suffix.""" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
153 name = check_white_space(name) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
154 match = re_suffix.search(name) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
155 if match: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
156 # Use the fact this is a suffix, and regular expression will be |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
157 # anchored to the end of the name: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
158 return name[:match.start()] |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
159 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
160 # Nothing to do |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
161 return name |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
162 assert clean_name("foo/1") == "foo" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
163 assert clean_name("foo/2") == "foo" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
164 assert clean_name("bar.f") == "bar" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
165 assert clean_name("bar.r") == "bar" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
166 assert clean_name("baz.p1") == "baz" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
167 assert clean_name("baz.q2") == "baz" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
168 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
169 # Just check the white space |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
170 clean_name = check_white_space |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
171 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
172 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
173 mapped_chars = { |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
174 '>': '__gt__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
175 '<': '__lt__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
176 "'": '__sq__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
177 '"': '__dq__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
178 '[': '__ob__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
179 ']': '__cb__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
180 '{': '__oc__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
181 '}': '__cc__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
182 '@': '__at__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
183 '\n': '__cn__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
184 '\r': '__cr__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
185 '\t': '__tc__', |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
186 '#': '__pd__', |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
187 } |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
188 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
189 # Read tabular file(s) and record all specified identifiers |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
190 ids = None # Will be a set |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
191 if options.id_list: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
192 assert not identifiers |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
193 ids = set() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
194 id_list = options.id_list |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
195 # Galaxy turns \r into __cr__ (CR) etc |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
196 for k in mapped_chars: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
197 id_list = id_list.replace(mapped_chars[k], k) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
198 for x in options.id_list.split(): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
199 ids.add(clean_name(x.strip())) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
200 print("Have %i unique identifiers from list" % len(ids)) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
201 for tabular_file, columns in identifiers: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
202 file_ids = set() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
203 handle = open(tabular_file, "rU") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
204 if len(columns) > 1: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
205 # General case of many columns |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
206 for line in handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
207 if line.startswith("#"): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
208 # Ignore comments |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
209 continue |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
210 parts = line.rstrip("\n").split("\t") |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
211 for col in columns: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
212 name = clean_name(parts[col]) |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
213 if name: |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
214 file_ids.add(name) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
215 else: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
216 # Single column, special case speed up |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
217 col = columns[0] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
218 for line in handle: |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
219 if not line.strip(): # skip empty lines |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
220 continue |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
221 if not line.startswith("#"): |
7
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
222 name = clean_name(line.rstrip("\n").split("\t")[col]) |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
223 if name: |
fb1313d79396
Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents:
6
diff
changeset
|
224 file_ids.add(name) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
225 print "Using %i IDs from column %s in tabular file" % (len(file_ids), ", ".join(str(col + 1) for col in columns)) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
226 if ids is None: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
227 ids = file_ids |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
228 if logic == "UNION": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
229 ids.update(file_ids) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
230 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
231 ids.intersection_update(file_ids) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
232 handle.close() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
233 if len(identifiers) > 1: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
234 if logic == "UNION": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
235 print "Have %i IDs combined from %i tabular files" % (len(ids), len(identifiers)) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
236 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
237 print "Have %i IDs in common from %i tabular files" % (len(ids), len(identifiers)) |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
238 if name_warn: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
239 sys.stderr.write(name_warn) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
240 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
241 |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
242 def crude_fasta_iterator(handle): |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
243 """Yields tuples, record ID and the full record as a string.""" |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
244 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
245 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
246 if line == "": |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
247 return # Premature end of file, or just empty? |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
248 if line[0] == ">": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
249 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
250 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
251 no_id_warned = False |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
252 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
253 if line[0] != ">": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
254 raise ValueError( |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
255 "Records in Fasta files should start with '>' character") |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
256 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
257 id = line[1:].split(None, 1)[0] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
258 except IndexError: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
259 if not no_id_warned: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
260 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n") |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
261 no_id_warned = True |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
262 id = None |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
263 lines = [line] |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
264 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
265 while True: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
266 if not line: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
267 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
268 if line[0] == ">": |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
269 break |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
270 lines.append(line) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
271 line = handle.readline() |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
272 yield id, "".join(lines) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
273 if not line: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
274 return # StopIteration |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
275 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
276 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
277 def fasta_filter(in_file, pos_file, neg_file, wanted): |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
278 """FASTA filter producing 60 character line wrapped outout.""" |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
279 pos_count = neg_count = 0 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
280 # Galaxy now requires Python 2.5+ so can use with statements, |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
281 with open(in_file) as in_handle: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
282 # Doing the if statement outside the loop for speed |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
283 # (with the downside of three very similar loops). |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
284 if pos_file is not None and neg_file is not None: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
285 print "Generating two FASTA files" |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
286 with open(pos_file, "w") as pos_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
287 with open(neg_file, "w") as neg_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
288 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
289 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
290 pos_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
291 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
292 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
293 neg_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
294 neg_count += 1 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
295 elif pos_file is not None: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
296 print "Generating matching FASTA file" |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
297 with open(pos_file, "w") as pos_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
298 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
299 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
300 pos_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
301 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
302 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
303 neg_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
304 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
305 print "Generating non-matching FASTA file" |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
306 assert neg_file is not None |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
307 with open(neg_file, "w") as neg_handle: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
308 for identifier, record in crude_fasta_iterator(in_handle): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
309 if clean_name(identifier) in wanted: |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
310 pos_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
311 else: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
312 neg_handle.write(record) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
313 neg_count += 1 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
314 return pos_count, neg_count |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
315 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
316 |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
317 def fastq_filter(in_file, pos_file, neg_file, wanted): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
318 """FASTQ filter.""" |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
319 from Bio.SeqIO.QualityIO import FastqGeneralIterator |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
320 handle = open(in_file, "r") |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
321 if pos_file is not None and neg_file is not None: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
322 print "Generating two FASTQ files" |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
323 positive_handle = open(pos_file, "w") |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
324 negative_handle = open(neg_file, "w") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
325 print in_file |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
326 for title, seq, qual in FastqGeneralIterator(handle): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
327 print("%s --> %s" % (title, clean_name(title.split(None, 1)[0]))) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
328 if clean_name(title.split(None, 1)[0]) in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
329 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
330 else: |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
331 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
332 positive_handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
333 negative_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
334 elif pos_file is not None: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
335 print "Generating matching FASTQ file" |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
336 positive_handle = open(pos_file, "w") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
337 for title, seq, qual in FastqGeneralIterator(handle): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
338 if clean_name(title.split(None, 1)[0]) in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
339 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
340 positive_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
341 elif neg_file is not None: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
342 print "Generating non-matching FASTQ file" |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
343 negative_handle = open(neg_file, "w") |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
344 for title, seq, qual in FastqGeneralIterator(handle): |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
345 if clean_name(title.split(None, 1)[0]) not in wanted: |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
346 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
347 negative_handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
348 handle.close() |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
349 # This does not currently bother to record record counts (faster) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
350 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
351 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
352 def sff_filter(in_file, pos_file, neg_file, wanted): |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
353 """SFF filter.""" |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
354 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
355 from Bio.SeqIO.SffIO import SffIterator, SffWriter |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
356 except ImportError: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
357 sys.exit("SFF filtering requires Biopython 1.54 or later") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
358 |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
359 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
360 from Bio.SeqIO.SffIO import ReadRocheXmlManifest |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
361 except ImportError: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
362 # Prior to Biopython 1.56 this was a private function |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
363 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
364 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
365 in_handle = open(in_file, "rb") # must be binary mode! |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
366 try: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
367 manifest = ReadRocheXmlManifest(in_handle) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
368 except ValueError: |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
369 manifest = None |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
370 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
371 # This makes two passes though the SFF file with isn't so efficient, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
372 # but this makes the code simple. |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
373 pos_count = neg_count = 0 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
374 if pos_file is not None: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
375 out_handle = open(pos_file, "wb") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
376 writer = SffWriter(out_handle, xml=manifest) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
377 in_handle.seek(0) # start again after getting manifest |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
378 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if clean_name(rec.id) in wanted) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
379 out_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
380 if neg_file is not None: |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
381 out_handle = open(neg_file, "wb") |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
382 writer = SffWriter(out_handle, xml=manifest) |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
383 in_handle.seek(0) # start again |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
384 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if clean_name(rec.id) not in wanted) |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
385 out_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
386 # And we're done |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
387 in_handle.close() |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
388 # At the time of writing, Galaxy doesn't show SFF file read counts, |
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
389 # so it is useful to put them in stdout and thus shown in job info. |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
390 return pos_count, neg_count |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
391 |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
392 |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
393 if seq_format.lower() == "sff": |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
394 # Now write filtered SFF file based on IDs wanted |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
395 pos_count, neg_count = sff_filter(in_file, out_positive_file, out_negative_file, ids) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
396 # At the time of writing, Galaxy doesn't show SFF file read counts, |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
397 # so it is useful to put them in stdout and thus shown in job info. |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
398 elif seq_format.lower() == "fasta": |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
399 # Write filtered FASTA file based on IDs from tabular file |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
400 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
401 print "%i with and %i without specified IDs" % (pos_count, neg_count) |
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
402 elif seq_format.lower().startswith("fastq"): |
5
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
403 # Write filtered FASTQ file based on IDs from tabular file |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
404 fastq_filter(in_file, out_positive_file, out_negative_file, ids) |
832c1fd57852
v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents:
3
diff
changeset
|
405 # This does not currently track the counts |
3
44ab4c0f7683
Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff
changeset
|
406 else: |
6
03e134cae41a
v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents:
5
diff
changeset
|
407 sys.exit("Unsupported file type %r" % seq_format) |