annotate tools/seq_filter_by_id/seq_filter_by_id.py @ 8:2d4537dbf0bc draft

v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
author peterjc
date Wed, 10 May 2017 13:18:01 -0400
parents fb1313d79396
children 141612f8c3e3
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
2 """Filter a FASTA, FASTQ or SSF file with IDs from a tabular file.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
4 Takes six command line options, tabular filename, ID column numbers (comma
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
5 separated list using one based counting), input filename, input type (e.g.
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
6 FASTA or SFF) and up to two output filenames (for records with and without
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
7 the given IDs, same format as input sequence file).
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
8
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
9 When filtering an SFF file, any Roche XML manifest in the input file is
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
10 preserved in both output files.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
11
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
13 in column one, and the ID of the match from the database is in column two.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
14 Here sensible values for the column numbers would therefore be "1" or "2".
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
15
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
16 This tool is a short Python script which requires Biopython 1.54 or later.
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
17 If you use this tool in scientific work leading to a publication, please
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
18 cite the Biopython application note:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
19
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
20 Cock et al 2009. Biopython: freely available Python tools for computational
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
21 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
22 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
23
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
24 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
25 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
26 See accompanying text file for licence details (MIT license).
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
27
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
28 Use -v or --version to get the version, -h or --help for help.
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
29 """
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
30
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
31 import os
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
32 import re
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
33 import sys
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
34
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
35 from optparse import OptionParser
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
36
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
37 # Parse Command Line
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
38 usage = """Use as follows:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
39
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
40 $ python seq_filter_by_id.py [options] tab1 cols1 [, tab2 cols2, ...]
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
41
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
42 e.g. Positive matches using column one from tabular file:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
43
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
44 $ seq_filter_by_id.py -i my_seqs.fastq -f fastq -p matches.fastq ids.tabular 1
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
45
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
46 Multiple tabular files and column numbers may be given, or replaced with
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
47 the -t or --text option.
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
48 """
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
49 parser = OptionParser(usage=usage)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
50 parser.add_option('-i', '--input', dest='input',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
51 default=None, help='Input sequences filename',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
52 metavar="FILE")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
53 parser.add_option('-f', '--format', dest='format',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
54 default=None,
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
55 help='Input sequence format (e.g. fasta, fastq, sff)')
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
56 parser.add_option('-t', '--text', dest='id_list',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
57 default=None, help="Lists of white space separated IDs (instead of a tabular file)")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
58 parser.add_option('-p', '--positive', dest='output_positive',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
59 default=None,
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
60 help='Output filename for matches',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
61 metavar="FILE")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
62 parser.add_option('-n', '--negative', dest='output_negative',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
63 default=None,
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
64 help='Output filename for non-matches',
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
65 metavar="FILE")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
66 parser.add_option("-l", "--logic", dest="logic",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
67 default="UNION",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
68 help="How to combined multiple ID columns (UNION or INTERSECTION)")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
69 parser.add_option("-s", "--suffix", dest="suffix",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
70 action="store_true",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
71 help="Ignore pair-read suffices for matching names")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
72 parser.add_option("-v", "--version", dest="version",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
73 default=False, action="store_true",
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
74 help="Show version and quit")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
75
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
76 options, args = parser.parse_args()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
77
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
78 if options.version:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
79 print "v0.2.5"
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
80 sys.exit(0)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
81
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
82 in_file = options.input
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
83 seq_format = options.format
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
84 out_positive_file = options.output_positive
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
85 out_negative_file = options.output_negative
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
86 logic = options.logic
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
87 drop_suffices = bool(options.suffix)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
88
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
89 if in_file is None or not os.path.isfile(in_file):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
90 sys.exit("Missing input file: %r" % in_file)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
91 if out_positive_file is None and out_negative_file is None:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
92 sys.exit("Neither output file requested")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
93 if seq_format is None:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
94 sys.exit("Missing sequence format")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
95 if logic not in ["UNION", "INTERSECTION"]:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
96 sys.exit("Logic agrument should be 'UNION' or 'INTERSECTION', not %r" % logic)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
97 if options.id_list and args:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
98 sys.exit("Cannot accept IDs via both -t in the command line, and as tabular files")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
99 elif not options.id_list and not args:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
100 sys.exit("Expected matched pairs of tabular files and columns (or -t given)")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
101 if len(args) % 2:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
102 sys.exit("Expected matched pairs of tabular files and columns, not: %r" % args)
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
103
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
104
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
105 # Cope with three widely used suffix naming convensions,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
106 # Illumina: /1 or /2
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
107 # Forward/revered: .f or .r
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
108 # Sanger, e.g. .p1k and .q1k
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
109 # See http://staden.sourceforge.net/manual/pregap4_unix_50.html
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
110 # re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$")
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
111 # re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
112 re_suffix = re.compile(r"(/1|\.f|\.[sfp]\d\w*|/2|\.r|\.[rq]\d\w*)$")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
113 assert re_suffix.search("demo.f")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
114 assert re_suffix.search("demo.s1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
115 assert re_suffix.search("demo.f1k")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
116 assert re_suffix.search("demo.p1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
117 assert re_suffix.search("demo.p1k")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
118 assert re_suffix.search("demo.p1lk")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
119 assert re_suffix.search("demo/2")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
120 assert re_suffix.search("demo.r")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
121 assert re_suffix.search("demo.q1")
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
122 assert re_suffix.search("demo.q1lk")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
123
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
124 identifiers = []
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
125 for i in range(len(args) // 2):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
126 tabular_file = args[2 * i]
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
127 cols_arg = args[2 * i + 1]
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
128 if not os.path.isfile(tabular_file):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
129 sys.exit("Missing tabular identifier file %r" % tabular_file)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
130 try:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
131 columns = [int(arg) - 1 for arg in cols_arg.split(",")]
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
132 except ValueError:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
133 sys.exit("Expected list of columns (comma separated integers), got %r" % cols_arg)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
134 if min(columns) < 0:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
135 sys.exit("Expect one-based column numbers (not zero-based counting), got %r" % cols_arg)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
136 identifiers.append((tabular_file, columns))
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
137
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
138 name_warn = False
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
139
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
140
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
141 def check_white_space(name):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
142 parts = name.split(None, 1)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
143 global name_warn
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
144 if not name_warn and len(parts) > 1:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
145 name_warn = "WARNING: Some of your identifiers had white space in them, " + \
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
146 "using first word only. e.g.:\n%s\n" % name
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
147 return parts[0]
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
148
8
2d4537dbf0bc v0.2.6 Depend on Biopython 1.67 from Tool Shed or (Bio)conda
peterjc
parents: 7
diff changeset
149
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
150 if drop_suffices:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
151 def clean_name(name):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
152 """Remove suffix."""
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
153 name = check_white_space(name)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
154 match = re_suffix.search(name)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
155 if match:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
156 # Use the fact this is a suffix, and regular expression will be
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
157 # anchored to the end of the name:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
158 return name[:match.start()]
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
159 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
160 # Nothing to do
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
161 return name
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
162 assert clean_name("foo/1") == "foo"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
163 assert clean_name("foo/2") == "foo"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
164 assert clean_name("bar.f") == "bar"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
165 assert clean_name("bar.r") == "bar"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
166 assert clean_name("baz.p1") == "baz"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
167 assert clean_name("baz.q2") == "baz"
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
168 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
169 # Just check the white space
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
170 clean_name = check_white_space
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
171
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
172
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
173 mapped_chars = {
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
174 '>': '__gt__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
175 '<': '__lt__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
176 "'": '__sq__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
177 '"': '__dq__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
178 '[': '__ob__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
179 ']': '__cb__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
180 '{': '__oc__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
181 '}': '__cc__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
182 '@': '__at__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
183 '\n': '__cn__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
184 '\r': '__cr__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
185 '\t': '__tc__',
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
186 '#': '__pd__',
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
187 }
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
188
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
189 # Read tabular file(s) and record all specified identifiers
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
190 ids = None # Will be a set
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
191 if options.id_list:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
192 assert not identifiers
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
193 ids = set()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
194 id_list = options.id_list
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
195 # Galaxy turns \r into __cr__ (CR) etc
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
196 for k in mapped_chars:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
197 id_list = id_list.replace(mapped_chars[k], k)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
198 for x in options.id_list.split():
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
199 ids.add(clean_name(x.strip()))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
200 print("Have %i unique identifiers from list" % len(ids))
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
201 for tabular_file, columns in identifiers:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
202 file_ids = set()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
203 handle = open(tabular_file, "rU")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
204 if len(columns) > 1:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
205 # General case of many columns
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
206 for line in handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
207 if line.startswith("#"):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
208 # Ignore comments
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
209 continue
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
210 parts = line.rstrip("\n").split("\t")
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
211 for col in columns:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
212 name = clean_name(parts[col])
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
213 if name:
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
214 file_ids.add(name)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
215 else:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
216 # Single column, special case speed up
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
217 col = columns[0]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
218 for line in handle:
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
219 if not line.strip(): # skip empty lines
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
220 continue
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
221 if not line.startswith("#"):
7
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
222 name = clean_name(line.rstrip("\n").split("\t")[col])
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
223 if name:
fb1313d79396 Uploaded v0.2.5, ignore blank names in tabular files (based on contribution from Gildas Le Corguille)
peterjc
parents: 6
diff changeset
224 file_ids.add(name)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
225 print "Using %i IDs from column %s in tabular file" % (len(file_ids), ", ".join(str(col + 1) for col in columns))
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
226 if ids is None:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
227 ids = file_ids
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
228 if logic == "UNION":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
229 ids.update(file_ids)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
230 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
231 ids.intersection_update(file_ids)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
232 handle.close()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
233 if len(identifiers) > 1:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
234 if logic == "UNION":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
235 print "Have %i IDs combined from %i tabular files" % (len(ids), len(identifiers))
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
236 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
237 print "Have %i IDs in common from %i tabular files" % (len(ids), len(identifiers))
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
238 if name_warn:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
239 sys.stderr.write(name_warn)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
240
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
241
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
242 def crude_fasta_iterator(handle):
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
243 """Yields tuples, record ID and the full record as a string."""
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
244 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
245 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
246 if line == "":
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
247 return # Premature end of file, or just empty?
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
248 if line[0] == ">":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
249 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
250
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
251 no_id_warned = False
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
252 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
253 if line[0] != ">":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
254 raise ValueError(
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
255 "Records in Fasta files should start with '>' character")
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
256 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
257 id = line[1:].split(None, 1)[0]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
258 except IndexError:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
259 if not no_id_warned:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
260 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
261 no_id_warned = True
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
262 id = None
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
263 lines = [line]
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
264 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
265 while True:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
266 if not line:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
267 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
268 if line[0] == ">":
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
269 break
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
270 lines.append(line)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
271 line = handle.readline()
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
272 yield id, "".join(lines)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
273 if not line:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
274 return # StopIteration
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
275
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
276
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
277 def fasta_filter(in_file, pos_file, neg_file, wanted):
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
278 """FASTA filter producing 60 character line wrapped outout."""
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
279 pos_count = neg_count = 0
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
280 # Galaxy now requires Python 2.5+ so can use with statements,
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
281 with open(in_file) as in_handle:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
282 # Doing the if statement outside the loop for speed
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
283 # (with the downside of three very similar loops).
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
284 if pos_file is not None and neg_file is not None:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
285 print "Generating two FASTA files"
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
286 with open(pos_file, "w") as pos_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
287 with open(neg_file, "w") as neg_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
288 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
289 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
290 pos_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
291 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
292 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
293 neg_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
294 neg_count += 1
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
295 elif pos_file is not None:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
296 print "Generating matching FASTA file"
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
297 with open(pos_file, "w") as pos_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
298 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
299 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
300 pos_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
301 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
302 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
303 neg_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
304 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
305 print "Generating non-matching FASTA file"
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
306 assert neg_file is not None
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
307 with open(neg_file, "w") as neg_handle:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
308 for identifier, record in crude_fasta_iterator(in_handle):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
309 if clean_name(identifier) in wanted:
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
310 pos_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
311 else:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
312 neg_handle.write(record)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
313 neg_count += 1
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
314 return pos_count, neg_count
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
315
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
316
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
317 def fastq_filter(in_file, pos_file, neg_file, wanted):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
318 """FASTQ filter."""
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
319 from Bio.SeqIO.QualityIO import FastqGeneralIterator
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
320 handle = open(in_file, "r")
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
321 if pos_file is not None and neg_file is not None:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
322 print "Generating two FASTQ files"
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
323 positive_handle = open(pos_file, "w")
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
324 negative_handle = open(neg_file, "w")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
325 print in_file
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
326 for title, seq, qual in FastqGeneralIterator(handle):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
327 print("%s --> %s" % (title, clean_name(title.split(None, 1)[0])))
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
328 if clean_name(title.split(None, 1)[0]) in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
329 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
330 else:
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
331 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
332 positive_handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
333 negative_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
334 elif pos_file is not None:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
335 print "Generating matching FASTQ file"
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
336 positive_handle = open(pos_file, "w")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
337 for title, seq, qual in FastqGeneralIterator(handle):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
338 if clean_name(title.split(None, 1)[0]) in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
339 positive_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
340 positive_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
341 elif neg_file is not None:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
342 print "Generating non-matching FASTQ file"
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
343 negative_handle = open(neg_file, "w")
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
344 for title, seq, qual in FastqGeneralIterator(handle):
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
345 if clean_name(title.split(None, 1)[0]) not in wanted:
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
346 negative_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
347 negative_handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
348 handle.close()
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
349 # This does not currently bother to record record counts (faster)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
350
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
351
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
352 def sff_filter(in_file, pos_file, neg_file, wanted):
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
353 """SFF filter."""
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
354 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
355 from Bio.SeqIO.SffIO import SffIterator, SffWriter
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
356 except ImportError:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
357 sys.exit("SFF filtering requires Biopython 1.54 or later")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
358
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
359 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
360 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
361 except ImportError:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
362 # Prior to Biopython 1.56 this was a private function
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
363 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
364
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
365 in_handle = open(in_file, "rb") # must be binary mode!
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
366 try:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
367 manifest = ReadRocheXmlManifest(in_handle)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
368 except ValueError:
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
369 manifest = None
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
370
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
371 # This makes two passes though the SFF file with isn't so efficient,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
372 # but this makes the code simple.
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
373 pos_count = neg_count = 0
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
374 if pos_file is not None:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
375 out_handle = open(pos_file, "wb")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
376 writer = SffWriter(out_handle, xml=manifest)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
377 in_handle.seek(0) # start again after getting manifest
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
378 pos_count = writer.write_file(rec for rec in SffIterator(in_handle) if clean_name(rec.id) in wanted)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
379 out_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
380 if neg_file is not None:
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
381 out_handle = open(neg_file, "wb")
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
382 writer = SffWriter(out_handle, xml=manifest)
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
383 in_handle.seek(0) # start again
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
384 neg_count = writer.write_file(rec for rec in SffIterator(in_handle) if clean_name(rec.id) not in wanted)
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
385 out_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
386 # And we're done
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
387 in_handle.close()
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
388 # At the time of writing, Galaxy doesn't show SFF file read counts,
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
389 # so it is useful to put them in stdout and thus shown in job info.
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
390 return pos_count, neg_count
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
391
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
392
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
393 if seq_format.lower() == "sff":
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
394 # Now write filtered SFF file based on IDs wanted
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
395 pos_count, neg_count = sff_filter(in_file, out_positive_file, out_negative_file, ids)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
396 # At the time of writing, Galaxy doesn't show SFF file read counts,
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
397 # so it is useful to put them in stdout and thus shown in job info.
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
398 elif seq_format.lower() == "fasta":
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
399 # Write filtered FASTA file based on IDs from tabular file
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
400 pos_count, neg_count = fasta_filter(in_file, out_positive_file, out_negative_file, ids)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
401 print "%i with and %i without specified IDs" % (pos_count, neg_count)
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
402 elif seq_format.lower().startswith("fastq"):
5
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
403 # Write filtered FASTQ file based on IDs from tabular file
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
404 fastq_filter(in_file, out_positive_file, out_negative_file, ids)
832c1fd57852 v0.2.2; New options for IDs via text parameter, ignore paired read suffix; misc changes
peterjc
parents: 3
diff changeset
405 # This does not currently track the counts
3
44ab4c0f7683 Uploaded v0.0.6, automatic dependency on Biopython 1.62, new README file, citation information, MIT licence
peterjc
parents:
diff changeset
406 else:
6
03e134cae41a v0.2.3, ignore blank lines in ID file (contributed by Gildas Le Corguille)
peterjc
parents: 5
diff changeset
407 sys.exit("Unsupported file type %r" % seq_format)