annotate tools/sample_seqs/sample_seqs.py @ 10:bdaefd241921 draft default tip

Remove legacy tool_dependencies.xml
author peterjc
date Thu, 30 Nov 2023 09:59:08 +0000
parents 5f505ed46e16
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
2 """Sub-sample sequence from a FASTA, FASTQ or SFF file.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
3
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
4 This tool is a short Python script which requires Biopython 1.62 or later
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
5 for sequence parsing. If you use this tool in scientific work leading to a
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
6 publication, please cite the Biopython application note:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
7
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
8 Cock et al 2009. Biopython: freely available Python tools for computational
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
9 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
10 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
11
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
12 This script is copyright 2014-2021 by Peter Cock, The James Hutton Institute
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
13 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
14 See accompanying text file for licence details (MIT license).
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
15
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
16 Use -v or --version to get the version, -h or --help for help.
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
17 """
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
18 import os
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
19 import sys
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
20 from optparse import OptionParser
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
21
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
22 # Parse Command Line
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
23 usage = """Use as follows:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
24
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
25 $ python sample_seqs.py [options]
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
26
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
27 e.g. Sample 20% of the reads:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
28
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
29 $ python sample_seqs.py -i my_seq.fastq -f fastq -p 20.0 -o sample.fastq
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
30
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
31 This samples uniformly though the file, rather than at random, and therefore
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
32 should be reproducible.
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
33
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
34 If you have interleaved paired reads, use the --interleaved switch. If
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
35 instead you have two matched files (one for each pair), run this tool
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
36 on each with the same sampling options to make two matched smaller files.
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
37 """
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
38 parser = OptionParser(usage=usage)
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
39 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
40 "-i",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
41 "--input",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
42 dest="input",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
43 default=None,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
44 help="Input sequences filename",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
45 metavar="FILE",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
46 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
47 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
48 "-f",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
49 "--format",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
50 dest="format",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
51 default=None,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
52 help="Input sequence format (e.g. fasta, fastq, sff)",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
53 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
54 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
55 "-o",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
56 "--output",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
57 dest="output",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
58 default=None,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
59 help="Output sampled sequenced filename",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
60 metavar="FILE",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
61 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
62 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
63 "-p",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
64 "--percent",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
65 dest="percent",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
66 default=None,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
67 help="Take this percent of the reads",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
68 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
69 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
70 "-n", "--everyn", dest="everyn", default=None, help="Take every N-th read"
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
71 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
72 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
73 "-c", "--count", dest="count", default=None, help="Take exactly N reads"
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
74 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
75 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
76 "--interleaved",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
77 dest="interleaved",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
78 default=False,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
79 action="store_true",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
80 help="Input is interleaved reads, preserve the pairings",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
81 )
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
82 parser.add_option(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
83 "-v",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
84 "--version",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
85 dest="version",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
86 default=False,
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
87 action="store_true",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
88 help="Show version and quit",
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
89 )
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
90 options, args = parser.parse_args()
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
91
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
92 if options.version:
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
93 print("v0.2.4")
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
94 sys.exit(0)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
95
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
96 try:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
97 from Bio import SeqIO
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
98 from Bio.SeqIO.QualityIO import FastqGeneralIterator
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
99 from Bio.SeqIO.FastaIO import SimpleFastaParser
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
100 from Bio.SeqIO.SffIO import SffIterator, SffWriter
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
101 except ImportError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
102 sys.exit("This script requires Biopython.")
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
103
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
104 in_file = options.input
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
105 out_file = options.output
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
106 interleaved = options.interleaved
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
107
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
108 if not in_file:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
109 sys.exit("Require an input filename")
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
110 if in_file != "/dev/stdin" and not os.path.isfile(in_file):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
111 sys.exit("Missing input file %r" % in_file)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
112 if not out_file:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
113 sys.exit("Require an output filename")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
114 if not options.format:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
115 sys.exit("Require the sequence format")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
116 seq_format = options.format.lower()
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
117
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
118
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
119 def count_fasta(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
120 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
121 with open(filename) as handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
122 for title, seq in SimpleFastaParser(handle):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
123 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
124 return count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
125
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
126
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
127 def count_fastq(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
128 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
129 with open(filename) as handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
130 for title, seq, qual in FastqGeneralIterator(handle):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
131 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
132 return count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
133
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
134
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
135 def count_sff(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
136 # If the SFF file has a built in index (which is normal),
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
137 # this will be parsed and is the quicker than scanning
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
138 # the whole file.
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
139 return len(SeqIO.index(filename, "sff"))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
140
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
141
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
142 def count_sequences(filename, format):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
143 if format == "sff":
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
144 return count_sff(filename)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
145 elif format == "fasta":
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
146 return count_fasta(filename)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
147 elif format.startswith("fastq"):
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
148 return count_fastq(filename)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
149 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
150 sys.exit("Unsupported file type %r" % format)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
151
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
152
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
153 if options.percent and options.everyn:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
154 sys.exit("Cannot combine -p and -n options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
155 elif options.everyn and options.count:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
156 sys.exit("Cannot combine -p and -c options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
157 elif options.percent and options.count:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
158 sys.exit("Cannot combine -n and -c options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
159 elif options.everyn:
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
160 try:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
161 N = int(options.everyn)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
162 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
163 sys.exit("Bad -n argument %r" % options.everyn)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
164 if N < 2:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
165 sys.exit("Bad -n argument %r" % options.everyn)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
166 if (N % 10) == 1:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
167 sys.stderr.write("Sampling every %ist sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
168 elif (N % 10) == 2:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
169 sys.stderr.write("Sampling every %ind sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
170 elif (N % 10) == 3:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
171 sys.stderr.write("Sampling every %ird sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
172 else:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
173 sys.stderr.write("Sampling every %ith sequence\n" % N)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
174
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
175 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
176 """Sample every Nth sequence."""
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
177 global N
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
178 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
179 for record in iterator:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
180 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
181 if count % N == 1:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
182 yield record
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
183
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
184 elif options.percent:
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
185 try:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
186 percent = float(options.percent) / 100.0
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
187 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
188 sys.exit("Bad -p percent argument %r" % options.percent)
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
189 if not (0.0 <= percent <= 1.0):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
190 sys.exit("Bad -p percent argument %r" % options.percent)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
191 sys.stderr.write("Sampling %0.3f%% of sequences\n" % (100.0 * percent))
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
192
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
193 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
194 """Sample given percentage of sequences."""
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
195 global percent
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
196 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
197 taken = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
198 for record in iterator:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
199 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
200 if percent * count > taken:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
201 taken += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
202 yield record
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
203
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
204 elif options.count:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
205 try:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
206 N = int(options.count)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
207 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
208 sys.exit("Bad -c count argument %r" % options.count)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
209 if N < 1:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
210 sys.exit("Bad -c count argument %r" % options.count)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
211 total = count_sequences(in_file, seq_format)
3
02c13ef1a669 Uploaded v0.2.1, fixed missing test file, more tests.
peterjc
parents: 2
diff changeset
212 sys.stderr.write("Input file has %i sequences\n" % total)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
213 if interleaved:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
214 # Paired
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
215 if total % 2:
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
216 sys.exit(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
217 "Paired mode, but input file has an odd number of sequences: %i" % total
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
218 )
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
219 elif N > total // 2:
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
220 sys.exit(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
221 "Requested %i sequence pairs, "
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
222 "but file only has %i pairs (%i sequences)." % (N, total // 2, total)
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
223 )
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
224 total = total // 2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
225 if N == 1:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
226 sys.stderr.write("Sampling just first sequence pair!\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
227 elif N == total:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
228 sys.stderr.write("Taking all the sequence pairs\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
229 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
230 sys.stderr.write("Sampling %i sequence pairs\n" % N)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
231 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
232 # Not paired
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
233 if total < N:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
234 sys.exit("Requested %i sequences, but file only has %i." % (N, total))
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
235 if N == 1:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
236 sys.stderr.write("Sampling just first sequence!\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
237 elif N == total:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
238 sys.stderr.write("Taking all the sequences\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
239 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
240 sys.stderr.write("Sampling %i sequences\n" % N)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
241 if N == total:
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
242
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
243 def sampler(iterator):
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
244 """No-operation dummy filter, taking everything."""
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
245 global N
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
246 taken = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
247 for record in iterator:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
248 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
249 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
250 assert taken == N, "Picked %i, wanted %i" % (taken, N)
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
251
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
252 else:
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
253
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
254 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
255 """Sample given number of sequences."""
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
256 # Mimic the percentage sampler, with double check on final count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
257 global N, total
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
258 # Do we need a floating point fudge factor epsilon?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
259 # i.e. What if percentage comes out slighty too low, and
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
260 # we could end up missing last few desired sequences?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
261 percentage = float(N) / float(total)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
262 # print("DEBUG: Want %i out of %i sequences/pairs, as a percentage %0.2f"
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
263 # % (N, total, percentage * 100.0))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
264 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
265 taken = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
266 for record in iterator:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
267 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
268 # Do we need the extra upper bound?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
269 if percentage * count > taken and taken < N:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
270 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
271 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
272 elif total - count + 1 <= N - taken:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
273 # remaining records (incuding this one) <= what we still need.
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
274 # This is a safey check for floating point edge cases where
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
275 # we need to take all remaining sequences to meet target
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
276 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
277 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
278 assert taken == N, "Picked %i, wanted %i" % (taken, N)
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
279
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
280 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
281 sys.exit("Must use either -n, -p or -c")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
282
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
283
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
284 def pair(iterator):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
285 """Quick and dirty pair batched iterator."""
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
286 while True:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
287 a = next(iterator)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
288 b = next(iterator)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
289 if not b:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
290 assert not a, "Odd number of records?"
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
291 break
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
292 yield (a, b)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
293
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
294
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
295 def raw_fasta_iterator(handle):
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
296 """Yield raw FASTA records as multi-line strings."""
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
297 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
298 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
299 if line == "":
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
300 return # Premature end of file, or just empty?
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
301 if line[0] == ">":
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
302 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
303
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
304 no_id_warned = False
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
305 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
306 if line[0] != ">":
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
307 raise ValueError("Records in Fasta files should start with '>' character")
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
308 try:
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
309 line[1:].split(None, 1)[0]
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
310 except IndexError:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
311 if not no_id_warned:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
312 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
313 no_id_warned = True
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
314 lines = [line]
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
315 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
316 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
317 if not line:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
318 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
319 if line[0] == ">":
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
320 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
321 lines.append(line)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
322 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
323 yield "".join(lines)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
324 if not line:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
325 return # StopIteration
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
326
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
327
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
328 def fasta_filter(in_file, out_file, iterator_filter, inter):
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
329 count = 0
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
330 # Galaxy now requires Python 2.5+ so can use with statements,
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
331 with open(in_file) as in_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
332 with open(out_file, "w") as pos_handle:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
333 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
334 for r1, r2 in iterator_filter(pair(raw_fasta_iterator(in_handle))):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
335 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
336 pos_handle.write(r1)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
337 pos_handle.write(r2)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
338 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
339 for record in iterator_filter(raw_fasta_iterator(in_handle)):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
340 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
341 pos_handle.write(record)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
342 return count
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
343
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
344
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
345 def fastq_filter(in_file, out_file, iterator_filter, inter):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
346 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
347 with open(in_file) as in_handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
348 with open(out_file, "w") as pos_handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
349 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
350 for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
351 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
352 pos_handle.write("@%s\n%s\n+\n%s\n" % r1)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
353 pos_handle.write("@%s\n%s\n+\n%s\n" % r2)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
354 else:
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
355 for title, seq, qual in iterator_filter(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
356 FastqGeneralIterator(in_handle)
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
357 ):
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
358 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
359 pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
360 return count
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
361
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
362
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
363 def sff_filter(in_file, out_file, iterator_filter, inter):
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
364 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
365 try:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
366 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
367 except ImportError:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
368 # Prior to Biopython 1.56 this was a private function
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
369 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
370 with open(in_file, "rb") as in_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
371 try:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
372 manifest = ReadRocheXmlManifest(in_handle)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
373 except ValueError:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
374 manifest = None
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
375 in_handle.seek(0)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
376 with open(out_file, "wb") as out_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
377 writer = SffWriter(out_handle, xml=manifest)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
378 in_handle.seek(0) # start again after getting manifest
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
379 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
380 from itertools import chain
9
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
381
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
382 count = writer.write_file(
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
383 chain.from_iterable(iterator_filter(pair(SffIterator(in_handle))))
5f505ed46e16 Bump Biopython dependency
peterjc
parents: 6
diff changeset
384 )
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
385 assert count % 2 == 0, "Odd number of records? %i" % count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
386 count /= 2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
387 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
388 count = writer.write_file(iterator_filter(SffIterator(in_handle)))
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
389 return count
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
390
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
391
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
392 if seq_format == "sff":
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
393 count = sff_filter(in_file, out_file, sampler, interleaved)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
394 elif seq_format == "fasta":
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
395 count = fasta_filter(in_file, out_file, sampler, interleaved)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
396 elif seq_format.startswith("fastq"):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
397 count = fastq_filter(in_file, out_file, sampler, interleaved)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
398 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
399 sys.exit("Unsupported file type %r" % seq_format)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
400
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
401 if interleaved:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
402 sys.stderr.write("Selected %i pairs\n" % count)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
403 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
404 sys.stderr.write("Selected %i records\n" % count)