annotate tools/sample_seqs/sample_seqs.py @ 6:31f5701cd2e9 draft

v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
author peterjc
date Thu, 11 May 2017 07:24:38 -0400
parents 6b71ad5d43fb
children 5f505ed46e16
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
2 """Sub-sample sequence from a FASTA, FASTQ or SFF file.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
3
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
4 This tool is a short Python script which requires Biopython 1.62 or later
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
5 for sequence parsing. If you use this tool in scientific work leading to a
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
6 publication, please cite the Biopython application note:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
7
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
8 Cock et al 2009. Biopython: freely available Python tools for computational
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
9 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
10 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
11
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
12 This script is copyright 2014-2015 by Peter Cock, The James Hutton Institute
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
13 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
14 See accompanying text file for licence details (MIT license).
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
15
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
16 Use -v or --version to get the version, -h or --help for help.
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
17 """
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
18 import os
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
19 import sys
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
20 from optparse import OptionParser
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
21
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
22 # Parse Command Line
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
23 usage = """Use as follows:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
24
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
25 $ python sample_seqs.py [options]
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
26
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
27 e.g. Sample 20% of the reads:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
28
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
29 $ python sample_seqs.py -i my_seq.fastq -f fastq -p 20.0 -o sample.fastq
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
30
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
31 This samples uniformly though the file, rather than at random, and therefore
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
32 should be reproducible.
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
33
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
34 If you have interleaved paired reads, use the --interleaved switch. If
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
35 instead you have two matched files (one for each pair), run the two
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
36 twice with the same sampling options to make to matched smaller files.
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
37 """
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
38 parser = OptionParser(usage=usage)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
39 parser.add_option('-i', '--input', dest='input',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
40 default=None, help='Input sequences filename',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
41 metavar="FILE")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
42 parser.add_option('-f', '--format', dest='format',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
43 default=None,
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
44 help='Input sequence format (e.g. fasta, fastq, sff)')
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
45 parser.add_option('-o', '--output', dest='output',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
46 default=None, help='Output sampled sequenced filename',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
47 metavar="FILE")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
48 parser.add_option('-p', '--percent', dest='percent',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
49 default=None,
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
50 help='Take this percent of the reads')
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
51 parser.add_option('-n', '--everyn', dest='everyn',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
52 default=None,
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
53 help='Take every N-th read')
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
54 parser.add_option('-c', '--count', dest='count',
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
55 default=None,
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
56 help='Take exactly N reads')
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
57 parser.add_option("--interleaved", dest="interleaved",
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
58 default=False, action="store_true",
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
59 help="Input is interleaved reads, preserve the pairings")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
60 parser.add_option("-v", "--version", dest="version",
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
61 default=False, action="store_true",
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
62 help="Show version and quit")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
63 options, args = parser.parse_args()
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
64
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
65 if options.version:
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
66 print("v0.2.4")
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
67 sys.exit(0)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
68
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
69 try:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
70 from Bio import SeqIO
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
71 from Bio.SeqIO.QualityIO import FastqGeneralIterator
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
72 from Bio.SeqIO.FastaIO import SimpleFastaParser
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
73 from Bio.SeqIO.SffIO import SffIterator, SffWriter
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
74 except ImportError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
75 sys.exit("This script requires Biopython.")
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
76
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
77 in_file = options.input
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
78 out_file = options.output
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
79 interleaved = options.interleaved
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
80
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
81 if not in_file:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
82 sys.exit("Require an input filename")
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
83 if in_file != "/dev/stdin" and not os.path.isfile(in_file):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
84 sys.exit("Missing input file %r" % in_file)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
85 if not out_file:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
86 sys.exit("Require an output filename")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
87 if not options.format:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
88 sys.exit("Require the sequence format")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
89 seq_format = options.format.lower()
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
90
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
91
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
92 def count_fasta(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
93 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
94 with open(filename) as handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
95 for title, seq in SimpleFastaParser(handle):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
96 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
97 return count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
98
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
99
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
100 def count_fastq(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
101 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
102 with open(filename) as handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
103 for title, seq, qual in FastqGeneralIterator(handle):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
104 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
105 return count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
106
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
107
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
108 def count_sff(filename):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
109 # If the SFF file has a built in index (which is normal),
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
110 # this will be parsed and is the quicker than scanning
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
111 # the whole file.
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
112 return len(SeqIO.index(filename, "sff"))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
113
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
114
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
115 def count_sequences(filename, format):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
116 if format == "sff":
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
117 return count_sff(filename)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
118 elif format == "fasta":
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
119 return count_fasta(filename)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
120 elif format.startswith("fastq"):
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
121 return count_fastq(filename)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
122 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
123 sys.exit("Unsupported file type %r" % format)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
124
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
125
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
126 if options.percent and options.everyn:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
127 sys.exit("Cannot combine -p and -n options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
128 elif options.everyn and options.count:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
129 sys.exit("Cannot combine -p and -c options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
130 elif options.percent and options.count:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
131 sys.exit("Cannot combine -n and -c options")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
132 elif options.everyn:
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
133 try:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
134 N = int(options.everyn)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
135 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
136 sys.exit("Bad -n argument %r" % options.everyn)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
137 if N < 2:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
138 sys.exit("Bad -n argument %r" % options.everyn)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
139 if (N % 10) == 1:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
140 sys.stderr.write("Sampling every %ist sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
141 elif (N % 10) == 2:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
142 sys.stderr.write("Sampling every %ind sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
143 elif (N % 10) == 3:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
144 sys.stderr.write("Sampling every %ird sequence\n" % N)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
145 else:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
146 sys.stderr.write("Sampling every %ith sequence\n" % N)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
147
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
148 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
149 """Sample every Nth sequence."""
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
150 global N
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
151 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
152 for record in iterator:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
153 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
154 if count % N == 1:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
155 yield record
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
156 elif options.percent:
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
157 try:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
158 percent = float(options.percent) / 100.0
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
159 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
160 sys.exit("Bad -p percent argument %r" % options.percent)
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
161 if not(0.0 <= percent <= 1.0):
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
162 sys.exit("Bad -p percent argument %r" % options.percent)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
163 sys.stderr.write("Sampling %0.3f%% of sequences\n" % (100.0 * percent))
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
164
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
165 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
166 """Sample given percentage of sequences."""
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
167 global percent
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
168 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
169 taken = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
170 for record in iterator:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
171 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
172 if percent * count > taken:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
173 taken += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
174 yield record
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
175 elif options.count:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
176 try:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
177 N = int(options.count)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
178 except ValueError:
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
179 sys.exit("Bad -c count argument %r" % options.count)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
180 if N < 1:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
181 sys.exit("Bad -c count argument %r" % options.count)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
182 total = count_sequences(in_file, seq_format)
3
02c13ef1a669 Uploaded v0.2.1, fixed missing test file, more tests.
peterjc
parents: 2
diff changeset
183 sys.stderr.write("Input file has %i sequences\n" % total)
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
184 if interleaved:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
185 # Paired
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
186 if total % 2:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
187 sys.exit("Paired mode, but input file has an odd number of sequences: %i"
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
188 % total)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
189 elif N > total // 2:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
190 sys.exit("Requested %i sequence pairs, but file only has %i pairs (%i sequences)."
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
191 % (N, total // 2, total))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
192 total = total // 2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
193 if N == 1:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
194 sys.stderr.write("Sampling just first sequence pair!\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
195 elif N == total:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
196 sys.stderr.write("Taking all the sequence pairs\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
197 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
198 sys.stderr.write("Sampling %i sequence pairs\n" % N)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
199 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
200 # Not paired
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
201 if total < N:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
202 sys.exit("Requested %i sequences, but file only has %i." % (N, total))
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
203 if N == 1:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
204 sys.stderr.write("Sampling just first sequence!\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
205 elif N == total:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
206 sys.stderr.write("Taking all the sequences\n")
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
207 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
208 sys.stderr.write("Sampling %i sequences\n" % N)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
209 if N == total:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
210 def sampler(iterator):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
211 """Dummy filter to filter nothing, taking everything."""
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
212 global N
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
213 taken = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
214 for record in iterator:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
215 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
216 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
217 assert taken == N, "Picked %i, wanted %i" % (taken, N)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
218 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
219 def sampler(iterator):
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
220 """Sample given number of sequences."""
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
221 # Mimic the percentage sampler, with double check on final count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
222 global N, total
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
223 # Do we need a floating point fudge factor epsilon?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
224 # i.e. What if percentage comes out slighty too low, and
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
225 # we could end up missing last few desired sequences?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
226 percentage = float(N) / float(total)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
227 # print("DEBUG: Want %i out of %i sequences/pairs, as a percentage %0.2f"
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
228 # % (N, total, percentage * 100.0))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
229 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
230 taken = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
231 for record in iterator:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
232 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
233 # Do we need the extra upper bound?
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
234 if percentage * count > taken and taken < N:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
235 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
236 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
237 elif total - count + 1 <= N - taken:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
238 # remaining records (incuding this one) <= what we still need.
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
239 # This is a safey check for floating point edge cases where
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
240 # we need to take all remaining sequences to meet target
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
241 taken += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
242 yield record
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
243 assert taken == N, "Picked %i, wanted %i" % (taken, N)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
244 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
245 sys.exit("Must use either -n, -p or -c")
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
246
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
247
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
248 def pair(iterator):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
249 """Quick and dirty pair batched iterator."""
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
250 while True:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
251 a = next(iterator)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
252 b = next(iterator)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
253 if not b:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
254 assert not a, "Odd number of records?"
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
255 break
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
256 yield (a, b)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
257
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
258
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
259 def raw_fasta_iterator(handle):
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
260 """Yields raw FASTA records as multi-line strings."""
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
261 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
262 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
263 if line == "":
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
264 return # Premature end of file, or just empty?
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
265 if line[0] == ">":
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
266 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
267
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
268 no_id_warned = False
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
269 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
270 if line[0] != ">":
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
271 raise ValueError(
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
272 "Records in Fasta files should start with '>' character")
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
273 try:
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
274 line[1:].split(None, 1)[0]
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
275 except IndexError:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
276 if not no_id_warned:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
277 sys.stderr.write("WARNING - Malformed FASTA entry with no identifier\n")
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
278 no_id_warned = True
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
279 lines = [line]
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
280 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
281 while True:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
282 if not line:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
283 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
284 if line[0] == ">":
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
285 break
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
286 lines.append(line)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
287 line = handle.readline()
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
288 yield "".join(lines)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
289 if not line:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
290 return # StopIteration
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
291
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
292
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
293 def fasta_filter(in_file, out_file, iterator_filter, inter):
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
294 count = 0
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
295 # Galaxy now requires Python 2.5+ so can use with statements,
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
296 with open(in_file) as in_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
297 with open(out_file, "w") as pos_handle:
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
298 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
299 for r1, r2 in iterator_filter(pair(raw_fasta_iterator(in_handle))):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
300 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
301 pos_handle.write(r1)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
302 pos_handle.write(r2)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
303 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
304 for record in iterator_filter(raw_fasta_iterator(in_handle)):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
305 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
306 pos_handle.write(record)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
307 return count
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
308
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
309
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
310 def fastq_filter(in_file, out_file, iterator_filter, inter):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
311 count = 0
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
312 with open(in_file) as in_handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
313 with open(out_file, "w") as pos_handle:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
314 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
315 for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
316 count += 1
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
317 pos_handle.write("@%s\n%s\n+\n%s\n" % r1)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
318 pos_handle.write("@%s\n%s\n+\n%s\n" % r2)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
319 else:
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
320 for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)):
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
321 count += 1
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
322 pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
323 return count
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
324
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
325
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
326 def sff_filter(in_file, out_file, iterator_filter, inter):
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
327 count = 0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
328 try:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
329 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
330 except ImportError:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
331 # Prior to Biopython 1.56 this was a private function
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
332 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
333 with open(in_file, "rb") as in_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
334 try:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
335 manifest = ReadRocheXmlManifest(in_handle)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
336 except ValueError:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
337 manifest = None
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
338 in_handle.seek(0)
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
339 with open(out_file, "wb") as out_handle:
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
340 writer = SffWriter(out_handle, xml=manifest)
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
341 in_handle.seek(0) # start again after getting manifest
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
342 if inter:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
343 from itertools import chain
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
344 count = writer.write_file(chain.from_iterable(iterator_filter(pair(SffIterator(in_handle)))))
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
345 assert count % 2 == 0, "Odd number of records? %i" % count
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
346 count /= 2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
347 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
348 count = writer.write_file(iterator_filter(SffIterator(in_handle)))
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
349 return count
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
350
6
31f5701cd2e9 v0.2.4 Depends on Biopython 1.67 via legacy Tool Shed package or bioconda.
peterjc
parents: 5
diff changeset
351
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
352 if seq_format == "sff":
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
353 count = sff_filter(in_file, out_file, sampler, interleaved)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
354 elif seq_format == "fasta":
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
355 count = fasta_filter(in_file, out_file, sampler, interleaved)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
356 elif seq_format.startswith("fastq"):
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
357 count = fastq_filter(in_file, out_file, sampler, interleaved)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
358 else:
5
6b71ad5d43fb v0.2.3 clarified help, internal cleanup of Python script
peterjc
parents: 3
diff changeset
359 sys.exit("Unsupported file type %r" % seq_format)
0
3a807e5ea6c8 Uploaded v0.0.1
peterjc
parents:
diff changeset
360
2
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
361 if interleaved:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
362 sys.stderr.write("Selected %i pairs\n" % count)
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
363 else:
da64f6a9e32b Uploaded v0.2.0, adds desired count mode
peterjc
parents: 0
diff changeset
364 sys.stderr.write("Selected %i records\n" % count)