annotate FilterUncorrectabledPEfastq.py @ 1:6703b98884a2 draft default tip

"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
author iuc
date Thu, 26 Dec 2019 05:21:50 -0500
parents 9a0b65ad3c84
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
1 """
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
2 author: adam h freedman
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
3 afreedman405 at gmail.com
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
4 data: Fri Aug 26 10:55:18 EDT 2016
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
5 This script takes as an input Rcorrector error corrected Illumina paired-reads
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
6 in fastq format and:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
7 1. Removes any reads that Rcorrector indentifes as containing an error,
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
8 but can't be corrected, typically low complexity sequences. For these,
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
9 the header contains 'unfixable'.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
10 2. Strips the ' cor' from headers of reads that Rcorrector fixed, to avoid
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
11 issues created by certain header formats for downstream tools.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
12 3. Write a log with counts of (a) read pairs that were removed because one end
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
13 was unfixable, (b) corrected left and right reads, (c) total number of
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
14 read pairs containing at least one corrected read.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
15 Currently, this script only handles paired-end data, and handle either unzipped
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
16 or gzipped files on the fly, so long as the gzipped files end with 'gz'.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
17 """
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
18
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
19 import argparse
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
20 import gzip
1
6703b98884a2 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents: 0
diff changeset
21 from itertools import zip_longest
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
22 from os.path import basename
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
23
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
24
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
25 def get_input_streams(r1file, r2file):
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
26 if r1file[-2:] == 'gz':
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
27 r1handle = gzip.open(r1file, 'rb')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
28 r2handle = gzip.open(r2file, 'rb')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
29 else:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
30 r1handle = open(r1file, 'r')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
31 r2handle = open(r2file, 'r')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
32 return r1handle, r2handle
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
33
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
34
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
35 def grouper(iterable, n, fillvalue=None):
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
36 "Collect data into fixed-length chunks or blocks"
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
37 # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
38 args = [iter(iterable)] * n
1
6703b98884a2 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents: 0
diff changeset
39 return zip_longest(fillvalue=fillvalue, * args)
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
40
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
41
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
42 if __name__ == "__main__":
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
43 parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs")
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
44 parser.add_argument('-1', '--left_reads', dest='leftreads', type=str, help='R1 fastq file')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
45 parser.add_argument('-2', '--right_reads', dest='rightreads', type=str, help='R2 fastq file')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
46 parser.add_argument('-o', '--out_prefix', dest='outprefix', type=str, help="prefix for filtered fastq output")
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
47 opts = parser.parse_args()
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
48 r1out = open(opts.outprefix + '_' + basename(opts.leftreads).replace('.gz', ''), 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
49 r2out = open(opts.outprefix + '_' + basename(opts.rightreads).replace('.gz', ''), 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
50 r1_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
51 r2_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
52 pair_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
53 unfix_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
54 r1_stream, r2_stream = get_input_streams(opts.leftreads, opts.rightreads)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
55 with r1_stream as f1, r2_stream as f2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
56 R1 = grouper(f1, 4)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
57 R2 = grouper(f2, 4)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
58 counter = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
59 for entry in R1:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
60 counter += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
61 if counter % 100000 == 0:
1
6703b98884a2 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents: 0
diff changeset
62 print("%s reads processed" % counter)
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
63 head1, seq1, placeholder1, qual1 = [i.strip() for i in entry]
1
6703b98884a2 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents: 0
diff changeset
64 head2, seq2, placeholder2, qual2 = [j.strip() for j in next(R2)]
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
65 if 'unfixable' in head1 or 'unfixable' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
66 unfix_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
67 else:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
68 if 'cor' in head1:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
69 r1_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
70 if 'cor' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
71 r2_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
72 if 'cor' in head1 or 'cor' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
73 pair_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
74 head1 = head1.split('l:')[0][:-1] # keeps all before the low kmer count statistic and removes the trailing whitespace character
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
75 head2 = head2.split('l:')[0][:-1]
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
76 # head1 = head1.replace(' cor', '')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
77 # head2 = head2.replace(' cor', '')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
78 r1out.write('%s\n' % '\n'.join([head1, seq1, placeholder1, qual1]))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
79 r2out.write('%s\n' % '\n'.join([head2, seq2, placeholder2, qual2]))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
80 unfix_log = open('rmunfixable.log', 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
81 unfix_log.write('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\n' % (counter, unfix_count, counter - unfix_count, r1_cor_count, r2_cor_count, pair_cor_count))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
82 r1out.close()
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
83 r2out.close()