annotate FilterUncorrectabledPEfastq.py @ 0:9a0b65ad3c84 draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
author iuc
date Thu, 13 Sep 2018 07:00:00 -0400
parents
children 6703b98884a2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
1 """
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
2 author: adam h freedman
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
3 afreedman405 at gmail.com
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
4 data: Fri Aug 26 10:55:18 EDT 2016
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
5 This script takes as an input Rcorrector error corrected Illumina paired-reads
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
6 in fastq format and:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
7 1. Removes any reads that Rcorrector indentifes as containing an error,
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
8 but can't be corrected, typically low complexity sequences. For these,
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
9 the header contains 'unfixable'.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
10 2. Strips the ' cor' from headers of reads that Rcorrector fixed, to avoid
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
11 issues created by certain header formats for downstream tools.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
12 3. Write a log with counts of (a) read pairs that were removed because one end
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
13 was unfixable, (b) corrected left and right reads, (c) total number of
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
14 read pairs containing at least one corrected read.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
15 Currently, this script only handles paired-end data, and handle either unzipped
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
16 or gzipped files on the fly, so long as the gzipped files end with 'gz'.
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
17 """
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
18
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
19 # import sys
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
20 import argparse
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
21 import gzip
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
22 from itertools import izip_longest
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
23 # izip
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
24 from os.path import basename
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
25
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
26
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
27 def get_input_streams(r1file, r2file):
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
28 if r1file[-2:] == 'gz':
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
29 r1handle = gzip.open(r1file, 'rb')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
30 r2handle = gzip.open(r2file, 'rb')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
31 else:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
32 r1handle = open(r1file, 'r')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
33 r2handle = open(r2file, 'r')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
34 return r1handle, r2handle
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
35
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
36
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
37 def grouper(iterable, n, fillvalue=None):
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
38 "Collect data into fixed-length chunks or blocks"
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
39 # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
40 args = [iter(iterable)] * n
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
41 return izip_longest(fillvalue=fillvalue, * args)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
42
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
43
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
44 if __name__ == "__main__":
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
45 parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs")
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
46 parser.add_argument('-1', '--left_reads', dest='leftreads', type=str, help='R1 fastq file')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
47 parser.add_argument('-2', '--right_reads', dest='rightreads', type=str, help='R2 fastq file')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
48 parser.add_argument('-o', '--out_prefix', dest='outprefix', type=str, help="prefix for filtered fastq output")
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
49 opts = parser.parse_args()
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
50 r1out = open(opts.outprefix + '_' + basename(opts.leftreads).replace('.gz', ''), 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
51 r2out = open(opts.outprefix + '_' + basename(opts.rightreads).replace('.gz', ''), 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
52 r1_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
53 r2_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
54 pair_cor_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
55 unfix_count = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
56 r1_stream, r2_stream = get_input_streams(opts.leftreads, opts.rightreads)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
57 with r1_stream as f1, r2_stream as f2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
58 R1 = grouper(f1, 4)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
59 R2 = grouper(f2, 4)
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
60 counter = 0
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
61 for entry in R1:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
62 counter += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
63 if counter % 100000 == 0:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
64 print "%s reads processed" % counter
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
65 head1, seq1, placeholder1, qual1 = [i.strip() for i in entry]
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
66 head2, seq2, placeholder2, qual2 = [j.strip() for j in R2.next()]
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
67 if 'unfixable' in head1 or 'unfixable' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
68 unfix_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
69 else:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
70 if 'cor' in head1:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
71 r1_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
72 if 'cor' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
73 r2_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
74 if 'cor' in head1 or 'cor' in head2:
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
75 pair_cor_count += 1
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
76 head1 = head1.split('l:')[0][:-1] # keeps all before the low kmer count statistic and removes the trailing whitespace character
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
77 head2 = head2.split('l:')[0][:-1]
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
78 # head1 = head1.replace(' cor', '')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
79 # head2 = head2.replace(' cor', '')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
80 r1out.write('%s\n' % '\n'.join([head1, seq1, placeholder1, qual1]))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
81 r2out.write('%s\n' % '\n'.join([head2, seq2, placeholder2, qual2]))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
82 unfix_log = open('rmunfixable.log', 'w')
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
83 unfix_log.write('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\n' % (counter, unfix_count, counter - unfix_count, r1_cor_count, r2_cor_count, pair_cor_count))
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
84 r1out.close()
9a0b65ad3c84 planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff changeset
85 r2out.close()