Mercurial > repos > iuc > rcorrector
annotate FilterUncorrectabledPEfastq.py @ 1:6703b98884a2 draft default tip
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
author | iuc |
---|---|
date | Thu, 26 Dec 2019 05:21:50 -0500 |
parents | 9a0b65ad3c84 |
children |
rev | line source |
---|---|
0
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
1 """ |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
2 author: adam h freedman |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
3 afreedman405 at gmail.com |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
4 data: Fri Aug 26 10:55:18 EDT 2016 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
5 This script takes as an input Rcorrector error corrected Illumina paired-reads |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
6 in fastq format and: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
7 1. Removes any reads that Rcorrector indentifes as containing an error, |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
8 but can't be corrected, typically low complexity sequences. For these, |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
9 the header contains 'unfixable'. |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
10 2. Strips the ' cor' from headers of reads that Rcorrector fixed, to avoid |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
11 issues created by certain header formats for downstream tools. |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
12 3. Write a log with counts of (a) read pairs that were removed because one end |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
13 was unfixable, (b) corrected left and right reads, (c) total number of |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
14 read pairs containing at least one corrected read. |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
15 Currently, this script only handles paired-end data, and handle either unzipped |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
16 or gzipped files on the fly, so long as the gzipped files end with 'gz'. |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
17 """ |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
18 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
19 import argparse |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
20 import gzip |
1
6703b98884a2
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents:
0
diff
changeset
|
21 from itertools import zip_longest |
0
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
22 from os.path import basename |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
23 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
24 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
25 def get_input_streams(r1file, r2file): |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
26 if r1file[-2:] == 'gz': |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
27 r1handle = gzip.open(r1file, 'rb') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
28 r2handle = gzip.open(r2file, 'rb') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
29 else: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
30 r1handle = open(r1file, 'r') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
31 r2handle = open(r2file, 'r') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
32 return r1handle, r2handle |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
33 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
34 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
35 def grouper(iterable, n, fillvalue=None): |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
36 "Collect data into fixed-length chunks or blocks" |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
37 # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
38 args = [iter(iterable)] * n |
1
6703b98884a2
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents:
0
diff
changeset
|
39 return zip_longest(fillvalue=fillvalue, * args) |
0
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
40 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
41 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
42 if __name__ == "__main__": |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
43 parser = argparse.ArgumentParser(description="options for filtering and logging rCorrector fastq outputs") |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
44 parser.add_argument('-1', '--left_reads', dest='leftreads', type=str, help='R1 fastq file') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
45 parser.add_argument('-2', '--right_reads', dest='rightreads', type=str, help='R2 fastq file') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
46 parser.add_argument('-o', '--out_prefix', dest='outprefix', type=str, help="prefix for filtered fastq output") |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
47 opts = parser.parse_args() |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
48 r1out = open(opts.outprefix + '_' + basename(opts.leftreads).replace('.gz', ''), 'w') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
49 r2out = open(opts.outprefix + '_' + basename(opts.rightreads).replace('.gz', ''), 'w') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
50 r1_cor_count = 0 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
51 r2_cor_count = 0 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
52 pair_cor_count = 0 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
53 unfix_count = 0 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
54 r1_stream, r2_stream = get_input_streams(opts.leftreads, opts.rightreads) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
55 with r1_stream as f1, r2_stream as f2: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
56 R1 = grouper(f1, 4) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
57 R2 = grouper(f2, 4) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
58 counter = 0 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
59 for entry in R1: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
60 counter += 1 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
61 if counter % 100000 == 0: |
1
6703b98884a2
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents:
0
diff
changeset
|
62 print("%s reads processed" % counter) |
0
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
63 head1, seq1, placeholder1, qual1 = [i.strip() for i in entry] |
1
6703b98884a2
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit 65ada0f9589f3ffebad1db6636ccb50d58082606"
iuc
parents:
0
diff
changeset
|
64 head2, seq2, placeholder2, qual2 = [j.strip() for j in next(R2)] |
0
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
65 if 'unfixable' in head1 or 'unfixable' in head2: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
66 unfix_count += 1 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
67 else: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
68 if 'cor' in head1: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
69 r1_cor_count += 1 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
70 if 'cor' in head2: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
71 r2_cor_count += 1 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
72 if 'cor' in head1 or 'cor' in head2: |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
73 pair_cor_count += 1 |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
74 head1 = head1.split('l:')[0][:-1] # keeps all before the low kmer count statistic and removes the trailing whitespace character |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
75 head2 = head2.split('l:')[0][:-1] |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
76 # head1 = head1.replace(' cor', '') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
77 # head2 = head2.replace(' cor', '') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
78 r1out.write('%s\n' % '\n'.join([head1, seq1, placeholder1, qual1])) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
79 r2out.write('%s\n' % '\n'.join([head2, seq2, placeholder2, qual2])) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
80 unfix_log = open('rmunfixable.log', 'w') |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
81 unfix_log.write('total PE reads:%s\nremoved PE reads:%s\nretained PE reads:%s\nR1 corrected:%s\nR2 corrected:%s\npairs corrected:%s\n' % (counter, unfix_count, counter - unfix_count, r1_cor_count, r2_cor_count, pair_cor_count)) |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
82 r1out.close() |
9a0b65ad3c84
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/rcorrector commit eba2cb86379f6ee6899a1141226e191c63fadfa6
iuc
parents:
diff
changeset
|
83 r2out.close() |