annotate fastq_subset.py @ 8:4e625d3672ba draft

Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
author pjbriggs
date Wed, 16 May 2018 07:39:16 -0400
parents
children 52dbe2089d14
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
8
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
1 #!/usr/bin/env python
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
2
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
3 import argparse
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
4 import random
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
5 from Bio.SeqIO.QualityIO import FastqGeneralIterator
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
6
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
7 def count_reads(fastq):
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
8 """
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
9 Count number of reads in a Fastq file
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
10 """
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
11 n = 0
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
12 with open(fastq,'r') as fq:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
13 while True:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
14 buf = fq.read()
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
15 n += buf.count('\n')
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
16 if buf == "": break
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
17 return n/4
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
18
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
19 def fastq_subset(fastq_in,fastq_out,indices):
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
20 """
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
21 Output a subset of reads from a Fastq file
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
22
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
23 The reads to output are specifed by a list
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
24 of integer indices; only reads at those
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
25 positions in the input file will be written
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
26 to the output.
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
27 """
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
28 with open(fastq_in,'r') as fq_in:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
29 fq_out = open(fastq_out,'w')
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
30 i = 0
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
31 for title,seq,qual in FastqGeneralIterator(fq_in):
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
32 if i in indices:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
33 fq_out.write("@%s\n%s\n+\n%s\n" % (title,
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
34 seq,
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
35 qual))
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
36 i += 1
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
37 fq_out.close()
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
38
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
39 if __name__ == "__main__":
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
40
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
41 p = argparse.ArgumentParser()
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
42 p.add_argument("fastq_r1")
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
43 p.add_argument("fastq_r2")
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
44 p.add_argument("-n",
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
45 dest="subset_size",
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
46 default=None,
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
47 help="subset size")
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
48 p.add_argument("-s",
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
49 dest="seed",
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
50 type=int,
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
51 default=None,
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
52 help="seed for random number generator")
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
53 args = p.parse_args()
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
54
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
55 print "Processing fastq pair:"
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
56 print "\t%s" % args.fastq_r1
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
57 print "\t%s" % args.fastq_r2
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
58
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
59 nreads = count_reads(args.fastq_r1)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
60 print "Counted %d reads in %s" % (nreads,args.fastq_r1)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
61
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
62 if args.subset_size is not None:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
63 subset_size = float(args.subset_size)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
64 if subset_size < 1.0:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
65 subset_size = int(nreads*subset_size)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
66 else:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
67 subset_size = int(subset_size)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
68 print "Extracting subset of reads: %s" % subset_size
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
69 if args.seed is not None:
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
70 print "Random number generator seed: %d" % args.seed
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
71 random.seed(args.seed)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
72 subset = random.sample(xrange(nreads),subset_size)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
73 fastq_subset(args.fastq_r1,"subset_r1.fq",subset)
4e625d3672ba Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff changeset
74 fastq_subset(args.fastq_r2,"subset_r2.fq",subset)