Mercurial > repos > pjbriggs > pal_finder
annotate fastq_subset.py @ 8:4e625d3672ba draft
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
author | pjbriggs |
---|---|
date | Wed, 16 May 2018 07:39:16 -0400 |
parents | |
children | 52dbe2089d14 |
rev | line source |
---|---|
8
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
1 #!/usr/bin/env python |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
2 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
3 import argparse |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
4 import random |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
5 from Bio.SeqIO.QualityIO import FastqGeneralIterator |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
6 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
7 def count_reads(fastq): |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
8 """ |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
9 Count number of reads in a Fastq file |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
10 """ |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
11 n = 0 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
12 with open(fastq,'r') as fq: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
13 while True: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
14 buf = fq.read() |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
15 n += buf.count('\n') |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
16 if buf == "": break |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
17 return n/4 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
18 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
19 def fastq_subset(fastq_in,fastq_out,indices): |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
20 """ |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
21 Output a subset of reads from a Fastq file |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
22 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
23 The reads to output are specifed by a list |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
24 of integer indices; only reads at those |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
25 positions in the input file will be written |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
26 to the output. |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
27 """ |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
28 with open(fastq_in,'r') as fq_in: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
29 fq_out = open(fastq_out,'w') |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
30 i = 0 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
31 for title,seq,qual in FastqGeneralIterator(fq_in): |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
32 if i in indices: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
33 fq_out.write("@%s\n%s\n+\n%s\n" % (title, |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
34 seq, |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
35 qual)) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
36 i += 1 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
37 fq_out.close() |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
38 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
39 if __name__ == "__main__": |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
40 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
41 p = argparse.ArgumentParser() |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
42 p.add_argument("fastq_r1") |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
43 p.add_argument("fastq_r2") |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
44 p.add_argument("-n", |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
45 dest="subset_size", |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
46 default=None, |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
47 help="subset size") |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
48 p.add_argument("-s", |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
49 dest="seed", |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
50 type=int, |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
51 default=None, |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
52 help="seed for random number generator") |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
53 args = p.parse_args() |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
54 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
55 print "Processing fastq pair:" |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
56 print "\t%s" % args.fastq_r1 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
57 print "\t%s" % args.fastq_r2 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
58 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
59 nreads = count_reads(args.fastq_r1) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
60 print "Counted %d reads in %s" % (nreads,args.fastq_r1) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
61 |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
62 if args.subset_size is not None: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
63 subset_size = float(args.subset_size) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
64 if subset_size < 1.0: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
65 subset_size = int(nreads*subset_size) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
66 else: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
67 subset_size = int(subset_size) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
68 print "Extracting subset of reads: %s" % subset_size |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
69 if args.seed is not None: |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
70 print "Random number generator seed: %d" % args.seed |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
71 random.seed(args.seed) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
72 subset = random.sample(xrange(nreads),subset_size) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
73 fastq_subset(args.fastq_r1,"subset_r1.fq",subset) |
4e625d3672ba
Pal_finder tool version 0.02.04.7: add detection/reporting of bad ranges; enable subset of reads to be used; check n-mers.
pjbriggs
parents:
diff
changeset
|
74 fastq_subset(args.fastq_r2,"subset_r2.fq",subset) |