annotate tools/fastq/fastq_paired_unpaired.py @ 3:6a14074bc810 draft

Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
author peterjc
date Mon, 29 Jul 2013 09:28:55 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
2 """Divides a FASTQ into paired and single (orphan reads) as separate files.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
3
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
4 The input file should be a valid FASTQ file which has been sorted so that
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
5 any partner forward+reverse reads are consecutive. The output files all
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
6 preserve this sort order. Pairing are recognised based on standard name
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
7 suffices. See below or run the tool with no arguments for more details.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
8
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
9 Note that the FASTQ variant is unimportant (Sanger, Solexa, Illumina, or even
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
10 Color Space should all work equally well).
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
11
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
12 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
13 (formerly SCRI), Scotland, UK. All rights reserved.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
14
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
15 See accompanying text file for licence details (MIT license).
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
16 """
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
17 import os
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
18 import sys
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
19 import re
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
20 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
21
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
22 if "-v" in sys.argv or "--version" in sys.argv:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
23 print "Version 0.0.8"
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
24 sys.exit(0)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
25
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
26 def stop_err(msg, err=1):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
27 sys.stderr.write(msg.rstrip() + "\n")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
28 sys.exit(err)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
29
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
30 msg = """Expect either 3 or 4 arguments, all FASTQ filenames.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
31
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
32 If you want two output files, use four arguments:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
33 - FASTQ variant (e.g. sanger, solexa, illumina or cssanger)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
34 - Sorted input FASTQ filename,
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
35 - Output paired FASTQ filename (forward then reverse interleaved),
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
36 - Output singles FASTQ filename (orphan reads)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
37
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
38 If you want three output files, use five arguments:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
39 - FASTQ variant (e.g. sanger, solexa, illumina or cssanger)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
40 - Sorted input FASTQ filename,
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
41 - Output forward paired FASTQ filename,
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
42 - Output reverse paired FASTQ filename,
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
43 - Output singles FASTQ filename (orphan reads)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
44
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
45 The input file should be a valid FASTQ file which has been sorted so that
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
46 any partner forward+reverse reads are consecutive. The output files all
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
47 preserve this sort order.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
48
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
49 Any reads where the forward/reverse naming suffix used is not recognised
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
50 are treated as orphan reads. The tool supports the /1 and /2 convention
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
51 originally used by Illumina, the .f and .r convention, and the Sanger
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
52 convention (see http://staden.sourceforge.net/manual/pregap4_unix_50.html
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
53 for details), and the new Illumina convention where the reads have the
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
54 same identifier with the fragment at the start of the description, e.g.
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
55
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
56 @HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
57 @HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
58
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
59 Note that this does support multiple forward and reverse reads per template
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
60 (which is quite common with Sanger sequencing), e.g. this which is sorted
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
61 alphabetically:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
62
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
63 WTSI_1055_4p17.p1kapIBF
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
64 WTSI_1055_4p17.p1kpIBF
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
65 WTSI_1055_4p17.q1kapIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
66 WTSI_1055_4p17.q1kpIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
67
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
68 or this where the reads already come in pairs:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
69
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
70 WTSI_1055_4p17.p1kapIBF
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
71 WTSI_1055_4p17.q1kapIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
72 WTSI_1055_4p17.p1kpIBF
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
73 WTSI_1055_4p17.q1kpIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
74
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
75 both become:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
76
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
77 WTSI_1055_4p17.p1kapIBF paired with WTSI_1055_4p17.q1kapIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
78 WTSI_1055_4p17.p1kpIBF paired with WTSI_1055_4p17.q1kpIBR
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
79 """
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
80
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
81 if len(sys.argv) == 5:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
82 format, input_fastq, pairs_fastq, singles_fastq = sys.argv[1:]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
83 elif len(sys.argv) == 6:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
84 pairs_fastq = None
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
85 format, input_fastq, pairs_f_fastq, pairs_r_fastq, singles_fastq = sys.argv[1:]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
86 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
87 stop_err(msg)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
88
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
89 format = format.replace("fastq", "").lower()
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
90 if not format:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
91 format="sanger" #safe default
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
92 elif format not in ["sanger","solexa","illumina","cssanger"]:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
93 stop_err("Unrecognised format %s" % format)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
94
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
95 def f_match(name):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
96 if name.endswith("/1") or name.endswith(".f"):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
97 return True
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
98
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
99 #Cope with three widely used suffix naming convensions,
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
100 #Illumina: /1 or /2
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
101 #Forward/revered: .f or .r
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
102 #Sanger, e.g. .p1k and .q1k
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
103 #See http://staden.sourceforge.net/manual/pregap4_unix_50.html
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
104 re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
105 re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
106
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
107 #assert re_f.match("demo/1")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
108 assert re_f.search("demo.f")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
109 assert re_f.search("demo.s1")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
110 assert re_f.search("demo.f1k")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
111 assert re_f.search("demo.p1")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
112 assert re_f.search("demo.p1k")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
113 assert re_f.search("demo.p1lk")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
114 assert re_r.search("demo/2")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
115 assert re_r.search("demo.r")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
116 assert re_r.search("demo.q1")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
117 assert re_r.search("demo.q1lk")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
118 assert not re_r.search("demo/1")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
119 assert not re_r.search("demo.f")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
120 assert not re_r.search("demo.p")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
121 assert not re_f.search("demo/2")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
122 assert not re_f.search("demo.r")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
123 assert not re_f.search("demo.q")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
124
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
125 re_illumina_f = re.compile(r"^@[a-zA-Z0-9_:-]+ 1:.*$")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
126 re_illumina_r = re.compile(r"^@[a-zA-Z0-9_:-]+ 2:.*$")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
127 assert re_illumina_f.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
128 assert re_illumina_r.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
129 assert not re_illumina_f.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
130 assert not re_illumina_r.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA")
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
131
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
132
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
133 count, forward, reverse, neither, pairs, singles = 0, 0, 0, 0, 0, 0
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
134 in_handle = open(input_fastq)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
135 if pairs_fastq:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
136 pairs_f_writer = fastqWriter(open(pairs_fastq, "w"), format)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
137 pairs_r_writer = pairs_f_writer
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
138 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
139 pairs_f_writer = fastqWriter(open(pairs_f_fastq, "w"), format)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
140 pairs_r_writer = fastqWriter(open(pairs_r_fastq, "w"), format)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
141 singles_writer = fastqWriter(open(singles_fastq, "w"), format)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
142 last_template, buffered_reads = None, []
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
143
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
144 for record in fastqReader(in_handle, format):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
145 count += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
146 name = record.identifier.split(None,1)[0]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
147 assert name[0]=="@", record.identifier #Quirk of the Galaxy parser
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
148 is_forward = False
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
149 suffix = re_f.search(name)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
150 if suffix:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
151 #============
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
152 #Forward read
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
153 #============
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
154 template = name[:suffix.start()]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
155 is_forward = True
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
156 elif re_illumina_f.match(record.identifier):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
157 template = name #No suffix
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
158 is_forward = True
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
159 if is_forward:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
160 #print name, "forward", template
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
161 forward += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
162 if last_template == template:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
163 buffered_reads.append(record)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
164 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
165 #Any old buffered reads are orphans
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
166 for old in buffered_reads:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
167 singles_writer.write(old)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
168 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
169 #Save this read in buffer
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
170 buffered_reads = [record]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
171 last_template = template
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
172 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
173 is_reverse = False
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
174 suffix = re_r.search(name)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
175 if suffix:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
176 #============
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
177 #Reverse read
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
178 #============
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
179 template = name[:suffix.start()]
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
180 is_reverse = True
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
181 elif re_illumina_r.match(record.identifier):
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
182 template = name #No suffix
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
183 is_reverse = True
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
184 if is_reverse:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
185 #print name, "reverse", template
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
186 reverse += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
187 if last_template == template and buffered_reads:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
188 #We have a pair!
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
189 #If there are multiple buffered forward reads, want to pick
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
190 #the first one (although we could try and do something more
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
191 #clever looking at the suffix to match them up...)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
192 old = buffered_reads.pop(0)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
193 pairs_f_writer.write(old)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
194 pairs_r_writer.write(record)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
195 pairs += 2
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
196 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
197 #As this is a reverse read, this and any buffered read(s) are
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
198 #all orphans
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
199 for old in buffered_reads:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
200 singles_writer.write(old)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
201 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
202 buffered_reads = []
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
203 singles_writer.write(record)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
204 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
205 last_template = None
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
206 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
207 #===========================
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
208 #Neither forward nor reverse
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
209 #===========================
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
210 singles_writer.write(record)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
211 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
212 neither += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
213 for old in buffered_reads:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
214 singles_writer.write(old)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
215 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
216 buffered_reads = []
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
217 last_template = None
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
218 if last_template:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
219 #Left over singles...
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
220 for old in buffered_reads:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
221 singles_writer.write(old)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
222 singles += 1
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
223 in_handle.close
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
224 singles_writer.close()
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
225 if pairs_fastq:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
226 pairs_f_writer.close()
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
227 assert pairs_r_writer.file.closed
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
228 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
229 pairs_f_writer.close()
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
230 pairs_r_writer.close()
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
231
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
232 if neither:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
233 print "%i reads (%i forward, %i reverse, %i neither), %i in pairs, %i as singles" \
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
234 % (count, forward, reverse, neither, pairs, singles)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
235 else:
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
236 print "%i reads (%i forward, %i reverse), %i in pairs, %i as singles" \
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
237 % (count, forward, reverse, pairs, singles)
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
238
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
239 assert count == pairs + singles == forward + reverse + neither, \
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
240 "%i vs %i+%i=%i vs %i+%i+%i=%i" \
6a14074bc810 Uploaded v0.0.8, automated Biopython dependency handling via ToolShed; MIT license; reST markup for README file.
peterjc
parents:
diff changeset
241 % (count,pairs,singles,pairs+singles,forward,reverse,neither,forward+reverse+neither)