annotate tools/seq_primer_clip/seq_primer_clip.py @ 5:530c8d6fedd8 draft

v0.0.15 - internal changes
author peterjc
date Wed, 10 May 2017 13:09:52 -0400
parents 9b074c1db68e
children b9dc7c967ee6
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
2 """Looks for the given primer sequences and clips matching SFF reads.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
3
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
4 Takes eight command line options, input read filename, input read format,
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
5 input primer FASTA filename, type of primers (forward, reverse or reverse-
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
6 complement), number of mismatches (currently only 0, 1 and 2 are supported),
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
7 minimum length to keep a read (after primer trimming), should primer-less
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
8 reads be kept (boolean), and finally the output sequence filename.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
9
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
10 Both the primer and read sequences can contain IUPAC ambiguity codes like N.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
11
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
12 This supports FASTA, FASTQ and SFF sequence files. Colorspace reads are not
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
13 supported.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
14
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
15 The mismatch parameter does not consider gapped alignemnts, however the
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
16 special case of missing bases at the very start or end of the read is handled.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
17 e.g. a primer sequence CCGACTCGAG will match a read starting CGACTCGAG...
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
18 if one or more mismatches are allowed.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
19
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
20 This can also be used for stripping off (and optionally filtering on) barcodes.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
21
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
22 Note that only the trim/clip values in the SFF file are changed, not the flow
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
23 information of the full read sequence.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
24
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
25 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
26 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
27 See accompanying text file for licence details (MIT/BSD style).
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
28
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
29 NOTE: Currently it uses Python's regular expression engine for finding the
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
30 primers, which for my needs is fast enough.
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
31 """
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
32
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
33 import re
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
34 import sys
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
35
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
36 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
37 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
38
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
39 if "-v" in sys.argv or "--version" in sys.argv:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
40 print "v0.0.12"
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
41 sys.exit(0)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
42
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
43 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
44 from Bio.Seq import reverse_complement
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
45 from Bio.SeqIO.SffIO import SffIterator, SffWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
46 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
47 sys.exit("Requires Biopython 1.54 or later")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
48 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
49 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
50 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
51 # Prior to Biopython 1.56 this was a private function
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
52 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
53
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
54 # Parse Command Line
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
55 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
56 in_file, seq_format, primer_fasta, primer_type, mm, min_len, keep_negatives, out_file = sys.argv[1:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
57 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
58 sys.exit("Expected 8 arguments, got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
59
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
60 if in_file == primer_fasta:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
61 sys.exit("Same file given as both primer sequences and sequences to clip!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
62 if in_file == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
63 sys.exit("Same file given as both sequences to clip and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
64 if primer_fasta == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
65 sys.exit("Same file given as both primer sequences and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
66
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
67 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
68 mm = int(mm)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
69 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
70 sys.exit("Expected non-negative integer number of mismatches (e.g. 0 or 1), not %r" % mm)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
71 if mm < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
72 sys.exit("Expected non-negtive integer number of mismatches (e.g. 0 or 1), not %r" % mm)
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
73 if mm not in [0, 1, 2]:
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
74 raise NotImplementedError
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
75
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
76 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
77 min_len = int(min_len)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
78 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
79 sys.exit("Expected non-negative integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
80 if min_len < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
81 sys.exit("Expected non-negtive integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
82
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
83
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
84 if keep_negatives.lower() in ["true", "yes", "on"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
85 keep_negatives = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
86 elif keep_negatives.lower() in ["false", "no", "off"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
87 keep_negatives = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
88 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
89 sys.exit("Expected boolean for keep_negatives (e.g. true or false), not %r" % keep_negatives)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
90
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
91
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
92 if primer_type.lower() == "forward":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
93 forward = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
94 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
95 elif primer_type.lower() == "reverse":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
96 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
97 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
98 elif primer_type.lower() == "reverse-complement":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
99 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
100 rc = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
101 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
102 sys.exit("Expected foward, reverse or reverse-complement not %r" % primer_type)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
103
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
104
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
105 ambiguous_dna_values = {
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
106 "A": "A",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
107 "C": "C",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
108 "G": "G",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
109 "T": "T",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
110 "M": "ACM",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
111 "R": "AGR",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
112 "W": "ATW",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
113 "S": "CGS",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
114 "Y": "CTY",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
115 "K": "GTK",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
116 "V": "ACGMRSV",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
117 "H": "ACTMWYH",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
118 "D": "AGTRWKD",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
119 "B": "CGTSYKB",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
120 "X": ".", # faster than [GATCMRWSYKVVHDBXN] or even [GATC]
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
121 "N": ".",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
122 }
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
123
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
124 ambiguous_dna_re = {}
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
125 for letter, values in ambiguous_dna_values.iteritems():
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
126 if len(values) == 1:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
127 ambiguous_dna_re[letter] = values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
128 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
129 ambiguous_dna_re[letter] = "[%s]" % values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
130
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
131
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
132 def make_reg_ex(seq):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
133 return "".join(ambiguous_dna_re[letter] for letter in seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
134
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
135
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
136 def make_reg_ex_mm(seq, mm):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
137 if mm > 2:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
138 raise NotImplementedError("At most 2 mismatches allowed!")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
139 seq = seq.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
140 yield make_reg_ex(seq)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
141 for i in range(1, mm + 1):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
142 # Missing first/last i bases at very start/end of sequence
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
143 for reg in make_reg_ex_mm(seq[i:], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
144 yield "^" + reg
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
145 for reg in make_reg_ex_mm(seq[:-i], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
146 yield "$" + reg
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
147 if mm >= 1:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
148 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
149 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
150 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
151 pattern = seq[:i] + "N" + seq[i + 1:]
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
152 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i"
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
153 % (pattern, len(pattern), seq, len(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
154 yield make_reg_ex(pattern)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
155 if mm >= 2:
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
156 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
157 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
158 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
159 for k, letter in enumerate(seq[i + 1:]):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
160 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
161 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
162 pattern = seq[:i] + "N" + seq[i + 1:i + 1 + k] + "N" + seq[i + k + 2:]
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
163 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i"
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
164 % (pattern, len(pattern), seq, len(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
165 yield make_reg_ex(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
166
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
167
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
168 def load_primers_as_re(primer_fasta, mm, rc=False):
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
169 # Read primer file and record all specified sequences
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
170 primers = set()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
171 in_handle = open(primer_fasta, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
172 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
173 count = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
174 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
175 if rc:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
176 seq = reverse_complement(record.sequence)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
177 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
178 seq = record.sequence
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
179 # primers.add(re.compile(make_reg_ex(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
180 count += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
181 for pattern in make_reg_ex_mm(seq, mm):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
182 primers.add(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
183 in_handle.close()
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
184 # Use set to avoid duplicates, sort to have longest first
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
185 # (so more specific primers found before less specific ones)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
186 primers = sorted(set(primers), key=lambda p: -len(p))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
187 return count, re.compile("|".join(primers)) # make one monster re!
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
188
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
189
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
190 # Read primer file and record all specified sequences
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
191 count, primer = load_primers_as_re(primer_fasta, mm, rc)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
192 print "%i primer sequences" % count
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
193
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
194 short_neg = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
195 short_clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
196 clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
197 negs = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
198
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
199 if seq_format.lower() == "sff":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
200 # SFF is different because we just change the trim points
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
201 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
202 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
203 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
204 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
205 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
206 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
207 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
208 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
209 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
210 # Forward primer, take everything after it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
211 # so move the left clip along
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
212 if len(seq) - result.end() >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
213 record.annotations["clip_qual_left"] = left_clip + result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
214 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
215 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
216 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
217 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
218 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
219 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
220 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
221 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
222 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
223 short_neg += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
224 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
225 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
226 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
227 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
228 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
229 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
230 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
231 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
232 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
233 # Reverse primer, take everything before it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
234 # so move the right clip back
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
235 new_len = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
236 if new_len >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
237 record.annotations["clip_qual_right"] = left_clip + new_len
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
238 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
239 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
240 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
241 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
242 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
243 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
244 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
245 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
246 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
247 short_neg += 1
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
248
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
249 in_handle = open(in_file, "rb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
250 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
251 manifest = ReadRocheXmlManifest(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
252 except ValueError:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
253 manifest = None
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
254 in_handle.seek(0)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
255 out_handle = open(out_file, "wb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
256 writer = SffWriter(out_handle, xml=manifest)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
257 writer.write_file(process(SffIterator(in_handle)))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
258 # End of SFF code
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
259 elif seq_format.lower().startswith("fastq"):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
260 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
261 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
262 reader = fastqReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
263 writer = fastqWriter(out_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
264 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
265 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
266 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
267 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
268 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
269 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
270 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
271 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
272 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
273 record.quality = record.quality[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
274 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
275 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
276 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
277 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
278 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
279 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
280 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
281 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
282 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
283 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
284 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
285 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
286 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
287 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
288 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
289 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
290 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
291 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
292 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
293 record.quality = record.quality[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
294 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
295 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
296 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
297 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
298 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
299 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
300 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
301 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
302 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
303 short_neg += 1
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
304 elif seq_format.lower() == "fasta":
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
305 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
306 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
307 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
308 writer = fastaWriter(out_handle)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
309 # Following code is identical to that for FASTQ but without editing qualities
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
310 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
311 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
312 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
313 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
314 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
315 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
316 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
317 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
318 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
319 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
320 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
321 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
322 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
323 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
324 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
325 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
326 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
327 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
328 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
329 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
330 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
331 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
332 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
333 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
334 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
335 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
336 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
337 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
338 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
339 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
340 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
341 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
342 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
343 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
344 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
345 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
346 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
347 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
348 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
349 sys.exit("Unsupported file type %r" % seq_format)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
350 in_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
351 out_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
352
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
353 print "Kept %i clipped reads," % clipped
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
354 print "discarded %i short." % short_clipped
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
355 if keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
356 print "Kept %i non-matching reads," % negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
357 print "discarded %i short." % short_neg