Mercurial > repos > peterjc > seq_primer_clip
annotate tools/seq_primer_clip/seq_primer_clip.py @ 6:b9dc7c967ee6 draft default tip
v0.0.16 Python 3 compatible print function
author | peterjc |
---|---|
date | Tue, 16 May 2017 09:36:50 -0400 |
parents | 530c8d6fedd8 |
children |
rev | line source |
---|---|
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
2 """Looks for the given primer sequences and clips matching SFF reads. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
3 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
4 Takes eight command line options, input read filename, input read format, |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
5 input primer FASTA filename, type of primers (forward, reverse or reverse- |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
6 complement), number of mismatches (currently only 0, 1 and 2 are supported), |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
7 minimum length to keep a read (after primer trimming), should primer-less |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
8 reads be kept (boolean), and finally the output sequence filename. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
9 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
10 Both the primer and read sequences can contain IUPAC ambiguity codes like N. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
11 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
12 This supports FASTA, FASTQ and SFF sequence files. Colorspace reads are not |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
13 supported. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
14 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
15 The mismatch parameter does not consider gapped alignemnts, however the |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
16 special case of missing bases at the very start or end of the read is handled. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
17 e.g. a primer sequence CCGACTCGAG will match a read starting CGACTCGAG... |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
18 if one or more mismatches are allowed. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
19 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
20 This can also be used for stripping off (and optionally filtering on) barcodes. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
21 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
22 Note that only the trim/clip values in the SFF file are changed, not the flow |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
23 information of the full read sequence. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
24 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
25 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
26 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
27 See accompanying text file for licence details (MIT/BSD style). |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
28 |
4 | 29 NOTE: Currently it uses Python's regular expression engine for finding the |
30 primers, which for my needs is fast enough. | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
31 """ |
5 | 32 |
33 import re | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
34 import sys |
5 | 35 |
6 | 36 if "-v" in sys.argv or "--version" in sys.argv: |
37 print("v0.0.16") | |
38 sys.exit(0) | |
39 | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
40 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
41 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
42 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
43 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
44 from Bio.Seq import reverse_complement |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
45 from Bio.SeqIO.SffIO import SffIterator, SffWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
46 except ImportError: |
4 | 47 sys.exit("Requires Biopython 1.54 or later") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
48 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
49 from Bio.SeqIO.SffIO import ReadRocheXmlManifest |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
50 except ImportError: |
4 | 51 # Prior to Biopython 1.56 this was a private function |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
52 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
53 |
4 | 54 # Parse Command Line |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
55 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
56 in_file, seq_format, primer_fasta, primer_type, mm, min_len, keep_negatives, out_file = sys.argv[1:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
57 except ValueError: |
4 | 58 sys.exit("Expected 8 arguments, got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv))) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
59 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
60 if in_file == primer_fasta: |
4 | 61 sys.exit("Same file given as both primer sequences and sequences to clip!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
62 if in_file == out_file: |
4 | 63 sys.exit("Same file given as both sequences to clip and output!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
64 if primer_fasta == out_file: |
4 | 65 sys.exit("Same file given as both primer sequences and output!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
66 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
67 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
68 mm = int(mm) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
69 except ValueError: |
4 | 70 sys.exit("Expected non-negative integer number of mismatches (e.g. 0 or 1), not %r" % mm) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
71 if mm < 0: |
4 | 72 sys.exit("Expected non-negtive integer number of mismatches (e.g. 0 or 1), not %r" % mm) |
73 if mm not in [0, 1, 2]: | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
74 raise NotImplementedError |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
75 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
76 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
77 min_len = int(min_len) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
78 except ValueError: |
4 | 79 sys.exit("Expected non-negative integer min_len (e.g. 0 or 1), not %r" % min_len) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
80 if min_len < 0: |
4 | 81 sys.exit("Expected non-negtive integer min_len (e.g. 0 or 1), not %r" % min_len) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
82 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
83 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
84 if keep_negatives.lower() in ["true", "yes", "on"]: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
85 keep_negatives = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
86 elif keep_negatives.lower() in ["false", "no", "off"]: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
87 keep_negatives = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
88 else: |
4 | 89 sys.exit("Expected boolean for keep_negatives (e.g. true or false), not %r" % keep_negatives) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
90 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
91 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
92 if primer_type.lower() == "forward": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
93 forward = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
94 rc = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
95 elif primer_type.lower() == "reverse": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
96 forward = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
97 rc = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
98 elif primer_type.lower() == "reverse-complement": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
99 forward = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
100 rc = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
101 else: |
4 | 102 sys.exit("Expected foward, reverse or reverse-complement not %r" % primer_type) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
103 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
104 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
105 ambiguous_dna_values = { |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
106 "A": "A", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
107 "C": "C", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
108 "G": "G", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
109 "T": "T", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
110 "M": "ACM", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
111 "R": "AGR", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
112 "W": "ATW", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
113 "S": "CGS", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
114 "Y": "CTY", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
115 "K": "GTK", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
116 "V": "ACGMRSV", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
117 "H": "ACTMWYH", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
118 "D": "AGTRWKD", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
119 "B": "CGTSYKB", |
4 | 120 "X": ".", # faster than [GATCMRWSYKVVHDBXN] or even [GATC] |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
121 "N": ".", |
4 | 122 } |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
123 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
124 ambiguous_dna_re = {} |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
125 for letter, values in ambiguous_dna_values.iteritems(): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
126 if len(values) == 1: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
127 ambiguous_dna_re[letter] = values |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
128 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
129 ambiguous_dna_re[letter] = "[%s]" % values |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
130 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
131 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
132 def make_reg_ex(seq): |
6 | 133 """Make regular expression for ambiguous DNA.""" |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
134 return "".join(ambiguous_dna_re[letter] for letter in seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
135 |
4 | 136 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
137 def make_reg_ex_mm(seq, mm): |
6 | 138 """Make regular expression for mis-matches.""" |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
139 if mm > 2: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
140 raise NotImplementedError("At most 2 mismatches allowed!") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
141 seq = seq.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
142 yield make_reg_ex(seq) |
4 | 143 for i in range(1, mm + 1): |
144 # Missing first/last i bases at very start/end of sequence | |
145 for reg in make_reg_ex_mm(seq[i:], mm - i): | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
146 yield "^" + reg |
4 | 147 for reg in make_reg_ex_mm(seq[:-i], mm - i): |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
148 yield "$" + reg |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
149 if mm >= 1: |
4 | 150 for i, letter in enumerate(seq): |
151 # We'll use a set to remove any duplicate patterns | |
152 # if letter not in "NX": | |
153 pattern = seq[:i] + "N" + seq[i + 1:] | |
5 | 154 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i" |
155 % (pattern, len(pattern), seq, len(seq))) | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
156 yield make_reg_ex(pattern) |
4 | 157 if mm >= 2: |
158 for i, letter in enumerate(seq): | |
159 # We'll use a set to remove any duplicate patterns | |
160 # if letter not in "NX": | |
161 for k, letter in enumerate(seq[i + 1:]): | |
162 # We'll use a set to remove any duplicate patterns | |
163 # if letter not in "NX": | |
164 pattern = seq[:i] + "N" + seq[i + 1:i + 1 + k] + "N" + seq[i + k + 2:] | |
5 | 165 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i" |
166 % (pattern, len(pattern), seq, len(seq))) | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
167 yield make_reg_ex(pattern) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
168 |
4 | 169 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
170 def load_primers_as_re(primer_fasta, mm, rc=False): |
6 | 171 """Load primers as regular expressions. |
172 | |
173 Read primer file and record all specified sequences. | |
174 """ | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
175 primers = set() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
176 in_handle = open(primer_fasta, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
177 reader = fastaReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
178 count = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
179 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
180 if rc: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
181 seq = reverse_complement(record.sequence) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
182 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
183 seq = record.sequence |
4 | 184 # primers.add(re.compile(make_reg_ex(seq))) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
185 count += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
186 for pattern in make_reg_ex_mm(seq, mm): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
187 primers.add(pattern) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
188 in_handle.close() |
4 | 189 # Use set to avoid duplicates, sort to have longest first |
190 # (so more specific primers found before less specific ones) | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
191 primers = sorted(set(primers), key=lambda p: -len(p)) |
4 | 192 return count, re.compile("|".join(primers)) # make one monster re! |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
193 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
194 |
4 | 195 # Read primer file and record all specified sequences |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
196 count, primer = load_primers_as_re(primer_fasta, mm, rc) |
6 | 197 print("%i primer sequences" % count) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
198 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
199 short_neg = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
200 short_clipped = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
201 clipped = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
202 negs = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
203 |
4 | 204 if seq_format.lower() == "sff": |
205 # SFF is different because we just change the trim points | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
206 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
207 def process(records): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
208 global short_clipped, short_neg, clipped, negs |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
209 for record in records: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
210 left_clip = record.annotations["clip_qual_left"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
211 right_clip = record.annotations["clip_qual_right"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
212 seq = str(record.seq)[left_clip:right_clip].upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
213 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
214 if result: |
4 | 215 # Forward primer, take everything after it |
216 # so move the left clip along | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
217 if len(seq) - result.end() >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
218 record.annotations["clip_qual_left"] = left_clip + result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
219 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
220 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
221 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
222 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
223 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
224 if len(seq) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
225 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
226 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
227 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
228 short_neg += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
229 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
230 def process(records): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
231 global short_clipped, short_neg, clipped, negs |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
232 for record in records: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
233 left_clip = record.annotations["clip_qual_left"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
234 right_clip = record.annotations["clip_qual_right"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
235 seq = str(record.seq)[left_clip:right_clip].upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
236 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
237 if result: |
4 | 238 # Reverse primer, take everything before it |
239 # so move the right clip back | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
240 new_len = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
241 if new_len >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
242 record.annotations["clip_qual_right"] = left_clip + new_len |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
243 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
244 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
245 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
246 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
247 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
248 if len(seq) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
249 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
250 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
251 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
252 short_neg += 1 |
4 | 253 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
254 in_handle = open(in_file, "rb") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
255 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
256 manifest = ReadRocheXmlManifest(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
257 except ValueError: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
258 manifest = None |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
259 in_handle.seek(0) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
260 out_handle = open(out_file, "wb") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
261 writer = SffWriter(out_handle, xml=manifest) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
262 writer.write_file(process(SffIterator(in_handle))) |
4 | 263 # End of SFF code |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
264 elif seq_format.lower().startswith("fastq"): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
265 in_handle = open(in_file, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
266 out_handle = open(out_file, "w") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
267 reader = fastqReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
268 writer = fastqWriter(out_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
269 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
270 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
271 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
272 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
273 if result: |
4 | 274 # Forward primer, take everything after it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
275 cut = result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
276 record.sequence = seq[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
277 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
278 record.quality = record.quality[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
279 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
280 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
281 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
282 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
283 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
284 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
285 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
286 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
287 else: |
4 | 288 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
289 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
290 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
291 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
292 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
293 if result: |
4 | 294 # Reverse primer, take everything before it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
295 cut = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
296 record.sequence = seq[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
297 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
298 record.quality = record.quality[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
299 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
300 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
301 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
302 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
303 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
304 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
305 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
306 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
307 else: |
4 | 308 short_neg += 1 |
309 elif seq_format.lower() == "fasta": | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
310 in_handle = open(in_file, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
311 out_handle = open(out_file, "w") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
312 reader = fastaReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
313 writer = fastaWriter(out_handle) |
4 | 314 # Following code is identical to that for FASTQ but without editing qualities |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
315 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
316 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
317 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
318 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
319 if result: |
4 | 320 # Forward primer, take everything after it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
321 cut = result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
322 record.sequence = seq[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
323 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
324 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
325 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
326 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
327 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
328 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
329 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
330 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
331 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
332 else: |
4 | 333 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
334 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
335 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
336 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
337 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
338 if result: |
4 | 339 # Reverse primer, take everything before it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
340 cut = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
341 record.sequence = seq[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
342 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
343 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
344 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
345 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
346 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
347 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
348 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
349 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
350 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
351 else: |
4 | 352 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
353 else: |
4 | 354 sys.exit("Unsupported file type %r" % seq_format) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
355 in_handle.close() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
356 out_handle.close() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
357 |
6 | 358 print("Kept %i clipped reads," % clipped) |
359 print("discarded %i short." % short_clipped) | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
360 if keep_negatives: |
6 | 361 print("Kept %i non-matching reads," % negs) |
362 print("discarded %i short." % short_neg) |