Mercurial > repos > peterjc > seq_primer_clip
annotate tools/seq_primer_clip/seq_primer_clip.py @ 4:9b074c1db68e draft
v0.0.14 galaxy_sequence_utils dependency etc
author | peterjc |
---|---|
date | Thu, 02 Feb 2017 11:52:37 -0500 |
parents | ee5acea162a7 |
children | 530c8d6fedd8 |
rev | line source |
---|---|
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
2 """Looks for the given primer sequences and clips matching SFF reads. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
3 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
4 Takes eight command line options, input read filename, input read format, |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
5 input primer FASTA filename, type of primers (forward, reverse or reverse- |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
6 complement), number of mismatches (currently only 0, 1 and 2 are supported), |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
7 minimum length to keep a read (after primer trimming), should primer-less |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
8 reads be kept (boolean), and finally the output sequence filename. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
9 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
10 Both the primer and read sequences can contain IUPAC ambiguity codes like N. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
11 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
12 This supports FASTA, FASTQ and SFF sequence files. Colorspace reads are not |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
13 supported. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
14 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
15 The mismatch parameter does not consider gapped alignemnts, however the |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
16 special case of missing bases at the very start or end of the read is handled. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
17 e.g. a primer sequence CCGACTCGAG will match a read starting CGACTCGAG... |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
18 if one or more mismatches are allowed. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
19 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
20 This can also be used for stripping off (and optionally filtering on) barcodes. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
21 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
22 Note that only the trim/clip values in the SFF file are changed, not the flow |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
23 information of the full read sequence. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
24 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
25 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
26 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
27 See accompanying text file for licence details (MIT/BSD style). |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
28 |
4 | 29 NOTE: Currently it uses Python's regular expression engine for finding the |
30 primers, which for my needs is fast enough. | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
31 """ |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
32 import sys |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
33 import re |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
34 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
35 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
36 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
37 if "-v" in sys.argv or "--version" in sys.argv: |
4 | 38 print "v0.0.12" |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
39 sys.exit(0) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
40 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
41 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
42 from Bio.Seq import reverse_complement |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
43 from Bio.SeqIO.SffIO import SffIterator, SffWriter |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
44 except ImportError: |
4 | 45 sys.exit("Requires Biopython 1.54 or later") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
46 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
47 from Bio.SeqIO.SffIO import ReadRocheXmlManifest |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
48 except ImportError: |
4 | 49 # Prior to Biopython 1.56 this was a private function |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
50 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
51 |
4 | 52 # Parse Command Line |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
53 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
54 in_file, seq_format, primer_fasta, primer_type, mm, min_len, keep_negatives, out_file = sys.argv[1:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
55 except ValueError: |
4 | 56 sys.exit("Expected 8 arguments, got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv))) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
57 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
58 if in_file == primer_fasta: |
4 | 59 sys.exit("Same file given as both primer sequences and sequences to clip!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
60 if in_file == out_file: |
4 | 61 sys.exit("Same file given as both sequences to clip and output!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
62 if primer_fasta == out_file: |
4 | 63 sys.exit("Same file given as both primer sequences and output!") |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
64 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
65 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
66 mm = int(mm) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
67 except ValueError: |
4 | 68 sys.exit("Expected non-negative integer number of mismatches (e.g. 0 or 1), not %r" % mm) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
69 if mm < 0: |
4 | 70 sys.exit("Expected non-negtive integer number of mismatches (e.g. 0 or 1), not %r" % mm) |
71 if mm not in [0, 1, 2]: | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
72 raise NotImplementedError |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
73 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
74 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
75 min_len = int(min_len) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
76 except ValueError: |
4 | 77 sys.exit("Expected non-negative integer min_len (e.g. 0 or 1), not %r" % min_len) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
78 if min_len < 0: |
4 | 79 sys.exit("Expected non-negtive integer min_len (e.g. 0 or 1), not %r" % min_len) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
80 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
81 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
82 if keep_negatives.lower() in ["true", "yes", "on"]: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
83 keep_negatives = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
84 elif keep_negatives.lower() in ["false", "no", "off"]: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
85 keep_negatives = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
86 else: |
4 | 87 sys.exit("Expected boolean for keep_negatives (e.g. true or false), not %r" % keep_negatives) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
88 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
89 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
90 if primer_type.lower() == "forward": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
91 forward = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
92 rc = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
93 elif primer_type.lower() == "reverse": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
94 forward = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
95 rc = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
96 elif primer_type.lower() == "reverse-complement": |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
97 forward = False |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
98 rc = True |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
99 else: |
4 | 100 sys.exit("Expected foward, reverse or reverse-complement not %r" % primer_type) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
101 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
102 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
103 ambiguous_dna_values = { |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
104 "A": "A", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
105 "C": "C", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
106 "G": "G", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
107 "T": "T", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
108 "M": "ACM", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
109 "R": "AGR", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
110 "W": "ATW", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
111 "S": "CGS", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
112 "Y": "CTY", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
113 "K": "GTK", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
114 "V": "ACGMRSV", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
115 "H": "ACTMWYH", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
116 "D": "AGTRWKD", |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
117 "B": "CGTSYKB", |
4 | 118 "X": ".", # faster than [GATCMRWSYKVVHDBXN] or even [GATC] |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
119 "N": ".", |
4 | 120 } |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
121 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
122 ambiguous_dna_re = {} |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
123 for letter, values in ambiguous_dna_values.iteritems(): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
124 if len(values) == 1: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
125 ambiguous_dna_re[letter] = values |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
126 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
127 ambiguous_dna_re[letter] = "[%s]" % values |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
128 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
129 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
130 def make_reg_ex(seq): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
131 return "".join(ambiguous_dna_re[letter] for letter in seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
132 |
4 | 133 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
134 def make_reg_ex_mm(seq, mm): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
135 if mm > 2: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
136 raise NotImplementedError("At most 2 mismatches allowed!") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
137 seq = seq.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
138 yield make_reg_ex(seq) |
4 | 139 for i in range(1, mm + 1): |
140 # Missing first/last i bases at very start/end of sequence | |
141 for reg in make_reg_ex_mm(seq[i:], mm - i): | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
142 yield "^" + reg |
4 | 143 for reg in make_reg_ex_mm(seq[:-i], mm - i): |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
144 yield "$" + reg |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
145 if mm >= 1: |
4 | 146 for i, letter in enumerate(seq): |
147 # We'll use a set to remove any duplicate patterns | |
148 # if letter not in "NX": | |
149 pattern = seq[:i] + "N" + seq[i + 1:] | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
150 assert len(pattern) == len(seq), "Len %s is %i, len %s is %i" \ |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
151 % (pattern, len(pattern), seq, len(seq)) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
152 yield make_reg_ex(pattern) |
4 | 153 if mm >= 2: |
154 for i, letter in enumerate(seq): | |
155 # We'll use a set to remove any duplicate patterns | |
156 # if letter not in "NX": | |
157 for k, letter in enumerate(seq[i + 1:]): | |
158 # We'll use a set to remove any duplicate patterns | |
159 # if letter not in "NX": | |
160 pattern = seq[:i] + "N" + seq[i + 1:i + 1 + k] + "N" + seq[i + k + 2:] | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
161 assert len(pattern) == len(seq), "Len %s is %i, len %s is %i" \ |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
162 % (pattern, len(pattern), seq, len(seq)) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
163 yield make_reg_ex(pattern) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
164 |
4 | 165 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
166 def load_primers_as_re(primer_fasta, mm, rc=False): |
4 | 167 # Read primer file and record all specified sequences |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
168 primers = set() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
169 in_handle = open(primer_fasta, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
170 reader = fastaReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
171 count = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
172 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
173 if rc: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
174 seq = reverse_complement(record.sequence) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
175 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
176 seq = record.sequence |
4 | 177 # primers.add(re.compile(make_reg_ex(seq))) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
178 count += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
179 for pattern in make_reg_ex_mm(seq, mm): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
180 primers.add(pattern) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
181 in_handle.close() |
4 | 182 # Use set to avoid duplicates, sort to have longest first |
183 # (so more specific primers found before less specific ones) | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
184 primers = sorted(set(primers), key=lambda p: -len(p)) |
4 | 185 return count, re.compile("|".join(primers)) # make one monster re! |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
186 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
187 |
4 | 188 # Read primer file and record all specified sequences |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
189 count, primer = load_primers_as_re(primer_fasta, mm, rc) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
190 print "%i primer sequences" % count |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
191 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
192 short_neg = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
193 short_clipped = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
194 clipped = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
195 negs = 0 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
196 |
4 | 197 if seq_format.lower() == "sff": |
198 # SFF is different because we just change the trim points | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
199 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
200 def process(records): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
201 global short_clipped, short_neg, clipped, negs |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
202 for record in records: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
203 left_clip = record.annotations["clip_qual_left"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
204 right_clip = record.annotations["clip_qual_right"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
205 seq = str(record.seq)[left_clip:right_clip].upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
206 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
207 if result: |
4 | 208 # Forward primer, take everything after it |
209 # so move the left clip along | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
210 if len(seq) - result.end() >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
211 record.annotations["clip_qual_left"] = left_clip + result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
212 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
213 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
214 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
215 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
216 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
217 if len(seq) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
218 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
219 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
220 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
221 short_neg += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
222 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
223 def process(records): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
224 global short_clipped, short_neg, clipped, negs |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
225 for record in records: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
226 left_clip = record.annotations["clip_qual_left"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
227 right_clip = record.annotations["clip_qual_right"] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
228 seq = str(record.seq)[left_clip:right_clip].upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
229 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
230 if result: |
4 | 231 # Reverse primer, take everything before it |
232 # so move the right clip back | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
233 new_len = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
234 if new_len >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
235 record.annotations["clip_qual_right"] = left_clip + new_len |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
236 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
237 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
238 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
239 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
240 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
241 if len(seq) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
242 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
243 yield record |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
244 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
245 short_neg += 1 |
4 | 246 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
247 in_handle = open(in_file, "rb") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
248 try: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
249 manifest = ReadRocheXmlManifest(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
250 except ValueError: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
251 manifest = None |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
252 in_handle.seek(0) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
253 out_handle = open(out_file, "wb") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
254 writer = SffWriter(out_handle, xml=manifest) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
255 writer.write_file(process(SffIterator(in_handle))) |
4 | 256 # End of SFF code |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
257 elif seq_format.lower().startswith("fastq"): |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
258 in_handle = open(in_file, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
259 out_handle = open(out_file, "w") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
260 reader = fastqReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
261 writer = fastqWriter(out_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
262 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
263 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
264 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
265 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
266 if result: |
4 | 267 # Forward primer, take everything after it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
268 cut = result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
269 record.sequence = seq[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
270 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
271 record.quality = record.quality[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
272 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
273 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
274 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
275 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
276 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
277 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
278 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
279 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
280 else: |
4 | 281 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
282 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
283 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
284 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
285 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
286 if result: |
4 | 287 # Reverse primer, take everything before it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
288 cut = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
289 record.sequence = seq[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
290 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
291 record.quality = record.quality[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
292 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
293 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
294 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
295 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
296 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
297 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
298 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
299 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
300 else: |
4 | 301 short_neg += 1 |
302 elif seq_format.lower() == "fasta": | |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
303 in_handle = open(in_file, "rU") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
304 out_handle = open(out_file, "w") |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
305 reader = fastaReader(in_handle) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
306 writer = fastaWriter(out_handle) |
4 | 307 # Following code is identical to that for FASTQ but without editing qualities |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
308 if forward: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
309 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
310 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
311 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
312 if result: |
4 | 313 # Forward primer, take everything after it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
314 cut = result.end() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
315 record.sequence = seq[cut:] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
316 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
317 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
318 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
319 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
320 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
321 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
322 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
323 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
324 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
325 else: |
4 | 326 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
327 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
328 for record in reader: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
329 seq = record.sequence.upper() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
330 result = primer.search(seq) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
331 if result: |
4 | 332 # Reverse primer, take everything before it |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
333 cut = result.start() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
334 record.sequence = seq[:cut] |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
335 if len(record.sequence) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
336 clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
337 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
338 else: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
339 short_clipped += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
340 elif keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
341 if len(record) >= min_len: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
342 negs += 1 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
343 writer.write(record) |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
344 else: |
4 | 345 short_neg += 1 |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
346 else: |
4 | 347 sys.exit("Unsupported file type %r" % seq_format) |
2
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
348 in_handle.close() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
349 out_handle.close() |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
350 |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
351 print "Kept %i clipped reads," % clipped |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
352 print "discarded %i short." % short_clipped |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
353 if keep_negatives: |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
354 print "Kept %i non-matching reads," % negs |
ee5acea162a7
Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff
changeset
|
355 print "discarded %i short." % short_neg |