annotate tools/seq_primer_clip/seq_primer_clip.py @ 6:b9dc7c967ee6 draft default tip

v0.0.16 Python 3 compatible print function
author peterjc
date Tue, 16 May 2017 09:36:50 -0400
parents 530c8d6fedd8
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
2 """Looks for the given primer sequences and clips matching SFF reads.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
3
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
4 Takes eight command line options, input read filename, input read format,
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
5 input primer FASTA filename, type of primers (forward, reverse or reverse-
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
6 complement), number of mismatches (currently only 0, 1 and 2 are supported),
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
7 minimum length to keep a read (after primer trimming), should primer-less
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
8 reads be kept (boolean), and finally the output sequence filename.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
9
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
10 Both the primer and read sequences can contain IUPAC ambiguity codes like N.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
11
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
12 This supports FASTA, FASTQ and SFF sequence files. Colorspace reads are not
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
13 supported.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
14
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
15 The mismatch parameter does not consider gapped alignemnts, however the
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
16 special case of missing bases at the very start or end of the read is handled.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
17 e.g. a primer sequence CCGACTCGAG will match a read starting CGACTCGAG...
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
18 if one or more mismatches are allowed.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
19
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
20 This can also be used for stripping off (and optionally filtering on) barcodes.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
21
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
22 Note that only the trim/clip values in the SFF file are changed, not the flow
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
23 information of the full read sequence.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
24
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
25 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
26 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
27 See accompanying text file for licence details (MIT/BSD style).
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
28
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
29 NOTE: Currently it uses Python's regular expression engine for finding the
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
30 primers, which for my needs is fast enough.
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
31 """
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
32
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
33 import re
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
34 import sys
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
35
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
36 if "-v" in sys.argv or "--version" in sys.argv:
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
37 print("v0.0.16")
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
38 sys.exit(0)
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
39
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
40 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
41 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
42
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
43 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
44 from Bio.Seq import reverse_complement
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
45 from Bio.SeqIO.SffIO import SffIterator, SffWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
46 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
47 sys.exit("Requires Biopython 1.54 or later")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
48 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
49 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
50 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
51 # Prior to Biopython 1.56 this was a private function
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
52 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
53
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
54 # Parse Command Line
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
55 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
56 in_file, seq_format, primer_fasta, primer_type, mm, min_len, keep_negatives, out_file = sys.argv[1:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
57 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
58 sys.exit("Expected 8 arguments, got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
59
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
60 if in_file == primer_fasta:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
61 sys.exit("Same file given as both primer sequences and sequences to clip!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
62 if in_file == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
63 sys.exit("Same file given as both sequences to clip and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
64 if primer_fasta == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
65 sys.exit("Same file given as both primer sequences and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
66
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
67 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
68 mm = int(mm)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
69 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
70 sys.exit("Expected non-negative integer number of mismatches (e.g. 0 or 1), not %r" % mm)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
71 if mm < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
72 sys.exit("Expected non-negtive integer number of mismatches (e.g. 0 or 1), not %r" % mm)
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
73 if mm not in [0, 1, 2]:
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
74 raise NotImplementedError
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
75
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
76 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
77 min_len = int(min_len)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
78 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
79 sys.exit("Expected non-negative integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
80 if min_len < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
81 sys.exit("Expected non-negtive integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
82
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
83
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
84 if keep_negatives.lower() in ["true", "yes", "on"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
85 keep_negatives = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
86 elif keep_negatives.lower() in ["false", "no", "off"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
87 keep_negatives = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
88 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
89 sys.exit("Expected boolean for keep_negatives (e.g. true or false), not %r" % keep_negatives)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
90
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
91
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
92 if primer_type.lower() == "forward":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
93 forward = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
94 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
95 elif primer_type.lower() == "reverse":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
96 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
97 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
98 elif primer_type.lower() == "reverse-complement":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
99 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
100 rc = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
101 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
102 sys.exit("Expected foward, reverse or reverse-complement not %r" % primer_type)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
103
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
104
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
105 ambiguous_dna_values = {
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
106 "A": "A",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
107 "C": "C",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
108 "G": "G",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
109 "T": "T",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
110 "M": "ACM",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
111 "R": "AGR",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
112 "W": "ATW",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
113 "S": "CGS",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
114 "Y": "CTY",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
115 "K": "GTK",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
116 "V": "ACGMRSV",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
117 "H": "ACTMWYH",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
118 "D": "AGTRWKD",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
119 "B": "CGTSYKB",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
120 "X": ".", # faster than [GATCMRWSYKVVHDBXN] or even [GATC]
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
121 "N": ".",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
122 }
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
123
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
124 ambiguous_dna_re = {}
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
125 for letter, values in ambiguous_dna_values.iteritems():
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
126 if len(values) == 1:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
127 ambiguous_dna_re[letter] = values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
128 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
129 ambiguous_dna_re[letter] = "[%s]" % values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
130
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
131
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
132 def make_reg_ex(seq):
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
133 """Make regular expression for ambiguous DNA."""
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
134 return "".join(ambiguous_dna_re[letter] for letter in seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
135
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
136
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
137 def make_reg_ex_mm(seq, mm):
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
138 """Make regular expression for mis-matches."""
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
139 if mm > 2:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
140 raise NotImplementedError("At most 2 mismatches allowed!")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
141 seq = seq.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
142 yield make_reg_ex(seq)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
143 for i in range(1, mm + 1):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
144 # Missing first/last i bases at very start/end of sequence
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
145 for reg in make_reg_ex_mm(seq[i:], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
146 yield "^" + reg
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
147 for reg in make_reg_ex_mm(seq[:-i], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
148 yield "$" + reg
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
149 if mm >= 1:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
150 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
151 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
152 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
153 pattern = seq[:i] + "N" + seq[i + 1:]
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
154 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i"
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
155 % (pattern, len(pattern), seq, len(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
156 yield make_reg_ex(pattern)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
157 if mm >= 2:
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
158 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
159 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
160 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
161 for k, letter in enumerate(seq[i + 1:]):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
162 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
163 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
164 pattern = seq[:i] + "N" + seq[i + 1:i + 1 + k] + "N" + seq[i + k + 2:]
5
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
165 assert len(pattern) == len(seq), ("Len %s is %i, len %s is %i"
530c8d6fedd8 v0.0.15 - internal changes
peterjc
parents: 4
diff changeset
166 % (pattern, len(pattern), seq, len(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
167 yield make_reg_ex(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
168
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
169
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
170 def load_primers_as_re(primer_fasta, mm, rc=False):
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
171 """Load primers as regular expressions.
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
172
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
173 Read primer file and record all specified sequences.
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
174 """
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
175 primers = set()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
176 in_handle = open(primer_fasta, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
177 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
178 count = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
179 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
180 if rc:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
181 seq = reverse_complement(record.sequence)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
182 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
183 seq = record.sequence
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
184 # primers.add(re.compile(make_reg_ex(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
185 count += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
186 for pattern in make_reg_ex_mm(seq, mm):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
187 primers.add(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
188 in_handle.close()
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
189 # Use set to avoid duplicates, sort to have longest first
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
190 # (so more specific primers found before less specific ones)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
191 primers = sorted(set(primers), key=lambda p: -len(p))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
192 return count, re.compile("|".join(primers)) # make one monster re!
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
193
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
194
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
195 # Read primer file and record all specified sequences
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
196 count, primer = load_primers_as_re(primer_fasta, mm, rc)
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
197 print("%i primer sequences" % count)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
198
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
199 short_neg = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
200 short_clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
201 clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
202 negs = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
203
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
204 if seq_format.lower() == "sff":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
205 # SFF is different because we just change the trim points
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
206 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
207 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
208 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
209 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
210 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
211 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
212 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
213 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
214 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
215 # Forward primer, take everything after it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
216 # so move the left clip along
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
217 if len(seq) - result.end() >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
218 record.annotations["clip_qual_left"] = left_clip + result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
219 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
220 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
221 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
222 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
223 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
224 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
225 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
226 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
227 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
228 short_neg += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
229 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
230 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
231 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
232 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
233 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
234 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
235 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
236 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
237 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
238 # Reverse primer, take everything before it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
239 # so move the right clip back
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
240 new_len = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
241 if new_len >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
242 record.annotations["clip_qual_right"] = left_clip + new_len
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
243 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
244 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
245 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
246 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
247 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
248 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
249 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
250 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
251 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
252 short_neg += 1
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
253
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
254 in_handle = open(in_file, "rb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
255 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
256 manifest = ReadRocheXmlManifest(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
257 except ValueError:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
258 manifest = None
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
259 in_handle.seek(0)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
260 out_handle = open(out_file, "wb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
261 writer = SffWriter(out_handle, xml=manifest)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
262 writer.write_file(process(SffIterator(in_handle)))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
263 # End of SFF code
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
264 elif seq_format.lower().startswith("fastq"):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
265 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
266 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
267 reader = fastqReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
268 writer = fastqWriter(out_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
269 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
270 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
271 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
272 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
273 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
274 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
275 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
276 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
277 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
278 record.quality = record.quality[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
279 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
280 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
281 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
282 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
283 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
284 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
285 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
286 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
287 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
288 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
289 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
290 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
291 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
292 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
293 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
294 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
295 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
296 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
297 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
298 record.quality = record.quality[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
299 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
300 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
301 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
302 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
303 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
304 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
305 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
306 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
307 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
308 short_neg += 1
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
309 elif seq_format.lower() == "fasta":
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
310 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
311 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
312 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
313 writer = fastaWriter(out_handle)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
314 # Following code is identical to that for FASTQ but without editing qualities
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
315 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
316 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
317 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
318 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
319 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
320 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
321 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
322 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
323 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
324 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
325 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
326 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
327 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
328 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
329 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
330 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
331 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
332 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
333 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
334 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
335 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
336 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
337 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
338 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
339 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
340 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
341 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
342 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
343 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
344 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
345 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
346 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
347 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
348 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
349 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
350 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
351 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
352 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
353 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
354 sys.exit("Unsupported file type %r" % seq_format)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
355 in_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
356 out_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
357
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
358 print("Kept %i clipped reads," % clipped)
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
359 print("discarded %i short." % short_clipped)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
360 if keep_negatives:
6
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
361 print("Kept %i non-matching reads," % negs)
b9dc7c967ee6 v0.0.16 Python 3 compatible print function
peterjc
parents: 5
diff changeset
362 print("discarded %i short." % short_neg)