annotate tools/seq_primer_clip/seq_primer_clip.py @ 4:9b074c1db68e draft

v0.0.14 galaxy_sequence_utils dependency etc
author peterjc
date Thu, 02 Feb 2017 11:52:37 -0500
parents ee5acea162a7
children 530c8d6fedd8
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
2 """Looks for the given primer sequences and clips matching SFF reads.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
3
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
4 Takes eight command line options, input read filename, input read format,
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
5 input primer FASTA filename, type of primers (forward, reverse or reverse-
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
6 complement), number of mismatches (currently only 0, 1 and 2 are supported),
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
7 minimum length to keep a read (after primer trimming), should primer-less
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
8 reads be kept (boolean), and finally the output sequence filename.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
9
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
10 Both the primer and read sequences can contain IUPAC ambiguity codes like N.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
11
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
12 This supports FASTA, FASTQ and SFF sequence files. Colorspace reads are not
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
13 supported.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
14
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
15 The mismatch parameter does not consider gapped alignemnts, however the
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
16 special case of missing bases at the very start or end of the read is handled.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
17 e.g. a primer sequence CCGACTCGAG will match a read starting CGACTCGAG...
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
18 if one or more mismatches are allowed.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
19
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
20 This can also be used for stripping off (and optionally filtering on) barcodes.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
21
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
22 Note that only the trim/clip values in the SFF file are changed, not the flow
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
23 information of the full read sequence.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
24
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
25 This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
26 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
27 See accompanying text file for licence details (MIT/BSD style).
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
28
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
29 NOTE: Currently it uses Python's regular expression engine for finding the
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
30 primers, which for my needs is fast enough.
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
31 """
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
32 import sys
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
33 import re
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
34 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
35 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
36
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
37 if "-v" in sys.argv or "--version" in sys.argv:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
38 print "v0.0.12"
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
39 sys.exit(0)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
40
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
41 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
42 from Bio.Seq import reverse_complement
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
43 from Bio.SeqIO.SffIO import SffIterator, SffWriter
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
44 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
45 sys.exit("Requires Biopython 1.54 or later")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
46 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
47 from Bio.SeqIO.SffIO import ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
48 except ImportError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
49 # Prior to Biopython 1.56 this was a private function
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
50 from Bio.SeqIO.SffIO import _sff_read_roche_index_xml as ReadRocheXmlManifest
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
51
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
52 # Parse Command Line
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
53 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
54 in_file, seq_format, primer_fasta, primer_type, mm, min_len, keep_negatives, out_file = sys.argv[1:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
55 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
56 sys.exit("Expected 8 arguments, got %i:\n%s" % (len(sys.argv) - 1, " ".join(sys.argv)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
57
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
58 if in_file == primer_fasta:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
59 sys.exit("Same file given as both primer sequences and sequences to clip!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
60 if in_file == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
61 sys.exit("Same file given as both sequences to clip and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
62 if primer_fasta == out_file:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
63 sys.exit("Same file given as both primer sequences and output!")
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
64
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
65 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
66 mm = int(mm)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
67 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
68 sys.exit("Expected non-negative integer number of mismatches (e.g. 0 or 1), not %r" % mm)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
69 if mm < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
70 sys.exit("Expected non-negtive integer number of mismatches (e.g. 0 or 1), not %r" % mm)
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
71 if mm not in [0, 1, 2]:
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
72 raise NotImplementedError
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
73
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
74 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
75 min_len = int(min_len)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
76 except ValueError:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
77 sys.exit("Expected non-negative integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
78 if min_len < 0:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
79 sys.exit("Expected non-negtive integer min_len (e.g. 0 or 1), not %r" % min_len)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
80
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
81
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
82 if keep_negatives.lower() in ["true", "yes", "on"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
83 keep_negatives = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
84 elif keep_negatives.lower() in ["false", "no", "off"]:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
85 keep_negatives = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
86 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
87 sys.exit("Expected boolean for keep_negatives (e.g. true or false), not %r" % keep_negatives)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
88
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
89
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
90 if primer_type.lower() == "forward":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
91 forward = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
92 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
93 elif primer_type.lower() == "reverse":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
94 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
95 rc = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
96 elif primer_type.lower() == "reverse-complement":
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
97 forward = False
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
98 rc = True
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
99 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
100 sys.exit("Expected foward, reverse or reverse-complement not %r" % primer_type)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
101
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
102
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
103 ambiguous_dna_values = {
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
104 "A": "A",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
105 "C": "C",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
106 "G": "G",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
107 "T": "T",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
108 "M": "ACM",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
109 "R": "AGR",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
110 "W": "ATW",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
111 "S": "CGS",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
112 "Y": "CTY",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
113 "K": "GTK",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
114 "V": "ACGMRSV",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
115 "H": "ACTMWYH",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
116 "D": "AGTRWKD",
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
117 "B": "CGTSYKB",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
118 "X": ".", # faster than [GATCMRWSYKVVHDBXN] or even [GATC]
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
119 "N": ".",
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
120 }
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
121
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
122 ambiguous_dna_re = {}
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
123 for letter, values in ambiguous_dna_values.iteritems():
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
124 if len(values) == 1:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
125 ambiguous_dna_re[letter] = values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
126 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
127 ambiguous_dna_re[letter] = "[%s]" % values
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
128
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
129
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
130 def make_reg_ex(seq):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
131 return "".join(ambiguous_dna_re[letter] for letter in seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
132
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
133
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
134 def make_reg_ex_mm(seq, mm):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
135 if mm > 2:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
136 raise NotImplementedError("At most 2 mismatches allowed!")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
137 seq = seq.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
138 yield make_reg_ex(seq)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
139 for i in range(1, mm + 1):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
140 # Missing first/last i bases at very start/end of sequence
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
141 for reg in make_reg_ex_mm(seq[i:], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
142 yield "^" + reg
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
143 for reg in make_reg_ex_mm(seq[:-i], mm - i):
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
144 yield "$" + reg
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
145 if mm >= 1:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
146 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
147 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
148 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
149 pattern = seq[:i] + "N" + seq[i + 1:]
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
150 assert len(pattern) == len(seq), "Len %s is %i, len %s is %i" \
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
151 % (pattern, len(pattern), seq, len(seq))
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
152 yield make_reg_ex(pattern)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
153 if mm >= 2:
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
154 for i, letter in enumerate(seq):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
155 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
156 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
157 for k, letter in enumerate(seq[i + 1:]):
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
158 # We'll use a set to remove any duplicate patterns
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
159 # if letter not in "NX":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
160 pattern = seq[:i] + "N" + seq[i + 1:i + 1 + k] + "N" + seq[i + k + 2:]
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
161 assert len(pattern) == len(seq), "Len %s is %i, len %s is %i" \
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
162 % (pattern, len(pattern), seq, len(seq))
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
163 yield make_reg_ex(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
164
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
165
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
166 def load_primers_as_re(primer_fasta, mm, rc=False):
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
167 # Read primer file and record all specified sequences
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
168 primers = set()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
169 in_handle = open(primer_fasta, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
170 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
171 count = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
172 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
173 if rc:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
174 seq = reverse_complement(record.sequence)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
175 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
176 seq = record.sequence
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
177 # primers.add(re.compile(make_reg_ex(seq)))
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
178 count += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
179 for pattern in make_reg_ex_mm(seq, mm):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
180 primers.add(pattern)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
181 in_handle.close()
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
182 # Use set to avoid duplicates, sort to have longest first
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
183 # (so more specific primers found before less specific ones)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
184 primers = sorted(set(primers), key=lambda p: -len(p))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
185 return count, re.compile("|".join(primers)) # make one monster re!
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
186
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
187
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
188 # Read primer file and record all specified sequences
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
189 count, primer = load_primers_as_re(primer_fasta, mm, rc)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
190 print "%i primer sequences" % count
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
191
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
192 short_neg = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
193 short_clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
194 clipped = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
195 negs = 0
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
196
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
197 if seq_format.lower() == "sff":
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
198 # SFF is different because we just change the trim points
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
199 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
200 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
201 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
202 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
203 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
204 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
205 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
206 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
207 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
208 # Forward primer, take everything after it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
209 # so move the left clip along
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
210 if len(seq) - result.end() >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
211 record.annotations["clip_qual_left"] = left_clip + result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
212 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
213 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
214 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
215 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
216 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
217 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
218 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
219 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
220 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
221 short_neg += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
222 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
223 def process(records):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
224 global short_clipped, short_neg, clipped, negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
225 for record in records:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
226 left_clip = record.annotations["clip_qual_left"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
227 right_clip = record.annotations["clip_qual_right"]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
228 seq = str(record.seq)[left_clip:right_clip].upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
229 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
230 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
231 # Reverse primer, take everything before it
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
232 # so move the right clip back
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
233 new_len = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
234 if new_len >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
235 record.annotations["clip_qual_right"] = left_clip + new_len
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
236 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
237 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
238 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
239 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
240 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
241 if len(seq) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
242 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
243 yield record
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
244 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
245 short_neg += 1
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
246
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
247 in_handle = open(in_file, "rb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
248 try:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
249 manifest = ReadRocheXmlManifest(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
250 except ValueError:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
251 manifest = None
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
252 in_handle.seek(0)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
253 out_handle = open(out_file, "wb")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
254 writer = SffWriter(out_handle, xml=manifest)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
255 writer.write_file(process(SffIterator(in_handle)))
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
256 # End of SFF code
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
257 elif seq_format.lower().startswith("fastq"):
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
258 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
259 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
260 reader = fastqReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
261 writer = fastqWriter(out_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
262 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
263 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
264 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
265 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
266 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
267 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
268 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
269 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
270 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
271 record.quality = record.quality[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
272 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
273 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
274 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
275 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
276 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
277 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
278 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
279 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
280 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
281 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
282 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
283 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
284 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
285 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
286 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
287 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
288 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
289 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
290 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
291 record.quality = record.quality[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
292 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
293 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
294 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
295 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
296 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
297 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
298 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
299 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
300 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
301 short_neg += 1
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
302 elif seq_format.lower() == "fasta":
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
303 in_handle = open(in_file, "rU")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
304 out_handle = open(out_file, "w")
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
305 reader = fastaReader(in_handle)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
306 writer = fastaWriter(out_handle)
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
307 # Following code is identical to that for FASTQ but without editing qualities
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
308 if forward:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
309 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
310 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
311 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
312 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
313 # Forward primer, take everything after it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
314 cut = result.end()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
315 record.sequence = seq[cut:]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
316 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
317 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
318 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
319 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
320 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
321 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
322 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
323 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
324 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
325 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
326 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
327 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
328 for record in reader:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
329 seq = record.sequence.upper()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
330 result = primer.search(seq)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
331 if result:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
332 # Reverse primer, take everything before it
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
333 cut = result.start()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
334 record.sequence = seq[:cut]
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
335 if len(record.sequence) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
336 clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
337 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
338 else:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
339 short_clipped += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
340 elif keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
341 if len(record) >= min_len:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
342 negs += 1
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
343 writer.write(record)
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
344 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
345 short_neg += 1
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
346 else:
4
9b074c1db68e v0.0.14 galaxy_sequence_utils dependency etc
peterjc
parents: 2
diff changeset
347 sys.exit("Unsupported file type %r" % seq_format)
2
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
348 in_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
349 out_handle.close()
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
350
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
351 print "Kept %i clipped reads," % clipped
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
352 print "discarded %i short." % short_clipped
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
353 if keep_negatives:
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
354 print "Kept %i non-matching reads," % negs
ee5acea162a7 Uploaded v0.0.10, README now using RST, MIT licence, automatic Biopython dependency
peterjc
parents:
diff changeset
355 print "discarded %i short." % short_neg