annotate microsatellite.py @ 0:70f8259b0b30 draft

Uploaded
author arkarachai-fungtammasan
date Wed, 01 Apr 2015 16:48:58 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1 #!/usr/bin/env python
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
2 """
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
3 Snoop thru a fasta file looking for microsatellite repeats of given periods
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
4 Output format: length_of_repeat left_flank_length right_flank_length repeat_motif hamming_distance read_name read_sequence read_quality (additional columns)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
5
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
6 If --r option turned on, output format will have additional columns behind:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
7 read_name read_chr pre_s pre_e tr_s tr_e suf_s suf_e tr_len tr_ref_seq
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
8
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
9 pre_s where the read start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
10 pre_e the last position before microsatellite
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
11 tr_s where microsatellite start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
12 tr_e where microsatellite end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
13 suf_s first base after microsatellite
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
14 tr_ref_seq reference sequence corresponding to microsatellite
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
15
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
16 * output positions are 0 based
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
17
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
18 :Author: Chen Sun (cxs1031@cse.psu.edu); Bob Harris (rsharris@bx.psu.edu)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
19
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
20 modifing log:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
21
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
22 09/27/2013
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
23 replace function dense_intervals with function non_negative_intervals, which do not need to import such file.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
24
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
25 10/18/2013
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
26 modify function find_repeat_element to get a quick speed, under the condition that hamming_distance = 0, which means do not allowed any mutation/indel
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
27
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
28 02/25/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
29 add function that can deal with mapped reads
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
30 with additional output
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
31
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
32 02/28/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
33 modify the 0-based end point, as in 0-base area, it is half-open [ )
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
34 so the 0-based site, should always be added by 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
35
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
36 03/05/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
37 deal with multi-fasta
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
38 """
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
39 from sys import argv,stdin,stderr,exit
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
40 from string import maketrans
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
41 from md5 import new as md5_new
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
42 import re
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
43 #from pyfracluster import dense_intervals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
44
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
45 def usage(s=None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
46 message = """
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
47 usage: microsat_snoop [fasta_file] [options]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
48 <fasta_file> Name of file to read sequences from; if absent,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
49 sequences are read from stdin
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
50 --fasta Input file is in fasta format
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
51 (this is the default)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
52 --fastq Input file is in fastq format
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
53 (default is fasta unless filename is .fastq)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
54 --fastq:noquals Input file is in fastq format, but discard quals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
55 --sam Input file is SAM file
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
56 --r Indicate additional output information, if indicated,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
57 --ref option is mendatory
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
58 --ref=<filepath> Reference file (absolute) path
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
59 --period=<length> (mandatory,cumulative) repeat length(s) to be
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
60 searched for
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
61 <length> is expected to be small, less than 10
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
62 <length> can also be a comma-separated list, or
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
63 a range <low>..<high>
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
64 --rate=<fraction> control the candidate repeat interval detector;
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
65 it will consider intervals with at least
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
66 <fraction> of matches when shifted by the period;
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
67 <fraction> is between 0 and 1 and can be either a
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
68 real number or <n>/<d>
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
69 (default is 6/7)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
70 --minlength=<length> minimum length of intervals reported, in bp
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
71 (default is 20)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
72 --progress=<count> how often to report the sequence we're searching
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
73 (default is no progress report)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
74 --allowduplicates process all input sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
75 (this is the default)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
76 --noduplicates ignore any input sequence that's the same as an
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
77 earlier sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
78 --nonearduplicates ignore any input sequence that has the same first
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
79 100 bp as an earlier sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
80 --nonearduplicate=<length> ignore any input sequence that has the same first
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
81 <length> bp as an earlier sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
82 --hamming=<count> Don't report candidate repeat intervals that have
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
83 more than <count> mismatches
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
84 (default is to do no such filtering)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
85 --prefix=<length> Don't report candidate repeat intervals that
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
86 start within <length> of the sequence start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
87 (default is to do no such filtering)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
88 --suffix=<length> Don't report candidate repeat intervals that
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
89 end within <length> of the sequence end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
90 (default is to do no such filtering)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
91 --subsample=<k>/<n> Process only the <k>th sequence of every group of
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
92 <n> sequences; <k> ranges from 1 to <n>
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
93 --multipleruns Consider all candidate intervals in a sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
94 (default is to consider only the longest)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
95 --partialmotifs Consider microatelites with a partial motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
96 (default is to consider only whole motifs)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
97 --splitbyvalidity Preprocess sequences, splitting at Ns; this
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
98 prevents candidates from including Ns
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
99 (default is not to split)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
100 --noflankdisplay Show entire sequence as flanking regions
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
101 (this is the default)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
102 --flankdisplay=<length> Limit length of flanking regions shown
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
103 --readnamesuffix=<string> Root of suffix to append to read names; e.g. 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
104 for forward, 2 for reverse; this triggers other
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
105 info to be included in the suffix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
106 (default is "1" for fastq; no suffix for fasta)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
107 --head=<number> limit the number of sequences processed
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
108 --markend Write a marker line upon completion
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
109 (default is not to write a marker)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
110 --help=details Describe the process, and quit"""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
111
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
112 if (s == None): exit (message)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
113 else: exit ("%s\n%s" % (s,message))
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
114
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
115
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
116 detailedDescription = """In broad terms, the process works as follows:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
117
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
118 (1) Identify intervals that are highly correlated with the interval shifted by
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
119 P (the repeat period). These intervals are called "runs" or "candidates".
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
120 The level of correlation required is controlled by rateThreshold.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
121 Depending on whether we want to look for more than one microsat, we either
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
122 find the longest such run (simple algorithm) or many runs (more complicated
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
123 algorithm). The following steps are then performed on each run.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
124
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
125 (2) Find the most likely repeat motif in the run. This is done by counting
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
126 all kmers (of length P) and choosing the most frequent. If that kmer is
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
127 itself covered by a sub-repeat we discard this run. The idea is that we
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
128 can ignore a 6-mer like ACGACG because we will find it when we are looking
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
129 for 3-mers.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
130
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
131 (3) Once we identify the most likely repeat motif, we then modify the
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
132 interval, adjusting start and end to find the interval that has the fewest
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
133 mismatches vs. a sequence of the motif repeated (hamming distance). Only
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
134 whole copies of the motif are considered.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
135
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
136 (4) At this point we have a valid microsat interval (in the eyes of the
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
137 program). It is subjected to some filtering stages (hamming distance or too
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
138 close to an end), and if it satisfies those conditions, it's reported to
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
139 the user."""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
140
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
141 def main():
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
142 global debug
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
143
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
144 #=== parse the command line ===
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
145
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
146 inputFilename = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
147 referenceFileName = None #add by Chen Sun on 02/25
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
148 inputFormat = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
149 repeatPeriods = []
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
150 rateThreshold = 6 / 7.0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
151 lengthThreshold = 20
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
152 reportProgress = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
153 discardDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
154 discardNearDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
155 nearDuplicatePrefix = 100
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
156 hammingThreshold = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
157 prefixThreshold = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
158 suffixThreshold = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
159 subsampleK = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
160 subsampleN = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
161 reportMultipleRuns = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
162 allowPartialMotifs = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
163 splitByValidity = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
164 flankDisplayLimit = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
165 readNameSuffix = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
166 headLimit = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
167 markEndOfFile = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
168 additionalInfo = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
169 debug = []
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
170
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
171 for arg in argv[1:]:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
172 if (arg == "--fasta"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
173 inputFormat = "fasta"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
174 elif (arg == "--fastq"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
175 inputFormat = "fastq"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
176 elif (arg == "--fastq:noquals"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
177 inputFormat = "fastq:noquals"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
178 elif (arg == "--sam"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
179 inputFormat = "sam"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
180 elif (arg == "--r"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
181 additionalInfo = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
182 elif (arg.startswith("--ref=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
183 referenceFileName = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
184 elif (arg.startswith("--period=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
185 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
186 for period in val.split(","):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
187 if (".." in period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
188 (lowPeriod,highPeriod) = period.split("..",1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
189 lowPeriod = int(lowPeriod)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
190 highPeriod = int(highPeriod)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
191 for period in xrange(lowPeriod,highPeriod+1):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
192 repeatPeriods += [period]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
193 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
194 repeatPeriods += [int(period)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
195 elif (arg.startswith("--rate=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
196 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
197 rateThreshold = float_or_fraction(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
198 assert (0.0 < rateThreshold <= 1.0), "%s not a valid rate" % val
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
199 elif (arg.startswith("--minlength=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
200 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
201 lengthThreshold = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
202 assert (lengthThreshold >= 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
203 elif (arg.startswith("--progress=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
204 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
205 reportProgress = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
206 elif (arg == "--allowduplicates"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
207 discardDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
208 discardNearDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
209 elif (arg == "--noduplicates"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
210 discardDuplicates = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
211 discardNearDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
212 elif (arg == "--nonearduplicates"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
213 discardDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
214 discardNearDuplicates = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
215 elif (arg.startswith("--nonearduplicate=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
216 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
217 discardDuplicates = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
218 discardNearDuplicates = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
219 nearDuplicatePrefix = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
220 assert (nearDuplicatePrefix > 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
221 elif (arg.startswith("--hamming=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
222 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
223 hammingThreshold = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
224 assert (hammingThreshold >= 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
225 elif (arg.startswith("--prefix=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
226 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
227 prefixThreshold = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
228 assert (prefixThreshold >= 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
229 elif (arg.startswith("--suffix=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
230 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
231 suffixThreshold = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
232 assert (suffixThreshold >= 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
233 elif (arg.startswith("--subsample=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
234 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
235 (k,n) = val.split("/",2)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
236 subsampleK = int(k)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
237 subsampleN = int(n)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
238 assert (0 < subsampleK <= subsampleN)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
239 elif (arg == "--multipleruns"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
240 reportMultipleRuns = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
241 elif (arg == "--partialmotifs"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
242 allowPartialMotifs = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
243 elif (arg == "--splitbyvalidity"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
244 splitByValidity = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
245 elif (arg == "--noflankdisplay"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
246 flankDisplayLimit = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
247 elif (arg.startswith("--flankdisplay=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
248 val = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
249 flankDisplayLimit = int(val)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
250 assert (flankDisplayLimit >= 0)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
251 elif (arg.startswith("--readnamesuffix")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
252 readNameSuffix = arg.split("=",1)[1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
253 elif (arg.startswith("--head=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
254 headLimit = int_with_unit(arg.split("=",1)[1])
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
255 elif (arg == "--markend"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
256 markEndOfFile = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
257 elif (arg == "--help=details"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
258 exit (detailedDescription)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
259 elif (arg.startswith("--debug=")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
260 debug += (arg.split("=",1)[1]).split(",")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
261 elif (arg.startswith("--")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
262 usage("unrecognized option: %s" % arg)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
263 elif (inputFilename == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
264 inputFilename = arg
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
265 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
266 usage("unrecognized option: %s" % arg)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
267
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
268 #=== determine periods of interest ===
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
269
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
270 if (repeatPeriods == []):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
271 usage("you gotta give me a repeat period")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
272
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
273 if (additionalInfo == True):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
274 if (referenceFileName == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
275 usage("reference file path needed. use --ref=<reference> to indicate")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
276
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
277 periodSeed = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
278 for period in repeatPeriods:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
279 if (period < 1): usage("period %d is not valid" % period)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
280 periodSeed[period] = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
281
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
282 repeatPeriods = [period for period in periodSeed]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
283 repeatPeriods.sort()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
284
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
285 #=== determine input format ===
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
286
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
287 if (inputFormat == "fasta"): sequence_reader = fasta_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
288 elif (inputFormat == "fastq"): sequence_reader = fastq_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
289 elif (inputFormat == "fastq:noquals"): sequence_reader = fastq_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
290 elif (inputFormat == "sam"): sequence_reader = sam_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
291 elif (inputFilename == None): sequence_reader = fasta_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
292 elif (inputFilename.endswith(".fastq")): sequence_reader = fastq_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
293 elif (inputFilename.endswith(".fq")): sequence_reader = fastq_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
294 elif (inputFilename.endswith(".sam")): sequence_reader = sam_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
295 else: sequence_reader = fasta_sequences
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
296
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
297 if (inputFilename != None): inputF = file(inputFilename,"rt")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
298 else: inputF = stdin
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
299
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
300 if (readNameSuffix == None) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
301 and (sequence_reader == fastq_sequences) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
302 and (inputFormat != "fastq:noquals"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
303 readNameSuffix = "1"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
304
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
305 #=== process the sequences ===
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
306
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
307 refSequence = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
308 rightName = ""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
309 sequence = ""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
310 if additionalInfo:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
311 firstFasta = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
312 originalRefF = open(referenceFileName)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
313 for line in originalRefF.readlines():
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
314 line = line.replace('\r','')
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
315 line = line.replace('\n','')
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
316 if line.startswith(">"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
317 if firstFasta:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
318 firstFasta = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
319 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
320 refSequence[rightName] = sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
321 rightName = line[1:]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
322 sequence = ""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
323 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
324 sequence += line
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
325 originalRefF.close()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
326 refSequence[rightName] = sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
327
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
328 sequenceSeen = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
329
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
330 numSequences = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
331 for seqInfo in sequence_reader(inputF):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
332 numSequences += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
333 if (headLimit != None) and (numSequences > headLimit):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
334 print >>stderr, "limit of %d sequences reached" % headLimit
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
335 break
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
336
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
337 if (sequence_reader == sam_sequences):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
338 #seqName,"".join(seqNucs).upper().translate(nonDnaMap), refName, pre_s, cigar
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
339 (name, sequence, refName, pre_s, cigar) = seqInfo
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
340 quals = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
341 elif (sequence_reader == fastq_sequences):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
342 (name,sequence,quals) = seqInfo
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
343 if (inputFormat == "fastq:noquals"): quals = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
344 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
345 (name,sequence) = seqInfo
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
346 quals = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
347
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
348 if (reportProgress != None) and (numSequences % reportProgress == 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
349 print >>stderr, "%s %d" % (name,numSequences)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
350
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
351 # if we're subsampling and not interested in this sequence, skip it
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
352
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
353 if (subsampleN != None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
354 if ((numSequences-1) % subsampleN != (subsampleK-1)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
355 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
356
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
357 # if this sequence is shorter than the length of interest, skip it
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
358
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
359 seqLen = len(sequence)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
360 if (seqLen < period) or (seqLen < lengthThreshold): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
361
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
362 # if we're not interested in duplicates and this is one, skip it;
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
363 # note that we assume no hash collisions occur, i.e. that all hash
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
364 # matches are truly sequence matches
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
365
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
366 if (discardDuplicates):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
367 h = hash108(sequence)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
368 if (h in sequenceSeen): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
369 sequenceSeen[h] = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
370 elif (discardNearDuplicates):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
371 h = hash108(sequence[:nearDuplicatePrefix])
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
372 if (h in sequenceSeen): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
373 sequenceSeen[h] = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
374
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
375 # split the sequence into chunks of valid nucleotides
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
376
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
377 if (splitByValidity):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
378 chunks = [(start,end) for (start,end) in nucleotide_runs(sequence)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
379 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
380 chunks = [(0,len(sequence))]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
381
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
382 # evaluate for each period of interest
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
383
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
384 for period in repeatPeriods:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
385
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
386 # operate on each chunk
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
387
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
388 for (chunkStart,chunkEnd) in chunks:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
389 chunkLen = chunkEnd - chunkStart
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
390 if (chunkLen < period) or (chunkLen < lengthThreshold): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
391
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
392 if ("validity" in debug) or ("correlation" in debug) or ("runs" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
393 print >>stderr, ">%s_%d_%d" % (name,chunkStart,chunkEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
394
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
395 # compute correlation sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
396
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
397 corr = correlation_sequence(sequence,period,chunkStart,chunkEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
398
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
399 if ("correlation" in debug) or ("runs" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
400 print >>stderr, sequence[chunkStart:chunkEnd]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
401 print >>stderr, corr
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
402
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
403 # find runs (candidates for being a microsat)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
404
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
405 if (reportMultipleRuns):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
406 runs = all_suitable_runs(corr,lengthThreshold-period,rateThreshold, hammingThreshold)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
407 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
408 runs = longest_suitable_run(corr,lengthThreshold,rateThreshold)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
409 if (runs == []): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
410
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
411
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
412 if ("runs" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
413 for (start,end) in runs:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
414 run = [" "] * seqLen
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
415 for ix in xrange(start-period,end):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
416 run[ix] = "*"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
417 print >>stderr, "".join(run)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
418
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
419 if ("candidates" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
420 for (start,end) in runs:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
421 print >>stderr, "%s %d %d" % (name,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
422
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
423 # process runs and report those that pass muster
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
424
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
425 runCount = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
426 for (start,end) in runs:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
427 runCount += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
428
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
429 start = chunkStart + start - period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
430 end = chunkStart + end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
431
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
432 (kmer,d,start,end) = find_repeat_element(hammingThreshold, period,sequence,start,end,allowPartials=allowPartialMotifs)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
433 if (kmer == None): continue # (no useful repeat kmer was found)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
434
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
435 rptExtent = end - start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
436 prefixLen = start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
437 suffixLen = seqLen - end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
438 if (rptExtent <= period): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
439 if (hammingThreshold != None) and (d > hammingThreshold): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
440 if (prefixThreshold != None) and (prefixLen < prefixThreshold): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
441 if (suffixThreshold != None) and (suffixLen < suffixThreshold): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
442
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
443 if (flankDisplayLimit == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
444 seq = sequence[:start] \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
445 + sequence[start:end].lower() \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
446 + sequence[end:]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
447 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
448 seq = sequence[max(chunkStart,start-flankDisplayLimit):start] \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
449 + sequence[start:end].lower() \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
450 + sequence[end:min(chunkEnd,end+flankDisplayLimit)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
451 reportName = name
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
452 if (readNameSuffix != None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
453 reportName += "_"+readNameSuffix+"_per"+str(period)+"_"+str(runCount)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
454 if (quals == None or quals == "." or quals == "\t."): quals = "\t."
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
455 else: quals = "\t" + quals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
456 if not additionalInfo:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
457 print "%d\t%d\t%d\t%s\t%d\t%s\t%s%s" \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
458 % (rptExtent,prefixLen,suffixLen,kmer,d,reportName,seq,quals)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
459 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
460 #pre_e = pre_s + prefixLen - 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
461 refPoint = pre_s
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
462 donorPoint = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
463
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
464 donorBeforeStart = prefixLen - 1 #pre_e
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
465 donorMicroStart = prefixLen #tr_s
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
466 donorMicroEnd = donorMicroStart + rptExtent - 1 #tr_e
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
467 donorAfterMicro = donorMicroEnd + 1 #suf_s
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
468 donorEnd = len(seq) - 1 #suf_e
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
469
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
470 set_pre_e = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
471 set_tr_s = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
472 set_tr_e = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
473 set_suf_s = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
474 set_suf_e = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
475
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
476 pre_e = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
477 tr_s = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
478 tr_e = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
479 suf_s = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
480 suf_e = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
481
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
482 matchList = re.findall('(\d+)([IDM])', cigar)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
483 unCognitiveCigar = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
484 for matchN, matchType in matchList:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
485 matchNum = int(matchN)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
486 if matchType == "M":
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
487 donorPoint = donorPoint + matchNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
488 refPoint = refPoint + matchNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
489 elif matchType == "D":
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
490 refPoint = refPoint + matchNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
491 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
492 elif matchType == "I":
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
493 donorPoint = donorPoint + matchNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
494 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
495 unCognitiveCigar = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
496 break
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
497
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
498 if not set_pre_e:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
499 if donorPoint >= donorBeforeStart:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
500 pre_e = refPoint - (donorPoint - donorBeforeStart)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
501 set_pre_e = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
502 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
503 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
504
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
505 if not set_tr_s:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
506 if donorPoint >= donorMicroStart:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
507 tr_s = refPoint - (donorPoint - donorMicroStart)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
508 set_tr_s = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
509 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
510 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
511
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
512 if not set_tr_e:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
513 if donorPoint >= donorMicroEnd:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
514 tr_e = refPoint - (donorPoint - donorMicroEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
515 set_tr_e = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
516 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
517 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
518
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
519 if not set_suf_s:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
520 if donorPoint >= donorAfterMicro:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
521 suf_s = refPoint - (donorPoint - donorAfterMicro)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
522 set_suf_s = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
523 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
524 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
525
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
526 if not set_suf_e:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
527 if donorPoint >= donorEnd:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
528 suf_e = refPoint - (donorPoint - donorEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
529 set_suf_e = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
530 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
531 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
532
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
533 if unCognitiveCigar:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
534 break
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
535 tr_len = tr_e - tr_s + 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
536
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
537 if refName not in refSequence:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
538 tr_ref_seq = "."
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
539 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
540 if refSequence[refName] == "":
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
541 tr_ref_seq = "."
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
542 elif len(refSequence[refName]) <= tr_e:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
543 tr_ref_seq = "."
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
544 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
545 tr_ref_seq = refSequence[refName][tr_s:tr_e+1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
546
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
547 pre_e += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
548 tr_e += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
549 suf_e += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
550 print "%d\t%d\t%d\t%s\t%d\t%s\t%s%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%s" \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
551 % (rptExtent,prefixLen,suffixLen,kmer,d,reportName,seq,quals,reportName,refName,pre_s,pre_e,tr_s,tr_e,suf_s,suf_e,tr_len,tr_ref_seq)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
552
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
553 if (markEndOfFile):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
554 print "# microsat_snoop end-of-file"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
555
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
556 if (inputF != stdin):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
557 inputF.close()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
558
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
559 # non_negative_intervals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
560 # find intervals with exactly + and no -
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
561 # from string like this : +++++++++---+++++++++
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
562 def non_negative_intervals(seq, minLength=None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
563
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
564 start = -1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
565 end = -1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
566 firstPlus = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
567 #print seq
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
568 for ix in range(len(seq)): # for every char in seq
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
569 ch = seq[ix]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
570 if(ch == "+"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
571 if(firstPlus):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
572 firstPlus = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
573 start = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
574 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
575 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
576 elif(ch == "-"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
577 if(start >= 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
578 end = ix-1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
579 if((end - start + 1) >= minLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
580 yield (start,end+1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
581 start = -1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
582 firstPlus = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
583 if(start > 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
584 if((ix - start + 1) >= minLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
585 yield (start, ix+1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
586
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
587
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
588 ###################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
589 # modified by Chen Sun on 7/11/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
590 # We do not want other modules, so parse these functions inside
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
591 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
592 ###################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
593
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
594 # parse a string of the form {positives}/{positives_and_neutrals}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
595
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
596 def parse_spec(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
597 if ("/" not in s): raise ValueError
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
598 (n,d) = s.split("/",1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
599 if (not n.startswith("{")) or (not n.endswith("}")): raise ValueError
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
600 if (not d.startswith("{")) or (not d.endswith("}")): raise ValueError
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
601
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
602 positives = n[1:-1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
603 d = d[1:-1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
604
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
605 for ch in positives:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
606 if (ch not in d): raise ValueError
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
607
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
608 neutrals = [ch for ch in d if (ch not in positives)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
609 return (positives,neutrals)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
610
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
611
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
612 # convert a string to a number, allowing fractions
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
613
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
614 def float_or_fraction(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
615 if ("/" in s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
616 (numer,denom) = s.split("/",1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
617 return float(numer)/float(denom)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
618 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
619 return float(s)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
620
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
621
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
622 # dense_intervals--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
623 # Find all non-overlapping runs with a good enough rate (of positives), and
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
624 # which meet our length threshold.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
625 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
626 # The algorithm used is adapted from Zhang, Berman, Miller, "Post-processing
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
627 # long pairwise alignments", Bioinformatics Vol. 15 no. 12 1999.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
628 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
629 # $$$ we use the denominator as the threshold, but we really should use the
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
630 # $$$ .. numerator, comparing it to minLength*rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
631
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
632 def dense_intervals(seq,rate,positives,neutrals,blockers="",minLength=None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
633
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
634 if (blockers == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
635 blockers = "".join([chr(n) for n in range(1,256)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
636 if (chr(n) not in positives)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
637 and (chr(n) not in neutrals)])
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
638
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
639 stackLeft = [None] # stack with each entry containing five
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
640 stackRight = [None] # .. elements; note that entry zero is not
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
641 stackLeftScore = [None] # .. used
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
642 stackRightScore = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
643 stackLower = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
644 top = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
645 score = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
646
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
647 for ix in range(len(seq)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
648 ch = seq[ix]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
649 if (ch in blockers):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
650 # emit intervals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
651
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
652 for sp in range(1,top+1):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
653 left = stackLeft [sp] + 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
654 right = stackRight[sp]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
655
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
656 while (left < right) and (seq[left] not in positives): left += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
657 while (right > left) and (seq[right] not in positives): right -= 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
658
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
659 right += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
660 if (minLength == None) or (right - left >= minLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
661 yield (left,right)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
662
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
663 #empty stack
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
664
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
665 stackLeft = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
666 stackRight = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
667 stackLeftScore = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
668 stackRightScore = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
669 stackLower = [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
670 top = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
671 score = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
672 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
673
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
674 if (ch in positives): weight = 1-rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
675 elif (ch in neutrals): weight = -rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
676 else: raise ValueError
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
677
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
678 score += weight
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
679 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
680 # print >>sys.stderr, "%3d: %c %5.2f" % (ix, ch, score),
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
681
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
682 if (weight < 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
683 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
684 # print >>sys.stderr
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
685 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
686
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
687 if (top > 0) and (stackRight[top] == ix-1):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
688 # add this site to the interval on top of the stack
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
689
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
690 stackRight [top] = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
691 stackRightScore[top] = score
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
692
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
693 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
694 # print >>sys.stderr, \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
695 # " extending [%d] %d-%d %4.1f %4.1f" \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
696 # % (top,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
697 # stackLeft [top], stackRight [top],
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
698 # stackLeftScore[top], stackRightScore[top]),
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
699
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
700 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
701 # create a one site interval
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
702
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
703 top += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
704 if (top >= len(stackLeft)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
705 stackLeft += [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
706 stackRight += [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
707 stackLeftScore += [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
708 stackRightScore += [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
709 stackLower += [None]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
710
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
711 stackLeft [top] = ix - 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
712 stackLeftScore [top] = score - weight
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
713 stackRight [top] = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
714 stackRightScore[top] = score
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
715 stackLower [top] = top - 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
716
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
717 while (stackLower[top] > 0) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
718 and (stackLeftScore[stackLower[top]] > stackLeftScore[top]):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
719 stackLower[top] = stackLower[stackLower[top]]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
720
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
721 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
722 # print >>sys.stderr, \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
723 # " creating [%d] %d-%d %4.1f %4.1f -> %d" \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
724 # % (top,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
725 # stackLeft [top], stackRight [top],
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
726 # stackLeftScore[top], stackRightScore[top],
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
727 # stackLower [top]),
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
728
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
729 # merge intervals; if there is a previous interval with a no-higher
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
730 # left score and no-higher right score, merge this interval (and all
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
731 # intervening ones) into that one
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
732
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
733 while (top > 1) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
734 and (stackLower[top] > 0) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
735 and (stackRightScore[stackLower[top]] <= stackRightScore[top]):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
736 stackRight [stackLower[top]] = stackRight [top]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
737 stackRightScore[stackLower[top]] = stackRightScore[top]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
738 top = stackLower[top]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
739
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
740 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
741 # print >>sys.stderr, \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
742 # "\n%*s merging [%d] %d-%d %4.1f %4.1f" \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
743 # % (13, "", top,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
744 # stackLeft[top], stackRight [top],
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
745 # stackLeftScore[top], stackRightScore[top]),
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
746
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
747 #if ("algorithm" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
748 # print >>sys.stderr
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
749
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
750 # emit intervals
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
751
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
752 for sp in range(1,top+1):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
753 left = stackLeft [sp] + 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
754 right = stackRight[sp]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
755
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
756 while (left < right) and (seq[left] not in positives): left += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
757 while (right > left) and (seq[right] not in positives): right -= 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
758
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
759 right += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
760 if (minLength == None) or (right - left >= minLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
761 yield (left,right)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
762
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
763
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
764 ###################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
765 # modified by Chen Sun on 7/11/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
766 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
767 ###################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
768
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
769 # correlation_sequence--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
770 # Compute the correlation sequence for a given period. This is a sequence
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
771 # of + and - indicating whether the base at a given position matches the one
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
772 # P positions earlier (where P is the period). The first P positions are
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
773 # blank. Positions with single character runs longer than the period are
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
774 # considered as non-matches, unless the period is 1.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
775
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
776 def correlation_sequence(sequence,period,start=None,end=None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
777 if (start == None): start = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
778 if (end == None): end = len(sequence)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
779
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
780 prevCh = sequence[start]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
781 run = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
782 for ix in xrange(start+1,start+period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
783 ch = sequence[ix]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
784 if (ch != prevCh): run = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
785 else: run += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
786 prevCh = ch
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
787
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
788 corr = [" "] * period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
789 for ix in xrange(start+period,end):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
790 rptCh = sequence[ix-period]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
791 ch = sequence[ix]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
792 if (ch != prevCh): run = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
793 else: run += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
794 if (ch in "ACGT") \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
795 and (ch == rptCh) \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
796 and ((period == 1) or (run < period)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
797 corr += ["+"]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
798 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
799 corr += ["-"]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
800 prevCh = ch
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
801
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
802 return "".join(corr)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
803
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
804
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
805 # longest_suitable_run--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
806 # Find longest run with a good enough rate (of positives).
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
807 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
808 # We score a "+" as 1-r and anything else as -r. This is based on the fol-
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
809 # lowing derivation (p is the number of "+"s, n is the number of non-"+"s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
810 # p/(p+n) >= r
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
811 # ==> p >= rp + rn
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
812 # ==> (1-r)p - rn >= 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
813 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
814 # We adapt an algorithm from "Programming Pearls", pg. 81 (2000 printing).
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
815 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
816 # $$$ we use the denominator as the threshold, but we really should use the
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
817 # $$$ .. numerator, comparing it to minLength*rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
818 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
819 # $$$ this needs to account for $$$ this situation:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
820 # $$$ sequence: ACGACGACGACGTTATTATTATTA
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
821 # $$$ matches: +++++++++---+++++++++
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
822 # $$$ this is currently considered to be one interval (if rate <= 6/7), but it
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
823 # $$$ ought to be two; we can't just post-process, though, because some other
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
824 # $$$ interval might be longer than the longest half of this; maybe what we
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
825 # $$$ need to do is consider matches at distances -P and -2P, or if we match
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
826 # $$$ -P but that itself was a mismatch, we should carry the mismatch forward
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
827
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
828 def longest_suitable_run(seq,minLength,rate):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
829 maxEndingHere = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
830 maxSoFar = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
831 start = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
832
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
833 for ix in xrange(len(seq)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
834 if (seq[ix] == "+"): s = 1-rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
835 else: s = -rate
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
836
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
837 if (maxEndingHere+s < 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
838 maxEndingHere = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
839 block = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
840 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
841 maxEndingHere += s
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
842 if (maxEndingHere >= maxSoFar):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
843 maxSoFar = maxEndingHere
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
844 start = block + 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
845 end = ix + 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
846
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
847 if (start == None) or (end - start < minLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
848 return []
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
849 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
850 return [(start,end)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
851
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
852
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
853 # all_suitable_runs--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
854 # Find all non-overlapping runs with a good enough rate (of positives), and
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
855 # which meet our length threshold.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
856 # $$$ this needs to post-process the intervals, splitting them to account for
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
857 # $$$ this situation:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
858 # $$$ sequence: ACGACGACGACGTTATTATTATTA
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
859 # $$$ matches: +++++++++---+++++++++
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
860 # $$$ this is currently reported as one interval (if rate <= 6/7), but it
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
861 # $$$ ought to be two
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
862
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
863 def all_suitable_runs(seq,minCorrLength,rate, hammingThreshold):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
864
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
865 ################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
866 # modified by Chen Sun on 07/11/2014
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
867 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
868 ################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
869
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
870 if hammingThreshold > 0:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
871 return [(start,end) for (start,end) in dense_intervals(seq,rate,"+","-",blockers=None,minLength=minCorrLength)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
872 elif hammingThreshold == 0:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
873 return [(start,end) for (start,end) in non_negative_intervals(seq, minLength=minCorrLength)]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
874
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
875
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
876 # find_repeat_element--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
877 # Find the most plausible repeat element for a run, and nudge the ends of
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
878 # the run if needed. Note that we will not consider kmers that represent
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
879 # shorter repeats. For example, we won't report ACTACT as a 6-mer since we
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
880 # consider this to have a shorter period than 6.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
881
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
882 def find_repeat_element(hammingThreshold, period,seq,start,end,allowPartials=False):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
883
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
884 if hammingThreshold > 0:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
885 (kmer,bestD,bestStart,bestEnd) = find_hamming_repeat_element(period,seq,start,end,allowPartials)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
886 return (kmer,bestD,bestStart,bestEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
887 # count the number of occurences of each k-mer; note that we can't
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
888 # reject kmers containing smaller repeats yet, since for a sequence like
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
889 # ACACACACACAAACACACACACACACACAC we must first discover ACACAC as the best
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
890 # 6-mer, and THEN reject it; if we reject ACACAC while counting, we'd end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
891 # up reporting something like ACACAA as the best motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
892
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
893 if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
894 print >>stderr, "find_repeat_element(%d,%d,%d)" % (period,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
895
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
896 if ("partial" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
897 print period, seq, start, end, allowPartials;
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
898 print seq[start:end]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
899
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
900 kmerToCount = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
901 kmerToFirst = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
902 for ix in xrange(start,end-(period-1)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
903 kmer = seq[ix:ix+period]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
904 if ("N" in kmer): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
905 if (kmer not in kmerToCount):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
906 kmerToCount[kmer] = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
907 kmerToFirst[kmer] = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
908 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
909 kmerToCount[kmer] += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
910 #if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
911 # print >>stderr, " %d: %s" % (ix,kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
912
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
913 # choose the best k-mer; this is simply the most frequently occurring one,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
914 # with ties broken by whichever one came first
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
915
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
916 kmers = [(-kmerToCount[kmer],kmerToFirst[kmer],kmer) for kmer in kmerToCount]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
917 if (kmers == []): return (None,None,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
918 kmers.sort()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
919
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
920 if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
921 for (count,first,kmer) in kmers:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
922 print >>stderr, " %s: %d" % (kmer,-count)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
923
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
924 (count,first,kmer) = kmers[0]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
925 if (contains_repeat(kmer)): return (None,None,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
926
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
927 # determine the hamming distance between the run and a simple repeat, for
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
928 # each "plausible" start and end; we compute the distance for each such
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
929 # interval, and choose the one with the lowest hamming distance; ties are
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
930 # broken in a deterministic-but-unspecified manner
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
931
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
932 bestD = bestStart = bestEnd = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
933 ###################################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
934 # modified by Chen Sun(cxs1031@cse.psu.edu) on 10/18/2013
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
935 # since we do not allow hamming_distance > 0, which means we do not allow mutation,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
936 # we do not need this section to produce bestStart and End
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
937 ###################################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
938
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
939 #for (s,e) in plausible_intervals(start,end,period,len(seq),allowPartials=allowPartials):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
940 # d = hamming_distance(seq,s,e,kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
941 # if (d == None): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
942 # if (bestD == None) or (d <= bestD):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
943 # (bestD,bestStart,bestEnd) = (d,s,e)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
944
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
945
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
946
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
947 bestStart = start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
948
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
949 if(allowPartials):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
950 bestEnd = end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
951 elif(not allowPartials):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
952 bestEnd = start
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
953 pattern = seq[start:start+period]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
954 if ("partial" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
955 print "kmer:", kmer
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
956 if(pattern != kmer):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
957 print "pattern:", pattern
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
958
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
959 while(bestEnd <= end-period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
960 bestEnd += period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
961
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
962 # bestD will always be 0, as we do not allow mutation
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
963 bestD = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
964
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
965 if ("partial" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
966 print bestD, bestStart, bestEnd
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
967
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
968 ###################################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
969 # modified by Chen Sun(cxs1031@cse.psu.edu) on 10/10
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
970 #
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
971 ###################################################################################
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
972 return (kmer,bestD,bestStart,bestEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
973
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
974
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
975 def find_hamming_repeat_element(period,seq,start,end,allowPartials=False):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
976
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
977 # count the number of occurences of each k-mer; note that we can't
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
978 # reject kmers containing smaller repeats yet, since for a sequence like
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
979 # ACACACACACAAACACACACACACACACAC we must first discover ACACAC as the best
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
980 # 6-mer, and THEN reject it; if we reject ACACAC while counting, we'd end
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
981 # up reporting something like ACACAA as the best motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
982
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
983 if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
984 print >>stderr, "find_repeat_element(%d,%d,%d)" % (period,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
985
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
986 kmerToCount = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
987 kmerToFirst = {}
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
988 for ix in xrange(start,end-(period-1)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
989 kmer = seq[ix:ix+period]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
990 if ("N" in kmer): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
991 if (kmer not in kmerToCount):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
992 kmerToCount[kmer] = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
993 kmerToFirst[kmer] = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
994 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
995 kmerToCount[kmer] += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
996 #if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
997 # print >>stderr, " %d: %s" % (ix,kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
998
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
999 # choose the best k-mer; this is simply the most frequently occurring one,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1000 # with ties broken by whichever one came first
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1001
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1002 kmers = [(-kmerToCount[kmer],kmerToFirst[kmer],kmer) for kmer in kmerToCount]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1003 if (kmers == []): return (None,None,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1004 kmers.sort()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1005
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1006 if ("element" in debug):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1007 for (count,first,kmer) in kmers:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1008 print >>stderr, " %s: %d" % (kmer,-count)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1009
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1010 (count,first,kmer) = kmers[0]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1011 if (contains_repeat(kmer)): return (None,None,start,end)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1012
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1013 # determine the hamming distance between the run and a simple repeat, for
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1014 # each "plausible" start and end; we compute the distance for each such
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1015 # interval, and choose the one with the lowest hamming distance; ties are
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1016 # broken in a deterministic-but-unspecified manner
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1017
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1018 bestD = bestStart = bestEnd = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1019
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1020 for (s,e) in plausible_intervals(start,end,period,len(seq),allowPartials=allowPartials):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1021 d = hamming_distance(seq,s,e,kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1022 if (d == None): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1023 if (bestD == None) or (d <= bestD):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1024 (bestD,bestStart,bestEnd) = (d,s,e)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1025
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1026 return (kmer,bestD,bestStart,bestEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1027
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1028 # plausible_intervals--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1029 # Yield all plausible intervals intersecting with a run. We generate all
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1030 # starts within P bp of the run's start. For each of these, we either (a) try
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1031 # all ends within P bp of run's end, or (b) trim the new interval to a whole
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1032 # multiple of the period, and report this short interval and the longer
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1033 # interval with one more period appended. Case (a) allows partial motifs,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1034 # while case (b) only allows whole motifs.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1035
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1036 def plausible_intervals(start,end,period,seqLen,allowPartials=False):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1037
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1038 # generate intervals that allow a partial copy of the motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1039
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1040 if (allowPartials):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1041 for candStart in xrange(start-(period-1),start+period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1042 if (candStart < 0): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1043 for candEnd in xrange(end-(period-1),end+period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1044 if (candEnd > seqLen): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1045 if (candEnd <= candStart+period): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1046 yield (candStart,candEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1047
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1048 # -OR- generate intervals that allow only whole copies of the motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1049
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1050 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1051 for candStart in xrange(start-(period-1),start+period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1052 if (candStart < 0): continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1053 candEnd = candStart + ((end-candStart)/period)*period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1054 yield (candStart,candEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1055 candEnd += period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1056 if (candEnd <= seqLen): yield (candStart,candEnd)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1057
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1058
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1059 # hamming_distance--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1060 # Determine the hamming distance between the run and a simple repeat.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1061 # $$$ improve this by allowing gaps, and stopping when we reach a threshold
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1062
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1063 kmerToDiffs = {} # (this is used for memo-ization)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1064
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1065 def hamming_distance(seq,start,end,kmer):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1066 period = len(kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1067 if (end < start + period): return None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1068
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1069 wholeEnd = start + ((end-start)/period)*period
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1070
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1071 if (kmer not in kmerToDiffs):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1072 kmerToDiffs[kmer] = { kmer:0 }
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1073
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1074 d = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1075 for ix in xrange(start,wholeEnd,period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1076 qmer = seq[ix:ix+period] # same size as the kmer motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1077 if (qmer in kmerToDiffs[kmer]):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1078 d += kmerToDiffs[kmer][qmer]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1079 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1080 diffs = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1081 for iy in xrange(0,period):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1082 if (qmer[iy] != kmer[iy]): diffs += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1083 kmerToDiffs[kmer][qmer] = diffs
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1084 d += diffs
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1085
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1086 if (end > wholeEnd):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1087 qmer = seq[wholeEnd:end] # shorter than the kmer motif
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1088 if (qmer in kmerToDiffs[kmer]):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1089 d += kmerToDiffs[kmer][qmer]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1090 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1091 diffs = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1092 for iy in xrange(0,len(qmer)):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1093 if (qmer[iy] != kmer[iy]): diffs += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1094 kmerToDiffs[kmer][qmer] = diffs
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1095 d += diffs
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1096
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1097 return d
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1098
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1099
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1100 # fasta_sequences--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1101 # Read the fasta sequences from a file. Note that we convert to upper case,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1102 # and convert any letter other than ACGT to N.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1103
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1104 nonDnaMap = maketrans("BDEFHIJKLMOPQRSUVWXYZ","NNNNNNNNNNNNNNNNNNNNN")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1105
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1106 def fasta_sequences(f):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1107 seqName = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1108 seqNucs = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1109
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1110 for line in f:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1111 line = line.strip()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1112 if (line.startswith(">")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1113 if (seqName != None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1114 yield (seqName,"".join(seqNucs))
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1115 seqName = sequence_name(line)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1116 seqNucs = []
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1117 elif (seqName == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1118 assert (False), "first sequence has no header"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1119 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1120 seqNucs += [line]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1121
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1122 if (seqName != None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1123 yield (seqName,"".join(seqNucs).upper().translate(nonDnaMap))
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1124
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1125
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1126 # fastq_sequences--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1127 # Read the fastq sequences from a file. Note that we convert to upper case,
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1128 # and convert any letter other than ACGT to N.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1129
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1130 def fastq_sequences(f):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1131 lineNum = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1132 for line in f:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1133 lineNum += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1134 line = line.strip()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1135
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1136 if (lineNum % 4 == 1):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1137 assert (line.startswith("@")), \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1138 "bad read name at line %d" % lineNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1139 seqName = line[1:]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1140 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1141
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1142 if (lineNum % 4 == 2):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1143 seqNucs = line
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1144 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1145
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1146 if (lineNum % 4 == 3):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1147 assert (line.startswith("+")), \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1148 "can't understand line %d:\n%s" % (lineNum,line)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1149 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1150
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1151 quals = line
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1152 assert (len(quals) == len(seqNucs)), \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1153 "length mismatch read vs. qualities at line %d" % lineNum
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1154 yield (seqName,"".join(seqNucs).upper().translate(nonDnaMap),quals)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1155
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1156 assert (lineNum % 4 == 0), \
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1157 "incomplete read at end of file"
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1158
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1159 def sam_sequences(f):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1160 lineNum = 0
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1161 for line in f:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1162 lineNum += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1163 line = line.strip()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1164
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1165 if line.startswith("@"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1166 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1167
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1168 columns = line.split("\t")
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1169 seqName = columns[0]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1170 refName = columns[2]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1171 pre_s = int(columns[3]) - 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1172 cigar = columns[5]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1173 seqNucs = columns[9]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1174
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1175 yield (seqName,"".join(seqNucs).upper().translate(nonDnaMap), refName, pre_s, cigar)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1176
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1177 # sequence_name--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1178 # Extract the sequence name from a fasta header.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1179 # $$$ this may need to be improved $$$
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1180
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1181 def sequence_name(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1182 s = s[1:].strip()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1183 if (s == ""): return ""
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1184 else: return s.split()[0]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1185
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1186
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1187 # nucleotide_runs--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1188 # Yield (start,end) for all runs of valid nucleotides in a sequence.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1189
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1190 def nucleotide_runs(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1191 runs = []
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1192 start = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1193 for (ix,nuc) in enumerate(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1194 if (nuc in "ACGT"):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1195 if (start == None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1196 start = ix
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1197 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1198 if (start != None):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1199 yield (start,ix)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1200 start = None
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1201
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1202 if (start != None): yield (start,len(s))
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1203
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1204
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1205 # contains_repeat--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1206 # Determine whether a short sequence contains a repeated element, such as a
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1207 # 6-mer containing a repeated 2-mer (ACACAC) or 3-mer (ACTACT). The repeat
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1208 # must cover the entire sequence, without mismatches.
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1209
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1210 def contains_repeat(kmer):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1211 kmerLength = len(kmer)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1212 hasRepeat = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1213 rptLen = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1214 while (not hasRepeat) and (2 * rptLen <= kmerLength):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1215 if (kmerLength % rptLen != 0):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1216 rptLen += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1217 continue
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1218 isRepeat = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1219 for i in xrange(rptLen,kmerLength,rptLen):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1220 if (kmer[i:i+rptLen] != kmer[:rptLen]):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1221 isRepeat = False
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1222 break
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1223 if (isRepeat):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1224 hasRepeat = True
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1225 break
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1226 rptLen += 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1227 return hasRepeat
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1228
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1229
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1230 # hash108--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1231 # Return a 108-bit hash "value" of a string
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1232
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1233 def hash108(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1234 m = md5_new()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1235 m.update(s)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1236 return m.hexdigest()[:27]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1237
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1238
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1239 # float_or_fraction--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1240 # Convert a string to a number, allowing fractions
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1241
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1242 def float_or_fraction(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1243 if ("/" in s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1244 (numer,denom) = s.split("/",1)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1245 return float(numer)/float(denom)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1246 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1247 return float(s)
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1248
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1249
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1250 # int_with_unit--
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1251 # Parse a string as an integer, allowing unit suffixes
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1252
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1253 def int_with_unit(s):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1254 if (s.endswith("K")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1255 multiplier = 1000
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1256 s = s[:-1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1257 elif (s.endswith("M")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1258 multiplier = 1000 * 1000
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1259 s = s[:-1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1260 elif (s.endswith("G")):
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1261 multiplier = 1000 * 1000 * 1000
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1262 s = s[:-1]
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1263 else:
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1264 multiplier = 1
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1265
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1266 try: return int(s) * multiplier
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1267 except ValueError: return int(math.ceil(float(s) * multiplier))
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1268
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1269
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1270 if __name__ == "__main__": main()
70f8259b0b30 Uploaded
arkarachai-fungtammasan
parents:
diff changeset
1271