annotate lastz_paired_reads_wrapper.py @ 1:39f974d0884e draft

Updated functional test definition to reference data copied to the test data repository.
author devteam
date Wed, 20 Mar 2013 15:19:25 -0400
parents 96825cee5c25
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
1 #!/usr/bin/env python
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
2
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
3 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
4 Runs Lastz paired read alignment process
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
5 Written for Lastz v. 1.02.00.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
6
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
7 # Author(s): based on various scripts written by Bob Harris (rsharris@bx.psu.edu),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
8 # then tweaked to this form by Greg Von Kuster (greg@bx.psu.edu)
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
9
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
10 This tool takes the following input:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
11 a. A collection of 454 paired end reads ( a fasta file )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
12 b. A linker sequence ( a very small fasta file )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
13 c. A reference genome ( nob, 2bit or fasta )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
14
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
15 and uses the following process:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
16 1. Split reads into mates: the input to this step is the read file XXX.fasta, and the output is three
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
17 files; XXX.short.fasta, XXX.long.fasta and XXX.mapping. The mapping file records the information necessary
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
18 to convert mate coordinates back into the original read, which is needed later in the process.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
19
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
20 2. Align short mates to the reference: this runs lastz against every chromosome. The input is XXX.short.fasta
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
21 and the reference genome, and the output is a SAM file, XXX.short.sam.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
22
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
23 3. Align long mates to the reference: this runs lastz against every chromosome. The input is XXX.long.fasta
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
24 and the reference genome, and the output is a SAM file, XXX.long.sam.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
25
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
26 4. Combine, and convert mate coordinates back to read coordinates. The input is XXX.mapping, XXX.short.sam and
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
27 XXX.long.sam, and the output is XXX.sam.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
28
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
29 usage: lastz_paired_reads_wrapper.py [options]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
30 --ref_name: The reference name to change all output matches to
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
31 --ref_source: The reference is cached or from the history
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
32 --source_select: Use pre-set or cached reference file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
33 --input1: The name of the reference file if using history or reference base name if using cached
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
34 --input2: The reads file to align
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
35 --input3: The sequencing linker file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
36 --input4: The base quality score 454 file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
37 --ref_sequences: The number of sequences in the reference file if using one from history
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
38 --output: The name of the output file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
39 --lastz_seqs_file_dir: Directory of local lastz_seqs.loc file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
40 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
41 import optparse, os, subprocess, shutil, sys, tempfile, time
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
42 from string import maketrans
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
43
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
44 from galaxy import eggs
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
45 import pkg_resources
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
46 pkg_resources.require( 'bx-python' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
47 from bx.seq.twobit import *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
48 from bx.seq.fasta import FastaReader
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
49 from galaxy.util.bunch import Bunch
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
50 from galaxy.util import string_as_bool
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
51
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
52 # Column indexes for SAM required fields
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
53 SAM_QNAME_COLUMN = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
54 SAM_FLAG_COLUMN = 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
55 SAM_RNAME_COLUMN = 2
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
56 SAM_POS_COLUMN = 3
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
57 SAM_MAPQ_COLUMN = 4
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
58 SAM_CIGAR_COLUMN = 5
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
59 SAM_MRNM_COLUMN = 6
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
60 SAM_MPOS_COLUMN = 7
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
61 SAM_ISIZE_COLUMN = 8
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
62 SAM_SEQ_COLUMN = 9
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
63 SAM_QUAL_COLUMN = 10
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
64 SAM_MIN_COLUMNS = 11
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
65 # SAM bit-encoded flags
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
66 BAM_FPAIRED = 1 # the read is paired in sequencing, no matter whether it is mapped in a pair
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
67 BAM_FPROPER_PAIR = 2 # the read is mapped in a proper pair
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
68 BAM_FUNMAP = 4 # the read itself is unmapped; conflictive with BAM_FPROPER_PAIR
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
69 BAM_FMUNMAP = 8 # the mate is unmapped
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
70 BAM_FREVERSE = 16 # the read is mapped to the reverse strand
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
71 BAM_FMREVERSE = 32 # the mate is mapped to the reverse strand
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
72 BAM_FREAD1 = 64 # this is read1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
73 BAM_FREAD2 = 128 # this is read2
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
74 BAM_FSECONDARY = 256 # not primary alignment
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
75 BAM_FQCFAIL = 512 # QC failure
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
76 BAM_FDUP = 1024 # optical or PCR duplicate
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
77
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
78 # Keep track of all created temporary files so they can be deleted
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
79 global tmp_file_names
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
80 tmp_file_names = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
81 # The values in the skipped_lines dict are tuples consisting of:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
82 # - the number of skipped lines for that error
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
83 # If not a sequence error:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
84 # - the 1st line number on which the error was found
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
85 # - the text of the 1st line on which the error was found
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
86 # If a sequence error:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
87 # - The number of the sequence in the file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
88 # - the sequence name on which the error occurred
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
89 # We may need to improve dealing with file position and text as
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
90 # much of it comes from temporary files that are created from the
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
91 # inputs, and not the inputs themselves, so this could be confusing
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
92 # to the user.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
93 global skipped_lines
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
94 skipped_lines = dict( bad_interval=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
95 inconsistent_read_lengths=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
96 inconsistent_reads=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
97 inconsistent_sizes=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
98 missing_mate=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
99 missing_quals=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
100 missing_seq=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
101 multiple_seqs=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
102 no_header=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
103 num_fields=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
104 reads_paired=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
105 sam_flag=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
106 sam_headers=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
107 sam_min_columns=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
108 two_mate_names=( 0, 0, '' ),
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
109 wrong_seq_len=( 0, 0, '' ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
110 global total_skipped_lines
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
111 total_skipped_lines = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
112
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
113 def stop_err( msg ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
114 sys.stderr.write( "%s" % msg )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
115 sys.exit()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
116
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
117 def skip_line( error_key, position, text ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
118 if not skipped_lines[ error_key ][2]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
119 skipped_lines[ error_key ][1] = position
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
120 skipped_lines[ error_key ][2] = text
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
121 skipped_lines[ error_key ][0] += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
122 total_skipped_lines += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
123
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
124 def get_tmp_file_name( dir=None, suffix=None ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
125 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
126 Return a unique temporary file name that can be managed. The
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
127 file must be manually removed after use.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
128 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
129 if dir and suffix:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
130 tmp_fd, tmp_name = tempfile.mkstemp( dir=dir, suffix=suffix )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
131 elif dir:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
132 tmp_fd, tmp_name = tempfile.mkstemp( dir=dir )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
133 elif suffix:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
134 tmp_fd, tmp_name = tempfile.mkstemp( suffix=suffix )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
135 os.close( tmp_fd )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
136 tmp_file_names.append( tmp_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
137 return tmp_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
138
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
139 def run_command( command ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
140 proc = subprocess.Popen( args=command, shell=True, stderr=subprocess.PIPE, )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
141 proc.wait()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
142 stderr = proc.stderr.read()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
143 proc.wait()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
144 if stderr:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
145 stop_err( stderr )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
146
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
147 def split_paired_reads( input2, combined_linker_file_name ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
148 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
149 Given a fasta file of allegedly paired end reads ( input2 ), and a list of intervals
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
150 showing where the linker is on each read ( combined_linker_file_name ), split the reads into left and right
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
151 halves.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
152
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
153 The input intervals look like this. Note that they may include multiple intervals for the same read
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
154 ( which should overlap ), and we use the union of them as the linker interval. Non-overlaps are
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
155 reported to the user, and those reads are not processed. Starts are origin zero.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
156
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
157 #name strand start len size
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
158 FG3OYDA05FTEES + 219 42 283
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
159 FG3OYDA05FVOLL + 263 41 416
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
160 FG3OYDA05FFL7J + 81 42 421
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
161 FG3OYDA05FOQWE + 55 42 332
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
162 FG3OYDA05FV4DW + 297 42 388
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
163 FG3OYDA05FWAQV + 325 42 419
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
164 FG3OYDA05FVLGA + 90 42 367
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
165 FG3OYDA05FWJ71 + 58 42 276
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
166
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
167 The output gives each half-sequence on a separate line, like this. This allows easy sorting of the
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
168 sequences by length, after the fact.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
169
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
170 219 FG3OYDA05FTEES_L TTTAGTTACACTTAACTCACTTCCATCCTCTAAATACGTGATTACCTTTC...
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
171 22 FG3OYDA05FTEES_R CCTTCCTTAAGTCCTAAAACTG
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
172 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
173 # Bob says these should be hard-coded.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
174 seq_len_lower_threshold = 17
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
175 short_mate_cutoff = 50
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
176 # We need to pass the name of this file back to the caller.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
177 tmp_mates_file_name = get_tmp_file_name( suffix='mates.txt' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
178 mates_file = file( tmp_mates_file_name, "w+b" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
179 # Read the linker intervals
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
180 combined_linker_file = file( combined_linker_file_name, "rb" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
181 read_to_linker_dict = {}
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
182 i = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
183 for i, line in enumerate( combined_linker_file ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
184 line = line.strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
185 if line.startswith( "#" ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
186 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
187 if line.find( '#' ) >= 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
188 line = line.split( "#", 1 )[0].rstrip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
189 fields = line.split()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
190 if len( fields ) != 4:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
191 skip_line( 'num_fields', i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
192 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
193 name, start, length, size = fields
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
194 start = int( start )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
195 length = int( length )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
196 size = int( size )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
197 end = start + length
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
198 if end > size:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
199 skip_line[ 'bad_interval' ] += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
200 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
201 if name not in read_to_linker_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
202 read_to_linker_dict[ name ] = ( start, end, size )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
203 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
204 if read_to_linker_dict[ name ] == None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
205 # Read previously marked as non-overlapping intervals, so skip this sequence - see below
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
206 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
207 ( s, e, sz ) = read_to_linker_dict[ name ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
208 if sz != size:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
209 skip_line( 'inconsistent_sizes', i+1, name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
210 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
211 if s > end or e < start:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
212 # Non-overlapping intervals, so skip this sequence
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
213 read_to_linker_dict[ name ] = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
214 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
215 read_to_linker_dict[ name ] = ( min( s, start ), max( e, end ), size )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
216 combined_linker_file.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
217 # We need to pass the name of this file back to the caller.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
218 tmp_mates_mapping_file_name = get_tmp_file_name( suffix='mates.mapping' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
219 mates_mapping_file = file( tmp_mates_mapping_file_name, 'w+b' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
220 # Process the sequences
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
221 seqs = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
222 fasta_reader = FastaReader( file( input2, 'rb' ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
223 while True:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
224 seq = fasta_reader.next()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
225 if not seq:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
226 break
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
227 seqs += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
228 if seq.name not in read_to_linker_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
229 if seq.length > seq_len_lower_threshold:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
230 mates_file.write( "%-3d %s %s\n" % ( seq.length, seq.name, seq.text ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
231 read_to_linker_dict[ seq.name ] = ""
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
232 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
233 if read_to_linker_dict[ seq.name ] == "":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
234 skip_line( 'multiple_seqs', seqs, seq.name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
235 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
236 if read_to_linker_dict[ seq.name ] == None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
237 # Read previously marked as non-overlapping intervals, so skip this sequence - see above
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
238 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
239 ( start, end, size ) = read_to_linker_dict[ seq.name ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
240 if seq.length != size:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
241 skip_line( 'wrong_seq_len', seqs, seq.name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
242 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
243 left = seq.text[ :start ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
244 right = seq.text[ end: ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
245 left_is_small = len( left ) <= seq_len_lower_threshold
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
246 right_is_small = len( right ) <= seq_len_lower_threshold
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
247 if left_is_small and right_is_small:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
248 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
249 if not left_is_small:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
250 mates_file.write( "%-3d %s %s\n" % ( len( left ), seq.name + "_L", left ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
251 mates_mapping_file.write( "%s %s %s %s\n" % ( seq.name + "_L", seq.name, 0, size - start ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
252 if not right_is_small:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
253 mates_file.write( "%-3d %s %s\n" % ( len( right ), seq.name + "_R", right ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
254 mates_mapping_file.write( "%s %s %s %s\n" % ( seq.name + "_R", seq.name, end, 0 ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
255 read_to_linker_dict[ seq.name ] = ""
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
256 combined_linker_file.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
257 mates_file.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
258 mates_mapping_file.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
259 # Create temporary files for short and long mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
260 tmp_mates_short_file_name = get_tmp_file_name( suffix='mates.short' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
261 tmp_mates_long_file_name = get_tmp_file_name( suffix='mates.long' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
262 tmp_mates_short = open( tmp_mates_short_file_name, 'w+b' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
263 tmp_mates_long = open( tmp_mates_long_file_name, 'w+b' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
264 i = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
265 for i, line in enumerate( file( tmp_mates_file_name, 'rb' ) ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
266 fields = line.split()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
267 seq_len = int( fields[0] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
268 seq_name = fields[1]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
269 seq_text = fields[2]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
270 if seq_len <= short_mate_cutoff:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
271 tmp_mates_short.write( ">%s\n%s\n" % ( seq_name, seq_text ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
272 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
273 tmp_mates_long.write( ">%s\n%s\n" % ( seq_name, seq_text ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
274 tmp_mates_short.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
275 tmp_mates_long.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
276 return tmp_mates_mapping_file_name, tmp_mates_file_name, tmp_mates_short_file_name, tmp_mates_long_file_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
277
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
278 def align_mates( input1, ref_source, ref_name, ref_sequences, tmp_mates_short_file_name, tmp_mates_long_file_name ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
279 tmp_align_file_names = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
280 if ref_source == 'history':
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
281 # Reference is a fasta dataset from the history
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
282 # Create temporary files to contain the output from lastz executions
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
283 tmp_short_file_name = get_tmp_file_name( suffix='short_out' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
284 tmp_align_file_names.append( tmp_short_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
285 tmp_long_file_name = get_tmp_file_name( suffix='long_out' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
286 tmp_align_file_names.append( tmp_long_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
287 seqs = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
288 fasta_reader = FastaReader( open( input1 ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
289 while True:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
290 # Read the next sequence from the reference dataset. Note that if the reference contains
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
291 # a small number of chromosomes this loop is ok, but in many cases the genome has a bunch
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
292 # of small straggler scaffolds and contigs and it is a computational waste to do each one
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
293 # of these in its own run. There is an I/O down side to running by subsets (even if they are
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
294 # one sequence per subset), compared to splitting the reference into sizes of 250 mb. With
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
295 # the subset action, lastz still has to read and parse the entire file for every run (this
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
296 # is true for fasta, but for .2bit files it can access each sequence directly within the file,
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
297 # so the overhead is minimal).
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
298 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
299 :> output_file (this creates the output file, empty)
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
300 while there are more sequences to align
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
301 find the next sequences that add up to 250M, put their names in farf.names
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
302 lastz ${refFile}[subset=farf.names][multi][unmask] ${matesPath}/${matesFile} ...
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
303 >> output_file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
304 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
305 seq = fasta_reader.next()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
306 if not seq:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
307 break
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
308 seqs += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
309 # Create a temporary file to contain the current sequence as input to lastz.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
310 # We're doing this a bit differently here since we could be generating a huge
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
311 # number of temporary files.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
312 tmp_in_fd, tmp_in_file_name = tempfile.mkstemp( suffix='seq_%d_in' % seqs )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
313 tmp_in_file = os.fdopen( tmp_in_fd, 'w+b' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
314 tmp_in_file.write( '>%s\n%s\n' % ( seq.name, seq.text ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
315 tmp_in_file.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
316 # Align short mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
317 command = 'lastz %s[unmask]%s %s ' % ( tmp_in_file_name, ref_name, tmp_mates_short_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
318 command += 'Z=1 --seed=1111111011111 --notrans --maxwordcount=90% --match=1,3 O=1 E=3 X=15 K=10 Y=12 L=18 --ambiguousn --noytrim --identity=95 --coverage=80 --continuity=95 --format=softsam- '
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
319 command += '>> %s' % tmp_short_file_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
320 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
321 # Align long mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
322 command = 'lastz %s[unmask]%s %s ' % ( tmp_in_file_name, ref_name, tmp_mates_long_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
323 command += 'Z=15 W=13 --notrans --exact=18 --maxwordcount=90% --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --noytrim --identity=95 --coverage=90 --continuity=95 --format=softsam- '
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
324 command += '>> %s' % tmp_long_file_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
325 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
326 # Remove the temporary file that contains the current sequence
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
327 os.remove( tmp_in_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
328 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
329 # Reference is a locally cached 2bit file, split lastz calls across number of chroms in 2bit file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
330 tbf = TwoBitFile( open( input1, 'rb' ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
331 for chrom in tbf.keys():
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
332 # Align short mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
333 tmp_short_file_name = get_tmp_file_name( suffix='short_vs_%s' % chrom )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
334 tmp_align_file_names.append( tmp_short_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
335 command = 'lastz %s/%s[unmask]%s %s ' % ( input1, chrom, ref_name, tmp_mates_short_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
336 command += 'Z=1 --seed=1111111011111 --notrans --maxwordcount=90% --match=1,3 O=1 E=3 X=15 K=10 Y=12 L=18 --ambiguousn --noytrim --identity=95 --coverage=80 --continuity=95 --format=softsam- '
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
337 command += '> %s' % tmp_short_file_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
338 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
339 # Align long mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
340 tmp_long_file_name = get_tmp_file_name( suffix='long_vs_%s' % chrom )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
341 tmp_align_file_names.append( tmp_long_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
342 command = 'lastz %s/%s[unmask]%s %s ' % ( input1, chrom, ref_name, tmp_mates_long_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
343 command += 'Z=15 W=13 --notrans --exact=18 --maxwordcount=90% --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --noytrim --identity=95 --coverage=90 --continuity=95 --format=softsam- '
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
344 command += '> %s' % tmp_long_file_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
345 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
346 return tmp_align_file_names
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
347
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
348 def paired_mate_unmapper( input2, input4, tmp_mates_mapping_file_name, tmp_align_file_name_list, output ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
349 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
350 Given a SAM file corresponding to alignments of *subsegments* of paired 'reads' to a reference sequence,
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
351 convert the positions on the subsegments to positions on the reads. Also (optionally) add quality values.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
352
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
353 The input file is in SAM format, as shown below. Each line represents the alignment of a part of a read
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
354 to a reference sequence. Read pairs are indicated by suffixes in their names. Normally, the suffixes _L
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
355 and _R indicate the left and right mates of reads (this can be overridden with the --left and --right
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
356 options). Reads that were not mates have no suffix.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
357
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
358 (SAM header lines omitted)
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
359 F2YP0BU02G7LK5_R 16 chr21 15557360 255 40M * 0 0 ATTTTATTCTCTTTGAAGCAATTGTGAATGGGAGTTTACT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
360 F2YP0BU02HXV58_L 16 chr21 15952091 255 40M6S * 0 0 GCAAATTGTGCTGCTTTAAACATGCGTGTGCAAGTATCTTtttcat *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
361 F2YP0BU02HREML_R 0 chr21 16386077 255 33M5S * 0 0 CCAAAGTTCTGGGATTACAGGCGTGAGCCATCGcgccc *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
362 F2YP0BU02IOF1F_L 0 chr21 17567321 255 7S28M * 0 0 taaagagAAGAATTCTCAACCCAGAATTTCATATC *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
363 F2YP0BU02IKX84_R 16 chr21 18491628 255 22M1D18M9S * 0 0 GTCTCTACCAAAAAATACAAAAATTAGCCGGGCGTGGTGGcatgtctgt *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
364 F2YP0BU02GW5VA_L 16 chr21 20255344 255 6S32M * 0 0 caagaaCAAACACATTCAAAAGCTAGTAGAAGGCAAGA *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
365 F2YP0BU02JIMJ4_R 0 chr21 22383051 255 19M * 0 0 CCCTTTATCATTTTTTATT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
366 F2YP0BU02IXZGF_L 16 chr21 23094798 255 13M1I18M * 0 0 GCAAGCTCCACTTCCCGGGTTCACGCCATTCT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
367 F2YP0BU02IODR5_L 0 chr21 30935325 255 37M * 0 0 GAAATAAAGGGTATTCAATTAGGAAAAGAGGAAGTCA *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
368 F2YP0BU02IMZBL_L 16 chr21 31603486 255 28M1D1M * 0 0 ATACAAAAATTAGCCGGGCACAGTGGCAG *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
369 F2YP0BU02JA9PR_L 16 chr21 31677159 255 23M * 0 0 CACACCTGTAACCCCAGCACTTT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
370 F2YP0BU02HKC61_R 0 chr21 31678718 255 40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
371 F2YP0BU02HKC61_R 0 chr21 31678718 255 40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
372 F2YP0BU02HVA88 16 chr21 31703558 255 1M1D35M8S * 0 0 TGGGATTACAGGCGTGAGCTACCACACCCAGCCAGAgttcaaat *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
373 F2YP0BU02JDCF1_L 0 chr21 31816600 255 38M * 0 0 AGGAGAATCGCTTGAACCCAGGAGGCAGAGGTTGCGGT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
374 F2YP0BU02GZ1GO_R 0 chr21 33360122 255 6S38M * 0 0 cctagaCTTCACACACACACACACACACACACACACACACACAC *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
375 F2YP0BU02FX387_L 16 chr22 14786201 255 26M * 0 0 TGGATGAAGCTGGAAACCATCATTCT *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
376 F2YP0BU02IF2NE_R 0 chr22 16960842 255 40M10S * 0 0 TGGCATGCACCTGTAGTCTCAGCTACTTGGGAGGCTGAGGtgggaggatc *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
377 F2YP0BU02F4TVA 0 chr22 19200522 255 49M * 0 0 CCTGGGAGGCGGAGGTTGCAGTGAGCCGAGATCACGCCATTGCACTCCA *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
378 F2YP0BU02HKC61_R 16 chr22 29516998 255 8S32M * 0 0 agacagagTCTTGCTTTGTCACCCAGGCTGGAGTGCAGTG *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
379 F2YP0BU02FS4EM_R 0 chr22 30159364 255 29M * 0 0 CTCCTGCCTCAGCCTCCCGAGTAGTTGGG *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
380 F2YP0BU02G197P_L 0 chr22 32044496 255 40M10S * 0 0 TTGTTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTgaataatgcc *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
381 F2YP0BU02FIING 16 chr22 45959944 255 3M1I11M1I26M * 0 0 AGCTATGGTACTGGCTATGAAAGCAGACACATAGACCAATGG *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
382 F2YP0BU02GUB9L_L 16 chr22 49198404 255 16M1I20M * 0 0 CACCACGCTCGGCTAATTTTTGTATTTTTAGTAGAGA *
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
383
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
384 The user must provide a mapping file (which might better be called an unmapping file). This file is usually
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
385 created by split_paired_reads, and tells us how to map the subsegments back to original coordinates in a single
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
386 read (this means the left and right mates were part of a single read). The mapping file contains four columns.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
387 The first two give the mates's name (including the suffix) and the read name. The last two columns describe how
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
388 much of the full original sequence is missing from the mate. For example, in the read below, the left mate is
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
389 missing 63 on the right (42 for the linker and 21 for the right half). The right mate is missing 339 on the left.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
390
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
391 left half: TTTCAACATATGCAAATCAATAAATGTAATCCAGCATATAAACAGAACCA
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
392 AAGACAAAAACCACATGATTATCTCAATAGATGCAGAAAAGGCCTTCGGC
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
393 AAAATTCAACAAAACTCCATGCTAAAACTCTCAATAAGGTATTGATGGGA
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
394 CATGCCGCATAATAATAAGACATATCTATGACAAACCCACAGCCAATATC
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
395 ATGCTGAATGCACAAAAATTGGAAGCATTCCCTTTGAAAACTGGCACAAG
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
396 ACTGGGATGCCCTCTCTCACAACTCCTATTCAACATAGTGTTGGAAG
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
397 linker: CGTAATAACTTCGTATAGCATACATTATACGAAGTCATACGA
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
398 right half: CTCCTGCCTCAGCCTCCCGAG
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
399
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
400 mate_name read_name offset_to_start offset_from_end
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
401 F2YP0BU02FS4EM_L F2YP0BU02FS4EM 0 71
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
402 F2YP0BU02FS4EM_R F2YP0BU02FS4EM 339 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
403
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
404 The user can also specify a quality scores file, which should look something like this. Quality values are presumed
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
405 to be PHRED scores, written in space-delimited decimal.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
406
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
407 >F2YP0BU02FS4EM
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
408 38 38 38 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 38 21 21 21 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
409 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 33
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
410 32 32 40 40 40 21 21 18 18 21 34 34 31 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
411 40 40 40 40 40 40 40 40 40 40 40 32 32 32 32 40 40 40 40 40 40 40 34 34 35
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
412 31 31 28 28 33 33 33 36 36 36 17 17 17 19 26 36 36 36 40 40 40 40 40 33 34
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
413 34 34 39 39 39 40 40 40 40 40 33 33 34 34 40 40 40 40 40 40 40 39 39 39 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
414 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
415 40 40 40 40 40 40 40 39 39 39 39 39 39 40 40 40 39 39 39 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
416 40 40 40 40 40 40 40 40 40 40 40 40 40 26 26 26 26 26 40 40 38 38 37 35 33
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
417 36 40 19 17 17 17 17 19 19 23 30 20 20 20 23 35 40 36 36 36 36 36 36 36 36
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
418 39 40 34 20 27 27 35 39 40 37 40 40 40 40 40 40 40 40 40 40 34 34 35 39 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
419 40 40 40 40 40 40 39 39 39 40 40 40 40 36 36 32 32 28 28 29 30 36 40 30 26
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
420 26 26 34 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
421 40 39 35 34 34 40 40 40 40 30 30 30 35 40 40 40 40 40 39 39 36 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
422 39 39 39 39 30 30 28 35 35 39 40 40 40 40 40 35 35 35
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
423 >F2YP0BU02G197P
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
424 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
425 40 40 40 40 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
426 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
427 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
428 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 34 34 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
429 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
430 40 40 40 40 40 40 40 34 34 34 34 40 40 40 40 34 34 34 34 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
431 40 40 40 40 40 39 39 39 34 34 34 34 40 40 40 40 39 39 25 25 26 39 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
432 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
433 33 33 33 33 40 35 21 21 21 30 38 40 40 40 40 40 40 40 40 35 35 30 30 30 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
434 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
435 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
436 40 40 40 39 39 39 40 40
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
437 >F2YP0BU02FIING
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
438 32 32 32 25 25 25 25 24 25 30 31 30 27 27 27 28 28 21 19 19 13 13 13 14 19
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
439 19 17 19 16 16 25 28 22 21 17 17 18 25 24 25 25 25
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
440
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
441 The output file is also SAM:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
442
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
443 (SAM header lines omitted)
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
444 F2YP0BU02G7LK5 81 chr21 15557360 255 40M303H * 0 0 ATTTTATTCTCTTTGAAGCAATTGTGAATGGGAGTTTACT D>>>>IIIIIIHHG???IIIIIIIIIHHHFFEIH999HII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
445 F2YP0BU02HXV58 145 chr21 15952091 255 226H40M6S * 0 0 GCAAATTGTGCTGCTTTAAACATGCGTGTGCAAGTATCTTtttcat AA===DDDDAAAAD???:::ABBBBBAAA:888ECF;F>>>?8??@
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
446 F2YP0BU02HREML 65 chr21 16386077 255 320H33M5S * 0 0 CCAAAGTTCTGGGATTACAGGCGTGAGCCATCGcgccc HH???HHIIIHFHIIIIIIICDDHHIIIIIIHHHHHHH
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
447 F2YP0BU02IOF1F 129 chr21 17567321 255 7S28M409H * 0 0 taaagagAAGAATTCTCAACCCAGAATTTCATATC 4100<<A>4113:<EFGGGFFFHHHHHHDFFFFED
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
448 F2YP0BU02IKX84 81 chr21 18491628 255 22M1D18M9S341H * 0 0 GTCTCTACCAAAAAATACAAAAATTAGCCGGGCGTGGTGGcatgtctgt ;;;=7@.55------?2?11112GGB=CCCCDIIIIIIIIIHHHHHHII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
449 F2YP0BU02GW5VA 145 chr21 20255344 255 286H6S32M * 0 0 caagaaCAAACACATTCAAAAGCTAGTAGAAGGCAAGA IIIIIIIHHHIIIIIIICCCCIIIIIIIIIIIIIIIII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
450 F2YP0BU02JIMJ4 65 chr21 22383051 255 208H19M * 0 0 CCCTTTATCATTTTTTATT 555544E?GE113344I22
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
451 F2YP0BU02IXZGF 145 chr21 23094798 255 291H13M1I18M * 0 0 GCAAGCTCCACTTCCCGGGTTCACGCCATTCT IIIIIIIIIIIGG;;;GGHIIIIIGGGIIIII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
452 F2YP0BU02IODR5 129 chr21 30935325 255 37M154H * 0 0 GAAATAAAGGGTATTCAATTAGGAAAAGAGGAAGTCA 6...7/--..,30;9<<>@BFFFAAAAHIIIIIH@@@
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
453 F2YP0BU02IMZBL 145 chr21 31603486 255 342H28M1D1M * 0 0 ATACAAAAATTAGCCGGGCACAGTGGCAG BB1552222<<>9==8;;?AA=??A???A
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
454 F2YP0BU02JA9PR 145 chr21 31677159 255 229H23M * 0 0 CACACCTGTAACCCCAGCACTTT IIIIIIIIIIICCCCIIIIIHHH
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
455 F2YP0BU02HKC61 65 chr21 31678718 255 300H40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT AA@BD:::==AAA@A?8888:<90004<>>?><<<<4442
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
456 F2YP0BU02HKC61 65 chr21 31678718 255 300H40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT AA@BD:::==AAA@A?8888:<90004<>>?><<<<4442
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
457 F2YP0BU02HVA88 16 chr21 31703558 255 1M1D35M8S * 0 0 TGGGATTACAGGCGTGAGCTACCACACCCAGCCAGAgttcaaat >8888DFFHHGFHHHH@@?@?DDC96666HIIIFFFFFFFFFFF
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
458 F2YP0BU02JDCF1 129 chr21 31816600 255 38M103H * 0 0 AGGAGAATCGCTTGAACCCAGGAGGCAGAGGTTGCGGT IIIIIIIIIIIHHHIIHHHIIIIIIIIIIIIIIIIIII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
459 F2YP0BU02GZ1GO 65 chr21 33360122 255 76H6S38M * 0 0 cctagaCTTCACACACACACACACACACACACACACACACACAC BBBBD?:688CFFFFFFFFFFFFFFFFFFFFFFFFFFDDBBB51
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
460 F2YP0BU02FX387 145 chr22 14786201 255 201H26M * 0 0 TGGATGAAGCTGGAAACCATCATTCT IIHHHHHHHHHHHHHFFFFFFFFFFF
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
461 F2YP0BU02IF2NE 65 chr22 16960842 255 209H40M10S * 0 0 TGGCATGCACCTGTAGTCTCAGCTACTTGGGAGGCTGAGGtgggaggatc BAAADDDDFDDDDDDBBA889<A?4444000@<>AA?9444;;8>77<7-
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
462 F2YP0BU02F4TVA 0 chr22 19200522 255 49M * 0 0 CCTGGGAGGCGGAGGTTGCAGTGAGCCGAGATCACGCCATTGCACTCCA FFF???FFFFFIIIIIIIIIIIIIIIIIIIIIIIHHIIFHFFFGDDB=5
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
463 F2YP0BU02HKC61 81 chr22 29516998 255 8S32M300H * 0 0 agacagagTCTTGCTTTGTCACCCAGGCTGGAGTGCAGTG 2444<<<<>?>><40009<:8888?A@AAA==:::DB@AA
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
464 F2YP0BU02FS4EM 65 chr22 30159364 255 339H29M * 0 0 CTCCTGCCTCAGCCTCCCGAGTAGTTGGG IIIIHHEIIIIHHHH??=DDHIIIIIDDD
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
465 F2YP0BU02G197P 129 chr22 32044496 255 40M10S258H * 0 0 TTGTTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTgaataatgcc IIIIIIIIIIHHHHHHIIIIIIIIIIIII;;;IIIIIIIIIIIIIIIIII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
466 F2YP0BU02FIING 16 chr22 45959944 255 3M1I11M1I26M * 0 0 AGCTATGGTACTGGCTATGAAAGCAGACACATAGACCAATGG :::9:32267=:114244/...446==<<<?@?:9::::AAA
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
467 F2YP0BU02GUB9L 145 chr22 49198404 255 176H16M1I20M * 0 0 CACCACGCTCGGCTAATTTTTGTATTTTTAGTAGAGA IIIIIIIIIHAAC;<</////@4F5778;IIIIIIII
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
468
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
469 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
470 left_suffix = "_L"
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
471 right_suffix = "_R"
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
472 # Read the mapping
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
473 mate_to_read_dict = {}
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
474 i = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
475 for i, line in enumerate( file( tmp_mates_mapping_file_name, 'rb' ) ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
476 line = line.strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
477 if not line.startswith( "#" ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
478 fields = line.split()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
479 if len( fields ) != 4:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
480 skip_line( "num_fields", i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
481 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
482 mate_name, read_name, s_offset, e_offset = fields
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
483 if mate_name in mate_to_read_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
484 skip_line( 'two_mate_names', i+1, mate_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
485 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
486 mate_to_read_dict[ mate_name ] = ( read_name, int( s_offset ), int( e_offset ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
487 # Read sequence data
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
488 read_to_nucs_dict = {}
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
489 seqs = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
490 fasta_reader = FastaReader( file( input2, 'rb' ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
491 while True:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
492 seq = fasta_reader.next()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
493 if not seq:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
494 break
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
495 seqs += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
496 seq_text_upper = seq.text.upper()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
497 if seq.name in read_to_nucs_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
498 if seq_text_upper != read_to_nucs_dict[ seq.name ]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
499 skip_line( 'inconsistent_reads', seqs, seq.name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
500 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
501 read_to_nucs_dict[ seq.name ] = seq_text_upper
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
502 # Read quality data
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
503 def quality_sequences( f ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
504 seq_name = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
505 seq_quals = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
506 line_number = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
507 for line in f:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
508 line_number += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
509 line = line.strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
510 if line.startswith( ">" ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
511 if seq_name != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
512 yield ( seq_name, seq_quals, seq_line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
513 seq_name = sequence_name( line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
514 seq_line = line_number
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
515 seq_quals = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
516 elif seq_name is None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
517 skip_line( 'no_header', line_number, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
518 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
519 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
520 seq_quals += [ int( q ) for q in line.split() ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
521 if seq_name is not None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
522 yield ( seq_name, seq_quals, seq_line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
523 def sequence_name( s ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
524 s = s[ 1: ].strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
525 if not s:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
526 return ""
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
527 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
528 return s.split()[ 0 ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
529 read_to_quals_dict = {}
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
530 # TODO: should we use Dan's fastaNamedReader here?
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
531 for seq_name, quals, line_number in quality_sequences( file( input4 ) ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
532 quals = samify_phred_scores( quals )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
533 if seq_name in read_to_quals_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
534 if quals != read_to_quals_dict[ seq_name ]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
535 skip_line( 'inconsistent_reads', line_number, seq_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
536 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
537 if len( quals ) != len( read_to_nucs_dict[ seq_name ] ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
538 skip_line( 'inconsistent_read_lengths', line_number, seq_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
539 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
540 read_to_quals_dict[ seq_name ] = quals
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
541 # process the SAM file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
542 tmp_align_file_names = ' '.join( tmp_align_file_name_list )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
543 combined_chrom_file_name = get_tmp_file_name( suffix='combined_chrom' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
544 command = 'cat %s | grep -v "^@" | sort -k 1 > %s' % ( tmp_align_file_names, combined_chrom_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
545 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
546 fout = file( output, 'w+b' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
547 has_non_header = False
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
548 i = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
549 for i, line in enumerate( file( combined_chrom_file_name, 'rb' ) ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
550 line = line.strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
551 if line.startswith( "@" ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
552 if has_non_header:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
553 skip_line( 'sam_headers', i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
554 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
555 fout.write( "%s\n" % line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
556 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
557 has_non_header = True
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
558 fields = line.split()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
559 num_fields = len( fields )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
560 if num_fields < SAM_MIN_COLUMNS:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
561 skip_line( 'sam_min_columns', i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
562 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
563 # Set flags for mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
564 try:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
565 flag = int( fields[ SAM_FLAG_COLUMN ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
566 except ValueError:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
567 skip_line( 'sam_flag', i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
568 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
569 if not( flag & ( BAM_FPAIRED + BAM_FREAD1 + BAM_FREAD2 ) == 0 ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
570 skip_line( 'reads_paired', i+1, line )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
571 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
572 mate_name = fields[ SAM_QNAME_COLUMN ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
573 unmap_it = False
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
574 half = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
575 if mate_name.endswith( left_suffix ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
576 flag += BAM_FPAIRED + BAM_FREAD2
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
577 fields[ SAM_FLAG_COLUMN ] = "%d" % flag
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
578 unmap_it = True
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
579 half = "L"
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
580 elif mate_name.endswith( right_suffix ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
581 flag += BAM_FPAIRED + BAM_FREAD1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
582 fields[ SAM_FLAG_COLUMN ] = "%d" % flag
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
583 unmap_it = True
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
584 half = "R"
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
585 on_plus_strand = ( flag & BAM_FREVERSE == 0 )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
586 # Convert position from mate to read by adding clipping to cigar
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
587 if not unmap_it:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
588 read_name = mate_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
589 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
590 try:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
591 read_name, s_offset, e_offset = mate_to_read_dict[ mate_name ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
592 except KeyError:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
593 skip_line( 'missing_mate', i+1, mate_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
594 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
595 cigar = fields[ SAM_CIGAR_COLUMN ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
596 cigar_prefix = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
597 cigar_suffix = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
598 if half == "L":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
599 if on_plus_strand:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
600 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
601 cigar_prefix = ( s_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
602 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
603 cigar_suffix = ( e_offset, "H" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
604 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
605 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
606 cigar_prefix = ( e_offset, "H" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
607 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
608 cigar_suffix = ( s_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
609 elif half == "R":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
610 if on_plus_strand:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
611 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
612 cigar_prefix = ( s_offset, "H" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
613 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
614 cigar_suffix = ( e_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
615 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
616 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
617 cigar_prefix = ( e_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
618 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
619 cigar_suffix = ( s_offset, "H" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
620 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
621 if on_plus_strand:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
622 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
623 cigar_prefix = ( s_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
624 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
625 cigar_suffix = ( e_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
626 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
627 if e_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
628 cigar_prefix = ( e_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
629 if s_offset > 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
630 cigar_suffix = ( s_offset, "S" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
631 if cigar_prefix != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
632 count, op = cigar_prefix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
633 cigar = prefix_cigar( "%d%s" % ( count, op ), cigar )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
634 if op == "S":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
635 refPos = int( fields[ SAM_POS_COLUMN ] ) - count
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
636 fields[ SAM_POS_COLUMN ] = "%d" % refPos
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
637 if cigar_suffix != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
638 count, op = cigar_suffix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
639 cigar = suffix_cigar( cigar,"%d%s" % ( count, op) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
640 fields[ SAM_QNAME_COLUMN ] = read_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
641 fields[ SAM_CIGAR_COLUMN ] = cigar
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
642 # Fetch sequence and quality values, and flip/clip them
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
643 if read_name not in read_to_nucs_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
644 skip_line( 'missing_seq', i+1, read_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
645 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
646 nucs = read_to_nucs_dict[ read_name ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
647 if not on_plus_strand:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
648 nucs = reverse_complement( nucs )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
649 quals = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
650 if read_to_quals_dict != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
651 if read_name not in read_to_quals_dict:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
652 skip_line( 'missing_quals', i+1, read_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
653 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
654 quals = read_to_quals_dict[ read_name ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
655 if not on_plus_strand:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
656 quals = reverse_string( quals )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
657 cigar = split_cigar( fields[ SAM_CIGAR_COLUMN ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
658 nucs, quals = clip_for_cigar( cigar, nucs, quals )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
659 fields[ SAM_SEQ_COLUMN ] = nucs
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
660 if quals != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
661 fields[ SAM_QUAL_COLUMN ] = quals
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
662 # Output the line
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
663 fout.write( "%s\n" % "\t".join( fields ) )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
664 fout.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
665
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
666 def prefix_cigar( prefix, cigar ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
667 ix = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
668 while cigar[ ix ].isdigit():
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
669 ix += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
670 if cigar[ ix ] != prefix[ -1 ]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
671 return prefix + cigar
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
672 count = int( prefix[ :-1 ] ) + int( cigar[ :ix ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
673 return "%d%s%s" % ( count, prefix[ -1 ], cigar[ ix+1: ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
674
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
675 def suffix_cigar( cigar, suffix ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
676 if cigar[ -1 ] != suffix[ -1 ]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
677 return cigar + suffix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
678 ix = len( cigar ) - 2
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
679 while cigar[ix].isdigit():
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
680 ix -= 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
681 ix += 1
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
682 count = int( cigar[ ix:-1 ] ) + int( suffix[ :-1 ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
683 return "%s%d%s" % ( cigar[ :ix ], count, suffix[ -1 ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
684
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
685 def split_cigar( text ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
686 fields = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
687 field = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
688 for ch in text:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
689 if ch not in "MIDHS":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
690 field += ch
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
691 continue
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
692 if field == []:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
693 raise ValueError
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
694 fields += [ ( int( "".join( field ) ), ch ) ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
695 field = []
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
696 if field != []:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
697 raise ValueError
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
698 return fields
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
699
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
700 def clip_for_cigar( cigar, nucs, quals ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
701 # Hard clip prefix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
702 count, op = cigar[0]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
703 if op == "H":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
704 nucs = nucs[ count: ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
705 if quals != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
706 quals = quals[ count: ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
707 count, op = cigar[ 1 ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
708 # Soft clip prefix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
709 if op == "S":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
710 nucs = nucs[ :count ].lower() + nucs[ count: ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
711 # Hard clip suffix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
712 count,op = cigar[ -1 ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
713 if op == "H":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
714 nucs = nucs[ :-count ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
715 if quals != None:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
716 quals = quals[ :-count ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
717 count, op = cigar[ -2 ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
718 # Soft clip suffix
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
719 if op == "S":
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
720 nucs = nucs[ :-count ] + nucs[ -count: ].lower()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
721 return nucs, quals
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
722
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
723 def samify_phred_scores( quals ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
724 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
725 Convert a decimal list of phred base-quality scores to a sam quality string.
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
726 Note that if a quality is outside the dynamic range of sam's ability to
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
727 represent it, we clip the value to the max allowed. SAM quality scores
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
728 range from chr(33) to chr(126).
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
729 """
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
730 if min( quals ) >= 0 and max( quals ) <= 126-33:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
731 return "".join( [ chr( 33 + q ) for q in quals ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
732 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
733 return "".join( [ chr( max( 33, min( 126, 33+q ) ) ) for q in quals ] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
734
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
735 def reverse_complement( nucs ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
736 complementMap = maketrans( "ACGTacgt", "TGCAtgca" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
737 return nucs[ ::-1 ].translate( complementMap )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
738
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
739 def reverse_string( s ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
740 return s[ ::-1 ]
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
741
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
742 def __main__():
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
743 # Parse command line
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
744 # input1: a reference genome ( 2bit or fasta )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
745 # input2: a collection of 454 paired end reads ( a fasta file )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
746 # input3: a linker sequence ( a very small fasta file )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
747 # input4: a base quality score 454 file ( qual454 )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
748 parser = optparse.OptionParser()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
749 parser.add_option( '', '--ref_name', dest='ref_name', help='The reference name to change all output matches to' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
750 parser.add_option( '', '--ref_source', dest='ref_source', help='The reference is cached or from the history' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
751 parser.add_option( '', '--ref_sequences', dest='ref_sequences', help='Number of sequences in the reference dataset' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
752 parser.add_option( '', '--source_select', dest='source_select', help='Use pre-set or cached reference file' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
753 parser.add_option( '', '--input1', dest='input1', help='The name of the reference file if using history or reference base name if using cached' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
754 parser.add_option( '', '--input2', dest='input2', help='The 454 reads file to align' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
755 parser.add_option( '', '--input3', dest='input3', help='The sequencing linker file' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
756 parser.add_option( '', '--input4', dest='input4', help='The base quality score 454 file' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
757 parser.add_option( '', '--output', dest='output', help='The output file' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
758 parser.add_option( '', '--lastz_seqs_file_dir', dest='lastz_seqs_file_dir', help='Directory of local lastz_seqs.loc file' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
759 ( options, args ) = parser.parse_args()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
760
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
761 # output version # of tool
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
762 try:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
763 tmp = tempfile.NamedTemporaryFile().name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
764 tmp_stdout = open( tmp, 'wb' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
765 proc = subprocess.Popen( args='lastz -v', shell=True, stdout=tmp_stdout )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
766 tmp_stdout.close()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
767 returncode = proc.wait()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
768 stdout = None
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
769 for line in open( tmp_stdout.name, 'rb' ):
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
770 if line.lower().find( 'version' ) >= 0:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
771 stdout = line.strip()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
772 break
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
773 if stdout:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
774 sys.stdout.write( '%s\n' % stdout )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
775 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
776 raise Exception
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
777 except:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
778 sys.stdout.write( 'Could not determine Lastz version\n' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
779
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
780 if options.ref_name:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
781 ref_name = '[nickname=%s]' % options.ref_name
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
782 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
783 ref_name = ''
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
784 if options.ref_source == 'history':
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
785 # Reference is a fasta dataset from the history
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
786 try:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
787 # Ensure there is at least 1 sequence in the dataset ( this may not be necessary ).
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
788 error_msg = "The reference dataset is missing metadata, click the pencil icon in the history item and 'auto-detect' the metadata attributes."
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
789 ref_sequences = int( options.ref_sequences )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
790 if ref_sequences < 1:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
791 stop_err( error_msg )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
792 except:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
793 stop_err( error_msg )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
794 else:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
795 ref_sequences = 0
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
796 tmp_w12_name = get_tmp_file_name( suffix='vs_linker.W12' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
797 tmp_T1_name = get_tmp_file_name( suffix='vs_linker.T1' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
798 # Run lastz twice ( with different options ) on the linker sequence and paired end reads,
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
799 # looking for the linker ( each run finds some the other doesn't )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
800 command = 'lastz %s %s W=12 --notrans --exact=18 --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --coverage=85 --format=general-:name2,zstart2+,length2,size2 > %s' % \
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
801 ( options.input3, options.input2, tmp_w12_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
802 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
803 command = 'lastz %s %s T=1 --match=1,2 O=1 E=2 X=15 K=10 Y=15 L=18 --ambiguousn --coverage=85 --format=general-:name2,zstart2+,length2,size2 > %s' % \
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
804 ( options.input3, options.input2, tmp_T1_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
805 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
806 # Combine the alignment output from the two lastz runs
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
807 tmp_combined_linker_file_name = get_tmp_file_name( suffix='vs_linker' )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
808 command = 'cat %s %s | sort -u > %s' % ( tmp_w12_name, tmp_T1_name, tmp_combined_linker_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
809 run_command( command )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
810 # Use the alignment info to split reads into left and right mates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
811 tmp_mates_mapping_file_name, tmp_mates_file_name, tmp_mates_short_file_name, tmp_mates_long_file_name = split_paired_reads( options.input2, tmp_combined_linker_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
812 # Align mates to the reference - tmp_align_file_names is a list of file names created by align_mates()
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
813 tmp_align_file_name_list = align_mates( options.input1, options.ref_source, ref_name, ref_sequences, tmp_mates_short_file_name, tmp_mates_long_file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
814 # Combine and convert mate coordinates back to read coordinates
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
815 paired_mate_unmapper( options.input2, options.input4, tmp_mates_mapping_file_name, tmp_align_file_name_list, options.output )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
816 # Delete all temporary files
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
817 for file_name in tmp_file_names:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
818 os.remove( file_name )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
819 # Handle any invalid lines in the input data
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
820 if total_skipped_lines:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
821 msgs = dict( bad_interval="Bad interval in line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
822 inconsistent_read_lengths="Inconsistent read/quality lengths for seq #",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
823 inconsistent_reads="Inconsistent reads for seq #",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
824 inconsistent_sizes="Inconsistent sizes for seq #",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
825 missing_mate="Mapping file does not include mate on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
826 missing_quals="Missing quality values for name on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
827 missing_seq="Missing sequence for name on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
828 multiple_seqs="Multiple names for seq #",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
829 no_header="First quality sequence has no header",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
830 num_fields="Must have 4 fields in line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
831 reads_paired="SAM flag indicates reads already paired on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
832 sam_flag="Bad SAM flag on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
833 sam_headers="SAM headers on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
834 sam_min_columns="Need 11 columns on line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
835 two_mate_names="Mate name already seen, line",
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
836 wrong_seq_len="Size differs from length of seq #" )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
837 print "Skipped %d invalid lines: "
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
838 msg = ""
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
839 for k, v in skipped_lines.items():
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
840 if v[0]:
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
841 # v[0] is the number of times the error occurred
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
842 # v[1] is the position of the line or sequence in the file
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
843 # v[2] is the name of the sequence or the text of the line
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
844 msg += "(%d)%s %d:%s. " % ( v[0], msgs[k], v[1], v[2] )
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
845 print msg
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
846
96825cee5c25 Uploaded tarball
devteam
parents:
diff changeset
847 if __name__=="__main__": __main__()