Mercurial > repos > ashvark > qiime_1_8_0
comparison bwa-0.6.2/bwa.1 @ 2:a294fbfcb1db draft default tip
Uploaded BWA
author | ashvark |
---|---|
date | Fri, 18 Jul 2014 07:55:59 -0400 |
parents | dd1186b11b3b |
children |
comparison
equal
deleted
inserted
replaced
1:a9636dc1e99a | 2:a294fbfcb1db |
---|---|
1 .TH bwa 1 "19 June 2012" "bwa-0.6.2" "Bioinformatics tools" | |
2 .SH NAME | |
3 .PP | |
4 bwa - Burrows-Wheeler Alignment Tool | |
5 .SH SYNOPSIS | |
6 .PP | |
7 bwa index -a bwtsw database.fasta | |
8 .PP | |
9 bwa aln database.fasta short_read.fastq > aln_sa.sai | |
10 .PP | |
11 bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam | |
12 .PP | |
13 bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam | |
14 .PP | |
15 bwa bwasw database.fasta long_read.fastq > aln.sam | |
16 | |
17 .SH DESCRIPTION | |
18 .PP | |
19 BWA is a fast light-weighted tool that aligns relatively short sequences | |
20 (queries) to a sequence database (targe), such as the human reference | |
21 genome. It implements two different algorithms, both based on | |
22 Burrows-Wheeler Transform (BWT). The first algorithm is designed for | |
23 short queries up to ~150bp with low error rate (<3%). It does gapped | |
24 global alignment w.r.t. queries, supports paired-end reads, and is one | |
25 of the fastest short read alignment algorithms to date while also | |
26 visiting suboptimal hits. The second algorithm, BWA-SW, is designed for | |
27 reads longer than 100bp with more errors. It performs a heuristic Smith-Waterman-like | |
28 alignment to find high-scoring local hits and split hits. On | |
29 low-error short queries, BWA-SW is a little slower and less accurate than the | |
30 first algorithm, but on long queries, it is better. | |
31 .PP | |
32 For both algorithms, the database file in the FASTA format must be | |
33 first indexed with the | |
34 .B `index' | |
35 command, which typically takes a few hours for a 3GB genome. The first algorithm is | |
36 implemented via the | |
37 .B `aln' | |
38 command, which finds the suffix array (SA) coordinates of good hits of | |
39 each individual read, and the | |
40 .B `samse/sampe' | |
41 command, which converts SA coordinates to chromosomal coordinate and | |
42 pairs reads (for `sampe'). The second algorithm is invoked by the | |
43 .B `bwasw' | |
44 command. It works for single-end reads only. | |
45 | |
46 .SH COMMANDS AND OPTIONS | |
47 .TP | |
48 .B index | |
49 bwa index [-p prefix] [-a algoType] <in.db.fasta> | |
50 | |
51 Index database sequences in the FASTA format. | |
52 | |
53 .B OPTIONS: | |
54 .RS | |
55 .TP 10 | |
56 .B -c | |
57 Build color-space index. The input fast should be in nucleotide space. (Disabled since 0.6.x) | |
58 .TP | |
59 .BI -p \ STR | |
60 Prefix of the output database [same as db filename] | |
61 .TP | |
62 .BI -a \ STR | |
63 Algorithm for constructing BWT index. Available options are: | |
64 .RS | |
65 .TP | |
66 .B is | |
67 IS linear-time algorithm for constructing suffix array. It requires | |
68 5.37N memory where N is the size of the database. IS is moderately fast, | |
69 but does not work with database larger than 2GB. IS is the default | |
70 algorithm due to its simplicity. The current codes for IS algorithm are | |
71 reimplemented by Yuta Mori. | |
72 .TP | |
73 .B bwtsw | |
74 Algorithm implemented in BWT-SW. This method works with the whole human | |
75 genome. | |
76 .RE | |
77 .RE | |
78 | |
79 .TP | |
80 .B aln | |
81 bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i | |
82 nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] | |
83 [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > | |
84 <out.sai> | |
85 | |
86 Find the SA coordinates of the input reads. Maximum | |
87 .I maxSeedDiff | |
88 differences are allowed in the first | |
89 .I seedLen | |
90 subsequence and maximum | |
91 .I maxDiff | |
92 differences are allowed in the whole sequence. | |
93 | |
94 .B OPTIONS: | |
95 .RS | |
96 .TP 10 | |
97 .BI -n \ NUM | |
98 Maximum edit distance if the value is INT, or the fraction of missing | |
99 alignments given 2% uniform base error rate if FLOAT. In the latter | |
100 case, the maximum edit distance is automatically chosen for different | |
101 read lengths. [0.04] | |
102 .TP | |
103 .BI -o \ INT | |
104 Maximum number of gap opens [1] | |
105 .TP | |
106 .BI -e \ INT | |
107 Maximum number of gap extensions, -1 for k-difference mode (disallowing | |
108 long gaps) [-1] | |
109 .TP | |
110 .BI -d \ INT | |
111 Disallow a long deletion within INT bp towards the 3'-end [16] | |
112 .TP | |
113 .BI -i \ INT | |
114 Disallow an indel within INT bp towards the ends [5] | |
115 .TP | |
116 .BI -l \ INT | |
117 Take the first INT subsequence as seed. If INT is larger than the query | |
118 sequence, seeding will be disabled. For long reads, this option is | |
119 typically ranged from 25 to 35 for `-k 2'. [inf] | |
120 .TP | |
121 .BI -k \ INT | |
122 Maximum edit distance in the seed [2] | |
123 .TP | |
124 .BI -t \ INT | |
125 Number of threads (multi-threading mode) [1] | |
126 .TP | |
127 .BI -M \ INT | |
128 Mismatch penalty. BWA will not search for suboptimal hits with a score | |
129 lower than (bestScore-misMsc). [3] | |
130 .TP | |
131 .BI -O \ INT | |
132 Gap open penalty [11] | |
133 .TP | |
134 .BI -E \ INT | |
135 Gap extension penalty [4] | |
136 .TP | |
137 .BI -R \ INT | |
138 Proceed with suboptimal alignments if there are no more than INT equally | |
139 best hits. This option only affects paired-end mapping. Increasing this | |
140 threshold helps to improve the pairing accuracy at the cost of speed, | |
141 especially for short reads (~32bp). | |
142 .TP | |
143 .B -c | |
144 Reverse query but not complement it, which is required for alignment in | |
145 the color space. (Disabled since 0.6.x) | |
146 .TP | |
147 .B -N | |
148 Disable iterative search. All hits with no more than | |
149 .I maxDiff | |
150 differences will be found. This mode is much slower than the default. | |
151 .TP | |
152 .BI -q \ INT | |
153 Parameter for read trimming. BWA trims a read down to | |
154 argmax_x{\\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original | |
155 read length. [0] | |
156 .TP | |
157 .B -I | |
158 The input is in the Illumina 1.3+ read format (quality equals ASCII-64). | |
159 .TP | |
160 .BI -B \ INT | |
161 Length of barcode starting from the 5'-end. When | |
162 .I INT | |
163 is positive, the barcode of each read will be trimmed before mapping and will | |
164 be written at the | |
165 .B BC | |
166 SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0] | |
167 .TP | |
168 .B -b | |
169 Specify the input read sequence file is the BAM format. For paired-end | |
170 data, two ends in a pair must be grouped together and options | |
171 .B -1 | |
172 or | |
173 .B -2 | |
174 are usually applied to specify which end should be mapped. Typical | |
175 command lines for mapping pair-end data in the BAM format are: | |
176 | |
177 bwa aln ref.fa -b1 reads.bam > 1.sai | |
178 bwa aln ref.fa -b2 reads.bam > 2.sai | |
179 bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam | |
180 .TP | |
181 .B -0 | |
182 When | |
183 .B -b | |
184 is specified, only use single-end reads in mapping. | |
185 .TP | |
186 .B -1 | |
187 When | |
188 .B -b | |
189 is specified, only use the first read in a read pair in mapping (skip | |
190 single-end reads and the second reads). | |
191 .TP | |
192 .B -2 | |
193 When | |
194 .B -b | |
195 is specified, only use the second read in a read pair in mapping. | |
196 .B | |
197 .RE | |
198 | |
199 .TP | |
200 .B samse | |
201 bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> | |
202 | |
203 Generate alignments in the SAM format given single-end reads. Repetitive | |
204 hits will be randomly chosen. | |
205 | |
206 .B OPTIONS: | |
207 .RS | |
208 .TP 10 | |
209 .BI -n \ INT | |
210 Maximum number of alignments to output in the XA tag for reads paired | |
211 properly. If a read has more than INT hits, the XA tag will not be | |
212 written. [3] | |
213 .TP | |
214 .BI -r \ STR | |
215 Specify the read group in a format like `@RG\\tID:foo\\tSM:bar'. [null] | |
216 .RE | |
217 | |
218 .TP | |
219 .B sampe | |
220 bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] | |
221 [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam> | |
222 | |
223 Generate alignments in the SAM format given paired-end reads. Repetitive | |
224 read pairs will be placed randomly. | |
225 | |
226 .B OPTIONS: | |
227 .RS | |
228 .TP 8 | |
229 .BI -a \ INT | |
230 Maximum insert size for a read pair to be considered being mapped | |
231 properly. Since 0.4.5, this option is only used when there are not | |
232 enough good alignment to infer the distribution of insert sizes. [500] | |
233 .TP | |
234 .BI -o \ INT | |
235 Maximum occurrences of a read for pairing. A read with more occurrneces | |
236 will be treated as a single-end read. Reducing this parameter helps | |
237 faster pairing. [100000] | |
238 .TP | |
239 .B -P | |
240 Load the entire FM-index into memory to reduce disk operations | |
241 (base-space reads only). With this option, at least 1.25N bytes of | |
242 memory are required, where N is the length of the genome. | |
243 .TP | |
244 .BI -n \ INT | |
245 Maximum number of alignments to output in the XA tag for reads paired | |
246 properly. If a read has more than INT hits, the XA tag will not be | |
247 written. [3] | |
248 .TP | |
249 .BI -N \ INT | |
250 Maximum number of alignments to output in the XA tag for disconcordant | |
251 read pairs (excluding singletons). If a read has more than INT hits, the | |
252 XA tag will not be written. [10] | |
253 .TP | |
254 .BI -r \ STR | |
255 Specify the read group in a format like `@RG\\tID:foo\\tSM:bar'. [null] | |
256 .RE | |
257 | |
258 .TP | |
259 .B bwasw | |
260 bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t | |
261 nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N | |
262 nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq] | |
263 | |
264 Align query sequences in the | |
265 .I in.fq | |
266 file. When | |
267 .I mate.fq | |
268 is present, perform paired-end alignment. The paired-end mode only works | |
269 for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW | |
270 may still output split alignments but they are all marked as not properly | |
271 paired; the mate positions will not be written if the mate has multiple | |
272 local hits. | |
273 | |
274 .B OPTIONS: | |
275 .RS | |
276 .TP 10 | |
277 .BI -a \ INT | |
278 Score of a match [1] | |
279 .TP | |
280 .BI -b \ INT | |
281 Mismatch penalty [3] | |
282 .TP | |
283 .BI -q \ INT | |
284 Gap open penalty [5] | |
285 .TP | |
286 .BI -r \ INT | |
287 Gap extension penalty. The penalty for a contiguous gap of size k is | |
288 q+k*r. [2] | |
289 .TP | |
290 .BI -t \ INT | |
291 Number of threads in the multi-threading mode [1] | |
292 .TP | |
293 .BI -w \ INT | |
294 Band width in the banded alignment [33] | |
295 .TP | |
296 .BI -T \ INT | |
297 Minimum score threshold divided by a [37] | |
298 .TP | |
299 .BI -c \ FLOAT | |
300 Coefficient for threshold adjustment according to query length. Given an | |
301 l-long query, the threshold for a hit to be retained is | |
302 a*max{T,c*log(l)}. [5.5] | |
303 .TP | |
304 .BI -z \ INT | |
305 Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1] | |
306 .TP | |
307 .BI -s \ INT | |
308 Maximum SA interval size for initiating a seed. Higher -s increases | |
309 accuracy at the cost of speed. [3] | |
310 .TP | |
311 .BI -N \ INT | |
312 Minimum number of seeds supporting the resultant alignment to skip | |
313 reverse alignment. [5] | |
314 .RE | |
315 | |
316 .SH SAM ALIGNMENT FORMAT | |
317 .PP | |
318 The output of the | |
319 .B `aln' | |
320 command is binary and designed for BWA use only. BWA outputs the final | |
321 alignment in the SAM (Sequence Alignment/Map) format. Each line consists | |
322 of: | |
323 | |
324 .TS | |
325 center box; | |
326 cb | cb | cb | |
327 n | l | l . | |
328 Col Field Description | |
329 _ | |
330 1 QNAME Query (pair) NAME | |
331 2 FLAG bitwise FLAG | |
332 3 RNAME Reference sequence NAME | |
333 4 POS 1-based leftmost POSition/coordinate of clipped sequence | |
334 5 MAPQ MAPping Quality (Phred-scaled) | |
335 6 CIAGR extended CIGAR string | |
336 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) | |
337 8 MPOS 1-based Mate POSistion | |
338 9 ISIZE Inferred insert SIZE | |
339 10 SEQ query SEQuence on the same strand as the reference | |
340 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) | |
341 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE | |
342 .TE | |
343 | |
344 .PP | |
345 Each bit in the FLAG field is defined as: | |
346 | |
347 .TS | |
348 center box; | |
349 cb | cb | cb | |
350 c | l | l . | |
351 Chr Flag Description | |
352 _ | |
353 p 0x0001 the read is paired in sequencing | |
354 P 0x0002 the read is mapped in a proper pair | |
355 u 0x0004 the query sequence itself is unmapped | |
356 U 0x0008 the mate is unmapped | |
357 r 0x0010 strand of the query (1 for reverse) | |
358 R 0x0020 strand of the mate | |
359 1 0x0040 the read is the first read in a pair | |
360 2 0x0080 the read is the second read in a pair | |
361 s 0x0100 the alignment is not primary | |
362 f 0x0200 QC failure | |
363 d 0x0400 optical or PCR duplicate | |
364 .TE | |
365 | |
366 .PP | |
367 The Please check <http://samtools.sourceforge.net> for the format | |
368 specification and the tools for post-processing the alignment. | |
369 | |
370 BWA generates the following optional fields. Tags starting with `X' are | |
371 specific to BWA. | |
372 | |
373 .TS | |
374 center box; | |
375 cb | cb | |
376 cB | l . | |
377 Tag Meaning | |
378 _ | |
379 NM Edit distance | |
380 MD Mismatching positions/bases | |
381 AS Alignment score | |
382 BC Barcode sequence | |
383 _ | |
384 X0 Number of best hits | |
385 X1 Number of suboptimal hits found by BWA | |
386 XN Number of ambiguous bases in the referenece | |
387 XM Number of mismatches in the alignment | |
388 XO Number of gap opens | |
389 XG Number of gap extentions | |
390 XT Type: Unique/Repeat/N/Mate-sw | |
391 XA Alternative hits; format: (chr,pos,CIGAR,NM;)* | |
392 _ | |
393 XS Suboptimal alignment score | |
394 XF Support from forward/reverse alignment | |
395 XE Number of supporting seeds | |
396 .TE | |
397 | |
398 .PP | |
399 Note that XO and XG are generated by BWT search while the CIGAR string | |
400 by Smith-Waterman alignment. These two tags may be inconsistent with the | |
401 CIGAR string. This is not a bug. | |
402 | |
403 .SH NOTES ON SHORT-READ ALIGNMENT | |
404 .SS Alignment Accuracy | |
405 .PP | |
406 When seeding is disabled, BWA guarantees to find an alignment | |
407 containing maximum | |
408 .I maxDiff | |
409 differences including | |
410 .I maxGapO | |
411 gap opens which do not occur within | |
412 .I nIndelEnd | |
413 bp towards either end of the query. Longer gaps may be found if | |
414 .I maxGapE | |
415 is positive, but it is not guaranteed to find all hits. When seeding is | |
416 enabled, BWA further requires that the first | |
417 .I seedLen | |
418 subsequence contains no more than | |
419 .I maxSeedDiff | |
420 differences. | |
421 .PP | |
422 When gapped alignment is disabled, BWA is expected to generate the same | |
423 alignment as Eland version 1, the Illumina alignment program. However, as BWA | |
424 change `N' in the database sequence to random nucleotides, hits to these | |
425 random sequences will also be counted. As a consequence, BWA may mark a | |
426 unique hit as a repeat, if the random sequences happen to be identical | |
427 to the sequences which should be unqiue in the database. | |
428 .PP | |
429 By default, if the best hit is not highly repetitive (controlled by -R), BWA | |
430 also finds all hits contains one more mismatch; otherwise, BWA finds all | |
431 equally best hits only. Base quality is NOT considered in evaluating | |
432 hits. In the paired-end mode, BWA pairs all hits it found. It further | |
433 performs Smith-Waterman alignment for unmapped reads to rescue reads with a | |
434 high erro rate, and for high-quality anomalous pairs to fix potential alignment | |
435 errors. | |
436 | |
437 .SS Estimating Insert Size Distribution | |
438 .PP | |
439 BWA estimates the insert size distribution per 256*1024 read pairs. It | |
440 first collects pairs of reads with both ends mapped with a single-end | |
441 quality 20 or higher and then calculates median (Q2), lower and higher | |
442 quartile (Q1 and Q3). It estimates the mean and the variance of the | |
443 insert size distribution from pairs whose insert sizes are within | |
444 interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair | |
445 considered to be properly paired (SAM flag 0x2) is calculated by solving | |
446 equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the | |
447 standard error of the insert size distribution, L is the length of the | |
448 genome, p0 is prior of anomalous pair and Phi() is the standard | |
449 cumulative distribution function. For mapping Illumina short-insert | |
450 reads to the human genome, x is about 6-7 sigma away from the | |
451 mean. Quartiles, mean, variance and x will be printed to the standard | |
452 error output. | |
453 | |
454 .SS Memory Requirement | |
455 .PP | |
456 With bwtsw algorithm, 5GB memory is required for indexing the complete | |
457 human genome sequences. For short reads, the | |
458 .B aln | |
459 command uses ~3.2GB memory and the | |
460 .B sampe | |
461 command uses ~5.4GB. | |
462 | |
463 .SS Speed | |
464 .PP | |
465 Indexing the human genome sequences takes 3 hours with bwtsw | |
466 algorithm. Indexing smaller genomes with IS algorithms is | |
467 faster, but requires more memory. | |
468 .PP | |
469 The speed of alignment is largely determined by the error rate of the query | |
470 sequences (r). Firstly, BWA runs much faster for near perfect hits than | |
471 for hits with many differences, and it stops searching for a hit with | |
472 l+2 differences if a l-difference hit is found. This means BWA will be | |
473 very slow if r is high because in this case BWA has to visit hits with | |
474 many differences and looking for these hits is expensive. Secondly, the | |
475 alignment algorithm behind makes the speed sensitive to [k log(N)/m], | |
476 where k is the maximum allowed differences, N the size of database and m | |
477 the length of a query. In practice, we choose k w.r.t. r and therefore r | |
478 is the leading factor. I would not recommend to use BWA on data with | |
479 r>0.02. | |
480 .PP | |
481 Pairing is slower for shorter reads. This is mainly because shorter | |
482 reads have more spurious hits and converting SA coordinates to | |
483 chromosomal coordinates are very costly. | |
484 | |
485 .SH NOTES ON LONG-READ ALIGNMENT | |
486 .PP | |
487 Command | |
488 .B bwasw | |
489 is designed for long-read alignment. BWA-SW essentially aligns the trie | |
490 of the reference genome against the directed acyclic word graph (DAWG) of a | |
491 read to find seeds not highly repetitive in the genome, and then performs a | |
492 standard Smith-Waterman algorithm to extend the seeds. A key heuristic, called | |
493 the Z-best heuristic, is that at each vertex in the DAWG, BWA-SW only keeps the | |
494 top Z reference suffix intervals that match the vertex. BWA-SW is more accurate | |
495 if the resultant alignment is supported by more seeds, and therefore BWA-SW | |
496 usually performs better on long queries or queries with low divergence to the | |
497 reference genome. | |
498 | |
499 BWA-SW is perhaps a better choice than BWA-short for 100bp single-end HiSeq reads | |
500 mainly because it gives better gapped alignment. For paired-end reads, it is yet | |
501 to know whether BWA-short or BWA-SW yield overall better results. | |
502 | |
503 .SH CHANGES IN BWA-0.6 | |
504 .PP | |
505 Since version 0.6, BWA has been able to work with a reference genome longer than 4GB. | |
506 This feature makes it possible to integrate the forward and reverse complemented | |
507 genome in one FM-index, which speeds up both BWA-short and BWA-SW. As a tradeoff, | |
508 BWA uses more memory because it has to keep all positions and ranks in 64-bit | |
509 integers, twice larger than 32-bit integers used in the previous versions. | |
510 | |
511 The latest BWA-SW also works for paired-end reads longer than 100bp. In | |
512 comparison to BWA-short, BWA-SW tends to be more accurate for highly unique | |
513 reads and more robust to relative long INDELs and structural variants. | |
514 Nonetheless, BWA-short usually has higher power to distinguish the optimal hit | |
515 from many suboptimal hits. The choice of the mapping algorithm may depend on | |
516 the application. | |
517 | |
518 .SH SEE ALSO | |
519 BWA website <http://bio-bwa.sourceforge.net>, Samtools website | |
520 <http://samtools.sourceforge.net> | |
521 | |
522 .SH AUTHOR | |
523 Heng Li at the Sanger Institute wrote the key source codes and | |
524 integrated the following codes for BWT construction: bwtsw | |
525 <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at | |
526 the University of Hong Kong and IS | |
527 <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge | |
528 <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and | |
529 implemented by Yuta Mori. | |
530 | |
531 .SH LICENSE AND CITATION | |
532 .PP | |
533 The full BWA package is distributed under GPLv3 as it uses source codes | |
534 from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS | |
535 libraries are distributed under the MIT license. | |
536 .PP | |
537 If you use the short-read alignment component, please cite the following | |
538 paper: | |
539 .PP | |
540 Li H. and Durbin R. (2009) Fast and accurate short read alignment with | |
541 Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168] | |
542 .PP | |
543 If you use the long-read component (BWA-SW), please cite: | |
544 .PP | |
545 Li H. and Durbin R. (2010) Fast and accurate long-read alignment with | |
546 Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505] | |
547 | |
548 .SH HISTORY | |
549 BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW | |
550 and mimics its binary file formats; BWA-SW resembles BWT-SW in several | |
551 ways. The initial idea about BWT-based alignment also came from the | |
552 group who developed BWT-SW. At the same time, BWA is different enough | |
553 from BWT-SW. The short-read alignment algorithm bears no similarity to | |
554 Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it | |
555 introduces heuristics that can hardly be applied to the original | |
556 algorithm. In all, BWA does not guarantee to find all local hits as what | |
557 BWT-SW is designed to do, but it is much faster than BWT-SW on both | |
558 short and long query sequences. | |
559 | |
560 I started to write the first piece of codes on 24 May 2008 and got the | |
561 initial stable version on 02 June 2008. During this period, I was | |
562 acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper, | |
563 was collaborating with Beijing Genomics Institute on SOAP2, the successor | |
564 to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has come out in | |
565 November 2008. According to the SourceForge download page, the third | |
566 BWT-based short read aligner, bowtie, was first released in August | |
567 2008. At the time of writing this manual, at least three more BWT-based | |
568 short-read aligners are being implemented. | |
569 | |
570 The BWA-SW algorithm is a new component of BWA. It was conceived in | |
571 November 2008 and implemented ten months later. |