comparison pyPRADA_1.2/tools/bwa-0.5.7-mh/bwa.1 @ 0:acc2ca1a3ba4

Uploaded
author siyuan
date Thu, 20 Feb 2014 00:44:58 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:acc2ca1a3ba4
1 .TH bwa 1 "10 Feburuary 2010" "bwa-0.5.6" "Bioinformatics tools"
2 .SH NAME
3 .PP
4 bwa - Burrows-Wheeler Alignment Tool
5 .SH SYNOPSIS
6 .PP
7 bwa index -a bwtsw database.fasta
8 .PP
9 bwa aln database.fasta short_read.fastq > aln_sa.sai
10 .PP
11 bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
12 .PP
13 bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam
14 .PP
15 bwa bwasw database.fasta long_read.fastq > aln.sam
16
17 .SH DESCRIPTION
18 .PP
19 BWA is a fast light-weighted tool that aligns relatively short sequences
20 (queries) to a sequence database (targe), such as the human reference
21 genome. It implements two different algorithms, both based on
22 Burrows-Wheeler Transform (BWT). The first algorithm is designed for
23 short queries up to ~200bp with low error rate (<3%). It does gapped
24 global alignment w.r.t. queries, supports paired-end reads, and is one
25 of the fastest short read alignment algorithms to date while also
26 visiting suboptimal hits. The second algorithm, BWA-SW, is designed for
27 long reads with more errors. It performs heuristic Smith-Waterman-like
28 alignment to find high-scoring local hits (and thus chimera). On
29 low-error short queries, BWA-SW is slower and less accurate than the
30 first algorithm, but on long queries, it is better.
31 .PP
32 For both algorithms, the database file in the FASTA format must be
33 first indexed with the
34 .B `index'
35 command, which typically takes a few hours. The first algorithm is
36 implemented via the
37 .B `aln'
38 command, which finds the suffix array (SA) coordinates of good hits of
39 each individual read, and the
40 .B `samse/sampe'
41 command, which converts SA coordinates to chromosomal coordinate and
42 pairs reads (for `sampe'). The second algorithm is invoked by the
43 .B `dbtwsw'
44 command. It works for single-end reads only.
45
46 .SH COMMANDS AND OPTIONS
47 .TP
48 .B index
49 bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
50
51 Index database sequences in the FASTA format.
52
53 .B OPTIONS:
54 .RS
55 .TP 10
56 .B -c
57 Build color-space index. The input fast should be in nucleotide space.
58 .TP
59 .B -p STR
60 Prefix of the output database [same as db filename]
61 .TP
62 .B -a STR
63 Algorithm for constructing BWT index. Available options are:
64 .RS
65 .TP
66 .B is
67 IS linear-time algorithm for constructing suffix array. It requires
68 5.37N memory where N is the size of the database. IS is moderately fast,
69 but does not work with database larger than 2GB. IS is the default
70 algorithm due to its simplicity. The current codes for IS algorithm are
71 reimplemented by Yuta Mori.
72 .TP
73 .B bwtsw
74 Algorithm implemented in BWT-SW. This method works with the whole human
75 genome, but it does not work with database smaller than 10MB and it is
76 usually slower than IS.
77 .RE
78 .RE
79
80 .TP
81 .B aln
82 bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
83 nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc]
84 [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> >
85 <out.sai>
86
87 Find the SA coordinates of the input reads. Maximum
88 .I maxSeedDiff
89 differences are allowed in the first
90 .I seedLen
91 subsequence and maximum
92 .I maxDiff
93 differences are allowed in the whole sequence.
94
95 .B OPTIONS:
96 .RS
97 .TP 10
98 .B -n NUM
99 Maximum edit distance if the value is INT, or the fraction of missing
100 alignments given 2% uniform base error rate if FLOAT. In the latter
101 case, the maximum edit distance is automatically chosen for different
102 read lengths. [0.04]
103 .TP
104 .B -o INT
105 Maximum number of gap opens [1]
106 .TP
107 .B -e INT
108 Maximum number of gap extensions, -1 for k-difference mode (disallowing
109 long gaps) [-1]
110 .TP
111 .B -d INT
112 Disallow a long deletion within INT bp towards the 3'-end [16]
113 .TP
114 .B -i INT
115 Disallow an indel within INT bp towards the ends [5]
116 .TP
117 .B -l INT
118 Take the first INT subsequence as seed. If INT is larger than the query
119 sequence, seeding will be disabled. For long reads, this option is
120 typically ranged from 25 to 35 for `-k 2'. [inf]
121 .TP
122 .B -k INT
123 Maximum edit distance in the seed [2]
124 .TP
125 .B -t INT
126 Number of threads (multi-threading mode) [1]
127 .TP
128 .B -M INT
129 Mismatch penalty. BWA will not search for suboptimal hits with a score
130 lower than (bestScore-misMsc). [3]
131 .TP
132 .B -O INT
133 Gap open penalty [11]
134 .TP
135 .B -E INT
136 Gap extension penalty [4]
137 .TP
138 .B -R INT
139 Proceed with suboptimal alignments if there are no more than INT equally
140 best hits. This option only affects paired-end mapping. Increasing this
141 threshold helps to improve the pairing accuracy at the cost of speed,
142 especially for short reads (~32bp).
143 .TP
144 .B -c
145 Reverse query but not complement it, which is required for alignment in
146 the color space.
147 .TP
148 .B -N
149 Disable iterative search. All hits with no more than
150 .I maxDiff
151 differences will be found. This mode is much slower than the default.
152 .TP
153 .B -q INT
154 Parameter for read trimming. BWA trims a read down to
155 argmax_x{\\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original
156 read length. [0]
157 .RE
158
159 .TP
160 .B samse
161 bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
162
163 Generate alignments in the SAM format given single-end reads. Repetitive
164 hits will be randomly chosen.
165
166 .B OPTIONS:
167 .RS
168 .TP 10
169 .B -n INT
170 Maximum number of alignments to output in the XA tag for reads paired
171 properly. If a read has more than INT hits, the XA tag will not be
172 written. [3]
173 .RE
174
175 .TP
176 .B sampe
177 bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis]
178 [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
179
180 Generate alignments in the SAM format given paired-end reads. Repetitive
181 read pairs will be placed randomly.
182
183 .B OPTIONS:
184 .RS
185 .TP 8
186 .B -a INT
187 Maximum insert size for a read pair to be considered being mapped
188 properly. Since 0.4.5, this option is only used when there are not
189 enough good alignment to infer the distribution of insert sizes. [500]
190 .TP
191 .B -o INT
192 Maximum occurrences of a read for pairing. A read with more occurrneces
193 will be treated as a single-end read. Reducing this parameter helps
194 faster pairing. [100000]
195 .TP
196 .B -P
197 Load the entire FM-index into memory to reduce disk operations
198 (base-space reads only). With this option, at least 1.25N bytes of
199 memory are required, where N is the length of the genome.
200 .TP
201 .B -n INT
202 Maximum number of alignments to output in the XA tag for reads paired
203 properly. If a read has more than INT hits, the XA tag will not be
204 written. [3]
205 .TP
206 .B -N INT
207 Maximum number of alignments to output in the XA tag for disconcordant
208 read pairs (excluding singletons). If a read has more than INT hits, the
209 XA tag will not be written. [10]
210 .RE
211
212 .TP
213 .B bwasw
214 bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t
215 nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N
216 nHspRev] [-c thresCoef] <in.db.fasta> <in.fq>
217
218 Align query sequences in the <in.fq> file.
219
220 .B OPTIONS:
221 .RS
222 .TP 10
223 .B -a INT
224 Score of a match [1]
225 .TP
226 .B -b INT
227 Mismatch penalty [3]
228 .TP
229 .B -q INT
230 Gap open penalty [5]
231 .TP
232 .B -r INT
233 Gap extension penalty. The penalty for a contiguous gap of size k is
234 q+k*r. [2]
235 .TP
236 .B -t INT
237 Number of threads in the multi-threading mode [1]
238 .TP
239 .B -w INT
240 Band width in the banded alignment [33]
241 .TP
242 .B -T INT
243 Minimum score threshold divided by a [37]
244 .TP
245 .B -c FLOAT
246 Coefficient for threshold adjustment according to query length. Given an
247 l-long query, the threshold for a hit to be retained is
248 a*max{T,c*log(l)}. [5.5]
249 .TP
250 .B -z INT
251 Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
252 .TP
253 .B -s INT
254 Maximum SA interval size for initiating a seed. Higher -s increases
255 accuracy at the cost of speed. [3]
256 .TP
257 .B -N INT
258 Minimum number of seeds supporting the resultant alignment to skip
259 reverse alignment. [5]
260 .RE
261
262 .SH SAM ALIGNMENT FORMAT
263 .PP
264 The output of the
265 .B `aln'
266 command is binary and designed for BWA use only. BWA outputs the final
267 alignment in the SAM (Sequence Alignment/Map) format. Each line consists
268 of:
269
270 .TS
271 center box;
272 cb | cb | cb
273 n | l | l .
274 Col Field Description
275 _
276 1 QNAME Query (pair) NAME
277 2 FLAG bitwise FLAG
278 3 RNAME Reference sequence NAME
279 4 POS 1-based leftmost POSition/coordinate of clipped sequence
280 5 MAPQ MAPping Quality (Phred-scaled)
281 6 CIAGR extended CIGAR string
282 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
283 8 MPOS 1-based Mate POSistion
284 9 ISIZE Inferred insert SIZE
285 10 SEQ query SEQuence on the same strand as the reference
286 11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
287 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE
288 .TE
289
290 .PP
291 Each bit in the FLAG field is defined as:
292
293 .TS
294 center box;
295 cb | cb | cb
296 c | l | l .
297 Chr Flag Description
298 _
299 p 0x0001 the read is paired in sequencing
300 P 0x0002 the read is mapped in a proper pair
301 u 0x0004 the query sequence itself is unmapped
302 U 0x0008 the mate is unmapped
303 r 0x0010 strand of the query (1 for reverse)
304 R 0x0020 strand of the mate
305 1 0x0040 the read is the first read in a pair
306 2 0x0080 the read is the second read in a pair
307 s 0x0100 the alignment is not primary
308 f 0x0200 QC failure
309 d 0x0400 optical or PCR duplicate
310 .TE
311
312 .PP
313 The Please check <http://samtools.sourceforge.net> for the format
314 specification and the tools for post-processing the alignment.
315
316 BWA generates the following optional fields. Tags starting with `X' are
317 specific to BWA.
318
319 .TS
320 center box;
321 cb | cb
322 cB | l .
323 Tag Meaning
324 _
325 NM Edit distance
326 MD Mismatching positions/bases
327 AS Alignment score
328 _
329 X0 Number of best hits
330 X1 Number of suboptimal hits found by BWA
331 XN Number of ambiguous bases in the referenece
332 XM Number of mismatches in the alignment
333 XO Number of gap opens
334 XG Number of gap extentions
335 XT Type: Unique/Repeat/N/Mate-sw
336 XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
337 _
338 XS Suboptimal alignment score
339 XF Support from forward/reverse alignment
340 XE Number of supporting seeds
341 .TE
342
343 .PP
344 Note that XO and XG are generated by BWT search while the CIGAR string
345 by Smith-Waterman alignment. These two tags may be inconsistent with the
346 CIGAR string. This is not a bug.
347
348 .SH NOTES ON SHORT-READ ALIGNMENT
349 .SS Alignment Accuracy
350 .PP
351 When seeding is disabled, BWA guarantees to find an alignment
352 containing maximum
353 .I maxDiff
354 differences including
355 .I maxGapO
356 gap opens which do not occur within
357 .I nIndelEnd
358 bp towards either end of the query. Longer gaps may be found if
359 .I maxGapE
360 is positive, but it is not guaranteed to find all hits. When seeding is
361 enabled, BWA further requires that the first
362 .I seedLen
363 subsequence contains no more than
364 .I maxSeedDiff
365 differences.
366 .PP
367 When gapped alignment is disabled, BWA is expected to generate the same
368 alignment as Eland, the Illumina alignment program. However, as BWA
369 change `N' in the database sequence to random nucleotides, hits to these
370 random sequences will also be counted. As a consequence, BWA may mark a
371 unique hit as a repeat, if the random sequences happen to be identical
372 to the sequences which should be unqiue in the database. This random
373 behaviour will be avoided in future releases.
374 .PP
375 By default, if the best hit is no so repetitive (controlled by -R), BWA
376 also finds all hits contains one more mismatch; otherwise, BWA finds all
377 equally best hits only. Base quality is NOT considered in evaluating
378 hits. In paired-end alignment, BWA pairs all hits it found. It further
379 performs Smith-Waterman alignment for unmapped reads with mates mapped
380 to rescue mapped mates, and for high-quality anomalous pairs to fix
381 potential alignment errors.
382
383 .SS Estimating Insert Size Distribution
384 .PP
385 BWA estimates the insert size distribution per 256*1024 read pairs. It
386 first collects pairs of reads with both ends mapped with a single-end
387 quality 20 or higher and then calculates median (Q2), lower and higher
388 quartile (Q1 and Q3). It estimates the mean and the variance of the
389 insert size distribution from pairs whose insert sizes are within
390 interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair
391 considered to be properly paired (SAM flag 0x2) is calculated by solving
392 equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the
393 standard error of the insert size distribution, L is the length of the
394 genome, p0 is prior of anomalous pair and Phi() is the standard
395 cumulative distribution function. For mapping Illumina short-insert
396 reads to the human genome, x is about 6-7 sigma away from the
397 mean. Quartiles, mean, variance and x will be printed to the standard
398 error output.
399
400 .SS Memory Requirement
401 .PP
402 With bwtsw algorithm, 2.5GB memory is required for indexing the complete
403 human genome sequences. For short reads, the
404 .B `aln'
405 command uses ~2.3GB memory and the
406 .B `sampe'
407 command uses ~3.5GB.
408
409 .SS Speed
410 .PP
411 Indexing the human genome sequences takes 3 hours with bwtsw
412 algorithm. Indexing smaller genomes with IS or divsufsort algorithms is
413 several times faster, but requires more memory.
414 .PP
415 Speed of alignment is largely determined by the error rate of the query
416 sequences (r). Firstly, BWA runs much faster for near perfect hits than
417 for hits with many differences, and it stops searching for a hit with
418 l+2 differences if a l-difference hit is found. This means BWA will be
419 very slow if r is high because in this case BWA has to visit hits with
420 many differences and looking for these hits is expensive. Secondly, the
421 alignment algorithm behind makes the speed sensitive to [k log(N)/m],
422 where k is the maximum allowed differences, N the size of database and m
423 the length of a query. In practice, we choose k w.r.t. r and therefore r
424 is the leading factor. I would not recommend to use BWA on data with
425 r>0.02.
426 .PP
427 Pairing is slower for shorter reads. This is mainly because shorter
428 reads have more spurious hits and converting SA coordinates to
429 chromosomal coordinates are very costly.
430 .PP
431 In a practical experiment, BWA is able to map 2 million 32bp reads to a
432 bacterial genome in several minutes, map the same amount of reads to
433 human X chromosome in 8-15 minutes and to the human genome in 15-25
434 minutes. This result implies that the speed of BWA is insensitive to the
435 size of database and therefore BWA is more efficient when the database
436 is sufficiently large. On smaller genomes, hash based algorithms are
437 usually much faster.
438
439 .SH NOTES ON LONG-READ ALIGNMENT
440 .PP
441 Command
442 .B `bwasw'
443 is designed for long-read alignment. The algorithm behind, BWA-SW, is
444 similar to BWT-SW, but does not guarantee to find all local hits due to
445 the heuristic acceleration. It tends to be faster and more accurate if
446 the resultant alignment is supported by more seeds, and therefore
447 BWA-SW usually performs better on long queries than on short ones.
448
449 On 350-1000bp reads, BWA-SW is several to tens of times faster than the
450 existing programs. Its accuracy is comparable to SSAHA2, more accurate
451 than BLAT. Like BLAT, BWA-SW also finds chimera which may pose a
452 challenge to SSAHA2. On 10-100kbp queries where chimera detection is
453 important, BWA-SW is over 10X faster than BLAT while being more
454 sensitive.
455
456 BWA-SW can also be used to align ~100bp reads, but it is slower than
457 the short-read algorithm. Its sensitivity and accuracy is lower than
458 SSAHA2 especially when the sequencing error rate is above 2%. This is
459 the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode.
460
461 .SH SEE ALSO
462 BWA website <http://bio-bwa.sourceforge.net>, Samtools website
463 <http://samtools.sourceforge.net>
464
465 .SH AUTHOR
466 Heng Li at the Sanger Institute wrote the key source codes and
467 integrated the following codes for BWT construction: bwtsw
468 <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at
469 the University of Hong Kong and IS
470 <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge
471 <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and
472 implemented by Yuta Mori.
473
474 .SH LICENSE AND CITATION
475 .PP
476 The full BWA package is distributed under GPLv3 as it uses source codes
477 from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS
478 libraries are distributed under the MIT license.
479 .PP
480 If you use the short-read alignment component, please cite the following
481 paper:
482 .PP
483 Li H. and Durbin R. (2009) Fast and accurate short read alignment with
484 Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID: 19451168]
485 .PP
486 If you use the long-read component (BWA-SW), please cite:
487 .PP
488 Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
489 Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505]
490
491 .SH HISTORY
492 BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
493 and mimics its binary file formats; BWA-SW resembles BWT-SW in several
494 ways. The initial idea about BWT-based alignment also came from the
495 group who developed BWT-SW. At the same time, BWA is different enough
496 from BWT-SW. The short-read alignment algorithm bears no similarity to
497 Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it
498 introduces heuristics that can hardly be applied to the original
499 algorithm. In all, BWA does not guarantee to find all local hits as what
500 BWT-SW is designed to do, but it is much faster than BWT-SW on both
501 short and long query sequences.
502
503 I started to write the first piece of codes on 24 May 2008 and got the
504 initial stable version on 02 June 2008. During this period, I was
505 acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper,
506 was collaborating with Beijing Genomics Institute on SOAP2, the successor
507 to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has come out in
508 November 2008. According to the SourceForge download page, the third
509 BWT-based short read aligner, bowtie, was first released in August
510 2008. At the time of writing this manual, at least three more BWT-based
511 short-read aligners are being implemented.
512
513 The BWA-SW algorithm is a new component of BWA. It was conceived in
514 November 2008 and implemented ten months later.