qiime_1_8_0: bwa-0.6.2/bwa.1 comparison

comparison bwa-0.6.2/bwa.1 @ 2:a294fbfcb1db draft default tip

Uploaded BWA

author	ashvark
date	Fri, 18 Jul 2014 07:55:59 -0400
parents	dd1186b11b3b
children

comparison

equal deleted inserted replaced

-:a9636dc1e99a
+:a294fbfcb1db
+.TH bwa 1 "19 June 2012" "bwa-0.6.2" "Bioinformatics tools"
+.SH NAME
+.PP
+bwa - Burrows-Wheeler Alignment Tool
+.SH SYNOPSIS
+.PP
+bwa index -a bwtsw database.fasta
+.PP
+bwa aln database.fasta short_read.fastq > aln_sa.sai
+.PP
+bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
+.PP
+bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam
+.PP
+bwa bwasw database.fasta long_read.fastq > aln.sam
+.SH DESCRIPTION
+.PP
+BWA is a fast light-weighted tool that aligns relatively short sequences
+(queries) to a sequence database (targe), such as the human reference
+genome. It implements two different algorithms, both based on
+Burrows-Wheeler Transform (BWT). The first algorithm is designed for
+short queries up to ~150bp with low error rate (<3%). It does gapped
+global alignment w.r.t. queries, supports paired-end reads, and is one
+of the fastest short read alignment algorithms to date while also
+visiting suboptimal hits. The second algorithm, BWA-SW, is designed for
+reads longer than 100bp with more errors. It performs a heuristic Smith-Waterman-like
+alignment to find high-scoring local hits and split hits. On
+low-error short queries, BWA-SW is a little slower and less accurate than the
+first algorithm, but on long queries, it is better.
+.PP
+For both algorithms, the database file in the FASTA format must be
+first indexed with the
+.B `index'
+command, which typically takes a few hours for a 3GB genome. The first algorithm is
+implemented via the
+.B `aln'
+command, which finds the suffix array (SA) coordinates of good hits of
+each individual read, and the
+.B `samse/sampe'
+command, which converts SA coordinates to chromosomal coordinate and
+pairs reads (for `sampe'). The second algorithm is invoked by the
+.B `bwasw'
+command. It works for single-end reads only.
+.SH COMMANDS AND OPTIONS
+.TP
+.B index
+bwa index [-p prefix] [-a algoType] <in.db.fasta>
+Index database sequences in the FASTA format.
+.B OPTIONS:
+.RS
+.TP 10
+.B -c
+Build color-space index. The input fast should be in nucleotide space. (Disabled since 0.6.x)
+.TP
+.BI -p \ STR
+Prefix of the output database [same as db filename]
+.TP
+.BI -a \ STR
+Algorithm for constructing BWT index. Available options are:
+.RS
+.TP
+.B is
+IS linear-time algorithm for constructing suffix array. It requires
+5.37N memory where N is the size of the database. IS is moderately fast,
+but does not work with database larger than 2GB. IS is the default
+algorithm due to its simplicity. The current codes for IS algorithm are
+reimplemented by Yuta Mori.
+.TP
+.B bwtsw
+Algorithm implemented in BWT-SW. This method works with the whole human
+genome.
+.RE
+.RE
+.TP
+.B aln
+bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
+nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc]
+[-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> >
+<out.sai>
+Find the SA coordinates of the input reads. Maximum
+.I maxSeedDiff
+differences are allowed in the first
+.I seedLen
+subsequence and maximum
+.I maxDiff
+differences are allowed in the whole sequence.
+.B OPTIONS:
+.RS
+.TP 10
+.BI -n \ NUM
+Maximum edit distance if the value is INT, or the fraction of missing
+alignments given 2% uniform base error rate if FLOAT. In the latter
+case, the maximum edit distance is automatically chosen for different
+read lengths. [0.04]
+.TP
+.BI -o \ INT
+Maximum number of gap opens [1]
+.TP
+.BI -e \ INT
+Maximum number of gap extensions, -1 for k-difference mode (disallowing
+long gaps) [-1]
+.TP
+.BI -d \ INT
+Disallow a long deletion within INT bp towards the 3'-end [16]
+.TP
+.BI -i \ INT
+Disallow an indel within INT bp towards the ends [5]
+.TP
+.BI -l \ INT
+Take the first INT subsequence as seed. If INT is larger than the query
+sequence, seeding will be disabled. For long reads, this option is
+typically ranged from 25 to 35 for `-k 2'. [inf]
+.TP
+.BI -k \ INT
+Maximum edit distance in the seed [2]
+.TP
+.BI -t \ INT
+Number of threads (multi-threading mode) [1]
+.TP
+.BI -M \ INT
+Mismatch penalty. BWA will not search for suboptimal hits with a score
+lower than (bestScore-misMsc). [3]
+.TP
+.BI -O \ INT
+Gap open penalty [11]
+.TP
+.BI -E \ INT
+Gap extension penalty [4]
+.TP
+.BI -R \ INT
+Proceed with suboptimal alignments if there are no more than INT equally
+best hits. This option only affects paired-end mapping. Increasing this
+threshold helps to improve the pairing accuracy at the cost of speed,
+especially for short reads (~32bp).
+.TP
+.B -c
+Reverse query but not complement it, which is required for alignment in
+the color space. (Disabled since 0.6.x)
+.TP
+.B -N
+Disable iterative search. All hits with no more than
+.I maxDiff
+differences will be found. This mode is much slower than the default.
+.TP
+.BI -q \ INT
+Parameter for read trimming. BWA trims a read down to
+argmax_x{\\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original
+read length. [0]
+.TP
+.B -I
+The input is in the Illumina 1.3+ read format (quality equals ASCII-64).
+.TP
+.BI -B \ INT
+Length of barcode starting from the 5'-end. When
+.I INT
+is positive, the barcode of each read will be trimmed before mapping and will
+be written at the
+.B BC
+SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
+.TP
+.B -b
+Specify the input read sequence file is the BAM format. For paired-end
+data, two ends in a pair must be grouped together and options
+.B -1
+or
+.B -2
+are usually applied to specify which end should be mapped. Typical
+command lines for mapping pair-end data in the BAM format are:
+bwa aln ref.fa -b1 reads.bam > 1.sai
+bwa aln ref.fa -b2 reads.bam > 2.sai
+bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam
+.TP
+.B -0
+When
+.B -b
+is specified, only use single-end reads in mapping.
+.TP
+.B -1
+When
+.B -b
+is specified, only use the first read in a read pair in mapping (skip
+single-end reads and the second reads).
+.TP
+.B -2
+When
+.B -b
+is specified, only use the second read in a read pair in mapping.
+.B
+.RE
+.TP
+.B samse
+bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
+Generate alignments in the SAM format given single-end reads. Repetitive
+hits will be randomly chosen.
+.B OPTIONS:
+.RS
+.TP 10
+.BI -n \ INT
+Maximum number of alignments to output in the XA tag for reads paired
+properly. If a read has more than INT hits, the XA tag will not be
+written. [3]
+.TP
+.BI -r \ STR
+Specify the read group in a format like `@RG\\tID:foo\\tSM:bar'. [null]
+.RE
+.TP
+.B sampe
+bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis]
+[-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
+Generate alignments in the SAM format given paired-end reads. Repetitive
+read pairs will be placed randomly.
+.B OPTIONS:
+.RS
+.TP 8
+.BI -a \ INT
+Maximum insert size for a read pair to be considered being mapped
+properly. Since 0.4.5, this option is only used when there are not
+enough good alignment to infer the distribution of insert sizes. [500]
+.TP
+.BI -o \ INT
+Maximum occurrences of a read for pairing. A read with more occurrneces
+will be treated as a single-end read. Reducing this parameter helps
+faster pairing. [100000]
+.TP
+.B -P
+Load the entire FM-index into memory to reduce disk operations
+(base-space reads only). With this option, at least 1.25N bytes of
+memory are required, where N is the length of the genome.
+.TP
+.BI -n \ INT
+Maximum number of alignments to output in the XA tag for reads paired
+properly. If a read has more than INT hits, the XA tag will not be
+written. [3]
+.TP
+.BI -N \ INT
+Maximum number of alignments to output in the XA tag for disconcordant
+read pairs (excluding singletons). If a read has more than INT hits, the
+XA tag will not be written. [10]
+.TP
+.BI -r \ STR
+Specify the read group in a format like `@RG\\tID:foo\\tSM:bar'. [null]
+.RE
+.TP
+.B bwasw
+bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t
+nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N
+nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq]
+Align query sequences in the
+.I in.fq
+file. When
+.I mate.fq
+is present, perform paired-end alignment. The paired-end mode only works
+for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW
+may still output split alignments but they are all marked as not properly
+paired; the mate positions will not be written if the mate has multiple
+local hits.
+.B OPTIONS:
+.RS
+.TP 10
+.BI -a \ INT
+Score of a match [1]
+.TP
+.BI -b \ INT
+Mismatch penalty [3]
+.TP
+.BI -q \ INT
+Gap open penalty [5]
+.TP
+.BI -r \ INT
+Gap extension penalty. The penalty for a contiguous gap of size k is
+q+k*r. [2]
+.TP
+.BI -t \ INT
+Number of threads in the multi-threading mode [1]
+.TP
+.BI -w \ INT
+Band width in the banded alignment [33]
+.TP
+.BI -T \ INT
+Minimum score threshold divided by a [37]
+.TP
+.BI -c \ FLOAT
+Coefficient for threshold adjustment according to query length. Given an
+l-long query, the threshold for a hit to be retained is
+a*max{T,c*log(l)}. [5.5]
+.TP
+.BI -z \ INT
+Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
+.TP
+.BI -s \ INT
+Maximum SA interval size for initiating a seed. Higher -s increases
+accuracy at the cost of speed. [3]
+.TP
+.BI -N \ INT
+Minimum number of seeds supporting the resultant alignment to skip
+reverse alignment. [5]
+.RE
+.SH SAM ALIGNMENT FORMAT
+.PP
+The output of the
+.B `aln'
+command is binary and designed for BWA use only. BWA outputs the final
+alignment in the SAM (Sequence Alignment/Map) format. Each line consists
+of:
+.TS
+center box;
+cb | cb | cb
+n | l | l .
+Col	Field	Description
+_
+1	QNAME	Query (pair) NAME
+2	FLAG	bitwise FLAG
+3	RNAME	Reference sequence NAME
+4	POS	1-based leftmost POSition/coordinate of clipped sequence
+5	MAPQ	MAPping Quality (Phred-scaled)
+6	CIAGR	extended CIGAR string
+7	MRNM	Mate Reference sequence NaMe (`=' if same as RNAME)
+8	MPOS	1-based Mate POSistion
+9	ISIZE	Inferred insert SIZE
+10	SEQ	query SEQuence on the same strand as the reference
+11	QUAL	query QUALity (ASCII-33 gives the Phred base quality)
+12	OPT	variable OPTional fields in the format TAG:VTYPE:VALUE
+.TE
+.PP
+Each bit in the FLAG field is defined as:
+.TS
+center box;
+cb | cb | cb
+c | l | l .
+Chr	Flag	Description
+_
+p	0x0001	the read is paired in sequencing
+P	0x0002	the read is mapped in a proper pair
+u	0x0004	the query sequence itself is unmapped
+U	0x0008	the mate is unmapped
+r	0x0010	strand of the query (1 for reverse)
+R	0x0020	strand of the mate
+1	0x0040	the read is the first read in a pair
+2	0x0080	the read is the second read in a pair
+s	0x0100	the alignment is not primary
+f	0x0200	QC failure
+d	0x0400	optical or PCR duplicate
+.TE
+.PP
+The Please check <http://samtools.sourceforge.net> for the format
+specification and the tools for post-processing the alignment.
+BWA generates the following optional fields. Tags starting with `X' are
+specific to BWA.
+.TS
+center box;
+cb | cb
+cB | l .
+Tag	Meaning
+_
+NM	Edit distance
+MD	Mismatching positions/bases
+AS	Alignment score
+BC	Barcode sequence
+_
+X0	Number of best hits
+X1	Number of suboptimal hits found by BWA
+XN	Number of ambiguous bases in the referenece
+XM	Number of mismatches in the alignment
+XO	Number of gap opens
+XG	Number of gap extentions
+XT	Type: Unique/Repeat/N/Mate-sw
+XA	Alternative hits; format: (chr,pos,CIGAR,NM;)*
+_
+XS	Suboptimal alignment score
+XF	Support from forward/reverse alignment
+XE	Number of supporting seeds
+.TE
+.PP
+Note that XO and XG are generated by BWT search while the CIGAR string
+by Smith-Waterman alignment. These two tags may be inconsistent with the
+CIGAR string. This is not a bug.
+.SH NOTES ON SHORT-READ ALIGNMENT
+.SS Alignment Accuracy
+.PP
+When seeding is disabled, BWA guarantees to find an alignment
+containing maximum
+.I maxDiff
+differences including
+.I maxGapO
+gap opens which do not occur within
+.I nIndelEnd
+bp towards either end of the query. Longer gaps may be found if
+.I maxGapE
+is positive, but it is not guaranteed to find all hits. When seeding is
+enabled, BWA further requires that the first
+.I seedLen
+subsequence contains no more than
+.I maxSeedDiff
+differences.
+.PP
+When gapped alignment is disabled, BWA is expected to generate the same
+alignment as Eland version 1, the Illumina alignment program. However, as BWA
+change `N' in the database sequence to random nucleotides, hits to these
+random sequences will also be counted. As a consequence, BWA may mark a
+unique hit as a repeat, if the random sequences happen to be identical
+to the sequences which should be unqiue in the database.
+.PP
+By default, if the best hit is not highly repetitive (controlled by -R), BWA
+also finds all hits contains one more mismatch; otherwise, BWA finds all
+equally best hits only. Base quality is NOT considered in evaluating
+hits. In the paired-end mode, BWA pairs all hits it found. It further
+performs Smith-Waterman alignment for unmapped reads to rescue reads with a
+high erro rate, and for high-quality anomalous pairs to fix potential alignment
+errors.
+.SS Estimating Insert Size Distribution
+.PP
+BWA estimates the insert size distribution per 256*1024 read pairs. It
+first collects pairs of reads with both ends mapped with a single-end
+quality 20 or higher and then calculates median (Q2), lower and higher
+quartile (Q1 and Q3). It estimates the mean and the variance of the
+insert size distribution from pairs whose insert sizes are within
+interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair
+considered to be properly paired (SAM flag 0x2) is calculated by solving
+equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the
+standard error of the insert size distribution, L is the length of the
+genome, p0 is prior of anomalous pair and Phi() is the standard
+cumulative distribution function. For mapping Illumina short-insert
+reads to the human genome, x is about 6-7 sigma away from the
+mean. Quartiles, mean, variance and x will be printed to the standard
+error output.
+.SS Memory Requirement
+.PP
+With bwtsw algorithm, 5GB memory is required for indexing the complete
+human genome sequences. For short reads, the
+.B aln
+command uses ~3.2GB memory and the
+.B sampe
+command uses ~5.4GB.
+.SS Speed
+.PP
+Indexing the human genome sequences takes 3 hours with bwtsw
+algorithm. Indexing smaller genomes with IS algorithms is
+faster, but requires more memory.
+.PP
+The speed of alignment is largely determined by the error rate of the query
+sequences (r). Firstly, BWA runs much faster for near perfect hits than
+for hits with many differences, and it stops searching for a hit with
+l+2 differences if a l-difference hit is found. This means BWA will be
+very slow if r is high because in this case BWA has to visit hits with
+many differences and looking for these hits is expensive. Secondly, the
+alignment algorithm behind makes the speed sensitive to [k log(N)/m],
+where k is the maximum allowed differences, N the size of database and m
+the length of a query. In practice, we choose k w.r.t. r and therefore r
+is the leading factor. I would not recommend to use BWA on data with
+r>0.02.
+.PP
+Pairing is slower for shorter reads. This is mainly because shorter
+reads have more spurious hits and converting SA coordinates to
+chromosomal coordinates are very costly.
+.SH NOTES ON LONG-READ ALIGNMENT
+.PP
+Command
+.B bwasw
+is designed for long-read alignment. BWA-SW essentially aligns the trie
+of the reference genome against the directed acyclic word graph (DAWG) of a
+read to find seeds not highly repetitive in the genome, and then performs a
+standard Smith-Waterman algorithm to extend the seeds. A key heuristic, called
+the Z-best heuristic, is that at each vertex in the DAWG, BWA-SW only keeps the
+top Z reference suffix intervals that match the vertex. BWA-SW is more accurate
+if the resultant alignment is supported by more seeds, and therefore BWA-SW
+usually performs better on long queries or queries with low divergence to the
+reference genome.
+BWA-SW is perhaps a better choice than BWA-short for 100bp single-end HiSeq reads
+mainly because it gives better gapped alignment. For paired-end reads, it is yet
+to know whether BWA-short or BWA-SW yield overall better results.
+.SH CHANGES IN BWA-0.6
+.PP
+Since version 0.6, BWA has been able to work with a reference genome longer than 4GB.
+This feature makes it possible to integrate the forward and reverse complemented
+genome in one FM-index, which speeds up both BWA-short and BWA-SW. As a tradeoff,
+BWA uses more memory because it has to keep all positions and ranks in 64-bit
+integers, twice larger than 32-bit integers used in the previous versions.
+The latest BWA-SW also works for paired-end reads longer than 100bp. In
+comparison to BWA-short, BWA-SW tends to be more accurate for highly unique
+reads and more robust to relative long INDELs and structural variants.
+Nonetheless, BWA-short usually has higher power to distinguish the optimal hit
+from many suboptimal hits. The choice of the mapping algorithm may depend on
+the application.
+.SH SEE ALSO
+BWA website <http://bio-bwa.sourceforge.net>, Samtools website
+<http://samtools.sourceforge.net>
+.SH AUTHOR
+Heng Li at the Sanger Institute wrote the key source codes and
+integrated the following codes for BWT construction: bwtsw
+<http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at
+the University of Hong Kong and IS
+<http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge
+<http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and
+implemented by Yuta Mori.
+.SH LICENSE AND CITATION
+.PP
+The full BWA package is distributed under GPLv3 as it uses source codes
+from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS
+libraries are distributed under the MIT license.
+.PP
+If you use the short-read alignment component, please cite the following
+paper:
+.PP
+Li H. and Durbin R. (2009) Fast and accurate short read alignment with
+Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
+.PP
+If you use the long-read component (BWA-SW), please cite:
+.PP
+Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
+Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505]
+.SH HISTORY
+BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
+and mimics its binary file formats; BWA-SW resembles BWT-SW in several
+ways. The initial idea about BWT-based alignment also came from the
+group who developed BWT-SW. At the same time, BWA is different enough
+from BWT-SW. The short-read alignment algorithm bears no similarity to
+Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it
+introduces heuristics that can hardly be applied to the original
+algorithm. In all, BWA does not guarantee to find all local hits as what
+BWT-SW is designed to do, but it is much faster than BWT-SW on both
+short and long query sequences.
+I started to write the first piece of codes on 24 May 2008 and got the
+initial stable version on 02 June 2008. During this period, I was
+acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper,
+was collaborating with Beijing Genomics Institute on SOAP2, the successor
+to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has come out in
+November 2008. According to the SourceForge download page, the third
+BWT-based short read aligner, bowtie, was first released in August
+2008. At the time of writing this manual, at least three more BWT-based
+short-read aligners are being implemented.
+The BWA-SW algorithm is a new component of BWA. It was conceived in
+November 2008 and implemented ten months later.

Mercurial > repos > ashvark > qiime_1_8_0

comparison bwa-0.6.2/bwa.1 @ 2:a294fbfcb1db draft default tip