Galaxy |

What it does

BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (large), such as the human reference genome. It is developed by Heng Li at the Sanger Insitute. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-60.

Know what you are doing

There is no such thing (yet) as an automated gearshift in short read mapping. It is all like stick-shift driving in San Francisco. In other words = running this tool with default parameters will probably not give you meaningful results. A way to deal with this is to understand the parameters by carefully reading the documentation and experimenting. Fortunately, Galaxy makes experimenting easy.

Input formats

BWA accepts files in either Sanger FASTQ format (galaxy type fastqsanger) or Illumina FASTQ format (galaxy type fastqillumina). Use the FASTQ Groomer to prepare your files.

A Note on Built-in Reference Genomes

The default variant for all genomes is "Full", defined as all primary chromosomes (or scaffolds/contigs) including mitochondrial plus associated unmapped, plasmid, and other segments. When only one version of a genome is available in this tool, it represents the default "Full" variant. Some genomes will have more than one variant available. The "Canonical Male" or sometimes simply "Canonical" variant contains the primary chromosomes for a genome. For example a human "Canonical" variant contains chr1-chr22, chrX, chrY, and chrM. The "Canonical Female" variant contains the primary chromosomes excluding chrY.

Outputs

The output is in SAM format, and has the following columns:

  Column  Description
--------  --------------------------------------------------------
1  QNAME  Query (pair) NAME
2  FLAG   bitwise FLAG
3  RNAME  Reference sequence NAME
4  POS    1-based leftmost POSition/coordinate of clipped sequence
5  MAPQ   MAPping Quality (Phred-scaled)
6  CIGAR  extended CIGAR string
7  MRNM   Mate Reference sequence NaMe ('=' if same as RNAME)
8  MPOS   1-based Mate POSition
9  ISIZE  Inferred insert SIZE
10 SEQ    query SEQuence on the same strand as the reference
11 QUAL   query QUALity (ASCII-33 gives the Phred base quality)
12 OPT    variable OPTional fields in the format TAG:VTYPE:VALU

The flags are as follows:

  Flag  Description
------  -------------------------------------
0x0001  the read is paired in sequencing
0x0002  the read is mapped in a proper pair
0x0004  the query sequence itself is unmapped
0x0008  the mate is unmapped
0x0010  strand of the query (1 for reverse)
0x0020  strand of the mate
0x0040  the read is the first read in a pair
0x0080  the read is the second read in a pair
0x0100  the alignment is not primary

It looks like this (scroll sideways to see the entire example):

QNAME FLAG    RNAME   POS     MAPQ    CIAGR   MRNM    MPOS    ISIZE   SEQ     QUAL    OPT
HWI-EAS91_1_30788AAXX:1:1:1761:343    4       *       0       0       *       *       0       0       AAAAAAANNAAAAAAAAAAAAAAAAAAAAAAAAAAACNNANNGAGTNGNNNNNNNGCTTCCCACAGNNCTGG        hhhhhhh;;hhhhhhhhhhh^hOhhhhghhhfhhhgh;;h;;hhhh;h;;;;;;;hhhhhhghhhh;;Phhh
HWI-EAS91_1_30788AAXX:1:1:1578:331    4       *       0       0       *       *       0       0       GTATAGANNAATAAGAAAAAAAAAAATGAAGACTTTCNNANNTCTGNANNNNNNNTCTTTTTTCAGNNGTAG        hhhhhhh;;hhhhhhhhhhhhhhhhhhhhhhhhhhhh;;h;;hhhh;h;;;;;;;hhhhhhhhhhh;;hhVh

BWA settings

All of the options have a default value. You can change any of them. All of the options in BWA have been implemented here.

BWA parameter list

This is an exhaustive list of BWA options:

For aln:

-n NUM  Maximum edit distance if the value is INT, or the fraction of missing
        alignments given 2% uniform base error rate if FLOAT. In the latter
        case, the maximum edit distance is automatically chosen for different
        read lengths. [0.04]
-o INT  Maximum number of gap opens [1]
-e INT  Maximum number of gap extensions, -1 for k-difference mode
        (disallowing long gaps) [-1]
-d INT  Disallow a long deletion within INT bp towards the 3'-end [16]
-i INT  Disallow an indel within INT bp towards the ends [5]
-l INT  Take the first INT subsequence as seed. If INT is larger than the
        query sequence, seeding will be disabled. For long reads, this option
        is typically ranged from 25 to 35 for '-k 2'. [inf]
-k INT  Maximum edit distance in the seed [2]
-t INT  Number of threads (multi-threading mode) [1]
-M INT  Mismatch penalty. BWA will not search for suboptimal hits with a score
        lower than (bestScore-misMsc). [3]
-O INT  Gap open penalty [11]
-E INT  Gap extension penalty [4]
-c      Reverse query but not complement it, which is required for alignment
        in the color space.
-R      Proceed with suboptimal alignments even if the top hit is a repeat. By
        default, BWA only searches for suboptimal alignments if the top hit is
        unique. Using this option has no effect on accuracy for single-end
        reads. It is mainly designed for improving the alignment accuracy of
        paired-end reads. However, the pairing procedure will be slowed down,
        especially for very short reads (~32bp).
-N      Disable iterative search. All hits with no more than maxDiff
        differences will be found. This mode is much slower than the default.

For samse:

-n INT  Maximum number of alignments to output in the XA tag for reads paired
        properly. If a read has more than INT hits, the XA tag will not be
        written. [3]
-r STR  Specify the read group in a format like '@RG\tID:foo\tSM:bar' [null]

For sampe:

-a INT  Maximum insert size for a read pair to be considered as being mapped
        properly. Since version 0.4.5, this option is only used when there
        are not enough good alignment to infer the distribution of insert
        sizes. [500]
-n INT  Maximum number of alignments to output in the XA tag for reads paired
        properly. If a read has more than INT hits, the XA tag will not be
        written. [3]
-N INT  Maximum number of alignments to output in the XA tag for disconcordant
        read pairs (excluding singletons). If a read has more than INT hits,
        the XA tag will not be written. [10]
-o INT  Maximum occurrences of a read for pairing. A read with more
        occurrences will be treated as a single-end read. Reducing this
        parameter helps faster pairing. [100000]
-r STR  Specify the read group in a format like '@RG\tID:foo\tSM:bar' [null]

For specifying the read group in samse or sampe, use the following:

@RG   Read group. Unordered multiple @RG lines are allowed.
ID    Read group identiﬁer. Each @RG line must have a unique ID. The value of
      ID is used in the RG tags of alignment records. Must be unique among all
      read groups in header section. Read group IDs may be modiﬁed when
      merging SAM ﬁles in order to handle collisions.
CN    Name of sequencing center producing the read.
DS    Description.
DT    Date the run was produced (ISO8601 date or date/time).
FO    Flow order. The array of nucleotide bases that correspond to the
      nucleotides used for each ﬂow of each read. Multi-base ﬂows are encoded
      in IUPAC format, and non-nucleotide ﬂows by various other characters.
      Format : /\*|[ACMGRSVTWYHKDBN]+/
KS    The array of nucleotide bases that correspond to the key sequence of each read.
LB    Library.
PG    Programs used for processing the read group.
PI    Predicted median insert size.
PL    Platform/technology used to produce the reads. Valid values : CAPILLARY,
      LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.
PU    Platform unit (e.g. ﬂowcell-barcode.lane for Illumina or slide for
      SOLiD). Unique identiﬁer.
SM    Sample. Use pool name where a pool is being sequenced.