Galaxy | Tool Preview

Freebayes4Workflow (version 0.5)
Sample BAM files
Sample BAM file 0
Sets --targets or --region options
Select how much control over the freebayes run you need

What it does

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

See https://github.com/ekg/freebayes for details on FreeBayes.

This Galaxy instance of FreeBayes corresponds to release 0.9.20

This tool is a fork of Freebayes revision 22 (99684adf84de)


Description

Privided BAM file(s) and a reference. FreeBayes will provide VCF output on standard out describing SNPs, indels, and complex variants in samples in the input alignments.

By default, FreeBayes will consider variants supported by at least 2 observations in a single sample (-C) and also by at least 20% of the reads from a single sample (-F). These settings are suitable to low to high depth sequencing in haploid and diploid samples, but users working with polyploid or pooled samples may wish to adjust them depending on the characteristics of their sequencing data.

FreeBayes is capable of calling variant haplotypes shorter than a read length where multiple polymorphisms segregate on the same read. The maximum distance between polymorphisms phased in this way is determined by the --max-complex-gap, which defaults to 3bp. In practice, this can comfortably be set to half the read length.

Ploidy may be set to any level (-p), but by default all samples are assumed to be diploid. FreeBayes can model per-sample and per-region variation in copy-number (-A) using a copy-number variation map.

FreeBayes can act as a frequency-based pooled caller and describe variants and haplotypes in terms of observation frequency rather than called genotypes. To do so, use --pooled-continuous and set input filters to a suitable level. Allele observation counts will be described by AO and RO fields in the VCF output.


Galaxy-specific options

Galaxy allows six levels of control over FreeBayes options provided by Choose parameter selection level menu option. These are:

  1. Simple diploid calling: The simples possible FreeBayes application. Equvalent of using FreeBayes with only a BAM input and no other parameter options.
  2. Simple diploid calling with filtering and coverage: Same as #1 plus two additional options: -0 (standard filters: --min-mapping-quality 30 --min-base-quality 20 --min-supporting-allele-qsum 0 --genotype-varinat-threshold 0) and --min-coverage.
  3. Frequency-based pooled calling: This is equivalent to using FreeBayes with the following options: --haplotype-length 0 --min-alternate-count 1 --min-alternate-fraction 0 --pooled-continuous --report-monomorphic. This is the best choice for calling varinats in mixtures such as viral, bacterial, or organellar genomes.
  4. Frequency-based pooled calling with filtering and coverage: Same as #3 but adds -0 and --min-coverage like in #2.
  5. Complete list of all options: Gives you full control by exposing all FreeBayes options as Galaxy widgets.

FreeBayes options

Note that each Galaxy parameter widget corresponding to command line flags listed below:

Input and output:

-t --targets FILE
                Limit analysis to targets listed in the BED-format FILE.
-r --region chrom:start_position-end_position
                Limit analysis to the specified region, 0-base coordinates,
                end_position included.  Either '-' or '..' maybe used as a separator.
-s --samples FILE
                Limit analysis to samples listed (one per line) in the FILE.
                By default FreeBayes will analyze all samples in its input
                BAM files.
--populations FILE
                Each line of FILE should list a sample and a population which
                it is part of.  The population-based bayesian inference model
                will then be partitioned on the basis of the populations.
-A --cnv-map FILE
                Read a copy number map from the BED file FILE, which has
                the format:
                   reference sequence, start, end, sample name, copy number
                ... for each region in each sample which does not have the
                default copy number as set by --ploidy.
--trace FILE    Output an algorithmic trace to FILE.
--failed-alleles FILE
                Write a BED file of the analyzed positions which do not
                pass --pvar to FILE.
-@ --variant-input VCF
                Use variants reported in VCF file as input to the algorithm.
                Variants in this file will be treated as putative variants
                even if there is not enough support in the data to pass
                input filters.
-l --only-use-input-alleles
                Only provide variant calls and genotype likelihoods for sites
                and alleles which are provided in the VCF input, and provide
                output in the VCF for all input alleles, not just those which
                have support in the data.
--haplotype-basis-alleles VCF
                When specified, only variant alleles provided in this input
                VCF will be used for the construction of complex or haplotype
                alleles.
--report-all-haplotype-alleles
                At sites where genotypes are made over haplotype alleles,
                provide information about all alleles in output, not only
                those which are called.
--report-monomorphic
                Report even loci which appear to be monomorphic, and report all
                considered alleles, even those which are not in called genotypes.
                Loci which do not have any potential alternates have '.' for ALT.

Reporting:

-P --pvar N     Report sites if the probability that there is a polymorphism
                at the site is greater than N.  default: 0.0.  Note that post-
                filtering is generally recommended over the use of this parameter.

Population model:

-T --theta N    The expected mutation rate or pairwise nucleotide diversity
                among the population under analysis.  This serves as the
                single parameter to the Ewens Sampling Formula prior model
                default: 0.001
-p --ploidy N   Sets the default ploidy for the analysis to N.  default: 2
-J --pooled-discrete
                Assume that samples result from pooled sequencing.
                Model pooled samples using discrete genotypes across pools.
                When using this flag, set --ploidy to the number of
                alleles in each sample or use the --cnv-map to define
                per-sample ploidy.
-K --pooled-continuous
                Output all alleles which pass input filters, regardles of
                genotyping outcome or model.

Reference allele:

-Z --use-reference-allele
                This flag includes the reference allele in the analysis as
                if it is another sample from the same population.
--reference-quality MQ,BQ
                Assign mapping quality of MQ to the reference allele at each
                site and base quality of BQ.  default: 100,60

Allele scope:

-I --no-snps    Ignore SNP alleles.
-i --no-indels  Ignore insertion and deletion alleles.
-X --no-mnps    Ignore multi-nuceotide polymorphisms, MNPs.
-u --no-complex Ignore complex events (composites of other classes).
-n --use-best-n-alleles N
                Evaluate only the best N SNP alleles, ranked by sum of
                supporting quality scores.  (Set to 0 to use all; default: all)
-E --max-complex-gap N
   --haplotype-length N
                Allow haplotype calls with contiguous embedded matches of up
                to this length.  (default: 3)
--min-repeat-size N
                When assembling observations across repeats, require the total repeat
                length at least this many bp.  (default: 5)
--min-repeat-entropy N
                To detect interrupted repeats, build across sequence until it has
                entropy > N bits per bp.  (default: 0, off)
--no-partial-observations
                Exclude observations which do not fully span the dynamically-determined
                detection window.  (default, use all observations, dividing partial
                support across matching haplotypes when generating haplotypes.)

Indel realignment:

-O --dont-left-align-indels
                Turn off left-alignment of indels, which is enabled by default.

Input filters:

-4 --use-duplicate-reads
                Include duplicate-marked alignments in the analysis.
                default: exclude duplicates marked as such in alignments
-m --min-mapping-quality Q
                Exclude alignments from analysis if they have a mapping
                quality less than Q.  default: 1
-q --min-base-quality Q
                Exclude alleles from analysis if their supporting base
                quality is less than Q.  default: 0
-R --min-supporting-allele-qsum Q
                Consider any allele in which the sum of qualities of supporting
                observations is at least Q.  default: 0
-Y --min-supporting-mapping-qsum Q
                Consider any allele in which and the sum of mapping qualities of
                supporting reads is at least Q.  default: 0
-Q --mismatch-base-quality-threshold Q
                Count mismatches toward --read-mismatch-limit if the base
                quality of the mismatch is >= Q.  default: 10
-U --read-mismatch-limit N
                Exclude reads with more than N mismatches where each mismatch
                has base quality >= mismatch-base-quality-threshold.
                default: ~unbounded
-z --read-max-mismatch-fraction N
                Exclude reads with more than N [0,1] fraction of mismatches where
                each mismatch has base quality >= mismatch-base-quality-threshold
                default: 1.0
-$ --read-snp-limit N
                Exclude reads with more than N base mismatches, ignoring gaps
                with quality >= mismatch-base-quality-threshold.
                default: ~unbounded
-e --read-indel-limit N
                Exclude reads with more than N separate gaps.
                default: ~unbounded
-0 --standard-filters  Use stringent input base and mapping quality filters
                Equivalent to -m 30 -q 20 -R 0 -S 0
-F --min-alternate-fraction N
                Require at least this fraction of observations supporting
                an alternate allele within a single individual in the
                in order to evaluate the position.  default: 0.2
-C --min-alternate-count N
                Require at least this count of observations supporting
                an alternate allele within a single individual in order
                to evaluate the position.  default: 2
-3 --min-alternate-qsum N
                Require at least this sum of quality of observations supporting
                an alternate allele within a single individual in order
                to evaluate the position.  default: 0
-G --min-alternate-total N
                Require at least this count of observations supporting
                an alternate allele within the total population in order
                to use the allele in analysis.  default: 1
-! --min-coverage N
                Require at least this coverage to process a site.  default: 0

Population priors:

-k --no-population-priors
                Equivalent to --pooled-discrete --hwe-priors-off and removal of
                Ewens Sampling Formula component of priors.

Mappability priors:

-w --hwe-priors-off
                Disable estimation of the probability of the combination
                arising under HWE given the allele frequency as estimated
                by observation frequency.
-V --binomial-obs-priors-off
                Disable incorporation of prior expectations about observations.
                Uses read placement probability, strand balance probability,
                and read position (5'-3') probability.
-a --allele-balance-priors-off
                Disable use of aggregate probability of observation balance between alleles
                as a component of the priors.

Genotype likelihoods:

--observation-bias FILE
                Read length-dependent allele observation biases from FILE.
                The format is [length] [alignment efficiency relative to reference]
                where the efficiency is 1 if there is no relative observation bias.
--base-quality-cap Q
                Limit estimated observation quality by capping base quality at Q.
--experimental-gls
                Generate genotype likelihoods using 'effective base depth' metric
                qual = 1-BaseQual * 1-MapQual.  Incorporate partial observations.
                This is the default when contamination estimates are provided.
                Optimized for diploid samples.
--prob-contamination F
                An estimate of contamination to use for all samples.  default: 10e-9
--contamination-estimates FILE
                A file containing per-sample estimates of contamination, such as
                those generated by VerifyBamID.  The format should be:
                    sample p(read=R|genotype=AR) p(read=A|genotype=AA)
                Sample '*' can be used to set default contamination estimates.

Algorithmic features:

--report-genotype-likelihood-max
                Report genotypes using the maximum-likelihood estimate provided
                from genotype likelihoods.
-B --genotyping-max-iterations N
                Iterate no more than N times during genotyping step. default: 1000.
--genotyping-max-banddepth N
                Integrate no deeper than the Nth best genotype by likelihood when
                genotyping. default: 6.
-W --posterior-integration-limits N,M
                Integrate all genotype combinations in our posterior space
                which include no more than N samples with their Mth best
                data likelihood. default: 1,3.
-N --exclude-unobserved-genotypes
                Skip sample genotypings for which the sample has no supporting reads.
-S --genotype-variant-threshold N
                Limit posterior integration to samples where the second-best
                genotype likelihood is no more than log(N) from the highest
                genotype likelihood for the sample.  default: ~unbounded
-j --use-mapping-quality
                Use mapping quality of alleles when calculating data likelihoods.
-H --harmonic-indel-quality
                Use a weighted sum of base qualities around an indel, scaled by the
                distance from the indel.  By default use a minimum BQ in flanking sequence.
-D --read-dependence-factor N
                Incorporate non-independence of reads by scaling successive
                observations by this factor during data likelihood
                calculations.  default: 0.9
-= --genotype-qualities
                Calculate the marginal probability of genotypes and report as GQ in
                each sample field in the VCF output.

Citation

For the underlying tool, please cite Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-read sequencing.

The initial version of the wrapper was produced by Dan Blankenberg and upgraded by Anton Nekrutenko.