Galaxy | Tool Preview

Admixture (version 1.2.0)
Note: The best choice for the Genotype switch penalty depends on the density of SNPs and the age of the admixture events. With 50,000 SNPs in a vertebrate genome, 10.0 might be appropriate, with millions of SNPs, 100.0 might work better. We recommend experimenting with various thresholds on minimal spacing between SNVs (to increase independence), minimal FST between the source populations (to identify "ancestry informative markers"), and Genotype switch penalty, to reach conclusions that are robust to changes in analysis parameters.

Dataset formats

The input datasets are in gd_snp, gd_genotype, and gd_indivs formats. It is important for the Individuals datasets to have unique names and for there to be no overlap between the two populations. Rename these datasets if needed to make them unique. There are two output datasets, one tabular and one composite. (Dataset missing?)


What it does

The user specifies two or three source populations (i.e., sources for chromosomes) and a set of potentially admixed individuals, and chooses between the sequence coverage or the estimated genotypes to measure the similarity of genomic intervals in admixed individuals to the three classes of source chromosomes. The user also specifies a "switch penalty", controlling the strength of evidence needed to switch between source populations as the the program scans along a chromosome. Choice of picksan appropriate value depends on the number of SNPs and, to a lesser extent, on the time since the admixture events. With several million SNPs genome-wide, reasonable values might fall between 10 and 100. If there are 3 source populatons, then for each potentially admixed individual the program divides the genome into six "genotypes":

  1. homozygous for the first source population (i.e., both chromosomes from that population),
  2. homozygous for the second source population,
  3. homozygous for the third source population,
  4. heterozygous for the first and second populations (i.e., one chromosome from each),
  5. heterozygous for the first and third populations, or
  6. heterozygous for the second and third populations.

Parts of a reference chromosome that are labeled as "heterochromatic" are given the "non-genotype" 0. With two source populations, only "genotypes" 1, 2 and 3 are possible, where 3 now means heterozygous in the two source populations.

There are two output datasets generated. A tabular dataset with chromosome, start, stop, and pairs of columns containing the "genotypes" from above and label from the admixed individual. The second dataset is a composite dataset with general information from the run and a link to a pdf which graphically shows the source population along each of the chromosomes. The second link is to a text file with summary information of the "genotypes" over the whole genome.