Galaxy |

What it does

This program exports a Stacks data set either as a set of observed haplotypes at each locus in the population, or with the haplotypes encoded into genotypes. The -r option allows only loci that exist in a certain number of population individuals to be exported. In a mapping context, raising or lowering this limit is an effective way to control the quality level of markers exported as genuine markers will be found in a large number of progeny. If exporting a set of observed haplotypes in a population, the "min stack depth" option can be used to restict exported loci to those that have a minimum depth of reads.

By default, when executing the pipeline (either denovo_map or ref_map) the genotypes program will be executed last and will identify mappable markers in the population and export both a set of observed haplotypes and a set of generic genotypes with "min number of progeny" option = 1.

Making Corrections

If enabled with the "make automated corrections to the data" option, the genotypes program will make automated corrections to the data. Since loci are matched up in the population, the script can correct false-negative heterozygote alleles since it knows the existence of alleles at a particular locus in the other individuals. For example, the program will identify loci with SNPs that didn’t have high enough coverage to be identified by the SNP caller. It will also check that homozygous tags have a minimum depth of coverage, since a low-coverage polymorphic locus may appear homozygous simply because the other allele wasn’t sequenced.

Correction Thresholds

The thresholds for automatic corrections can be modified by using the "automated corrections option" and changing the default values for the "min number of reads for homozygous genotype", "homozygote minor minimum allele frequency" and "heterozygote minor minimum allele frequency" parameters to genotypes. "min number of reads for homozygous genotype" is the minimum number of reads required to consider a stack homozygous (default of 5). The "homozygote minor minimum allele frequency" and "heterozygote minor minimum allele frequency" variables represent fractions. If the ratio of the depth of the the smaller allele to the bigger allele is greater than "heterozygote minor minimum allele frequency" (default of 1/10) a stack is called a het. If the ratio is less than homozygote minor minimum allele frequency (default of 1/20) a stack is called homozygous. If the ratio is in between the two values it is unknown and a genotype will not be assigned.

Automated corrections made by the program are shown in the output file in capital letters.

Created by:

Stacks was developed by Julian Catchen with contributions from Angel Amores, Paul Hohenlohe, and Bill Cresko

Example:

Input files:

FASTQ, FASTA, zip, tar.gz

Output files:

XXX.tags.tsv file:

Column    Name                     Description
1         Sql ID                   This field will always be "0", however the MySQL database will assign an ID when it is loaded.
2         Sample ID                Each sample passed through Stacks gets a unique id for that sample.
3         Stack ID                 Each stack formed gets an ID.
4         Chromosome               If aligned to a reference genome using pstacks, otherwise it is blank.
5         Basepair                 If aligned to ref genome using pstacks.
6         Strand                   If aligned to ref genome using pstacks.
7         Sequence Type            Either 'consensus', 'primary' or 'secondary', see the Stacks paper for definitions of these terms.
8         Sequence ID              The individual sequence read that was merged into this stack.
9         Sequence                 The raw sequencing read.
10        Deleveraged Flag         If "1", this stack was processed by the deleveraging algorithm and was broken down from a larger stack.
11        Blacklisted Flag         If "1", this stack was still confounded depsite processing by the deleveraging algorithm.
12        Lumberja ckstack Flag    If "1", this stack was set aside due to having an extreme depth of coverage.

Notes: For the tags file, each stack will start in the file with a consensus sequence for the entire stack followed by the flags for that stack. Then, each individual read that was merged into that stack will follow. The next stack will start with another consensus sequence.

XXX.snps.tsv file:

Column    Name                     Description
1         Sql ID                   This field will always be "0", however the MySQL database will assign an ID when it is loaded.
2         Sample ID
3         Stack ID
4         SNP Column
5         Likelihood ratio         From the SNP-calling model.
6         Rank_1                   Majority nucleotide.
7         Rank_2                   Alternative nucleotide.

Notes: If a stack has two SNPs called within it, then there will be two lines in this file listing each one.

XXX.alleles.tsv file:

Column    Name                     Description
1         Sql ID                   This field will always be "0", however the MySQL database will assign an ID when it is loaded.
2         Sample ID
3         Stack ID
4         Haplotype                The haplotype, as constructed from the called SNPs at each locus.
5         Percent                  Percentage of reads that have this haplotype
6         Count                    Raw number of reads that have this haplotype

XXX.matches.tsv file:

Column    Name                     Description
1         Sql ID                   This field will always be "0", however the MySQL database will assign an ID when it is loaded.
2         Batch ID
3         Catalog ID
4         Sample ID
5         Stack ID
6         Haplotype
7         Stack Depth

Notes: Each line in this file records a match between a catalog locus and a locus in an individual, for a particular haplotype. The Batch ID plus the Catalog ID together represent a unique locus in the entire population, while the Sample ID and the Stack ID together represent a unique locus in an individual sample.

batch_X.sumstats.tsv Summary Statistics Output:

Batch ID The batch identifier for this data set.
Locus ID Catalog locus identifier.
Chromosome If aligned to a reference genome.
Basepair If aligned to a reference genome. This is the alignment of the whole catalog locus. The exact basepair reported is aligned to the location of the RAD site (depending on whether alignment is to the positive or negative strand).
Column The nucleotide site within the catalog locus.
Population ID The ID supplied to the populations program, as written in the population map file.
P Nucleotide The most frequent allele at this position in this population.
Q Nucleotide The alternative allele.
Number of Individuals Number of individuals sampled in this population at this site.
P Frequency of most frequent allele.
Observed Heterozygosity The proportion of individuals that are heterozygotes in this population.
Observed Homozygosity The proportion of individuals that are homozygotes in this population.
Expected Heterozygosity Heterozygosity expected under Hardy-Weinberg equilibrium.
Expected Homozygosity Homozygosity expected under Hardy-Weinberg equilibrium.
pi An estimate of nucleotide diversity.
Smoothed pi A weighted average of p depending on the surrounding 3s of sequence in both directions.
Smoothed pi P-value If bootstrap resampling is enabled, a p-value ranking the significance of p within this population.
FIS The inbreeding coefficient of an individual (I) relative to the subpopulation (S).
Smoothed FIS A weighted average of FIS depending on the surrounding 3s of sequence in both directions.
Smoothed FIS P-value If bootstrap resampling is enabled, a p-value ranking the significance of FIS within this population.
Private allele True (1) or false (0), depending on if this allele is only occurs in this population.

batch_X.fst_Y-Z.tsv Pairwise FST Output:

Batch ID The batch identifier for this data set.
Locus ID Catalog locus identifier.
Population ID 1 The ID supplied to the populations program, as written in the population map file.
Population ID 2 The ID supplied to the populations program, as written in the population map file.
Chromosome If aligned to a reference genome.
Basepair If aligned to a reference genome. This is the alignment of the whole catalog locus. The exact basepair reported is aligned to the location of the RAD site (depending on whether alignment is to the positive or negative strand).
Column The nucleotide site within the catalog locus.
Overall pi An estimate of nucleotide diversity across the two populations.
FST A measure of population differentiation.
FET p-value P-value describing if the FST measure is statistically significant according to Fisher's Exact Test.
Odds Ratio Fisher's Exact Test odds ratio
CI High Fisher's Exact Test confidence interval.
CI Low Fisher's Exact Test confidence interval.
LOD Score Logarithm of odds score.
Expected Heterozygosity Heterozygosity expected under Hardy-Weinberg equilibrium.
Expected Homozygosity Homozygosity expected under Hardy-Weinberg equilibrium.
Corrected FST FST with either the FET p-value, or a window-size or genome size Bonferroni correction.
Smoothed FST A weighted average of FST depending on the surrounding 3s of sequence in both directions.
Smoothed FST P-value If bootstrap resampling is enabled, a p-value ranking the significance of FST within this pair of populations.

Instructions to add the functionality of archives management in Galaxy on the eBiogenouest HUB wiki .

Output type:

Output type details:

No compression                  All files will be added in the current history.
Compressed by categories        Files will be compressed by categories (snps, allele, matches and tags) into 4 zip archives. These archives and batch files will be added in the current history.
Compressed all outputs          All files will be compressed in an unique zip archive. Batch files will be added in the current history with the archive.

Project links:

STACKS website .

STACKS manual .

STACKS google group .

References:

-J. Catchen, P. Hohenlohe, S. Bassham, A. Amores, and W. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology. 2013.

-J. Catchen, S. Bassham, T. Wilson, M. Currey, C. O'Brien, Q. Yeates, and W. Cresko. The population structure and recent colonization history of Oregon threespine stickleback determined using restriction-site associated DNA-sequencing. Molecular Ecology. 2013.

-J. Catchen, A. Amores, P. Hohenlohe, W. Cresko, and J. Postlethwait. Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1:171-182, 2011.

-A. Amores, J. Catchen, A. Ferrara, Q. Fontenot and J. Postlethwait. Genome evolution and meiotic maps by massively parallel DNA sequencing: Spotted gar, an outgroup for the teleost genome duplication. Genetics, 188:799'808, 2011.

-P. Hohenlohe, S. Amish, J. Catchen, F. Allendorf, G. Luikart. RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow trout and westslope cutthroat trout. Molecular Ecology Resources, 11(s1):117-122, 2011.

-K. Emerson, C. Merz, J. Catchen, P. Hohenlohe, W. Cresko, W. Bradshaw, C. Holzapfel. Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Science, 107(37):16196-200, 2010.

Integrated by:

Yvan Le Bras and Cyril Monjeaud

GenOuest Bio-informatics Core Facility

UMR 6074 IRISA INRIA-CNRS-UR1 Rennes (France)

support@genouest.org

If you use this tool in Galaxy, please cite :

Y. Le Bras, A. Roult, C. Monjeaud, M. Bahin, O. Quenez, C. Heriveau, A. Bretaudeau, O. Sallou, O. Collin, Towards a Life Sciences Virtual Research Environment : an e-Science initiative in Western France. JOBIM 2013.