Galaxy | Tool Preview

Filter SNPs (version 1.2.0)

Dataset formats

The input datasets are in gd_snp, gd_genotype, and gd_indivs formats. The output dataset is in gd_snp or gd_genotype format. (Dataset missing?)


What it does

For a gd_snp dataset, the user specifies that some of the individuals form a "population", by supplying a list that has been previously created using the Specify Individuals tool. SNPs are then discarded if their total coverage for the population is too low or too high, or if their coverage or quality score for any individual in the population is too low.

The upper and lower bounds on total population coverage can be specified either as read counts or as percentiles (e.g. "5%", with no decimal places). For percentile bounds the SNPs are ranked by read count, so for example, a lower bound of "10%" means that the least-covered 10% of the SNPs will be discarded, while an upper bound of, say, "80%" will discard all SNPs above the 80% mark, i.e. the top 20%. The threshold for the lower bound on individual coverage can only be specified as a plain read count.

For either a gd_snp or gd_genotype dataset, the user can specify a minimum number of defined genotypes (i.e., not -1) and/or a minimum spacing relative to the reference sequence. An error is reported if the user requests a minimum spacing but no reference sequence is available.


Example