Dataset formats
The input dataset is in gd_snp or gd_genotype format.
The output is a composite dataset, containing the tree in both text (Newick)
and PostScript formats, as well as supplemental text information.
(Dataset missing?)
What it does
This tool uses a gd_snp dataset to determine a kind of "genetic distance"
between each pair of individuals. That information is used to
produce a tree-shaped figure that depicts how the individuals are related,
both as a text files and as a diagram.
The text files include a common tree format, Newick, as well as distance
matrices and counts of informative SNPs for each pairwise comparison.
The informative SNPs can be used as a guide to how reliable the tree is.
The input parameters are:
- SNP dataset
- A table of SNPs for various individuals, in gd_snp format.
- Individuals
- By default all individuals are included in the analysis, but this can
optionally be restricted to a subset that has been defined using the
Specify Individuals tool.
- Minimum SNP coverage
- For each pair of individuals, the tool looks for informative SNPs, i.e.,
where the sequence data for both individuals is adequate. Specifying,
say, 7 for this option instructs the tool to consider only SNPs with
at least 7 reads in each of the two individuals (regardless of the
alleles) when estimating their genetic distance.
- Minimum SNP quality
- Specifying, say, 37 for this option instructs the tool to consider
only SNPs with a quality score of at least 37 in both individuals
when estimating their genetic distance.
- Include reference sequence
- For gd_snp datasets containing columns for a reference sequence, the
user can ask that the reference be indicated in the tree, to help with
rooting it. If the dataset has no reference columns, this option has
no effect.
- Distance metric
- The genetic distance between two individuals at a given SNP can
be estimated two ways. One method is to use the absolute value of the
difference in the frequency of the first allele (or equivalently, the
second allele). For instance, if the first individual has 5 reads of
each allele and the second individual has respectively 3 and 6 reads,
then the frequencies are 1/2 and 1/3, giving a distance 1/6 at that
SNP. The other approach is to use the genotype calls to estimate
the difference in the number of occurrences of the first allele.
For instance, if the two genotypes are 2 and 1, i.e., the individuals
are estimated to have respectively 2 and 1 occurrences of the first
allele at this location, then the distance is 1 (the absolute value
of the difference of the two numbers).
- Output options
- The final four options apply mostly to the graphical drawing of the
tree, except that the branch lengths are also added to the Newick text
file.
Acknowledgments
To convert the distance matrix to a Newick-formatted tree, we use the
QuickTree program from
http://www.sanger.ac.uk/resources/software/quicktree/ .
To make the diagram we use draw_tree, available at
http://compgen.bscb.cornell.edu/phast/ .