Galaxy | Tool Preview

Phylogenetic Tree (version 1.1.0)

Dataset formats

The input dataset is in gd_snp or gd_genotype format. The output is a composite dataset, containing the tree in both text (Newick) and PostScript formats, as well as supplemental text information. (Dataset missing?)


What it does

This tool uses a gd_snp dataset to determine a kind of "genetic distance" between each pair of individuals. That information is used to produce a tree-shaped figure that depicts how the individuals are related, both as a text files and as a diagram. The text files include a common tree format, Newick, as well as distance matrices and counts of informative SNPs for each pairwise comparison. The informative SNPs can be used as a guide to how reliable the tree is.

The input parameters are:

SNP dataset
A table of SNPs for various individuals, in gd_snp format.
Individuals
By default all individuals are included in the analysis, but this can optionally be restricted to a subset that has been defined using the Specify Individuals tool.
Minimum SNP coverage
For each pair of individuals, the tool looks for informative SNPs, i.e., where the sequence data for both individuals is adequate. Specifying, say, 7 for this option instructs the tool to consider only SNPs with at least 7 reads in each of the two individuals (regardless of the alleles) when estimating their genetic distance.
Minimum SNP quality
Specifying, say, 37 for this option instructs the tool to consider only SNPs with a quality score of at least 37 in both individuals when estimating their genetic distance.
Include reference sequence
For gd_snp datasets containing columns for a reference sequence, the user can ask that the reference be indicated in the tree, to help with rooting it. If the dataset has no reference columns, this option has no effect.
Distance metric
The genetic distance between two individuals at a given SNP can be estimated two ways. One method is to use the absolute value of the difference in the frequency of the first allele (or equivalently, the second allele). For instance, if the first individual has 5 reads of each allele and the second individual has respectively 3 and 6 reads, then the frequencies are 1/2 and 1/3, giving a distance 1/6 at that SNP. The other approach is to use the genotype calls to estimate the difference in the number of occurrences of the first allele. For instance, if the two genotypes are 2 and 1, i.e., the individuals are estimated to have respectively 2 and 1 occurrences of the first allele at this location, then the distance is 1 (the absolute value of the difference of the two numbers).
Output options
The final four options apply mostly to the graphical drawing of the tree, except that the branch lengths are also added to the Newick text file.

Acknowledgments

To convert the distance matrix to a Newick-formatted tree, we use the QuickTree program from http://www.sanger.ac.uk/resources/software/quicktree/ .

To make the diagram we use draw_tree, available at http://compgen.bscb.cornell.edu/phast/ .