Mercurial > repos > miller-lab > genome_diversity
diff pca.xml @ 14:8ae67e9fb6ff
Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
author | miller-lab |
---|---|
date | Fri, 28 Sep 2012 11:35:56 -0400 |
parents | |
children | 95a05c1ef5d5 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pca.xml Fri Sep 28 11:35:56 2012 -0400 @@ -0,0 +1,116 @@ +<tool id="gd_pca" name="PCA" version="1.0.0"> + <description>: Principal Component Analysis of genotype data</description> + + <command interpreter="python"> + pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path" + </command> + + <inputs> + <param name="input" type="data" format="gd_ped" label="Dataset" /> + </inputs> + + <outputs> + <data name="output" format="html" /> + </outputs> + + <!-- + <tests> + <test> + <param name="input" value="fake" ftype="gd_ped" > + <metadata name="base_name" value="admix" /> + <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" /> + <composite_data value="test_out/prepare_population_structure/admix.ped" /> + <composite_data value="test_out/prepare_population_structure/admix.map" /> + <edit_attributes type="name" value="fake" /> + </param> + + <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2"> + <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" /> + <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" /> + <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" /> + <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" /> + <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" /> + <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" /> + <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" /> + </output> + + </test> + </tests> + --> + + <help> + +**Dataset formats** + +The input dataset is in gd_ped_ format. +The output dataset is html_ with links to a pdf for a graphical output and +text files. (`Dataset missing?`_) + +.. _gd_ped: ./static/formatHelp.html#gd_ped +.. _html: ./static/formalHelp.html#html +.. _Dataset missing?: ./static/formatHelp.html + +----- + +**What it does** + +The user selects a gd_ped dataset generated by the Prepare Input tool. +The PCA tool runs a +Principal Component Analysis on the input genotype data and constructs +a plot of the top two principal components. It also reports the +following estimates of the statistical significance of the analysis. + +1. Average divergence between each pair of populations. Specifically, +from the covariance matrix X whose eigenvectors were computed, we can +compute a "distance", d, for each pair of individuals (i,j): d(i,j) = +X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now +define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b) +/ (\|pop a\| * \|pop b\|). We then normalize D so that the diagonal +has mean 1 and report it. + +2. Anova statistics for population differences along each +eigenvector. For each eigenvector, a P-value for statistical +significance of differences between each pair of populations along +that eigenvector is printed. +++ is used to highlight P-values less +than 1e-06. \*\*\* is used to highlight P-values between 1e-06 and +1e-03. If there are more than 2 populations, then an overall P-value +is also printed for that eigenvector, as are the populations with +minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is +only 1 population, no Anova statistics are printed.] + +3. Statistical significance of differences between populations. For +each pair of populations, the above Anova statistics are summed across +eigenvectors. The result is approximately chisq with d.o.f. equal to +the number of eigenvectors. The chisq statistic and its p-value are +printed. [If there is only 1 population, no statistics are printed.] + +We post-process the output of the PCA tool to estimate "admixture +fractions". For this, we take three populations at a time and +determine each one's average point in the PCA plot (by separately +averaging first and second coordinates). For each combination of two +center points, modeling two ancestral populations, we try to model the +third central point as having a certain fraction, r, of its SNP +genotypes from the second ancestral population and the remainder from +the first ancestral population, where we estimate r. The output file +"coordinates.txt" then contains pairs of lines like + +projection along chord Population1 -> Population2 + Population3: 0.12345 + +where the number (in this case 0.1245) is the estimation of r. +Computations with simulated data suggests that the true r is +systematically underestimated, perhaps giving roughly 0.6 times r. + +----- + +**Acknowledgments** + +We use the programs "smartpca" and "ploteig" downloaded from + +http://genepath.med.harvard.edu/~reich/Software.htm + +and described in the paper "Population structure and eigenanalysis" +by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190. + + </help> +</tool>