genome_diversity: pca.xml comparison

comparison pca.xml @ 14:8ae67e9fb6ff

Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]

author	miller-lab
date	Fri, 28 Sep 2012 11:35:56 -0400
parents
children	95a05c1ef5d5

comparison

equal deleted inserted replaced

-:fdb4240fb565
+:8ae67e9fb6ff
+<tool id="gd_pca" name="PCA" version="1.0.0">
+<description>: Principal Component Analysis of genotype data</description>
+<command interpreter="python">
+pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
+</command>
+<inputs>
+<param name="input" type="data" format="gd_ped" label="Dataset" />
+</inputs>
+<outputs>
+<data name="output" format="html" />
+</outputs>
+<!--
+<tests>
+<test>
+<param name="input" value="fake" ftype="gd_ped" >
+<metadata name="base_name" value="admix" />
+<composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
+<composite_data value="test_out/prepare_population_structure/admix.ped" />
+<composite_data value="test_out/prepare_population_structure/admix.map" />
+<edit_attributes type="name" value="fake" />
+</param>
+<output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
+<extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
+<extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
+<extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
+<extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
+<extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
+<extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
+<extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
+</output>
+</test>
+</tests>
+-->
+<help>
+**Dataset formats**
+The input dataset is in gd_ped_ format.
+The output dataset is html_ with links to a pdf for a graphical output and
+text files.  (`Dataset missing?`_)
+.. _gd_ped: ./static/formatHelp.html#gd_ped
+.. _html: ./static/formalHelp.html#html
+.. _Dataset missing?: ./static/formatHelp.html
+-----
+**What it does**
+The user selects a gd_ped dataset generated by the Prepare Input tool.
+The PCA tool runs a
+Principal Component Analysis on the input genotype data and constructs
+a plot of the top two principal components. It also reports the
+following estimates of the statistical significance of the analysis.
+1. Average divergence between each pair of populations.  Specifically,
+from the covariance matrix X whose eigenvectors were computed, we can
+compute a "distance", d, for each pair of individuals (i,j): d(i,j) =
+X(i,i) + X(j,j) - 2X(i,j).  For each pair of populations (a,b) now
+define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b)
+/ (\|pop a\| * \|pop b\|).  We then normalize D so that the diagonal
+has mean 1 and report it.
+2. Anova statistics for population differences along each
+eigenvector. For each eigenvector, a P-value for statistical
+significance of differences between each pair of populations along
+that eigenvector is printed.  +++ is used to highlight P-values less
+than 1e-06.  \*\*\* is used to highlight P-values between 1e-06 and
+1e-03.  If there are more than 2 populations, then an overall P-value
+is also printed for that eigenvector, as are the populations with
+minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is
+only 1 population, no Anova statistics are printed.]
+3. Statistical significance of differences between populations. For
+each pair of populations, the above Anova statistics are summed across
+eigenvectors. The result is approximately chisq with d.o.f. equal to
+the number of eigenvectors. The chisq statistic and its p-value are
+printed. [If there is only 1 population, no statistics are printed.]
+We post-process the output of the PCA tool to estimate "admixture
+fractions".  For this, we take three populations at a time and
+determine each one's average point in the PCA plot (by separately
+averaging first and second coordinates).  For each combination of two
+center points, modeling two ancestral populations, we try to model the
+third central point as having a certain fraction, r, of its SNP
+genotypes from the second ancestral population and the remainder from
+the first ancestral population, where we estimate r.  The output file
+"coordinates.txt" then contains pairs of lines like
+projection along chord Population1 -> Population2
+Population3: 0.12345
+where the number (in this case 0.1245) is the estimation of r.
+Computations with simulated data suggests that the true r is
+systematically underestimated, perhaps giving roughly 0.6 times r.
+-----
+**Acknowledgments**
+We use the programs "smartpca" and "ploteig" downloaded from
+http://genepath.med.harvard.edu/~reich/Software.htm
+and described in the paper "Population structure and eigenanalysis"
+by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190.
+</help>
+</tool>

Mercurial > repos > miller-lab > genome_diversity

comparison pca.xml @ 14:8ae67e9fb6ff