diff pca.xml @ 14:8ae67e9fb6ff

Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
author miller-lab
date Fri, 28 Sep 2012 11:35:56 -0400
parents
children 95a05c1ef5d5
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/pca.xml	Fri Sep 28 11:35:56 2012 -0400
@@ -0,0 +1,116 @@
+<tool id="gd_pca" name="PCA" version="1.0.0">
+  <description>: Principal Component Analysis of genotype data</description>
+
+  <command interpreter="python">
+    pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
+  </command>
+
+  <inputs>
+    <param name="input" type="data" format="gd_ped" label="Dataset" />
+  </inputs>
+
+  <outputs>
+    <data name="output" format="html" />
+  </outputs>
+
+  <!--
+  <tests>
+    <test>
+      <param name="input" value="fake" ftype="gd_ped" >
+        <metadata name="base_name" value="admix" />
+        <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
+        <composite_data value="test_out/prepare_population_structure/admix.ped" />
+        <composite_data value="test_out/prepare_population_structure/admix.map" />
+        <edit_attributes type="name" value="fake" />
+      </param>
+
+      <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
+        <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
+        <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
+        <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
+        <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
+        <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
+        <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
+        <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
+      </output>
+      
+    </test>
+  </tests>
+  -->
+
+  <help>
+
+**Dataset formats**
+
+The input dataset is in gd_ped_ format.
+The output dataset is html_ with links to a pdf for a graphical output and
+text files.  (`Dataset missing?`_)
+
+.. _gd_ped: ./static/formatHelp.html#gd_ped
+.. _html: ./static/formalHelp.html#html
+.. _Dataset missing?: ./static/formatHelp.html
+
+-----
+
+**What it does**
+
+The user selects a gd_ped dataset generated by the Prepare Input tool.
+The PCA tool runs a
+Principal Component Analysis on the input genotype data and constructs
+a plot of the top two principal components. It also reports the
+following estimates of the statistical significance of the analysis.
+
+1. Average divergence between each pair of populations.  Specifically,
+from the covariance matrix X whose eigenvectors were computed, we can
+compute a "distance", d, for each pair of individuals (i,j): d(i,j) =
+X(i,i) + X(j,j) - 2X(i,j).  For each pair of populations (a,b) now
+define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b)
+/ (\|pop a\| * \|pop b\|).  We then normalize D so that the diagonal
+has mean 1 and report it.
+
+2. Anova statistics for population differences along each
+eigenvector. For each eigenvector, a P-value for statistical
+significance of differences between each pair of populations along
+that eigenvector is printed.  +++ is used to highlight P-values less
+than 1e-06.  \*\*\* is used to highlight P-values between 1e-06 and
+1e-03.  If there are more than 2 populations, then an overall P-value
+is also printed for that eigenvector, as are the populations with
+minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is
+only 1 population, no Anova statistics are printed.]
+
+3. Statistical significance of differences between populations. For
+each pair of populations, the above Anova statistics are summed across
+eigenvectors. The result is approximately chisq with d.o.f. equal to
+the number of eigenvectors. The chisq statistic and its p-value are
+printed. [If there is only 1 population, no statistics are printed.]
+
+We post-process the output of the PCA tool to estimate "admixture
+fractions".  For this, we take three populations at a time and
+determine each one's average point in the PCA plot (by separately
+averaging first and second coordinates).  For each combination of two
+center points, modeling two ancestral populations, we try to model the
+third central point as having a certain fraction, r, of its SNP
+genotypes from the second ancestral population and the remainder from
+the first ancestral population, where we estimate r.  The output file
+"coordinates.txt" then contains pairs of lines like
+
+projection along chord Population1 -> Population2
+  Population3: 0.12345
+
+where the number (in this case 0.1245) is the estimation of r.
+Computations with simulated data suggests that the true r is
+systematically underestimated, perhaps giving roughly 0.6 times r.
+
+-----
+
+**Acknowledgments**
+
+We use the programs "smartpca" and "ploteig" downloaded from
+
+http://genepath.med.harvard.edu/~reich/Software.htm
+
+and described in the paper "Population structure and eigenanalysis"
+by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190.
+
+  </help>
+</tool>