comparison pca.xml @ 14:8ae67e9fb6ff

Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
author miller-lab
date Fri, 28 Sep 2012 11:35:56 -0400
parents
children 95a05c1ef5d5
comparison
equal deleted inserted replaced
13:fdb4240fb565 14:8ae67e9fb6ff
1 <tool id="gd_pca" name="PCA" version="1.0.0">
2 <description>: Principal Component Analysis of genotype data</description>
3
4 <command interpreter="python">
5 pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
6 </command>
7
8 <inputs>
9 <param name="input" type="data" format="gd_ped" label="Dataset" />
10 </inputs>
11
12 <outputs>
13 <data name="output" format="html" />
14 </outputs>
15
16 <!--
17 <tests>
18 <test>
19 <param name="input" value="fake" ftype="gd_ped" >
20 <metadata name="base_name" value="admix" />
21 <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
22 <composite_data value="test_out/prepare_population_structure/admix.ped" />
23 <composite_data value="test_out/prepare_population_structure/admix.map" />
24 <edit_attributes type="name" value="fake" />
25 </param>
26
27 <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
28 <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
29 <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
30 <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
31 <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
32 <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
33 <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
34 <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
35 </output>
36
37 </test>
38 </tests>
39 -->
40
41 <help>
42
43 **Dataset formats**
44
45 The input dataset is in gd_ped_ format.
46 The output dataset is html_ with links to a pdf for a graphical output and
47 text files. (`Dataset missing?`_)
48
49 .. _gd_ped: ./static/formatHelp.html#gd_ped
50 .. _html: ./static/formalHelp.html#html
51 .. _Dataset missing?: ./static/formatHelp.html
52
53 -----
54
55 **What it does**
56
57 The user selects a gd_ped dataset generated by the Prepare Input tool.
58 The PCA tool runs a
59 Principal Component Analysis on the input genotype data and constructs
60 a plot of the top two principal components. It also reports the
61 following estimates of the statistical significance of the analysis.
62
63 1. Average divergence between each pair of populations. Specifically,
64 from the covariance matrix X whose eigenvectors were computed, we can
65 compute a "distance", d, for each pair of individuals (i,j): d(i,j) =
66 X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now
67 define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b)
68 / (\|pop a\| * \|pop b\|). We then normalize D so that the diagonal
69 has mean 1 and report it.
70
71 2. Anova statistics for population differences along each
72 eigenvector. For each eigenvector, a P-value for statistical
73 significance of differences between each pair of populations along
74 that eigenvector is printed. +++ is used to highlight P-values less
75 than 1e-06. \*\*\* is used to highlight P-values between 1e-06 and
76 1e-03. If there are more than 2 populations, then an overall P-value
77 is also printed for that eigenvector, as are the populations with
78 minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is
79 only 1 population, no Anova statistics are printed.]
80
81 3. Statistical significance of differences between populations. For
82 each pair of populations, the above Anova statistics are summed across
83 eigenvectors. The result is approximately chisq with d.o.f. equal to
84 the number of eigenvectors. The chisq statistic and its p-value are
85 printed. [If there is only 1 population, no statistics are printed.]
86
87 We post-process the output of the PCA tool to estimate "admixture
88 fractions". For this, we take three populations at a time and
89 determine each one's average point in the PCA plot (by separately
90 averaging first and second coordinates). For each combination of two
91 center points, modeling two ancestral populations, we try to model the
92 third central point as having a certain fraction, r, of its SNP
93 genotypes from the second ancestral population and the remainder from
94 the first ancestral population, where we estimate r. The output file
95 "coordinates.txt" then contains pairs of lines like
96
97 projection along chord Population1 -> Population2
98 Population3: 0.12345
99
100 where the number (in this case 0.1245) is the estimation of r.
101 Computations with simulated data suggests that the true r is
102 systematically underestimated, perhaps giving roughly 0.6 times r.
103
104 -----
105
106 **Acknowledgments**
107
108 We use the programs "smartpca" and "ploteig" downloaded from
109
110 http://genepath.med.harvard.edu/~reich/Software.htm
111
112 and described in the paper "Population structure and eigenanalysis"
113 by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190.
114
115 </help>
116 </tool>