annotate pca.xml @ 14:8ae67e9fb6ff

Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
author miller-lab
date Fri, 28 Sep 2012 11:35:56 -0400
parents
children 95a05c1ef5d5
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
14
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
1 <tool id="gd_pca" name="PCA" version="1.0.0">
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
2 <description>: Principal Component Analysis of genotype data</description>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
3
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
4 <command interpreter="python">
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
5 pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
6 </command>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
7
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
8 <inputs>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
9 <param name="input" type="data" format="gd_ped" label="Dataset" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
10 </inputs>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
11
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
12 <outputs>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
13 <data name="output" format="html" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
14 </outputs>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
15
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
16 <!--
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
17 <tests>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
18 <test>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
19 <param name="input" value="fake" ftype="gd_ped" >
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
20 <metadata name="base_name" value="admix" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
21 <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
22 <composite_data value="test_out/prepare_population_structure/admix.ped" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
23 <composite_data value="test_out/prepare_population_structure/admix.map" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
24 <edit_attributes type="name" value="fake" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
25 </param>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
26
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
27 <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
28 <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
29 <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
30 <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
31 <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
32 <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
33 <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
34 <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
35 </output>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
36
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
37 </test>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
38 </tests>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
39 -->
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
40
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
41 <help>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
42
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
43 **Dataset formats**
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
44
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
45 The input dataset is in gd_ped_ format.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
46 The output dataset is html_ with links to a pdf for a graphical output and
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
47 text files. (`Dataset missing?`_)
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
48
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
49 .. _gd_ped: ./static/formatHelp.html#gd_ped
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
50 .. _html: ./static/formalHelp.html#html
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
51 .. _Dataset missing?: ./static/formatHelp.html
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
52
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
53 -----
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
54
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
55 **What it does**
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
56
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
57 The user selects a gd_ped dataset generated by the Prepare Input tool.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
58 The PCA tool runs a
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
59 Principal Component Analysis on the input genotype data and constructs
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
60 a plot of the top two principal components. It also reports the
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
61 following estimates of the statistical significance of the analysis.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
62
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
63 1. Average divergence between each pair of populations. Specifically,
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
64 from the covariance matrix X whose eigenvectors were computed, we can
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
65 compute a "distance", d, for each pair of individuals (i,j): d(i,j) =
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
66 X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
67 define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b)
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
68 / (\|pop a\| * \|pop b\|). We then normalize D so that the diagonal
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
69 has mean 1 and report it.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
70
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
71 2. Anova statistics for population differences along each
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
72 eigenvector. For each eigenvector, a P-value for statistical
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
73 significance of differences between each pair of populations along
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
74 that eigenvector is printed. +++ is used to highlight P-values less
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
75 than 1e-06. \*\*\* is used to highlight P-values between 1e-06 and
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
76 1e-03. If there are more than 2 populations, then an overall P-value
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
77 is also printed for that eigenvector, as are the populations with
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
78 minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
79 only 1 population, no Anova statistics are printed.]
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
80
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
81 3. Statistical significance of differences between populations. For
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
82 each pair of populations, the above Anova statistics are summed across
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
83 eigenvectors. The result is approximately chisq with d.o.f. equal to
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
84 the number of eigenvectors. The chisq statistic and its p-value are
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
85 printed. [If there is only 1 population, no statistics are printed.]
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
86
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
87 We post-process the output of the PCA tool to estimate "admixture
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
88 fractions". For this, we take three populations at a time and
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
89 determine each one's average point in the PCA plot (by separately
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
90 averaging first and second coordinates). For each combination of two
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
91 center points, modeling two ancestral populations, we try to model the
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
92 third central point as having a certain fraction, r, of its SNP
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
93 genotypes from the second ancestral population and the remainder from
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
94 the first ancestral population, where we estimate r. The output file
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
95 "coordinates.txt" then contains pairs of lines like
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
96
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
97 projection along chord Population1 -> Population2
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
98 Population3: 0.12345
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
99
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
100 where the number (in this case 0.1245) is the estimation of r.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
101 Computations with simulated data suggests that the true r is
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
102 systematically underestimated, perhaps giving roughly 0.6 times r.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
103
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
104 -----
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
105
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
106 **Acknowledgments**
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
107
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
108 We use the programs "smartpca" and "ploteig" downloaded from
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
109
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
110 http://genepath.med.harvard.edu/~reich/Software.htm
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
111
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
112 and described in the paper "Population structure and eigenanalysis"
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
113 by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190.
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
114
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
115 </help>
8ae67e9fb6ff Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
miller-lab
parents:
diff changeset
116 </tool>