Mercurial > repos > miller-lab > genome_diversity

<tool id="gd_pca" name="PCA" version="1.0.0">
  <description>: Principal Components Analysis of genotype data</description>

  <command interpreter="python">
    pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
  </command>

  <inputs>
    <param name="input" type="data" format="gd_ped" label="Dataset" />
  </inputs>

  <outputs>
    <data name="output" format="html" />
  </outputs>

  <!--
  <tests>
    <test>
      <param name="input" value="fake" ftype="gd_ped" >
        <metadata name="base_name" value="admix" />
        <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
        <composite_data value="test_out/prepare_population_structure/admix.ped" />
        <composite_data value="test_out/prepare_population_structure/admix.map" />
        <edit_attributes type="name" value="fake" />
      </param>

      <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
        <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
        <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
        <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
        <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
        <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
        <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
        <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
      </output>

    </test>
  </tests>
  -->

  <help>

**Dataset formats**

The input dataset is in gd_ped_ format.
The output dataset is html_ with links to a pdf for a graphical output and
text files.  (`Dataset missing?`_)

.. _gd_ped: ./static/formatHelp.html#gd_ped
.. _html: ./static/formalHelp.html#html
.. _Dataset missing?: ./static/formatHelp.html

-----

**What it does**

The user selects a gd_ped dataset generated by the Prepare Input tool.
The PCA tool runs a
Principal Components Analysis on the input genotype data and constructs
a plot of the top two principal components. It also reports the
following estimates of the statistical significance of the analysis.

1. Average divergence between each pair of populations.  Specifically,
from the covariance matrix X whose eigenvectors were computed, we can
compute a "distance", d, for each pair of individuals (i,j): d(i,j) =
X(i,i) + X(j,j) - 2X(i,j).  For each pair of populations (a,b) now
define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b)
/ (\|pop a\| * \|pop b\|).  We then normalize D so that the diagonal
has mean 1 and report it.

2. Anova statistics for population differences along each
eigenvector. For each eigenvector, a P-value for statistical
significance of differences between each pair of populations along
that eigenvector is printed.  +++ is used to highlight P-values less
than 1e-06.  \*\*\* is used to highlight P-values between 1e-06 and
1e-03.  If there are more than 2 populations, then an overall P-value
is also printed for that eigenvector, as are the populations with
minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is
only 1 population, no Anova statistics are printed.]

3. Statistical significance of differences between populations. For
each pair of populations, the above Anova statistics are summed across
eigenvectors. The result is approximately chisq with d.o.f. equal to
the number of eigenvectors. The chisq statistic and its p-value are
printed. [If there is only 1 population, no statistics are printed.]

We post-process the output of the PCA tool to estimate "admixture
fractions".  For this, we take three populations at a time and
determine each one's average point in the PCA plot (by separately
averaging first and second coordinates).  For each combination of two
center points, modeling two ancestral populations, we try to model the
third central point as having a certain fraction, r, of its SNP
genotypes from the second ancestral population and the remainder from
the first ancestral population, where we estimate r.  The output file
"coordinates.txt" then contains pairs of lines like

projection along chord Population1 -> Population2
  Population3: 0.12345

where the number (in this case 0.1245) is the estimation of r.
Computations with simulated data suggests that the true r is
systematically underestimated, perhaps giving roughly 0.6 times r.

-----

**Acknowledgments**

We use the programs "smartpca" and "ploteig" downloaded from

http://genepath.med.harvard.edu/~reich/Software.htm

and described in the paper "Population structure and eigenanalysis"
by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190.

  </help>
</tool>
author	Richard Burhans <burhans@bx.psu.edu>
date	Mon, 03 Jun 2013 12:29:29 -0400
parents	95a05c1ef5d5
children	8997f2ca8c7a