CADD scores licensing
CADD scores are freely available for non-commercial applications only. Make sure you contact the developers before using them in any commercial application.
What it does
Before we can use GEMINI to explore genetic variation, we must first load the variant information stored in VCF format into the GEMINI database framework.
To fully leverage the power of GEMINI, you should first annotate your VCF dataset with the functional consequences of the variants using either VEP or snpEff.
To avoid problems during annotation, but also during later variant queries with GEMINI tools, it is good practice to preprocess your VCF dataset even before annoation to split records with multiple alternate alleles, and to left-align and trim indels. The authors of GEMINI recommend the tool vt for this purpose, an equivalently good option is bcftools norm, and Galaxy wrappers exist for both tools.
In addition, you are encouraged to provide family and sample phenotype information in PED format, if you are planning to use GEMINI for any kind of variant identification based on inheritance patterns.
A PED file is simply a tabular text file (columns can be separated by either spaces or TABs, but not a mixture of the two within the same file) with the header:
#family_id name paternal_id maternal_id sex phenotype
and optional additional columns. The actual column names in the header are not fixed, but there have to be at least six columns that are interpreted as detailed next.
Subsequent lines describe one sample from the VCF input dataset each, where
family_id is an alphanumeric identifier of a family
If the family, to which the sample belongs, is unknown, a placeholder of 0, -9 or None can be used to indicate this fact.
name is the identifier of the sample described by the line
paternal_id is the identifier of the sample's father
If the sample's father is not available in the VCF, a placeholder of 0, -9 or None can be used to indicate this fact.
maternal_id is the identifier of the sample's mother
If the sample's mother is not available in the VCF, a placeholder of 0, -9 or None can be used to indicate this fact.
sex is a numeric code for the sample's sex (1=male, 2=female, any other number=unknown sex)
phenotype is a numeric code for the sample's phenotypic affection status (1=unaffected, 2=affected)
If the sample's phenotype is unknown, a placeholder of 0 or -9 can be used to indicate this fact.
Optional additional columns can have any column name you like, and accept any per-sample value. The data from such extra columns will be added to the samples table of the GEMINI database so you can use them in queries. Extra columns can be used, e.g., to describe additional phenotypes.
If no extra columns are present in a PED file, then the header line is optional.
Here are two examples of valid PED file contents:
#family_id name paternal_id maternal_id sex phenotype hair_color 1 M10475 -9 -9 1 1 brown 1 M10478 M10475 M10500 2 2 brown 1 M10500 -9 -9 2 2 black 1 M128215 M10475 M10500 1 1 blue
This describes a family with two kids, in which mother and daughter, but not father and son are phenotypically affected. The file also stores the hair color of all family members.
#family_id name paternal_id maternal_id sex phenotype 0 M10475 0 0 -1 1 0 M10478 0 0 -1 2 0 M10500 0 0 -1 2 0 M128215 0 0 -1 1
This describes the same samples as above, but without recording family structure, sex or additional traits. Only the sample phenotypes are provided. In this case (no extra columns), the header line could be omitted.