Galaxy |

The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.). It contains variant annotations such as:

1000Gp1_AC

Alternative allele counts in the whole 1000 genomes phase 1 (1000Gp1) data

1000Gp1_AF

Alternative allele frequency in the whole 1000Gp1 data

1000Gp1_AFR_AC

Alternative allele counts in the 1000Gp1 African descendent samples

1000Gp1_AFR_AF

Alternative allele frequency in the 1000Gp1 African descendent samples

1000Gp1_AMR_AC

Alternative allele counts in the 1000Gp1 American descendent samples

1000Gp1_AMR_AF

Alternative allele frequency in the 1000Gp1 American descendent samples

1000Gp1_ASN_AC

Alternative allele counts in the 1000Gp1 Asian descendent samples

1000Gp1_ASN_AF

Alternative allele frequency in the 1000Gp1 Asian descendent samples

1000Gp1_EUR_AC

Alternative allele counts in the 1000Gp1 European descendent samples

1000Gp1_EUR_AF

Alternative allele frequency in the 1000Gp1 European descendent samples

aaalt

Alternative amino acid. "." if the variant is a splicing site SNP (2bp on each end of an intron)

aapos

Amino acid position as to the protein. "-1" if the variant is a splicing site SNP (2bp on each end of an intron)

aapos_SIFT

ENSP id and amino acid positions corresponding to SIFT scores. Multiple entries separated by ";"

aapos_FATHMM

ENSP id and amino acid positions corresponding to FATHMM scores. Multiple entries separated by ";"

aaref

Reference amino acid. "." if the variant is a splicing site SNP (2bp on each end of an intron)

alt

Alternative nucleotide allele (as on the + strand)

Ancestral_allele

Ancestral allele (based on 1000 genomes reference data)

cds_strand

Coding sequence (CDS) strand (+ or -)

chr

Chromosome number

codonpos

Position on the codon (1, 2 or 3)

Ensembl_geneid

Ensembl gene ID

Ensembl_transcriptid

Ensembl transcript IDs (separated by ";")

ESP6500_AA_AF

Alternative allele frequency in the African American samples of the NHLBI GO Exome Sequencing Project (ESP6500 data set)

ESP6500_EA_AF

Alternative allele frequency in the European American samples of the NHLBI GO Exome Sequencing Project (ESP6500 data set)

FATHMM_pred

If a FATHMM_score is <=-1.5 (or rankscore <=0.81415) the corresponding non-synonymous SNP is predicted as "D(AMAGING)"; otherwise it is predicted as "T(OLERATED)". Multiple predictions separated by ";"

FATHMM_rankscore

FATHMMori scores were ranked among all FATHMMori scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of FATHMMori scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0 to 1

FATHMM_score

FATHMM default score (FATHMMori)

fold-degenerate

Degenerate type (0, 2 or 3)

genename

Gene name; if the non-synonymous SNP can be assigned to multiple genes, gene names are separated by ";"

GERP++_NR

GERP++ neutral rate

GERP++_RS

GERP++ RS score, the larger the score, the more conserved the site

GERP++_RS_rankscore

GERP++ RS scores were ranked among all GERP++ RS scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of GERP++ RS scores in dbNSFP

hg18_pos(1-coor)

Physical position on the chromosome as to hg18 (1-based coordinate)

Interpro_domain

Domain or conserved site on which the variant locates

LR_pred

Prediction of our LR based ensemble prediction score, "T(olerated)" or "D(amaging)". The score cutoff between "D" and "T" is 0.5. The rankscore cutoff between "D" and "T" is 0.82268

LR_rankscore

LR scores were ranked among all LR scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of LR scores in dbNSFP. The scores range from 0 to 1

LR_score

Our logistic regression (LR) based ensemble prediction score, which incorporated 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging. Scores range from 0 to 1

LRT_Omega

Estimated nonsynonymous-to-synonymous-rate ratio (Omega, reported by LRT)

LRT_converted_rankscore

LRTori scores were first converted as LRTnew=1-LRTori*0.5 if Omega<1, or LRTnew=LRTori*0.5 if Omega>=1. Then LRTnew scores were ranked among all LRTnew scores in dbNSFP. The rankscore is the ratio of the rank over the total number of the scores in dbNSFP. The scores range from 0.00166 to 0.85682

LRT_pred

LRT prediction, D(eleterious), N(eutral) or U(nknown), which is not solely determined by the score

LRT_score

The original LRT two-sided p-value (LRTori), ranges from 0 to 1

MutationAssessor_pred

MutationAssessor's functional impact of a variant

MutationAssessor_rankscore

MAori scores were ranked among all MAori scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of MAori scores in dbNSFP. The scores range from 0 to 1

MutationAssessor_score

MutationAssessor functional impact combined score (MAori)

MutationTaster_converted_rankscore

The MTori scores were first converted: if the prediction is "A" or "D" MTnew=MTori; if the prediction is "N" or "P", MTnew=1-MTori. Then MTnew scores were ranked among all MTnew scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of MTnew scores in dbNSFP. The scores range from 0.0931 to 0.80722

MutationTaster_pred

MutationTaster prediction

MutationTaster_score

MutationTaster p-value (MTori), ranges from 0 to 1

phastCons46way_placental

phastCons conservation score based on the multiple alignments of 33 placental mammal genomes (including human). The larger the score, the more conserved the site

phastCons46way_placental_rankscore

phastCons46way_placental scores were ranked among all phastCons46way_placental scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phastCons46way_placental scores in dbNSFP

phastCons46way_primate

phastCons conservation score based on the multiple alignments of 10 primate genomes (including human). The larger the score, the more conserved the site

phastCons46way_primate_rankscore

phastCons46way_primate scores were ranked among all phastCons46way_primate scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phastCons46way_primate scores in dbNSFP

phastCons100way_vertebrate

phastCons conservation score based on the multiple alignments of 100 vertebrate genomes (including human). The larger the score, the more conserved the site

phastCons100way_vertebrate_rankscore

phastCons100way_vertebrate scores were ranked among all phastCons100way_vertebrate scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phastCons100way_vertebrate scores in dbNSFP

phyloP46way_placental

phyloP (phylogenetic p-values) conservation score based on the multiple alignments of 33 placental mammal genomes (including human). The larger the score, the more conserved the site

phyloP46way_placental_rankscore

phyloP46way_placental scores were ranked among all phyloP46way_placental scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phyloP46way_placental scores in dbNSFP

phyloP46way_primate

phyloP (phylogenetic p-values) conservation score based on the multiple alignments of 10 primate genomes (including human). The larger the score, the more conserved the site

phyloP46way_primate_rankscore

phyloP46way_primate scores were ranked among all phyloP46way_primate scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phyloP46way_primate scores in dbNSFP

phyloP100way_vertebrate

phyloP (phylogenetic p-values) conservation score based on the multiple alignments of 100 vertebrate genomes (including human). The larger the score, the more conserved the site

phyloP100way_vertebrate_rankscore

phyloP100way_vertebrate scores were ranked among all phyloP100way_vertebrate scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of phyloP100way_vertebrate scores in dbNSFP

Polyphen2_HDIV_pred

Polyphen2 prediction based on HumDiv

Polyphen2_HDIV_rankscore

Polyphen2 HDIV scores were first ranked among all HDIV scores in dbNSFP. The rankscore is the ratio of the rank the score over the total number of the scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0.02656 to 0.89917

Polyphen2_HDIV_score

Polyphen2 score based on HumDiv, i.e. hdiv_prob. The score ranges from 0 to 1. Multiple entries separated by ";"

Polyphen2_HVAR_pred

Polyphen2 prediction based on HumVar

Polyphen2_HVAR_rankscore

Polyphen2 HVAR scores were first ranked among all HVAR scores in dbNSFP. The rankscore is the ratio of the rank the score over the total number of the scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0.01281 to 0.9711

Polyphen2_HVAR_score

Polyphen2 score based on HumVar, i.e. hvar_prob. The score ranges from 0 to 1. Multiple entries separated by ";"

pos(1-coor)

Physical position on the chromosome as to hg19 (1-based coordinate)

RadialSVM_pred

Prediction of our SVM based ensemble prediction score, "T(olerated)" or "D(amaging)". The score cutoff between "D" and "T" is 0. The rankscore cutoff between "D" and "T" is 0.83357

RadialSVM_rankscore

RadialSVM scores were ranked among all RadialSVM scores in dbNSFP. The rankscore is the ratio of the rank of the screo over the total number of RadialSVM scores in dbNSFP. The scores range from 0 to 1

RadialSVM_score

Our support vector machine (SVM) based ensemble prediction score, which incorporated 10 scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Larger value means the SNV is more likely to be damaging. Scores range from -2 to 3 in dbNSFP

ref

Reference nucleotide allele (as on the + strand)

refcodon

Reference codon

Reliability_index

Number of observed component scores (except the maximum frequency in the 1000 genomes populations) for RadialSVM and LR. Ranges from 1 to 10. As RadialSVM and LR scores are calculated based on imputed data, the less missing component scores, the higher the reliability of the scores and predictions

SIFT_converted_rankscore

SIFTori scores were first converted to SIFTnew=1-SIFTori, then ranked among all SIFTnew scores in dbNSFP. The rankscore is the ratio of the rank the SIFTnew score over the total number of SIFTnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The rankscores range from 0.02654 to 0.87932

SIFT_pred

If SIFTori is smaller than 0.05 (rankscore>0.55) the corresponding non-synonymous SNP is predicted as "D(amaging)"; otherwise it is predicted as "T(olerated)". Multiple predictions separated by ";"

SIFT_score

SIFT score (SIFTori). Scores range from 0 to 1. The smaller the score the more likely the SNP has damaging effect. Multiple scores separated by ";"

SiPhy_29way_logOdds

SiPhy score based on 29 mammals genomes. The larger the score, the more conserved the site

SiPhy_29way_pi

The estimated stationary distribution of A, C, G and T at the site, using SiPhy algorithm based on 29 mammals genomes

SLR_test_statistic

SLR test statistic for testing natural selection on codons. A negative value indicates negative selection, and a positive value indicates positive selection. Larger magnitude of the value suggests stronger evidence

Uniprot_aapos

Amino acid position as to Uniprot. Multiple entries separated by ";"

Uniprot_acc

Uniprot accession number. Multiple entries separated by ";"

Uniprot_id

Uniprot ID number. Multiple entries separated by ";"

UniSNP_ids

rs numbers from UniSNP, which is a cleaned version of dbSNP build 129, in format: rs number1;rs number2;...

The procedure for preparing the dbNSFP data for use in SnpSift dbnsfp and a couple of prebuilt dbNSFP databases are available at: http://snpeff.sourceforge.net/SnpSift.html#dbNSFP

Uploading Your Own Annotations for any Genome

The website for dbNSFP databases releases is: https://sites.google.com/site/jpopgen/dbNSFP

But there is only annotation for human hg18, hg19, and hg38 genome builds.

However, any dbNSFP-like tabular file that be can used with SnpSift dbnsfp if it has:

The first line of the file must be column headers that name the annotations.

The first 4 columns are required and must be:

#chr - chromosome

pos(1-coor) - position in chromosome

ref - reference base

alt - alternate base

For example:

#chr    pos(1-coor)     ref     alt     aaref   aaalt   genename        SIFT_score
  4      100239319       T       A        H       L      ADH1B             0
  4      100239319       T       C        H       R      ADH1B             0.15
  4      100239319       T       G        H       P      ADH1B             0

The custom galaxy datatypes for dbNSFP can automatically convert the specially formatted tabular file for use by SnpSift dbNSFP:

Upload the tabular file, set the datatype as: "dbnsfp.tabular"
Edit the history dataset attributes (pencil icon): Use "Convert Format" to convert the "dbnsfp.tabular" to the correct format for SnpSift dbnsfp: "snpsiftdbnsfp".

The procedure for preparing the dbNSFP data for use in SnpSift dbnsfp is in the SnpSift documentation.

For details about this tool, please go to:

http://snpeff.sourceforge.net/SnpEff_manual.html

http://snpeff.sourceforge.net/SnpSift.html#dbNSFP