Phopsphoproteomic Enrichment Pipeline Preprocessing Steps

Overview

Prior to statistical analysis, it is necessary to perform three steps to transform the MaxQuant output for phosphoproteome-enriched samples.

Workflow position

Upstream tool

The input dataset for this tool is the Phospho (STY)Sites.txt file that is produced:

by the Galaxy "MaxQuant" (maxquant) tool

or by the Galaxy "Maxquant (using mqpar.xml)" (maxquant_mqpar) tool

or by the desktop version of MaxQuant.

Downstream tool

The "MaxQuant Phosphopeptide ANOVA" tool (mqppep_anova) consumes the "preprocessed" output file preproc_tab that this tool produces.

Phopsphoproteomic Enrichment Pipeline Localization-Probability Cut-Off

This step applies a "localization-probability cut-off" for phosphopeptides for each phosphopeptide. Higher values may reduce the number of peptides in the output. The default value of 0.75 reflects the text of [Cheng 2018]:

"For phosphopeptide identification, a localization probability cutoff is applied. This filter is performed to select for phosphopeptides with a high confidence (i.e., greater than 0.75) in phosphoresidue identification [Hogrebe 2018; Olsen 2006]. In other words, the summed probability of all other residues that could potentially contain the phospho-group is less than 0.25. This cutoff could be raised to increase the stringency of the phosphopeptide selection. In regard to the number of identifications, the expected number of pY peptides is in the hundreds, while the expected number of pST peptides is in the high thousands. These values reflect previously observed phosphoproteome distribution where about 2%, 12%, and 86% of the phosphosites are pY, pT, and pS, respectively [Olsen 2006]."

This tool wraps an R script. written by Larry Cheng, that performs the following (in order):

Remove contaminant and reverse sequence rows
Filters rows based on localization probability
Extract the quantitative data
Inserts a "p" before the phosphorylated residue(s) in each peptide sequence
Merges (aggregating by "sum" or "average") multiply-phosphorylated peptides
Filters output phosphopeptides based on enrichment
Produces an output file (in tabular format) that contains the phosphopeptide (first column) and its (possibly merged) mass spectral intensity for each sample.

Note that the "ProTeomiX Quality Control Report" [Bielow 2016] (available at https://github.com/cbielow/PTXQC/) is run by the Galaxy wrappers for MaxQuant, so it is omitted here even though it was included in Larry Cheng's original script.

Input dataset

Phospho (STY)Sites.txt: This is the MaxQuant Phospho (STY)Sites.txt file produced by MaxQuant. If you use the desktop version of MaxQuant, you will find this file in the txt folder.

Input parameters

Localization probability cutoff: Minimum localization probability; see above.
Intensity merge-function: Specifies how intensities for identical phosphosites should be merged. Choosing "sum" means that relative intensities reflect number of phospho-residues; choosing "average" means that relative intensities reflect number of phospho-peptides.

Output datasets

ppep_intensities: Data table (in tabular format) presenting, for each sample, the mass-spectral intensity of each phopshopeptide having localization probability greater than the cutoff.
enrichment.pdf: Graph (in PDF format) presenting non-zero proportions of pS, pT, and pY among the phosphosites; note that a phosphopeptide may have multiple phosphosite.
locProbCutoff.pdf: Graph (in PDF format) contrasting proportion of phosphopeptides above the localization probability cutoff with the proportion below.
enrichment.svg: Enrichment graph (in downloadable "scalable vector graphics" format) for incorporation into documents.
locProbCutoff.svg: Localization probability cutoff graph (in downloadable "scalable vector graphics" format) for incorporation into documents.
filteredData: Data table (in tabular format) comprising rows of the phosphSites input file that are not flagged as contaminants or reversed sequences.
quantData: Data table (in tabular format) comprising rows of the filteredData file whose localization probability exceeds the Localization Probability Cutoff parameter.

Authors

Nicholas A. Graham: (ORCiD 0000-0002-6811-1941) initiated the original script.
Larry C. Cheng: (ORCiD 0000-0002-6922-6433) updated the original script.
Arthur C. Eschenlauer: (ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.
James E. Johnson: (University of Minnesota Supercomputing Institute) adapted the script to run in Galaxy.

Phopsphoproteomic Enrichment Pipeline Upstream Kinase Mapping

This step searches phosphopeptides against several databases for known or predicted sites.

Input databases

networkin: This table is the result of filtering the NetworkKIN database [Linding 2007; Horn 2014] for cutoff score > 2.0. The ENSEMBL data used to generate the file were from Ensembl, ensembl.org [Howe 2021].

To generate this file:

Download the "precomputed data for all available kinase predictors against ENSEMBL" (available at the NetworkKIN predictions link on the downloads page at https://web.archive.org/web/20200208000403/http://networkin.info/download/networkin_human_predictions_3.1.tsv.xz; N.B.: "Commercial users are requested to contact the authors before using the data on the networkin.info website");

Decompress the .tsv.xz with file with "unxz" (from XZ Utils https://tukaani.org/xz/);

Filter out the rows having "network_kin" less than 2.0.

The result should be a tab-separated file with the following columns:

#substrate

position

id

networkin_score

tree

netphorest_group

netphorest_score

string_identifier

string_score

substrate_name

sequence

string_path

p_sty_motifs

This database merges motif patterns from [Amanchy 2007] and Phosida [Gnad 2011].

The Amanchy data are adapted from https://web.archive.org/web/*/http://hprd.org/serine_motifs and https://web.archive.org/web/*/http://hprd.org/tyrosine_motifs (both links cite the reference where each motif was published), and the patterns are translated into Perl regular expression format (https://perldoc.perl.org/perlre).

The Phosida data are adapted (translated to Perl-formatted regular expressions) from http://pegasus.biochem.mpg.de/phosida/help/motifs.aspx (this link cites the reference where each motif was published).

This file has three tab-separated columns (and no header):

column 1 is an (ignored) identifier

column 2 is a Perl regular expression

column 3 is a descriptor.

For two examples:

2<TAB>R.R..(pS|pT)<TAB>Akt kinase substrate motif (HPRD)

10<TAB>R..(pS|pT)V<TAB>CAMK2_Phosida

psp_kinase_substrate

'Kinase-substrate dataset: experimentally determined substrates, sequences, cognate kinases, and metadata curated from the literature' [Hornbeck 2011]. This tabular-formatted file may be downloaded for non-commercial purposes as 'Kinase_Substrate_Dataset.gz' from https://www.phosphosite.org/staticDownloads.action.

Data extracted from PhosphoSitePlus(R), created by Cell Signaling Technology Inc. PhosphoSitePlus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-nc-sa/3.0/). Attribution must be given in written, oral and digital presentations to PhosphoSitePlus, www.phosphosite.org. Written documents should additionally cite:

Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M (2012) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 40, D261-D270.; www.phosphosite.org.

psp_regulatory_sites

'Regulatory sites: information curated from the literature about modification sites shown to regulate molecular functions, biological processes, and molecular interactions including protein-protein interactions' [Hornbeck 2011]. This tabular-formatted file may be downloaded for non-commercial purposes as 'Regulatory_sites.gz' from https://www.phosphosite.org/staticDownloads.action.

Terms of use and citatation are as for the psp_kinase_substrate file.

Output datasets

ppep_map: Data table (in tabular format, consumed by the merge/filter step) presenting, for each phosphopeptide, the kinase mappings, the mass-spectral intensities for each sample, and the metadata from UniProtKB/SwissProt, phospho-sites, phospho-motifs, and regulatory sites. Data in the columns marked "Domain", "ON_...", or "..._PhosphoSite" are available subject to the following terms:

"PhosphoSitePlus® (PSP) was created by Cell Signaling Technology Inc. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License(https://creativecommons.org/licenses/by-nc-sa/3.0/). When using PSP data or analyses in printed publications or in online resources, the following acknowledgements must be included: (a) the words 'PhosphoSitePlus(R), www.phosphosite.org' must be included at appropriate places in the text or webpage, and (b) citation of [Hornbeck 2011 (PMID: 25514926)] must be included in the bibliography."
melted: Data table (in tabular format) presenting, for each phosphopeptide, the gene and one of the phospho-motifs or kinase-substrate sites.
ppep_mapping_sqlite: SQLite database (consumed by the merge/filter step).

Authors

Nicholas A. Graham: (ORCiD 0000-0002-6811-1941) wrote the original script.
Arthur C. Eschenlauer: (ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.

Phopsphoproteomic Enrichment Pipeline Merge and Filter

This step merges mapped metadata into metadata for phosphopeptides, filtering by species.

Input parameters

species: Limit PhosphoSitesPlus to indicated species. Default: human

Output datasets

preproc_tab: Phosphopeptides annotated with SwissProt and phosphosite metadata, in tabular format. This file is designed to be consumed by the downstream ANOVA tool. Some data in the columns marked "PSP" are available subject to the following terms:

"PhosphoSitePlus® (PSP) was created by Cell Signaling Technology Inc. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License(https://creativecommons.org/licenses/by-nc-sa/3.0/). When using PSP data or analyses in printed publications or in online resources, the following acknowledgements must be included: (a) the words 'PhosphoSitePlus(R), www.phosphosite.org' must be included at appropriate places in the text or webpage, and (b) citation of [Hornbeck 2011 (PMID: 25514926)] must be included in the bibliography."
preproc_csv: Phosphopeptides annotated with SwissProt and phosphosite metadata, in CSV format.
preproc_sqlite: ppep_mapping_sqlite updated with annotations, in SQLite format.

Authors

Nicholas A. Graham: (ORCiD 0000-0002-6811-1941) initiated the original script.
Larry C. Cheng: (ORCiD 0000-0002-6922-6433) updated the original script.
Arthur C. Eschenlauer: (ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.