Galaxy | Tool Preview

MaxQuant Phosphopeptide Preprocessing (version 0.1.19+galaxy0)
Tabular 'Phospho (STY)Sites.txt' produced by MaxQuant
PERL-compatible regular expression matching header of column having number of 'Phospho (STY)'
PERL-compatible regular expression matching column header having first sample intensity
E.g., 1 if subsequent column is next sample; 2 if next sample is two columns away, etc.
Were samples enriched for pS and pT, or were they enriched for pY instead?
When a peptide is multiply phosphorylated, how should intensities be merged? [default: sum]
See help below for an explanation.
Sequence database; supply the same FASTA file as you supplied to by MaxQuant
NetworKIN file; see help section below
pS/pT/pY phosphorylation site motifs; see help section below
'Kinase-substrate dataset'; see help section below
'Regulatory sites'; see help section below
(field may be empty) [default: human]. If you supply this parameter, use the species indentifier seen as a suffix in UniProtKB

Phopsphoproteomic Enrichment Pipeline Preprocessing Steps

Overview

Prior to statistical analysis, it is necessary to perform three steps to transform the MaxQuant output for phosphoproteome-enriched samples.

Workflow position

Upstream tool

The input dataset for this tool is the Phospho (STY)Sites.txt file that is produced:

  • by the Galaxy "MaxQuant" (maxquant) tool
  • or by the Galaxy "Maxquant (using mqpar.xml)" (maxquant_mqpar) tool
  • or by the desktop version of MaxQuant.

Downstream tool

The "MaxQuant Phosphopeptide ANOVA" tool (mqppep_anova) consumes the "preprocessed" output file preproc_tab that this tool produces.

Phopsphoproteomic Enrichment Pipeline Localization-Probability Cut-Off

This step applies a "localization-probability cut-off" for phosphopeptides for each phosphopeptide. Higher values may reduce the number of peptides in the output. The default value of 0.75 reflects the text of [Cheng 2018]:

"For phosphopeptide identification, a localization probability cutoff is applied. This filter is performed to select for phosphopeptides with a high confidence (i.e., greater than 0.75) in phosphoresidue identification [Hogrebe 2018; Olsen 2006]. In other words, the summed probability of all other residues that could potentially contain the phospho-group is less than 0.25. This cutoff could be raised to increase the stringency of the phosphopeptide selection. In regard to the number of identifications, the expected number of pY peptides is in the hundreds, while the expected number of pST peptides is in the high thousands. These values reflect previously observed phosphoproteome distribution where about 2%, 12%, and 86% of the phosphosites are pY, pT, and pS, respectively [Olsen 2006]."

This tool wraps an R script. written by Larry Cheng, that performs the following (in order):

  1. Remove contaminant and reverse sequence rows
  2. Filters rows based on localization probability
  3. Extract the quantitative data
  4. Inserts a "p" before the phosphorylated residue(s) in each peptide sequence
  5. Merges (aggregating by "sum" or "average") multiply-phosphorylated peptides
  6. Filters output phosphopeptides based on enrichment
  7. Produces an output file (in tabular format) that contains the phosphopeptide (first column) and its (possibly merged) mass spectral intensity for each sample.

Note that the "ProTeomiX Quality Control Report" [Bielow 2016] (available at https://github.com/cbielow/PTXQC/) is run by the Galaxy wrappers for MaxQuant, so it is omitted here even though it was included in Larry Cheng's original script.

Input dataset

Phospho (STY)Sites.txt
This is the MaxQuant Phospho (STY)Sites.txt file produced by MaxQuant. If you use the desktop version of MaxQuant, you will find this file in the txt folder.

Input parameters

Localization probability cutoff
Minimum localization probability; see above.
Intensity merge-function
Specifies how intensities for identical phosphosites should be merged. Choosing "sum" means that relative intensities reflect number of phospho-residues; choosing "average" means that relative intensities reflect number of phospho-peptides.

Output datasets

ppep_intensities
Data table (in tabular format) presenting, for each sample, the mass-spectral intensity of each phopshopeptide having localization probability greater than the cutoff.
enrichment.pdf
Graph (in PDF format) presenting non-zero proportions of pS, pT, and pY among the phosphosites; note that a phosphopeptide may have multiple phosphosite.
locProbCutoff.pdf
Graph (in PDF format) contrasting proportion of phosphopeptides above the localization probability cutoff with the proportion below.
enrichment.svg
Enrichment graph (in downloadable "scalable vector graphics" format) for incorporation into documents.
locProbCutoff.svg
Localization probability cutoff graph (in downloadable "scalable vector graphics" format) for incorporation into documents.
filteredData
Data table (in tabular format) comprising rows of the phosphSites input file that are not flagged as contaminants or reversed sequences.
quantData
Data table (in tabular format) comprising rows of the filteredData file whose localization probability exceeds the Localization Probability Cutoff parameter.

Authors

Nicholas A. Graham
(ORCiD 0000-0002-6811-1941) initiated the original script.
Larry C. Cheng
(ORCiD 0000-0002-6922-6433) updated the original script.
Arthur C. Eschenlauer
(ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.
James E. Johnson
(University of Minnesota Supercomputing Institute) adapted the script to run in Galaxy.

Phopsphoproteomic Enrichment Pipeline Upstream Kinase Mapping

This step searches phosphopeptides against several databases for known or predicted sites.

Input databases

networkin
This table is the result of filtering the NetworkKIN database [Linding 2007; Horn 2014] for cutoff score > 2.0. The ENSEMBL data used to generate the file were from Ensembl, ensembl.org [Howe 2021].

To generate this file:

  1. Download the "precomputed data for all available kinase predictors against ENSEMBL" (available at the NetworkKIN predictions link on the downloads page at https://web.archive.org/web/20200208000403/http://networkin.info/download/networkin_human_predictions_3.1.tsv.xz; N.B.: "Commercial users are requested to contact the authors before using the data on the networkin.info website");
  2. Decompress the .tsv.xz with file with "unxz" (from XZ Utils https://tukaani.org/xz/);
  3. Filter out the rows having "network_kin" less than 2.0.

The result should be a tab-separated file with the following columns:

  • #substrate
  • position
  • id
  • networkin_score
  • tree
  • netphorest_group
  • netphorest_score
  • string_identifier
  • string_score
  • substrate_name
  • sequence
  • string_path
p_sty_motifs

This database merges motif patterns from [Amanchy 2007] and Phosida [Gnad 2011].

The Amanchy data are adapted from https://web.archive.org/web/*/http://hprd.org/serine_motifs and https://web.archive.org/web/*/http://hprd.org/tyrosine_motifs (both links cite the reference where each motif was published), and the patterns are translated into Perl regular expression format (https://perldoc.perl.org/perlre).

The Phosida data are adapted (translated to Perl-formatted regular expressions) from http://pegasus.biochem.mpg.de/phosida/help/motifs.aspx (this link cites the reference where each motif was published).

This file has three tab-separated columns (and no header):

  • column 1 is an (ignored) identifier
  • column 2 is a Perl regular expression
  • column 3 is a descriptor.

For two examples:

2<TAB>R.R..(pS|pT)<TAB>Akt kinase substrate motif (HPRD)

10<TAB>R..(pS|pT)V<TAB>CAMK2_Phosida

psp_kinase_substrate

'Kinase-substrate dataset: experimentally determined substrates, sequences, cognate kinases, and metadata curated from the literature' [Hornbeck 2011]. This tabular-formatted file may be downloaded for non-commercial purposes as 'Kinase_Substrate_Dataset.gz' from https://www.phosphosite.org/staticDownloads.action.

Data extracted from PhosphoSitePlus(R), created by Cell Signaling Technology Inc. PhosphoSitePlus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-nc-sa/3.0/). Attribution must be given in written, oral and digital presentations to PhosphoSitePlus, www.phosphosite.org. Written documents should additionally cite:

Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M (2012) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 40, D261-D270.; www.phosphosite.org.

psp_regulatory_sites

'Regulatory sites: information curated from the literature about modification sites shown to regulate molecular functions, biological processes, and molecular interactions including protein-protein interactions' [Hornbeck 2011]. This tabular-formatted file may be downloaded for non-commercial purposes as 'Regulatory_sites.gz' from https://www.phosphosite.org/staticDownloads.action.

Terms of use and citatation are as for the psp_kinase_substrate file.

Output datasets

ppep_map

Data table (in tabular format, consumed by the merge/filter step) presenting, for each phosphopeptide, the kinase mappings, the mass-spectral intensities for each sample, and the metadata from UniProtKB/SwissProt, phospho-sites, phospho-motifs, and regulatory sites. Data in the columns marked "Domain", "ON_...", or "..._PhosphoSite" are available subject to the following terms:

"PhosphoSitePlus® (PSP) was created by Cell Signaling Technology Inc. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License(https://creativecommons.org/licenses/by-nc-sa/3.0/). When using PSP data or analyses in printed publications or in online resources, the following acknowledgements must be included: (a) the words 'PhosphoSitePlus(R), www.phosphosite.org' must be included at appropriate places in the text or webpage, and (b) citation of [Hornbeck 2011 (PMID: 25514926)] must be included in the bibliography."
melted
Data table (in tabular format) presenting, for each phosphopeptide, the gene and one of the phospho-motifs or kinase-substrate sites.
ppep_mapping_sqlite
SQLite database (consumed by the merge/filter step).

Authors

Nicholas A. Graham
(ORCiD 0000-0002-6811-1941) wrote the original script.
Arthur C. Eschenlauer
(ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.

Phopsphoproteomic Enrichment Pipeline Merge and Filter

This step merges mapped metadata into metadata for phosphopeptides, filtering by species.

Input parameters

species
Limit PhosphoSitesPlus to indicated species. Default: human

Output datasets

preproc_tab

Phosphopeptides annotated with SwissProt and phosphosite metadata, in tabular format. This file is designed to be consumed by the downstream ANOVA tool. Some data in the columns marked "PSP" are available subject to the following terms:

"PhosphoSitePlus® (PSP) was created by Cell Signaling Technology Inc. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License(https://creativecommons.org/licenses/by-nc-sa/3.0/). When using PSP data or analyses in printed publications or in online resources, the following acknowledgements must be included: (a) the words 'PhosphoSitePlus(R), www.phosphosite.org' must be included at appropriate places in the text or webpage, and (b) citation of [Hornbeck 2011 (PMID: 25514926)] must be included in the bibliography."
preproc_csv
Phosphopeptides annotated with SwissProt and phosphosite metadata, in CSV format.
preproc_sqlite
ppep_mapping_sqlite updated with annotations, in SQLite format.

Authors

Nicholas A. Graham
(ORCiD 0000-0002-6811-1941) initiated the original script.
Larry C. Cheng
(ORCiD 0000-0002-6922-6433) updated the original script.
Arthur C. Eschenlauer
(ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.