MaxQuant Phosphopeptide ANOVA (version 0.1.19+galaxy0)

Filtered phosphopeptide intensities (tabular):

'preproc_tab' dataset produced by 'MaxQuant Phosphopeptide Preprocessing' tool

ANOVA alpha cutoff level (tabular):

ANOVA alpha cutoff values for significance testing: tabular data having one column and no header

Database from mqppep_preproc (sqlite):

'preproc_sqlite' dataset produced by 'MaxQuant Phosphopeptide Preprocessing' tool

Intensity-column pattern:

Pattern matching columns that have peptide intensity data (PERL-compatible regular expression matching column label)

Imputation method:

Impute missing values by (1) using median for each sample-group; (2) using median across all samples; (3) using mean across all samples; or (4) using randomly generated values having same SD as across all samples (with mean specified by 'Mean percentile for random values')

Mean percentile for random values:

Percentile center of random values; range [1,99]

Percentile SD for random values:

Standard deviation adjustment-factor for random values; real number. (1.0 means SD of random values equal to the SD for the entire data set.)

Sample-name extraction pattern:

Pattern extracting sample-names from names of columns of 'Filtered phosphopeptide intensities' that have peptide intensity data (PERL-compatible regular expression)

Sample-group extraction pattern:

Pattern extracting sample-group from the extracted sample-names (PERL-compatible regular expression)

Minimum number of values per sample-group:

Only consider as comparable those intensities having at least this number of values in each sample-group (range [0,∞])

Filter sample-groups:

What filter should be applied to sample-group names? (1) 'none', no filter; (2) 'include', match is required; (3) 'exclude', match is forbidden.

Minimum number of kinase-substrates for KSEA:

Minimum number of substrates to consider any kinase for KSEA (range [1,∞])

KSEA threshold level:

Maximum FDR to be used to score a kinase enrichment as significant; see warning against setting this too low in help text below.

Use abs(log2(fold-change)) for KSEA:

Should log2(fold-change) be used for KSEA? (Checking this may alter (possibly reduce) the number of hits.)

Minimum quality of substrates for KSEA:

Minimum 'quality' of substrates to be considered for KSEA (range [0,∞]); higher numbers reduce the number of substrates considered - see help text below.

Phopsphoproteomic Enrichment Pipeline ANOVA and KSEA

Overview

Perform statistical analysis of preprocessed MaxQuant output data collected as described in [Cheng, 2018].

Extracts sample-group IDs from sample names.

Imputes missing values.

Performs ANOVA analysis for each phosphopeptide.

Performs Kinase-Substrate Enrichment Analysis (KSEA) using the method described by Casado et al. (2013); see "Algorithms" section below.

Workflow position

Upstream tool: The "MaxQuant Phosphopeptide Preprocessing" tool (mqppep_preproc) that transforms MaxQuant output for phospoproteome-enriched samples into a form suitable for statistical analysis.

Input datasets

Filtered phosphopeptide intensities (tabular)

Phosphopeptides annotated with SwissProt and phosphosite metadata (in tabular format). This is the output from the "MaxQuant Phopsphopeptide Preprocessing" (mqppep_preproc) tool.

First column label 'Phosphopeptide'.

Sample-intensities must begin in first column matching 'Intensity-column pattern' and must have column labels to match argument 'Sample-name extraction pattern'.

ANOVA alpha cutoff level (tabular)

List of alpha cutoff values for significance testing; text file having one column and no header. For example:

0.2
0.1
0.05

Database from mqppep_preproc (sqlite): SQLite database produced by the "MaxQuant Phopsphopeptide Preprocessing" (mqppep_preproc) tool.

Input parameters

Intensity-column pattern

First column of Filtered phosphopeptide intensities having intensity values (integer or PERL-compatible regular expression matching column label). Default:

^Intensity[^_]

Imputation method

Impute missing values by:

group-median - use median for each sample-group;

mean - use mean across all samples; or

median - use median across all samples;

random - use randomly generated values where:

Mean percentile for random values specifies the percentile among non-missing values to be used as mean of random values, and

Percentile SD for random values specifies the factor to be multiplied by the standard deviation among the non-missing values (across all samples) to determine the standard deviation of random values.

Sample-name extraction pattern

PERL-compatible regular expression extracting the sample-name from the the name of a column of intensities (from Filtered phosphopeptide intensities) for one sample.

For example, "\.\d+[A-Z]$" applied to "Intensity.splunge.10A" would produce ".10A".

Note that this is case sensitive by default.

Sample-group extraction pattern

PERL-compatible regular expression extracting the sample-grouping from the sample-name (that was in turn extracted with Sample-name extraction pattern from a column of intensites from Filtered phosphopeptide intensities).

For example, "\d+$" applied to ".10A" would produce "10".

Note that this is case sensitive by default.

Minimum number of values per sample-group

Sometimes you may wish to filter out the intensities that are poorly represented among some sample groups because they complicate the comparison process. You can use this parameter to specify the minimum number of values in any sample-group (range [0,∞])

Filter sample-groups

Sometimes you may have spectra that are for treatments that you are not considering for your comparison. You can specify a filter (or not) for sample-group names; if you do, you can specify whether groups that match your criteria should be excluded from the analysis ("forbidden") or included in the analysis ("required").

Sample-group matching mode

The R base::grep function that is used here for pattern matching is exhaustively documented at https://rdrr.io/r/base/grep.html. There are two choices you make here. The first is whether to differentiate lowercase and uppercase characters. The second is wheter to require exact matches ("fixed" pattern-matching mode) or to use "PERL-compatible regular expressions) ("perl") or "extendd regular expressions" ("grep"). See https://rdrr.io/r/base/grep.html for further info.

Sample-group matching pattern

This is a comma-separated list of patterns to match to group-names, according to the Sample-group matching mode that you have chosen.

Minimum number of kinase-substrates for KSEA

For KSEA, you may decide that you wish to ignore kinases having fewer substrates than some minimum; specify that minimum here (range [1,∞])

KSEA threshold level

Specifies minimum FDR at which a kinase will be considered to be enriched; the default choice of 0.05 is arbitrary and may exclude kinases that are interesting. The KSEA FDR perhaps should not be treated as conservatively as would be appropriate for hypothesis testing. For example, at an FDR of 0.05, for every 20 kinases that on discards, 19 are likely truely enriched.

Use abs(log2(fold-change)) for KSEA

When TRUE, consider only the magnitude of the differences across the contrast for all of the substrates when aggregating them to assess the enrichment of a given kinase's substrates. When FALSE, also consider the direction. Surprisingly, setting this to TRUE may decrease the enriched kinases.

Minimum quality of substrates for KSEA

An arbitrary "quality score" is assigned to each substrate, as described in the PDF report produced by the tool. This score takes into account both FDR-adjusted p-value and the number of missing values for each substrate. Setting the minimum to zero retains all substrates, which may be a large number.

Outputs

Report dataset

[input file].[imputation method]-imputed_report

Summary report for normalization, imputation, and ANOVA, in PDF format.

Imputed intensities

[input file].[imputation method]-imputed_intensities

Phosphopeptide MS intensities where missing values have been imputed by the chosen method, in tabular format.

Imputed quantum-normalized log-transformed intensities

[input file].[imputation method]-imputed_QN_LT_intensities

Phosphopeptide MS intensities where missing values have been imputed by the chosen method, quantile-normalized (QN), and log10-transformed (LT), in tabular format.

ANOVA KSEA metadata

[input file].[imputation method]-imputed_anova_ksea_metadata Phosphopeptide metadata including ANOVA significance and KSEA enrichments.

KSEA SQLite database sqlite

[input file].[imputation method]-imputed_ksea_sqlite An SQLite database that is usable for ad hoc report creation.

Algorithm

The KSEA algorithm used here is as in the KSEAapp package as reported in [Wiredja 2017]. The code is adapted from "Danica D. Wiredja (2017). KSEAapp: Kinase-Substrate Enrichment Analysis. R package version 0.99.0." to work with output from the "MaxQuant Phosphopeptide Preprocessing" Galaxy tool and the multiple kinase-substrate databases that the latter tool searches.

Authors

Larry C. Cheng: (ORCiD 0000-0002-6922-6433) wrote the original script.
Arthur C. Eschenlauer: (ORCiD 0000-0002-2882-0508) adapted the script to run in Galaxy.

PERL-compatible regular expressions

Note that the PERL-compatible regular expressions accepted by this tool are documented at http://rdrr.io/r/base/regex.html