Author - Arthur Eschenlauer (University of Minnesota, esch0041@umn.edu)
Release Notes - https://github.com/HegemanLab/w4mcorcov_galaxy_wrapper#release-notes
OPLS-DA and the SIMCA S-PLOT (Wiklund et al., 2008) may be employed to draw attention to metabolomic features that are potential biomarkers, i.e. features that are potentially useful when assigning a sample to one of two classes (e.g. Sun et al., 2016). Workflow4Metabolomics (W4M, Giacomoni et al., 2014, Guitton et al., 2017) provides a suite of tools for preprocessing and statistical analysis of LC-MS, GC-MS, and NMR metabolomics data; however, it does not (as of release 3.2) include a tool for making the equivalent of an S-PLOT.
The S-PLOT is computed from mean-centered, pareto-scaled data. This plot presents the correlation of the first score vector from an OPLS-DA model with the sample-variables used to produce that model versus the covariance of the scores with the sample-variables. For OPLS-DA, the first score vector represents the variation among the sample-variables that is related to the predictor (i.e., the contrasting factor); the second score vector, variation that is orthogonal to the predictor.
The primary aims of this tool are:
Note: This tool only supports categorical factors with non-numeric level-names.
The purpose of the 'OPLS-DA Contrasts' tool is to visualize GC-MS or LC-MS features that are possible biomarkers.
The W4M 'Univariate' tool (Thévenot et al., 2015) adds the results of family-wise corrected pairwise significance-tests as columns of the variableMetadata dataset. For instance, if Kruskal-Wallis testing were perfomred on a column named 'cluster' in sampleMetadata that has values 'k1' and 'k2' and at least one other value:
The 'OPLS-DA Contrasts' tool produces graphics and data for OPLS-DA contrasts of feature-intensities between significantly different pairs of factor-levels. For each factor-level, the tool performs a contrast with all other factor-levels combined and then separately with each other factor-level.
Along the left-to-right axis, the plots show the supervised projection of the variation explained by the predictor (i.e., the factor specified when invoking the tool); the top-to-bottom axis displays the variation that is orthogonal to the predictor level (i.e., independent of it).
Although this tool can be used in a purely exploratory manner by supplying the variableMetadata file without the columns added by the W4M 'Univariate' tool, a preferable workflow may be to use univariate testing to exclude features that are not significantly different and then to use OPLS-DA to visualize the differences identified in univariate testing (Thévenot et al., 2015); an appropriate exception would be to visualize contrasts of a specific list of metabolites. If you do exclude features, however, make sure that you set the advanced parameter "How many features for p-value calculation?" accordingly.
It must be stressed that there may be no single definitive computational approach to select features that are reliable biomarkers, especially from a small number of samples or experiments. A few possible choices are:
In this spirit, this tool reports the S-PLOT covariance and correlation (Wiklund op. cit.) and VIP metrics, and it introduces an informal "salience" metric to flag features that may merit attention without dimensional reduction; future versions may add selectivity ratio.
For a more systematic approach to biomarker identification, please consider the W4M 'biosigner' tool (Rinuardo et al. 2016), which applies three different identification metrics to the selection process. Regardless of how any potential biomarker is identified, further validation analysis (e.g., independent confirmatory experiments) is needed before it can be recommended for general application.
File Format Data matrix tabular Sample metadata tabular Variable metadata tabular
File Format Contrast detail Contrast "corrlation and covariance" data tabular Feature "salience" data tabular
(first row, left) correlation-versus-covariance plot of OPLS-DA results
- This is a work-alike for the S-PLOT described in Wiklund, (op. cit.), ignoring samples with missing values;
- point-color becomes saturated as the "variable importance in projection to the predictive components" (VIP4,p from Galindo-Prieto op. cit.) ranges from 0.83 and 1.21 (Mehmood et al. 2012), for use to identify features for consideration as biomarkers;
- plot symbols are diamonds when the p-value of the correlation, adjusted for family-wise error rate (Yekutieli et al., 2001), is greater than 0.05, circles when it is less than 0.01, and triangles when between 0.01 and 0.05.
(second row, left) model-overview plot for the two projections; grey bars are the correlation coefficient for the fitted data; black bars indicate performance in cross-validation tests (Thévenot, 2017)
(first row, right) OPLS-DA scores-plot for the two projections (Thévenot et al., 2015)
(second row, right) correlation-versus-covariance plot of OPLS-DA results orthogonal to the predictor (see section "S-Plot of Orthogonal Component" in Wiklund, op. cit., pp. 120-121; this characterizes variation of features that is independent of the predictor).
(third row, left, when "predictor C-plot" is chosen under "Advanced") plot of the correlation (or covariance) vs. the VIP4,p (Galindo-Prieto op. cit.), to assist in identifying features for consideration as biomarkers.
(third row, right, when "orthogonal C-plot" is chosen under "Advanced") plot of the correlation (or covariance) vs. the VIP4,o (ibid.), to assist in identifying features varying considerably without regard to the predictor.
"wild card" patterns may be used to select level-names.
For example
"regular expression" patterns may be used to select level-names.
POSIX 1003.2 standard regular expressions allow precise pattern-matching and are exhaustively defined at: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
However, only a few basic building blocks of regular expressions need to be mastered for most cases:
Caveat: The tool wrapper uses the comma (',') to split a list of sample-level names, so commas may not be used within regular expressions for this tool.
First Example: Consider a field of level-names consisting of 'marq3,marq6,marq9,marq12,front3,front6,front9,front12'
Second Example: Consider these regular expression patterns as possible matches to a sample-level name 'AB0123':
Input files
Example 1: Include in the analysis only features identified as pair-wise significant in the Univariate test.
Input Parameter or Result Value Factor of interest k10 Univariate Significance-Test kruskal Retain only pairwise-significant features Yes Levels of interest k[12],k[3-4] Level-name matching use regular expressions for matching level-names Number of features having extreme loadings ALL How many features for p-value calculation? 250 Output primary table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_corcov.tsv Output salience table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_salience.tsv Output figures PDF https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_detail.pdf
Example 2: Include in the analysis only features identified as overall-significant in the Univariate test. Note that this even includes these features in contrasts where they were not determined to be pair-wise significant in the Univariate test. Thus, more features are included than in Example 1.
Input Parameter or Result Value Factor of interest k10 Univariate Significance-Test kruskal Retain only pairwise-significant features No Levels of interest * Level-name matching use wild cards for matching level-names Number of features having extreme loadings 5 How many features for p-value calculation? ALL Output primary table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_corcov_all.tsv Output salience table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_salience_all.tsv Output figures PDF https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_detail_all.pdf
Example 3: Include all features in the analysis without regard to Univariate testing. Univariate testing is not even a pre-requisite to using the tool when 'none' is selected for the test. Thus, more features are included than in Example 2.
Input Parameter or Result Value Factor of interest k10 Univariate Significance-Test none Levels of interest k[12],k[3-4] Level-name matching use regular expressions for matching level-names Number of features having extreme loadings 0 How many features for p-value calculation? ALL Output primary table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_corcov_global.tsv Output salience table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_salience_global.tsv Output figures PDF https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_detail_global.pdf
Example 4: Analysis of a two-level factor (including all features). This suppresses the contrasts of "each factor vs. the aggregate of all the others".
Input Parameter or Result Value Factor of interest lohi Univariate Significance-Test none Levels of interest low,high Level-name matching use regular expressions for matching level-names Number of features having extreme loadings 3 How many features for p-value calculation? ALL Output primary table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_corcov_lohi.tsv Output salience table https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_salience_lohi.tsv Output figures PDF https://raw.githubusercontent.com/HegemanLab/w4mcorcov_galaxy_wrapper/master/tools/w4mcorcov/test-data/expected_contrast_detail_lohi.pdf
OPLS-DA, SIMCA, and S-PLOT are registered trademarks of the Umetrics company. http://umetrics.com/about-us/trademarks