Galaxy | Tool Preview

W4m Data Subset (version 0.98.19)
Choose data-matrix file (tab-separated values with sample names in first row and feature names in first column).
Choose sample-metadata file (tab-separated values with one row per sample, sample name in first column).
Choose variable-metadata file (tab-separated values with one row per feature, feature name in first column).
Name the column in 'Sample metadata' that has the values to be referenced by 'Sample-class names' and 'Compute centers for classes'. [default: 'class']
List of names (or patterns to match names) of sample classes to include or exclude. List should be comma-separated with no stray space characters. (Leave this empty to match no names.) [default: empty]
Indicate meaning of preceding list: either to identify classes to exclude from output or to identify classes to include in output. [default: 'filter-out']
See 'Wild-card patterns to match class names' and 'Regular-expression patterns to match sample-class names' sections below. [default: 'wild-card patterns']
List of filters, each specifying the range of permitted values in a column of 'Variable metadata' (specified as 'column:min:max'), as described in 'Variable-range filters' section below. List should be comma-separated with no stray space characters. (Leave this empty for no filtering.) [default: empty]
Choose transformation. In all cases, negative intensities become missing values. See 'Data transformation and imputation' section below. [default: 'none']
Choose imputation for missing values. See 'Data transformation and imputation' section below. [default: 'zero']
List of sample-metadata column names for sorting samples. List should be comma-separated with no stray space characters. (This is ignored when 'Compute centers for classes' is set to either 'centroid' or 'median'.) [default: 'sampleMetadata']
List of feature-metadata column names for sorting features. List should be comma-separated with no stray space characters. [default: 'variableMetadata']
[default: 'none']

Author Arthur Eschenlauer (University of Minnesota, esch0041@umn.edu)


R package

The w4mclassfilter package (which is used by the W4M Data Subset tool) is available from the Hegeman lab GitHub repository (https://github.com/HegemanLab/w4mclassfilter/releases).


Tool updates

See https://github.com/HegemanLab/w4mclassfilter_galaxy_wrapper#news


"W4M Data Subset" - Filter Workflow4Metabolomics data

Motivation

LC-MS metabolomics experiments seek to resolve "features", i.e., species that have distinct chromatographic retention time ("rt") and (after ionization) mass-to-charge ratio ("m/z" or "mz"). (If a chemical is fragmented or may have a variety of adducts, several features will result.) Data for a sample are collected as mass-spectral intensities, each of which is associated with a position on a 2D plane with dimensions of rt and m/z. Ideally, features would be sufficiently reproducible among sample-runs to distinguish features that are similar among samples from those that differ.

For liquid chromatography, the retention time for a species can vary considerably from one chromatography run to the next. The Workflow4Metabolomics suite of Galaxy tools (W4M, [Giacomoni et al., 2014, Guitton et al. 2017]) uses the XCMS preprocessing tools [Smith et al., 2006] for "retention-time correction" to align features among samples. Features may be better aligned if pooled samples and blanks are included.

Multivariate statistical tools may be used to discover clusters of similar samples [Thévenot et al., 2015]. However, once retention-time alignment of features has been achieved among samples in LC-MS datasets:

  • The presence of pools and blanks may confound identification and separation of sample clusters.
  • Multivariate statistical algorithms may be impacted by missing values or dimensions that have zero variance.

Description

The W4M Data Subset tool selects subsets of samples, features, or data values and conditions the data for further analysis.

  • The tool takes as input the dataMatrix, sampleMetadata, and variableMetadata datasets produced by W4M's XCMS and CAMERA [Kuhl et al., 2012] tools.
  • The tool produces the same trio of output datasets, modified as described below.

This tool can perform several operations to reduce the number samples or features to be analyzed (although this should be done only in a statistically sound manner consistent with the nature of the experiment):

  • Sample filtering: Samples may be selected by designating a "sample class" column in sampleMetadata and specifying criteria to include or exclude samples based on the contents of this column.
  • Feature filtering: Features may be selected by specifying minimum or maximum value (or both) allowable in columns of variableMetadata.
  • Intensity filtering: To exclude minimal features from consideration, a lower bound may be specified for the maximum intensity for a feature across all samples (i.e., for a row in dataMatrix).

This tool also conditions data for statistical analysis:

  • Samples that are missing from either sampleMetadata or dataMatrix are eliminated.
  • Features that are missing from either variableMetadata or dataMatrix are eliminated.
  • Features and samples that have zero variance are eliminated.
  • Samples and features are ordered consistently in variableMetadata, sampleMetadata, and dataMatrix. (The columns for sorting variableMetadata or sampleMetadata may be specified.)
  • The names of the first columns of variableMetadata and sampleMetadata are set respectively to "variableMetadata" and "sampleMetadata".
  • If desired, the values in dataMatrix may be log-transformed.
  • Negative intensities become missing values (before missing-value replacement is performed).
  • If desired, each missing value in dataMatrix may be replaced with zero or the median value observed for the corresponding feature.
  • If desired, a "center" for each treatment can be computed in lieu of the samples for that treatment.

This tool may be applied several times sequentially, which may be useful for:

  • analyzing subsets of samples for progressively smaller sets of treatment levels, or
  • choosing subsets of samples or features, respectively based on criteria in columns of sampleMetadata or variableMetadata.

Workflow Position

This tool can be used at any point downstream of Preprocessing.

  • Possible upstream tool categories: Preprocessing, Quality Control, Statistical Analysis, Filter and Sort
  • Possible downstream tool categories: Normalisation, Statistical Analysis, Quality Control, Filter and Sort

Input files

File Contents Format
Data matrix per-feature, per-sample intensities tabular
Sample metadata metadata for samples tabular
Variable metadata metadata for features tabular

Parameters

Data matrix
feature x sample dataMatrix (tab-separated values) file of the numeric data matrix, with period-character ('.') as decimal, and 'NA' for missing values.
The file must not contain metadata apart from the required row and column names.

Sample metadata
sample x metadata sampleMetadata (tab-separated values) file of the numeric and/or character sample metadata, with period-character ('.') as decimal, and 'NA' for missing values.

Variable metadata
variable x metadata variableMetadata (tab-separated values) file of the numeric and/or character variable metadata, with period-character ('.') as decimal, and 'NA' for missing values.

Column containing the sample-class names (default = 'class')
name of the column in sampleMetadata that has the values to be tested against the 'Sample-class names' input parameter or to be referenced by the 'Compute centers for classes' input parameter.
Only letters, digits, periods, and underscores are permitted.

Sample-class names (default = no names)
names (or regular expressions to match names) of sample-classes to include or exclude
(Separate names with commas, without any extra space characters.)

Exclude/include named (or matched) classes (default = 'filter-out')
'filter-in' - include only the named sample-classes
'filter-out' - exclude only the named sample-classes

Use 'wild card patterns' or 'regular expression patterns' (default = 'wild-card patterns')
'wild-card patterns' - use wild cards to match names of sample-classes (see the 'Wild-card patterns to match class names' section below.)
'regular-expression patterns' - use regular expressions to match the named sample-classes (see the 'Regular-expression patterns to match class names' section below.)

Variable-range filters (default = no filters)
variable-range filters (see the 'Variable-range filters' section below)
(Separate filter expressions with commas, without any extra space characters.)

Data transformation (default = 'none')
'none' - Do not transform data matrix values.
'log2' - Take the log base 2 of the values in the data matrix.
'log10' - Take the log base 10 of the values in the data matrix.

Note that negative intensities become missing values regardless of the choice made here.

Imputation of missing values (default = 'zero')
'none' - Do not impute data matrix values.
'zero' - Negative and missing values are imputed to zero.
'center' - For each feature, negative and missing values are imputed to the median of other values.

Note well: For 'none' option, 'Compute centers for classes' cannot be set to 'medoid'.

Columns that specify order for samples (default = 'sampleMetadata')
names of the columns in sampleMetadata that is used to sort samples; only letters, digits, periods, and underscores are permitted.
(Separate column names with commas, without any extra space characters.)

Columns that specify order for features (default = 'variableMetadata')
names of the columns in variableMetadata that is used to sort features; only letters, digits, periods, and underscores are permitted.
(Separate column names with commas, without any extra space characters.)

Compute centers for classes, e.g., treatments (default = 'none')
'none' - Return all samples; do not compute centers for classes/treatments.
'centroid' - For each treatment, return only the centroid (the treatment-center computed as the mean intensity for each feature).
'median' - For each treatment, return only the treatment-center computed as the median intensity for each feature.
'medoid' - For each treatment, return only the medoid (the sample most similar to the other samples for that treatment).

Note well: For 'medoid' option, 'Imputation of missing values' cannot be set to 'none'.

Output files

sampleMetadata
(tab-separated values) file.
If centering is 'none' or 'medoid', this will be identical to the sampleMetadata file given as an input argument, excepting lacking rows for samples that have been filtered out (by the sample-class filter, or because of zero variance, or because they were missing in the input data matrix)
If centering is 'centroid' or 'median', most columns will be replaced with the treatment name and the number of samples for that treatment.

variableMetadata
(tab-separated values) file identical to the variableMetadata file given as an input argument, excepting lacking rows for variables (LC-MS features) that have been filtered out (by the variable-range filter, or because of zero variance, or because they were missing in the input data matrix)

dataMatrix
(tab-separated values) file identical to the dataMatrix file given as an input argument, excepting lacking rows and columns for variables and samples that have been filtered out, respectively

Wild-card patterns to match class names

W4M Data Subset supports use of "wild card" patterns to select class-names.

  • use '?' to match a single character
  • use '*' to match zero or more characters
  • the entire pattern must match the sample name

For example

  • '??.samp*' matches 'my.sample' but not 'my.own.sample'
  • '*.sample' matches 'my.sample' and 'my.own.sample'
  • '*.sampl' matches neither 'my.sample' nor 'my.own.sample'

Regular-expression patterns to match class names

W4M Data Subset supports use of R "extended regular expression" patterns to select class-names.

R extended regular expressions, which allow precise pattern-matching and are exhaustively defined at https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html

However, only a few basic building blocks of regular expressions need to be mastered for most cases:

  • '^' matches the beginning of a class-name
  • '$' matches the end of a class-name
  • '.' outside of square brackets matches a single character
  • '*' matches character specified immediately before zero or more times
  • square brackets specify a set of characters to be matched.

Within square brackets

  • '^' as the first character specifies that the list of characters are those that should not be matched.
  • '-' is used to specify ranges of characters

Caveat: The tool wrapper uses the comma (',') to split a list of sample-class names, so commas may not be used within regular expressions for this tool

First Example: Consider a field of class-names consisting of 'marq3,marq6,marq9,marq12,front3,front6,front9,front12'

  • The regular expression '^front[0-9][0-9]*$' will match the same sample-classes as 'front3,front6,front9,front12'
  • The regular expression '^[a-z][a-z]3$' will match the same sample-classes as 'front3,marq3'
  • The regular expression '^[a-z][a-z]12$' will match the same sample-classes as 'front12,marq12'
  • The regular expression '^[a-z][a-z][0-9]$' will match the same sample-classes as 'front3,front6,front9,marq3,marq6,marq9'

Second Example: Consider these regular expression patterns as possible matches to a sample-class name 'AB0123':

  • '^[A-Z][A-Z][0-9][0-9]*$' MATCHES '**^AB0123$**'
  • '^[A-Z][A-Z]*[0-9][0-9]*$' MATCHES '**^AB0123$**'
  • '^[A-Z][0-9]*' MATCHES '**^A** B0123$' - first character is a letter, '*' can specify zero characters, and end of line did not need to be matched.
  • '^[A-Z][A-Z][0-9]' MATCHES '**^AB0** 123$' - first two characters are letters aind the third is a digit.
  • '^[A-Z][A-Z]*[0-9][0-9]$' DOES NOT MATCH - the name does not end with the pattern '[A-Z][0-9][0-9]$', i.e., it ends with four digits, not two.
  • '^[A-Z][0-9]*$' DOES NOT MATCH - the pattern specifies that second character and all those that follow, if present, must be digits.

Variable-range filters

An array of range-specification strings may be supplied in the 'Variable-range filters' argument. If supplied, only features having numerical values in the specified column of variableMetadata that fall within the specified ranges will be retained in the output. Each range is a string of three colon-separated values (e.g., 'mz:200:800') in the following order:

  • the name of a column of variableMetadata which must have numerical data (only letters, digits, periods, and underscores are permitted in the name itself), e.g., 'mz';
  • the minimum allowed value in that column for the feature to be retained, e.g., '200';
  • the maximum allowed value, e.g., '800'.

Note for the range specification strings:

  • If the "maximum" is less than the "minimum", then the range is exclusive (e.g., 'mz:800:200' means retain only features whose mz is NOT in the range 200-800)
  • If the name supplied in the first field is 'FEATMAX', then the string is defining the threshold for the maximum intensity for each feature in the dataMatrix.
    • For example, 'FEATMAX:1e6:' would specify that any feature would be excluded if no sample had an intensity for that feature greater than 1,000,000.
    • Although a maximum may be specified, it seems unlikely that this would be useful. Note that when the "maximum" is less than the "minimum" for the FEATMAX range specification, then the specification is ignored.

Data transformation and imputation

Data may optionally be log2- or log10-transformed.

Negative intensities are always substituted with missing values before imputation, even when no transformation is chosen.

Missing intensity data values may optionally be imputed. Missing values may be substituted:

  • with zeros (as may be appropriate for univariate analysis)
  • with the median for the feature (as may be appropriate for multivariate analysis).
    • Note that the median feature-intensity is computed for the samples before variable-range filters are applied.

Optional Computation of Treatment Centers

A "center" for each treatment may be computed in lieu of all the samples for each treatment.

  • 'none' - Return all samples; do not compute centers.
  • 'centroid' - For each treatment, return only the centroid (the treatment-center computed as the mean intensity for each feature).
  • 'median' - For each treatment, return only the treatment-center computed as the median intensity for each feature.
  • 'medoid' - For each treatment, return only the medoid (the sample most similar to the other samples for that treatment). This choice requires that the 'Imputation of missing values' argument must not be set to 'none'.

The medoid is the sample having the smallest sum of its distances from other samples in the treatment:

  • Because principal components are uncorrelated, distances are computed in the space defined by the principal-component scores to minimize the distortion of computed distances by correlated features.
  • Because principal components are used to compute distances, no missing values are permitted, which is why the 'Imputation of missing values' argument must not be set to 'none'.
  • The distances are used to identify the medoid using code adapted from https://web.archive.org/web/20191231012914/https://www.biostars.org/p/11987/#11989

WORKING EXAMPLES

Example without Range-Filtering

This example retains only samples whose 'gender' attribute is 'M'.

Input parameters

Input Parameter Value
Column that names the sample class gender
Sample-class names M
Exclude/include named classes filter-in
Use 'wild-cards' or 'regular expressions' wild-cards
Variable range-filters (Leave this field empty.)
Data transformation none
Missing-value imputation center
Sample-sort column sampleMetadata
Feature-sort column variableMetadata
Compute centers for classes none

Expected outputs

Expected Output Download from URL
Data matrix https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/expected_dataMatrix.tsv
Sample metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/expected_sampleMetadata.tsv
Variable metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/expected_variableMetadata.tsv

Example with Range-Filtering

This example retains only features whose mz is greater than 200, whose rt is less than 800, and whose maximum intensity across all samples is 2,000,000. This example retains all samples (except those having zero variance for all feature), although it would be possible to filter on samples as well.

Input parameters

Input Parameter Value
Column that names the sample class sampleMetadata
Sample-class names HU_13[48]
Exclude/include named classes filter-out
Use 'wild-cards' or 'regular expressions' regular-expressions
Variable range-filters FEATMAX:20.93157:,mz:200:,rt::800
Data transformation log2
Missing-value imputation zero
Sample-sort column sampleMetadata
Feature-sort column variableMetadata
Compute centers for classes none

Expected outputs

Expected Output Download from URL
Data matrix https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/rangefilter_dataMatrix.tsv
Sample metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/rangefilter_sampleMetadata.tsv
Variable metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter_galaxy_wrapper/master/tools/w4mclassfilter/test-data/rangefilter_variableMetadata.tsv

Example with Treatment-Centering

This example retains only the samples that are medoids for their gender.

Input parameters

Input Parameter Value
Column that names the sample class gender
Sample-class names (Leave this field empty.)
Exclude/include named classes filter-out
Use 'wild-cards' or 'regular expressions' wild-cards
Variable range-filters (Leave this field empty.)
Data transformation none
Missing-value imputation zero
Sample-sort column gender
Feature-sort column rt
Compute centers for classes medoid

Expected outputs

Expected Output Download from URL
Data matrix https://raw.githubusercontent.com/HegemanLab/w4mclassfilter/master/tests/testthat/exp_cent_medoid_dm.tsv
Sample metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter/master/tests/testthat/exp_cent_medoid_sm.tsv
Variable metadata https://raw.githubusercontent.com/HegemanLab/w4mclassfilter/master/tests/testthat/exp_cent_medoid_vm.tsv