Galaxy |

What it does

Lda Effective Size (LEfSe) is a biomarker discovery and explanation tool for high-dimensional data. It couples statistical significance with biological consistency and effect size estimation. For an overview of LEfSe please refer to the "Introduction" module or to (Segata et. al 2011).

The scheme and the description below illustrates how the algorithm works:

https://bytebucket.org/biobakery/galaxy_lefse/wiki/lefse_met.png

Input data consist of a collection of m samples (columns) each made up of n numerical features (rows, typically normalized per-sample, red representing high values and green low). These samples are labeled with a class (taking two or more possible values) that represents the main biological hypothesis under investigation; they may also have one or more subclass labels reflecting within-class groupings.

Step 1: the Kruskall-Wallis test analyzes all features, testing whether the values in different classes are differentially distributed. Features violating the null hypothesis are further analyzed in Step 2.
Step 2: the pairwise Wilcoxon test checks whether all pairwise comparisons between subclasses within different classes significantly agree with the class level trend.
Step 3: the resulting subset of vectors is used to build a Linear Discriminant Analysis model from which the relative difference among classes is used to rank the features. The final output thus consists of a list of features that are discriminative with respect to the classes, consistent with the subclass grouping within classes, and ranked according to the effect size with which they differentiate classes.

Input format

The input for this module must be generated with the "Format Input for LEfSe" tool.

Output format

The output consists of a tabular file listing all the features, the logarithm value of the highest mean among all the classes, and if the feature is discriminative, the class with the highest mean and the logarithmic LDA score.

The output file can be conveniently visualized with the "Plot LEfSe Results" module and, if feature names define a hierarchy, with the "Plot Cladogram" module. The output can also be used for generating the histograms of the raw data of the discriminative features using the "Plot Differential Features" module.

Parameters

The input parameters are the alpha-values for the factorial Kruskal-Wallis test and for the pairwise Wilcoxon test among subclasses (steps 1 and 2 in the schematic picture above) and the non-negative threshold for the logarithmic LDA score. Moreover, the user can decide the pairwise Wilcoxon test to be applied only among subclasses in different classes with the same name (less stringent) and select the multi-class strategy to be the All-against-all (more stringent) or the One-against-all (less stringent).

Example

For the mouse model dataset (see the "Introduction" module) it is suggested to use alpha=0.01 as the sample size is not very large.