Galaxy | Tool Preview

A) Format Data for LEfSe (version 1.0)

What it does

LDA Effect Size (LEfSe) (Segata et. al 2010) is an algorithm for high-dimensional biomarker discovery and explanation that identifies genomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions (or classes, see figure below). It emphasizes both statistical significance and biological relevance, allowing researchers to identify differentially abundant features that are also consistent with biologically meaningful categories (subclasses). LEfSe first robustly identifies features that are statistically different among biological classes. It then performs additional tests to assess whether these differences are consistent with respect to expected biological behavior.

Specifically, we first use the non-parametric factorial Kruskal-Wallis (KW) sum-rank test to detect features with significant differential abundance with respect to the class of interest; biological significance is subsequently investigated using a set of pairwise tests among subclasses using the (unpaired) Wilcoxon rank-sum test. As a last step, LEfSe uses Linear Discriminant Analysis to estimate the effect size of each differentially abundant feature and, if desired by the investigator, to perform dimension reduction.

LEfSe consists of six modules performing the following steps (see the figure below).

The first step consists of uploading your file by using Galaxy's "Get-Data / Upload-file"

The next steps are:

https://bytebucket.org/biobakery/galaxy_lefse/wiki/lefse_ove.png

Input file format

The text tab-delimited input file consists of a list of numerical features, the class vector and optionally the subclass and subject vectors. The features can be read counts directly or abundance floating-point values more generally, and the first field is the name of the feature. Class, subclass and subject vectors have a name (the first field) and a list of non-numerical strings.

Although both column and row feature organization is accepted, given the high-dimensional nature of metagenomic data, the listing of the features in rows is preferred. A partial example of an input file follows (all values are separated by single-tab):

bodysite                                mucosal         mucosal         mucosal         mucosal         mucosal         non_mucosal     non_mucosal     non_mucosal     non_mucosal     non_mucosal
subsite                                 oral            gut             oral            oral            gut             skin            nasal           skin            ear             nasal
id                                      1023            1023            1672            1876            1672            159005010       1023            1023            1023            1672
Bacteria                                0.99999         0.99999         0.999993        0.999989        0.999997        0.999927        0.999977        0.999987        0.999997        0.999993
Bacteria|Actinobacteria                 0.311037        0.000864363     0.00446132      0.0312045       0.000773642     0.359354        0.761108        0.603002        0.95913         0.753688
Bacteria|Bacteroidetes                  0.0689602       0.804293        0.00983343      0.0303561       0.859838        0.0195298       0.0212741       0.145729        0.0115617       0.0114511
Bacteria|Firmicutes                     0.494223        0.173411        0.715345        0.813046        0.124552        0.177961        0.189178        0.188964        0.0226835       0.192665
Bacteria|Proteobacteria                 0.0914284       0.0180378       0.265664        0.109549        0.00941215      0.430869        0.0225884       0.0532684       0.00512034      0.0365453
Bacteria|Firmicutes|Clostridia          0.090041        0.170246        0.00483188      0.0465328       0.122702        0.0402301       0.0460614       0.135201        0.0115835       0.0537381

In this case one may want to use bodysite as class, subsite as subclass and id as subject. Notice that the features have a hierarchical structure specified using the character |.

Input file sample

You can try the LEfSe modules using the dataset available here. You can upload the dataset using Galaxy's Get-Data / Upload File

This is a 16S dataset from (Garrett et. al 2010) and (Veiga et. al 2010) for studying the characteristics of the fecal microbiota in a mouse model of spontaneous colitis. The dataset contains 30 abundance profiles (obtained processing the 16S reads with RDP) belonging to 10 rag2 (control) and 20 truc (case) mice. The metadata consists in class information only, as we don't have subject or subclass information. The same dataset is used to show the graphical results in the module descriptions.


STEP A:

What STEP A does

Preprocessing module for the biomarker discovery tool called LEfSe:

This module of LEfSe preprocesses metagenomic abundance data for the analyses to be carried out with the "Run LEfSe" module. This module is separated from the "Run LEfSe" because one may want to preprocess the data only once but run multiple analyses.

For an overview of LEfSe please refer to the "Introduction" module or to (Segata et. al 2011).


Input format

The module accepts tabular data with the feature list in rows or columns.


Output format

The module generates data readable by the "Run LEfSe" module only.


Parameters

The class vector represents the labels of the main condition under investigation. The (optional) subclass vector denotes the internal groupings with biological meaning inside each class (subclasses of different classes with the same name are processed as different subclasses). The subject vector (optional) reports a third dimension denoting meta-data (subject id, sample type, ... ) which is independent from the class and subclass definition.

The labels can have a hierarchical organization (see example below) reflecting taxonomies (like NCBI or RDB microbial taxonomy, SEED subsystems or GO terms). The taxonomic levels are specified using the character |.

The per-sample normalization is usually applied for metagenomic data in which the relative abundances are taken into account.


Example

Although both column and row feature organization is accepted, given the high-dimensional nature of metagenomic data, the listing of the features in rows is preferred. A partial example of an input file follows (all values are separated by single-tab):

bodysite                                mucosal         mucosal         mucosal         mucosal         mucosal         non_mucosal     non_mucosal     non_mucosal     non_mucosal     non_mucosal
subsite                                 oral            gut             oral            oral            gut             skin            nasal           skin            ear             nasal
id                                      1023            1023            1672            1876            1672            159005010       1023            1023            1023            1672
Bacteria                                0.99999         0.99999         0.999993        0.999989        0.999997        0.999927        0.999977        0.999987        0.999997        0.999993
Bacteria|Actinobacteria                 0.311037        0.000864363     0.00446132      0.0312045       0.000773642     0.359354        0.761108        0.603002        0.95913         0.753688
Bacteria|Bacteroidetes                  0.0689602       0.804293        0.00983343      0.0303561       0.859838        0.0195298       0.0212741       0.145729        0.0115617       0.0114511
Bacteria|Firmicutes                     0.494223        0.173411        0.715345        0.813046        0.124552        0.177961        0.189178        0.188964        0.0226835       0.192665
Bacteria|Proteobacteria                 0.0914284       0.0180378       0.265664        0.109549        0.00941215      0.430869        0.0225884       0.0532684       0.00512034      0.0365453
Bacteria|Firmicutes|Clostridia          0.090041        0.170246        0.00483188      0.0465328       0.122702        0.0402301       0.0460614       0.135201        0.0115835       0.0537381

In this case one may want to use bodysite as class, subsite as subclass and id as subject. Notice that the features have a hierarchical structure specified using the character |.

Example with the "mouse model dataset"

You can try the LEfSe modules using the dataset available here. This is a 16S dataset from (Garrett et. al 2010) and (Veiga et. al 2010) for studying the characteristics of the fecal microbiota in a mouse model of spontaneous colitis. The dataset contains 30 abundance profiles (obtained processing the 16S reads with RDP) belonging to 10 rag2 (control) and 20 truc (case) mice. The metadata consists of class information only, as we don't have subject or subclass information. The dataset contains the features organized in rows; you need to select the first row as class, whereas you have to select "no subclass" and "no subject" options.

How to Cite LEfSe

If you find LEfSe usefull in your research please city our paper (Segata et. al 2010):

Nicola Segata, Jacques Izard, Levi Walron, Dirk Gevers, Larisa Miropolsky, Wendy Garrett, Curtis Huttenhower.
Genome Biology, 2011 Jun 24;12(6):R60

Please do not hesitate to contact us for any questions of comments.