Galaxy | Tool Preview

MaAsLin (version 1.0.1)

Feedback? Not working? Please contact us at Maaslin_google_group .

MaAsLin: Multivariate Analysis by Linear Models

MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and microbial community abundance or function. The clinical metadata can be of any type continuous (for example age and weight), boolean (sex, stool/biopsy), or discrete/factor (cohort groupings and phenotypes). MaAsLin is best used in the case when you are associating many metadata with microbial measurements. When this is the case each metadatum can be a diffrent type. For example, you could include age, weight, sex, cohort and phenotype in the same input file to be analyzed in the same MaAsLin run. The microbial measurements are expected to be normalized before using MaAsLin and so are proportional data ranging from 0 to 1.0.

The results of a MaAsLin run are the association of a specific microbial community member with metadata. These associations are without the influence of the other metadata in the study. There are certain factors known that can influence the microbiome (for example diet, age, geography, fecal or biopsy sample origin). MaAsLin allows one to detect the effect of a metadata, possibly a phenotype, deconfounding the effects of diet, age, sample origin or any other metadata captured in the study!

https://bytebucket.org/biobakery/galaxy_maaslin/wiki/Figure1-Overview.png

Maaslin Analysis Overview MaAsLin performs boosted, additive general linear models between one group of data (metadata/the predictors) and another group (in our case microbial abundance/the response). Given that metagenomic data is sparse, the boosting is used to select metadata that show some potential to be associated with microbial abundances. Boosting of metadata and selection of a model occurs per otu. The metadata data that is selected for use by boosting is then used in a general linear model using metadata as predictors and otu arcsin-square root transformed abundance as the response.

For more information on the technical aspects to this algorithm please see the methodological evaluation of MaAsLin that compared it to multiviariate and univariate analyses. Please check back for paper citing.

Process:

The first step consists of uploading your data using Galaxy's Get Data - Upload File

A sample file is located at: https://bytebucket.org/biobakery/maaslin/wiki/maaslin_demo_pcl.txt

Important

Please make sure to choose File Format: maaslin

Required inputs

MaAsLin requires an input pcl file of metadata and microbial community measurements. MaAsLin expects a PCL file as an input file. A PCL file is a text delimited file similar to an excel spread sheet with the following characteristics.

  1. Rows represent metadata and features (bugs), columns represent samples
  2. The first row by default should be the sample ids.
  3. Metadata rows should be next.
  4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
  5. The first column should contain the ID describing the column. For metadata this may be, for example, ''Age'' for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
  6. The file is expected to be TAB delimited.

Description of parameters

Input file Select a loaded data file to use in analysis.

Last metadata row Metadata and microbial measurements should be rows of the pcl file. Metadata should all come before microbial measurements. This row is the last metadata row which is only followed by rows which are microbial measurements.

Maximum false discovery rate (Significance threshold) Associations are found significant if thier q-value is equal to or less than this threshold.

Minimum for feature relative abundance filtering The minimum relative abundance allowed in the data. Values below this are removed and imputed as the median of the sample data.

Minimum for feature prevalence filtering The minimum percentage of samples a feature can have abudance in before being removed.

Type of Output Select one of the two options for output (summary or detailed results).

Outputs

The Run MaAsLin module will create either A) a summary text file of plotted significant associations or B) a compressed directory of associations (significant and not significant).

  1. Any association that had a q-value less than or equal to the significance threshold will be included in a tab-delimited file.
  2. The following files will be generated per MaAsLin run. In the following listing the term projectname refers to what you named your pcl file without the extension.

Analysis (These files are useful for analysis):

projectname-metadata.txt Each metadata will have a file of associations. Any associations indicated to be performed after initial boosting is recorded here. Included are the information from the final general linear model (performed after the boosting) and the FDR corrected p-value (q-value). Can be opened as a text file or spreadsheet.

projectname-metadata.pdf Any association that had a q-value less than or equal to the significance threshold will be plotted here. If this file does not exist, the projectname-metadata.txt should not have an entry that is less than or equal to the threshold. Factor and boolean data is plotted as knotched box plots; continuous data is plotted as a scatter plot with a line of best fit.

https://bytebucket.org/biobakery/galaxy_maaslin/wiki/Maaslin_Output.png

Example of the projectname-metadata.pdf file Significant associations are combined in files of associations per metadata. Factor and boolean data is plotted as knotched box plots; continuous data is plotted as a scatter plot with a line of best fit. Plots show raw data, header data show information from the reduced

projectname_Summary.txt Any entry in the projectname-metadata.pdf are collected together here. Can be opened as a text file or spreadsheet.

Troubleshooting (These files are typically not used for analysis but are there for documenting the process and troubleshooting):

projectname.txt Contains the detail for the statistical engine. Is useful for detailed troubleshooting.

data.tsv The data matrix that was read in (transposed). Useful for making sure the correct data was read in.

data.read.config Can be used to read in the data.tsv .

metadata.tsv The metadata that was read in (transposed). Useful for making sure the correct metadata was read in.

metadata.read.config Can be used to read in the data.tsv .

read_merged.tsv The data and metadata merged (transposed). Useful for making sure the merging occurred correctly.

read_merged.read.config Can be used to read in the read_merged.tsv .

read_cleaned.tsv The data read in, merged, and then cleaned. After this process the data is written to this file for reference if needed.

read_cleaned.read.config Can be used to read in read_cleaned.tsv .

ProcessQC.txt Contains quality control for the MaAsLin analysis. This includes information on the magnitude of outlier removal.

Contacts

Please feel free to contact us at subraman@broadinstitute.org for any questions or comments!