Galaxy | Tool Preview

LCMS matching (version 4.0.2)
Decimal: '.', missing: NA, mode: character and numerical, sep: tabular. Retention time values must be in seconds.
The list of column names of your database in-house file, as a coma separated list of key/value pairs.
Values used for the file database MS modes, as a coma separated list of key/value pairs.
Decimal: '.', missing: NA, mode: character and numerical, sep: tabular. RT values must be in seconds.
Input file column names, as a coma separated list of key/value pairs.
The retention time tolerance X parameter (in seconds).
The retention time tolerance Y parameter (no unit).
The retention time tolerance used when precursor matching is enabled.

LC/MS matching

This tool performs LC/MS matching on an input list of MZ/RT values, using either a provided in-house single file database or a connection to Peakforest database.

Database

When selecting the database, you have the choice between a Peakforest database or an in-house file.

For the Peakforest database, a default REST web base address is already provided. But you can change it to use a custom database. A field is also available for setting a token key in case the access to the Peakforest database you want to use is restricted. This is the case of the default database URL.

For the in-house file, please refer to the paragraph "Single file database" below.

Input files

Be careful to always provide UTF-8 encoded files, unless you do not use special characters at all. For instance, greek letters in molecule names give errors if the file is in latin1 (ISO 8859-1) or Windows 1252 (not distinguishable from latin1) encoding.

Single file database

In this case, the database used is provided as a single file by the user, in tabular format, through the Database file field. This file must contain a list of MS peaks, with possibly retention times. Peaks are "duplicated" as much as necessary. For instance if 3 retention times are available on a compound with 10 peaks in positive mode, then there will be 30 lines for this compound in positive mode.

The file must contain a header with the column names. The names are free, but must be provided through the Column names field as a comma separated list of key/value pairs. See default value as an example. Of course it is much easier if your database file uses the default column names used in the default value of the Column names field. The column names shown in the default values, are only the ones used by the algorithm. You can provide any additional columns in your database file, they will be copied in the output.

Then you must provide the values used to identify the MS modes (positive and negative), using field MS modes.

A last information about the single file database is the unit of the retention times, either in seconds or in minutes. Use the field "Retention time unit" to provide this information.

Example of database file (totally fake, no meaning):

molid mode mz composition attribution col rt molcomp molmass molnames
A10 "POS" 112.07569 "P9Z6W410 O" "[(M+H)-(H2O)-(NH3)]+" "colzz" 5.69 "J114L6M62O2" 146.10553 "Blablaine'"
A10 "POS" 112.07569 "P9Z6W410 O" "[(M+H)-(H2O)-(NH3)]+" "col12" 0.8 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 112.07569 "P9Z6W410 O" "[(M+H)-(H2O)-(NH3)]+" "somecol" 8.97 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 191.076694 "P92Z6W413 Na2 O2" "[(M-H+2Na)]+" "colAA" 1.58 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 191.076694 "P92Z6W413 Na2 O2" "[(M-H+2Na)]+" "colzz2" 4.08 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 294.221687 "U1113P94ZW429 O4" "[(2M+H)]+ (13C)" "somecol" 8.97 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 72.080775 "P9Z4W410 O0" "[(M+H)-(J15L2M6O2)]+" "hcoltt" 0.8 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 112.07569 "P9Z6W410 O" "[(M+H)-(H2O)-(NH3)]+" "colzz3" 4.54 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 72.080775 "P9Z4W410 O0" "[(M+H)-(J15L2M6O2)]+" "colzz3" 4.54 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 72.080775 "P9Z4W410 O0" "[(M+H)-(J15L2M6O2)]+" "colpp" 0.89 "J114L6M62O2" 146.10553 "Blablaine"
A10 "POS" 145.097154 "P92Z6W413 O2" "[(M+H)-(H2)]+" "hcoltt" 0.8 "J114L6M62O2" 146.10553 "Blablaine"

The corresponding value of the Column names field for this database field would be: mztheo=mz,chromcolrt=rt,compoundid=molid,chromcol=col,msmode=mode,peakattr=attribution.

And the value of the MS modes field would be: pos=POS,neg=NEG.

MZ/RT input file

The input to provide is a dataset in a tabular format (or TSV: Tab Seperated Values), containing the list of M/Z values, with possibly also RT values. The dataset is chosen through the field Input file - MZ(/RT) values.

The column names for the M/Z and RT values must be provided through the field Input column names, as a comma separated list of key/value pairs. The file/dataset must contain a header line with the same names specified in the field Input column names.

The unit of the retention time has to be provided with the field Retention time unit.

Example of file input:

mz rt
75.02080998 49.38210915
75.05547146 0.658528069
75.08059797 1743.94267
76.03942694 51.23158899
76.07584477 50.51249853
76.07593168 0.149308136

M/Z matching

In the simplest form of the algorithm only the M/Z values are matched against the database peaks. This happens if both Retention time match and Precursor match are off.

The first parameter is the MS mode, specified through the MS mode parameter.

The parameters M/Z precision and M/Z shift are used by the algorithm in the following formula in order to match an M/Z value:

mz - shift - precision < mzref < mz - shift + precision

Where mzref is the M/Z of reference from the database peak that is tested. If this double inequality is true, then the M/Z value is matched with this peak.

The parameters shift and precision can be input in either PPM values of M/Z or in plain values. Use the field M/Z tolerance unit to set the unit.

Retention time match

If at least one column is selected inside the Chromatographic columns parameter section, then retention time is also matched, in addition to the M/Z value, according to the following formula:

rt - x - rt^y < colrt < rt + x + rt^y

Where x is the value of the parameter RTX and y the value of the parameter RTY.

If for a reference compound the database does not contain retention time for at least one of the specified columns, then only the M/Z value is matched against the peaks of the reference compound. This means that in the results you can find compounds that do no match the provided retention time value.

The RTZ parameter is used in the Precursor match algorithm (see below).

Precursor match

If the "Precursor match" option is enabled inside the parameters section, then a more sophisticated version of the algorithm, which is executed in two steps, is used.

This algorithm takes two more parameters, one for each MS mode. These are the lists of precursors. Since the matching is run for one MS mode only, only one of the two parameters is used. Inside the single file database, all the peaks whose peakattr column value is equal to one of the precursor listed in List of negative precursors or List of positive precursors, depending on the mode, are considered as precursor peaks.

M/Z matching using precursor matching

  1. Using the normal M/Z matching algorithm described above, we first look only for precursor peaks ([(M+H)]+, [(M+Na)]+, [(M+Cl)]-, ...).
  2. From step 1, we construct a list of matched molecules.
  3. We look at all peaks inside the molecule list obtained in step 2, using the normal M/Z matching algorithm described above.

MZ/RT matching using precursor matching

  1. Using the normal MZ/RT matching algorithm described above, we first look only for precursor peaks ([(M+H)]+, [(M+Na)]+, [(M+Cl)]-, ...).
  2. From step 1, we construct a list of matched molecules, retaining the matched retention time of each molecule.
  3. For each input couple (m/z,rt), we look at all peaks inside the molecules taken from step 2, whose matched retention time between rt - z and rt + z, where z is the value of parameter RTZ.

Output settings

The Multiple matches separator character is used to customize the character used to separate the multiple values inside each row in the main output dataset. The main output contains as much rows as the MZ/RT input dataset, thus when for one MZ/RT value the algorithm finds more than one match, it concatenates the matches using this separator character.

Output files

Three files are output by the tool.

Outputs File name Description
Main output lcmsmatching_{input_file_name} Contains the same data as the input dataset, with match result included on each row. If more than one match is found for a row, the different values of the match are concatenated using the provided separator character.
Peak list lcmsmatching_{input_file_name}_peaks Contains the same data as the input dataset, with match result included on each row. If more than one match is found for a row, then the row is duplicated. Hence there is either no match for a row, or one single match.
HTML output lcmsmatching_{input_file_name}.html Contains the same table as Peak list but in HTML format and with links to external databases if columns for PubChem Compound, ChEBI, HMDB Metabolites or KEGG Compounds are provided.

The match results are output as new columns appended to the columns provided inside the MZ/RT input dataset, and prefixed with "lcmsmatching.".

About

Author
Pierrick Roger (pierrick.roger@cea.fr) wrote this MS matching method. MetaboHUB: The French National Infrastructure for Metabolomics and Fluxomics (http://www.metabohub.fr/en).
Acknowledgement
Data and algorithms have been kindly provided by Christophe Junot at DSV/IBITEC-S/SPI (CEA/Saclay), from a former application developped by Cyrille Petat and Arnaud Martel at DSV/IBITEC-S/DIR (CEA/Saclay).
Please cite
R Core Team (2013). R: A language and Environment for Statistical Computing. http://www.r-project.org.

Changelog/News

Version 4.0.0 - 02/01/2019

  • NEW: Use of R biodb library. Connection to databases and matching have been moved to biodb library, which is maintained separately at http://github.com/pkrog/biodb.