Galaxy | Tool Preview

Blank Filter (version 2.0.0+galaxy0)
The percentage of samples (with a intensity value) that must beat the blank samples (peak by peak basis)
Select the function that should be used to calculate the peak intensity threshold (blanks only)
Minimum fold change
Show options for addtional output (*.tsv files)s
Show options for addtional output (*.tsv files) 0

Blank Filter


Description

Standard DIMS processing workflow: Process Scans -> [Replicate Filter] -> Align Samples -> Blank Filter -> Sample Filter -> [Missing values sample filter] -> Pre-processing -> Statistics


This tool is typically used to subtract peaks from the input peak intensity matrix that are believed to originate from non-biological sources e.g. leachables/extractables from pipette tips, plates, tubes and other consumables, as well as electrical noise peaks and peaks originating from solvents, infusion instrumentation etc.

In a routine DIMS analytical workflow, a set of extraction blank samples are prepared and analysed alongside other study samples. These “reference” samples are equivalent to the other study samples, with the exception that they do not contain the biological material to be analysed in the study. Peaks detected in these samples are therefore likely to be of non-biological origin. These peaks are removed from the peak intensity matrix if fewer-than the user-defined 'minimum fraction' of non-reference study samples have an intensity ratio (relative to the blank class) greater-than the user-specified 'Miniumum fold change'; e.g. a minimum fraction of 0.5 and 'miniumum fold change' of 10 requires that, for any given peak, at least 50% of the non-reference study samples have an intensity value at least 10 times greater than average intensity value calculated from “reference” samples.

Note - while extraction blank samples are typically used to filter peaks from the peak intensity matrix, this tool can in principle be used to filter peaks originating from any class defined in the Process Scans or Replicate Filter metadata file. The class used for filtering the peak intensity matrix is defined using the Label for the blank samples parameter.


Parameters


Peak Intensity Matrix (HDF5 file) (REQUIRED) - a peak intensity matrix (in .hdf5 format), typically returned from the 'Align Samples' tool.

Label for the blank samples (REQUIRED) - a string indicating the name of the class to be used for filtering (e.g. blank), i.e. the “reference” class. This string must have been included in the “classLabel” column of the metadata file associated with the Process Scans or Replicate Filter tool(s).

Minimum fraction (percentage) (REQUIRED; default = 1) - a numeric value ranging from 0 to 1. Setting this value to 0 will skip this filtering step. A value greater than 0 requires that for each peak in the peak intensity matrix, at least this proportion of non-reference samples have to have an intensity value that exceeds the product of: (A) the average intensity of “reference” class intensities and (B) the user-defined “Minimum fold change”. If this condition is not met, the peak is removed from the peak intensity matrix.

Function (REQUIRED; default = mean) - toggle, where selection of:

  • mean - corresponds to using the non-weighted average of “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.
  • median - corresponds to using the median of “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.
  • max corresponds to the use of the maximum intensity among “reference” sample peak intensities (NA values are ignored) in calculating the “reference” to “non-reference” peak intensity ratio.

Minimum fold change (REQUIRED; default = 10) - numeric value from 0 upwards. When minimum fraction filtering is enabled, this value defines the minimum required ratio between the intensity of a peak in a “non-reference” sample and the average intensity of the “reference” sample(s). Peaks with ratios exceeding this threshold are considered to have been reliably detected in a “non-reference” sample.


Remove blank samples (rows) (REQUIRED; default = Yes) - toggle:

Yes - samples belonging to the user-defined “reference” class are removed from the output peak matrix

No - samples belonging to the user-defined “reference” class are retained in the output peak matrix.


Show options for additional output(s) (OPTIONAL):

  • Standard output (default = No) - boolean toggle where selection of:

    • No - prevent the export of a .txt formatted peak matrix to the active Galaxy history.
    • Yes - export a .txt formatted peak matrix to the active Galaxy history that includes only those peaks from the input peak intensity matrix that passed the filtering procedure.
  • Comprehensive output (default = "No") - boolean toggle where selection of:

    • No - prevents export of a .txt formatted comprehensive peak matrix.
    • Yes - exports a .txt formatted comprehensive peak matrix to the active Galaxy history that contains the m/z, missing values and other metrics associated with all peaks included in the input peak intensity matrix, including the metric defined by the "The peak matrix should contain intensity | m/z | SNR values" parameter.
  • Should rows or columns represent the samples? (default = rows) - binary toggle where selection of:

    • rows - sample information is presented in the rows and m/z values (for aligned mass spectral peaks) in the columns of any output peak matrix.
    • columns - sample information is presented in the columns and m/z values (for aligned mass spectral peaks) in the rows of any output peak matrix.
  • The peak matrix should contain intensity | m/z | SNR values - use this option to define which peak metric is inserted in to the cells of any optionally-output peak matrix:

    • Intensity - writes the absolute peak intensity to the cells of the peak matrix
    • m/z - writes the mass-to-charge ratio to the cells of the peak matrix
    • signal-to-noise ratio (SNR) - writes the signal-to-noise ratio to the cells of the peak matrix

Output file(s)

IMPORTANT - in all outputs except for the (optional) comprehensive output, if fewer-than the user-defined “Minimum fraction” of “non-reference” samples had an intensity value that, when divided by the average “reference” class peak intensity value, were less than the user-defined “Minimum fold change” parameter, then that peak will be removed from the output peak matrix.

Default output - a HDF5 file containing the aligned peak intensity matrix.


Optional outputs - the metric recorded in any optionally output peak matrix/matrices is defined using the parameter "The peak matrix should contain intensity | m/z | SNR values". By default, study samples are listed row-wise, while mass-to-charge ratios of the aligned mass spectral peaks are presented in columns (to adjust, users must adjust the "Should rows or columns represent samples" toggle to “columns”).

  • Standard output - an aligned peak matrix in tab-delimited format (“.” as decimal and NA for missing values).

    Example of a standard peak intensity matrix:

    mz 96.04216 99.08062 100.0759 100.8672 ...
    QC_1 0 0 0 0 ...
    Blank_1 3342.626 0 0 0 ...
    Control_10 0 0 45432.2 0 ...
    Sample_2 0 3423.3 0 0 ...
    Control_5 0 0 49759 0 ...
    Control_10 0   39890.5 0 ...
    Sample_20 0 14563.7 0 0 ...
    Sample_2 0 34676.4 0 0 ...
    Sample_14 0 13134.9 0 521.4 ...
    ... ... ... ... ... ...

  • Comprehensive output - an aligned peak matrix, as described for the "standard output" (above), including all metadata from the "Process Scans" Filelist/samplelist and the following additional mass spectral peak metrics:

    • present - a positive integer value (0 < value < total number of study samples in the filelist / samplelist) that indicates the total number of study samples in which a peak was detected with the specified mass-to-charge ratio, plus or minus the user-defined ppm error tolerance.
    • occurrence - a positive integer value indicating the number of peaks that were grouped together during the alignment procedure and thus, that were used to calculate the average mass-to-charge ratio indicated for the aligned peak. A value greater than given in the “Present” metric indicates that one or more peaklists contained more-than one mass spectral peak with the specified mass-to-charge, plus or minus the user-defined ppm error tolerance.
    • purity - a proportion ranging from 0 to 1 that indicates the number of scans in which only a single peak was detected during the peaklist alignment process. If the value in the “occurrence” metric is greater than the “present” metric, purity will be < 1. A purity < 1 means that in at least one peaklist there was more-than one mass spectral peak with the specified mass-to-charge, plus or minus the user-defined ppm error tolerance.
    • rsd_all - a numeric value indicating the percent relative standard deviation (otherwise termed the percent coefficient of variation) of peak intensities for peaks aligned together using the Align Samples tool. If fewer than 2 peaks were aligned across samples, then the rsd_all column will be filled in with ‘nan’
    • blank_flag (may be absent if "Blank filter” tool was not applied) - a boolean value where 0 = reject peak, 1 = accept peak. A peak is accepted during blank filtering if a user-defined minimum proportion of study samples had peak intensity values greater-than the product of the average of “reference” sample peak intensities and the “min_fold_change” parameter.
    • fraction_flag (may be absent if "Sample filter” tool was not applied)- a boolean value where 0 = reject peak, 1 = accept peak. If greater-than a user-defined minimum fraction of samples (whether checked across ALL experimental classes, or within ANY of the individual experimental classes) had recorded intensity values for a given peak, then this peak is accepted, i.e. it is considered in downstream processing procedures, while rejected peaks are not.
    • flags - a boolean value indicating whether a peak should be included (“1”) or excluded (“0”) from downstream processing procedures. Exclusion of a peak occurs if the thresholds for “relative standard deviation” and/or “minimum number of technical replicates a peak has to be present in” were not met.

    Example of a comprehensive peak intensity matrix:

    mz missing values tags_batch tags_replicates tags_replicate tags_injectionOrder tags_classLabel tags_untyped 96.04216 99.08062 100.0759 100.8672 ...
    present*               1 4 3 1 ...
    occurrence*               1 4 4 1 ...
    purity*               1 1 1 1 ...
    rsd_all*               nan nan 10.98 nan ...
    flags*               1 1 1 1 ...
    QC_1 2901 1 2_3_4 2 2 QC   0 0 0 0 ...
    Blank_1 2948 1 1_2_4 1 5 Blank   3342.626 0 0 0 ...
    Control_10 2921 1 2_3_4 2 10 Control   0 0 45432.2 0 ...
    Sample_2 2819 1 1_2_4 1 13 Exposed   0 3423.3 0 0 ...
    Control_5 2877 1 2_3_4 2 18 Control   0 0 49759 0 ...
    Control_10 2856 1 1_2_3 1 21 Control   0   39890.5 0 ...
    Sample_20 2855 1 1_2_4 1 25 Exposed   0 14563.7 0 0 ...
    Sample_2 2814 1 1_2_4 1 29 Exposed   0 34676.4 0 0 ...
    Sample_14 2870 1 1_2_3 1 33 Exposed   0 13134.9 0 521.4 ...
    ... ... ... ... ... ... ... ... ... ... ... ... ...

Developers and contributors

License

DIMSpy is released under the GNU General Public License v3.0 (see LICENSE file)