view data_normalization_and_rescaling.xml @ 1:2e7d47c0b027 draft

"planemo upload for repository https://malex@toolshed.g2.bx.psu.edu/repos/malex/secimtools"
author malex
date Mon, 08 Mar 2021 22:04:06 +0000
parents
children
line wrap: on
line source

<tool id="secimtools_data_normalization_and_rescaling" name="Normalization and Re-Scaling" version="@WRAPPER_VERSION@">
    <description>of data.</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command><![CDATA[
data_normalization_and_rescaling.py
--input $input
--design $design
--uniqID $uniqID
--method $method
--out $out
    ]]></command>
    <inputs>
        <param name="input" type="data" format="tabular" label="Wide Dataset" help="Input tab-separated wide format dataset. If not tab separated see TIP below."/>
        <param name="design" type="data" format="tabular" label="Design File" help="Input your design file (tab-separated). Note you need a 'sampleID' column. If not tab separated see TIP below."/>
        <param name="uniqID" type="text" size="30" value="" label="Unique Feature ID" help="Name of the column in your Wide Dataset that has unique identifiers."/>
        <param name="method" size="30" type="select" value="" display="radio" label="Normalization Method" help="Method to be used for normalization and re-scaling of the data.">
            <option value="mean" selected="true">Mean (samples)</option>
            <option value="sum" selected="true">Sum (samples)</option>
            <option value="median" selected="true">Median (samples)</option>
            <option value="centering" selected="true">Centering (features)</option>
            <option value="auto" selected="true">Autoscaling (features)</option>
            <option value="pareto" selected="true">Pareto (features)</option>
            <option value="range" selected="true">Range (features)</option>
            <option value="level" selected="true">Level (features)</option>
            <option value="vast" selected="true">VAST (features)</option>
        </param>
    </inputs>
    <outputs>
        <data format="tabular" name="out" label="${tool.name} on ${on_string}: Normalized data"/>
    </outputs>
    <tests>
        <test>
            <param name="input"  value="ST000006_data.tsv"/>
            <param name="design" value="ST000006_design.tsv"/>
            <param name="uniqID" value="Retention_Index" />
            <param name="method" value="mean" />
            <output name="out"   file="ST000006_data_normalization_and_rescaling_mean_output.tsv" />
        </test>
        <test>
            <param name="input"  value="ST000006_data.tsv"/>
            <param name="design" value="ST000006_design.tsv"/>
            <param name="uniqID" value="Retention_Index" />
            <param name="method" value="vast" />
            <output name="out"   file="ST000006_data_normalization_and_rescaling_vast_output.tsv" />
        </test>
    </tests>
    <help><![CDATA[

@TIP_AND_WARNING@

**Tool Description**

The first three normalization methods (Mean, Sum and Median) perform re-scaling of the data by sample.
Each individual sample (column) in the wide dataset is re-scaled by dividing all feature values within that column by the mean, median or sum of those feature values.
Each sample (column) is re-scaled independently from other samples (columns).

The last six normalization methods (Centering, Pareto, Autoscaling, Range, Level, and Variable Stability (VAST)) perform scaling of the data by features.
Each feature (row) is re-scaled independently from other features.
Each individual feature (row) in the wide dataset is centered by subtraction of the mean of that feature and is re-scaled by dividing all the feature values within that row by the scaling factor.
The scaling factor is computed from the feature values in the current row and depends on the selected method.
Centering does not have a scaling factor and does not perform division, Autoscaling uses standard deviation, Pareto scaling uses the square root of the standard deviation, Range uses the difference between the max and min values, and Level uses the mean.
VAST scaling is performed in two steps. The first step is Autoscaling, followed by division of the resulting feature values in each row by the coefficient of variation for that feature.

More information on the scaling methods are available from the literature:

Keun, Hector C., Timothy MD Ebbels, Henrik Antti, Mary E. Bollard, Olaf Beckonert, Elaine Holmes, John C. Lindon, and Jeremy K. Nicholson. "Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling." Analytica chimica acta 490, no. 1 (2003): 265-276.
van den Berg, Robert A., Huub CJ Hoefsloot, Johan A. Westerhuis, Age K. Smilde, and Mariƫt J. van der Werf. "Centering, scaling, and transformations: improving the biological information content of metabolomics data." BMC genomics 7, no. 1 (2006): 142


------------------------------------------------------------------------------------------

**Input Files**

    - Two input datasets are required.

@WIDE@

**NOTE:** The sample IDs must match the sample IDs in the Design File
(below). Extra columns will automatically be ignored.

@METADATA@

@UNIQID@

**Normalization Method**

    - Method to be used for normalization and re-scaling of the data. The parenthesis indicates whether the method will be applied to samples or features.

--------------------------------------------------------------------------------

**Output**

TSV file containing the same column names as in the original Wide Dataset where the values in each cell correspond to the values after normalization/re-scaling.

    ]]></help>
    <expand macro="citations"/>
</tool>