Galaxy | Tool Preview

Support Vector Machine (SVM) Classifier (version 21.6.3)
Dataset missing? See TIP below.
Dataset missing? See TIP below.
Dataset missing? See TIP below.
Dataset missing? See TIP below.
Name of the column in your Training and Target design files that contain group classifications.
Name of the column in your Training and Target wide datasets that contain unique identifiers.
Only used for the polynomial kernel.
See references in tool description for setting this parameter. Value must be positive (C > 0). Used only if cross-validation is not selected. Default = 1.
Defines the lower bound for regularization parameter C when cross-validation is used. Must have a positive value (C > 0) Default = 0.1.
Defines the upper bound for regularization parameter C when cross-validation is used. Must have a positive value that is larger than the lower bound. Default = 10.
Used in the kernel functions above. Must be greater than zero. However, the default = 0 and translates to a = 1/n_features, where n_features is the number of features. Default = 0.
Independent term in kernel function. It is only significant in polynomial and sigmoid kernels. Default = 0.

TIP: If your data is not TAB delimited, use Text Manipulation->Convert.

WARNINGS:
    1. This script automatically removes spaces and special characters from strings.
    1. If a feature name starts with a number it will prepend an '_'.

Tool Description

NOTE: A minimum of 100 samples is required by the tool for single or double cross validation

Given a set of supervised samples in a Training Dataset, the SVM training algorithm builds a model based on these samples that can be used for predicting the categories of new, unclassified samples in a Target Dataset. The Target Dataset is not used for model training or evaluation, only for prediction based on the finalized model. SVM classification is performed on the target data and accuracy is estimated for both Target and Training Datasets.

SVM uses different kernel functions to carry out different types of classification such as radial bassis (gaussian), linear, polynomial, and sigmoid. The classification model can be trained with and without cross-validation (single or double).

For single and double cross-validation: the training dataset is split differently when the model fit is performed.

In single cross-validation: the same data are used to both fit and evaluate the model.

In double cross-validation: the training dataset is split into pieces and the model fit is performed on one of the pieces and evaluated on the other pieces.

Under cross-validation, the user specifies Regularization Parameter C and the Upper and Lower bounds of Regularization Parameter C. For more information about Regularization Parameter C, see references below:

Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning. 20(3) 273-297.

Steinwart, I and Christmann, A. 2008. Support vector machines. Springer Science and Business Media.

To use the SVM tool, users need the following information:

  1. a Training Dataset with known categories in the training design file and
  2. a Target Dataset with predicted categories in the target design file.
  3. the name of the Group/Treatment classification column should be the same for both design files.
  4. the Unique Feature IDs should be the same in both the wide datasets.
  5. the number of Unique Feature IDs should be the same in both the wide datasets.

Input

  • Four input datasets are required.

Wide Formatted Dataset

A wide formatted dataset that contains measurements for each sample:

Feature sample1 sample2 sample3 ...
one 10 20 10 ...
two 5 22 30 ...
three 30 27 2 ...
four 32 17 8 ...
... ... ... ... ...

NOTE: The 'Feature' column defines the rows within a wide formatted dataset.

NOTE: The sample IDs must match the sample IDs in the Design File (below). Extra columns will automatically be ignored.

Design File

A Design file relating samples to various groups/treatment:

sampleID group
sample1 g1
sample2 g1
sample3 g1
sample4 g2
sample5 g2
sample6 g2
... ...

NOTE: You must have a column named sampleID and the values in this column must match the column names in the wide dataset.

Group/Treatment

  • Name of the column in your Design File that contains group classifications.

Unique Feature ID

  • Name of the column in your Wide Dataset that has unique Feature IDs.

SVM Kernel Function

  • Kernel functions available for the SVM algorithm.

Polynomial Degree

  • Only used for the polynomial kernel.

Cross-Validation Choice

  • Cross-validation options available for the user. 'None' corresponds to no cross-validation- the user specifies regularization parameter C manually.

Regularization Parameter C

  • Penalizes potential overfitting, must be positive.

Regularization Parameter C (Lower Bound)

  • Lower bound for regularization parameter C. Value must be greater than 0. Only if cross-validation is selected.

Regularization Parameter C (Upper Bound)

  • Upper bound for regularization parameter C. Value must be greater than the Lower Bound.

Coefficient A

  • Used in the kernel functions above. Must be greater than zero. Default = 0, however, this translates to a = 1/n_features, where n_features is the number of features.

Coefficent B

  • Independent term in the kernel function. It is only significant in polynomial and sigmoid kernels.

Output

This tool will output two files for the Training dataset and two for the Target datset:

Training:

  1. a TSV file containing the observed and predicted grouping classifications for each sample and
  2. a TSV file containing the accuracy (percentage) of the classification.

Target:

  1. a TSV file containing suspected and predicted grouping classifications for each sample and
  2. a TSV file containing the accuracy (percentage) of the prediction in comparison to the suspected grouping specified in the design file.

NOTE: Some knowledge about the SVM classifier algorithm and different kernel types is recommended for users who plan to use the tool frequently with different settings.