Galaxy |

Support Vector Machine (SVM) Classifier (version 21.6.3)

Training wide dataset:

Dataset missing? See TIP below.

Training design file:

Dataset missing? See TIP below.

Target wide dataset:

Dataset missing? See TIP below.

Target design file:

Dataset missing? See TIP below.

Group/Treatment:

Name of the column in your Training and Target design files that contain group classifications.

Unique Feature ID:

Name of the column in your Training and Target wide datasets that contain unique identifiers.

Select a SVM Kernel Function:

Polynomial Degree:

Only used for the polynomial kernel.

Select Cross-Validation:

Regularization Parameter C:

See references in tool description for setting this parameter. Value must be positive (C > 0). Used only if cross-validation is not selected. Default = 1.

Regularization Parameter C (Lower Bound):

Defines the lower bound for regularization parameter C when cross-validation is used. Must have a positive value (C > 0) Default = 0.1.

Regularization Parameter C (Upper Bound):

Defines the upper bound for regularization parameter C when cross-validation is used. Must have a positive value that is larger than the lower bound. Default = 10.

Coefficient A:

Used in the kernel functions above. Must be greater than zero. However, the default = 0 and translates to a = 1/n_features, where n_features is the number of features. Default = 0.

Coefficient B:

Independent term in kernel function. It is only significant in polynomial and sigmoid kernels. Default = 0.

TIP: If your data is not TAB delimited, use Text Manipulation->Convert.

WARNINGS:

1. This script automatically removes spaces and special characters from strings.
1. If a feature name starts with a number it will prepend an '_'.

Tool Description

NOTE: A minimum of 100 samples is required by the tool for single or double cross validation

Given a set of supervised samples in a Training Dataset, the SVM training algorithm builds a model based on these samples that can be used for predicting the categories of new, unclassified samples in a Target Dataset. The Target Dataset is not used for model training or evaluation, only for prediction based on the finalized model. SVM classification is performed on the target data and accuracy is estimated for both Target and Training Datasets.

SVM uses different kernel functions to carry out different types of classification such as radial bassis (gaussian), linear, polynomial, and sigmoid. The classification model can be trained with and without cross-validation (single or double).

For single and double cross-validation: the training dataset is split differently when the model fit is performed.

In single cross-validation: the same data are used to both fit and evaluate the model.

In double cross-validation: the training dataset is split into pieces and the model fit is performed on one of the pieces and evaluated on the other pieces.

Under cross-validation, the user specifies Regularization Parameter C and the Upper and Lower bounds of Regularization Parameter C. For more information about Regularization Parameter C, see references below:

Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning. 20(3) 273-297.

Steinwart, I and Christmann, A. 2008. Support vector machines. Springer Science and Business Media.

To use the SVM tool, users need the following information:

a Training Dataset with known categories in the training design file and
a Target Dataset with predicted categories in the target design file.
the name of the Group/Treatment classification column should be the same for both design files.
the Unique Feature IDs should be the same in both the wide datasets.
the number of Unique Feature IDs should be the same in both the wide datasets.

Input

Four input datasets are required.

Wide Formatted Dataset

A wide formatted dataset that contains measurements for each sample:

Feature sample1 sample2 sample3 ...

one 10 20 10 ...

two 5 22 30 ...

three 30 27 2 ...

four 32 17 8 ...

... ... ... ... ...

NOTE: The 'Feature' column defines the rows within a wide formatted dataset.

Feature	sample1	sample2	sample3	...
one	10	20	10	...
two	5	22	30	...
three	30	27	2	...
four	32	17	8	...
...	...	...	...	...

NOTE: The sample IDs must match the sample IDs in the Design File (below). Extra columns will automatically be ignored.

Design File

A Design file relating samples to various groups/treatment:

sampleID group

sample1 g1

sample2 g1

sample3 g1

sample4 g2

sample5 g2

sample6 g2

... ...

NOTE: You must have a column named sampleID and the values in this column must match the column names in the wide dataset.

Group/Treatment

Name of the column in your Design File that contains group classifications.

Unique Feature ID

Name of the column in your Wide Dataset that has unique Feature IDs.

sampleID	group
sample1	g1
sample2	g1
sample3	g1
sample4	g2
sample5	g2
sample6	g2
...	...

SVM Kernel Function

Kernel functions available for the SVM algorithm.

Polynomial Degree

Only used for the polynomial kernel.

Cross-Validation Choice

Cross-validation options available for the user. 'None' corresponds to no cross-validation- the user specifies regularization parameter C manually.

Regularization Parameter C

Penalizes potential overfitting, must be positive.

Regularization Parameter C (Lower Bound)

Lower bound for regularization parameter C. Value must be greater than 0. Only if cross-validation is selected.

Regularization Parameter C (Upper Bound)

Upper bound for regularization parameter C. Value must be greater than the Lower Bound.

Coefficient A

Used in the kernel functions above. Must be greater than zero. Default = 0, however, this translates to a = 1/n_features, where n_features is the number of features.

Coefficent B

Independent term in the kernel function. It is only significant in polynomial and sigmoid kernels.

Output

This tool will output two files for the Training dataset and two for the Target datset:

Training:

a TSV file containing the observed and predicted grouping classifications for each sample and
a TSV file containing the accuracy (percentage) of the classification.

Target:

a TSV file containing suspected and predicted grouping classifications for each sample and
a TSV file containing the accuracy (percentage) of the prediction in comparison to the suspected grouping specified in the design file.

NOTE: Some knowledge about the SVM classifier algorithm and different kernel types is recommended for users who plan to use the tool frequently with different settings.