TIP: If your data is not TAB delimited, use Text Manipulation->Convert.
Tool Description
NOTE: A minimum of 100 samples is required by the tool for single or double cross validation
Given a set of supervised samples in a Training Dataset, the SVM training algorithm builds a model based on these samples that can be used for predicting the categories of new, unclassified samples in a Target Dataset. The Target Dataset is not used for model training or evaluation, only for prediction based on the finalized model. SVM classification is performed on the target data and accuracy is estimated for both Target and Training Datasets.
SVM uses different kernel functions to carry out different types of classification such as radial bassis (gaussian), linear, polynomial, and sigmoid. The classification model can be trained with and without cross-validation (single or double).
For single and double cross-validation: the training dataset is split differently when the model fit is performed.
In single cross-validation: the same data are used to both fit and evaluate the model.
In double cross-validation: the training dataset is split into pieces and the model fit is performed on one of the pieces and evaluated on the other pieces.
Under cross-validation, the user specifies Regularization Parameter C and the Upper and Lower bounds of Regularization Parameter C. For more information about Regularization Parameter C, see references below:
Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning. 20(3) 273-297.
Steinwart, I and Christmann, A. 2008. Support vector machines. Springer Science and Business Media.
To use the SVM tool, users need the following information:
Input
- Four input datasets are required.
Wide Formatted Dataset
A wide formatted dataset that contains measurements for each sample:
Feature sample1 sample2 sample3 ... one 10 20 10 ... two 5 22 30 ... three 30 27 2 ... four 32 17 8 ... ... ... ... ... ... NOTE: The 'Feature' column defines the rows within a wide formatted dataset.
NOTE: The sample IDs must match the sample IDs in the Design File (below). Extra columns will automatically be ignored.
Design File
A Design file relating samples to various groups/treatment:
sampleID group sample1 g1 sample2 g1 sample3 g1 sample4 g2 sample5 g2 sample6 g2 ... ... NOTE: You must have a column named sampleID and the values in this column must match the column names in the wide dataset.
Group/Treatment
- Name of the column in your Design File that contains group classifications.
Unique Feature ID
- Name of the column in your Wide Dataset that has unique Feature IDs.
SVM Kernel Function
- Kernel functions available for the SVM algorithm.
Polynomial Degree
- Only used for the polynomial kernel.
Cross-Validation Choice
- Cross-validation options available for the user. 'None' corresponds to no cross-validation- the user specifies regularization parameter C manually.
Regularization Parameter C
- Penalizes potential overfitting, must be positive.
Regularization Parameter C (Lower Bound)
- Lower bound for regularization parameter C. Value must be greater than 0. Only if cross-validation is selected.
Regularization Parameter C (Upper Bound)
- Upper bound for regularization parameter C. Value must be greater than the Lower Bound.
Coefficient A
- Used in the kernel functions above. Must be greater than zero. Default = 0, however, this translates to a = 1/n_features, where n_features is the number of features.
Coefficent B
- Independent term in the kernel function. It is only significant in polynomial and sigmoid kernels.
Output
This tool will output two files for the Training dataset and two for the Target datset:
Training:
Target:
NOTE: Some knowledge about the SVM classifier algorithm and different kernel types is recommended for users who plan to use the tool frequently with different settings.