view accuracy.xml @ 2:6169ba9ed42a draft

author testtool
date Fri, 13 Oct 2017 10:10:32 -0400
line wrap: on
line source

<tool id="accuracy" name="accuracy" version="1.0.0">
    <description>model creation and accuracy estimation</description>
        <requirement type="package" version="6.0_76">r-caret</requirement>
    <command detect_errors="aggressive">
        Rscript '$__tool_directory__/accuracy.R' '$input' '$p' '$output1' '$output2' 
        <param format="csv" type="data" name="input"  value="" label="Input dataset" help="
   e.g. iris species table 
 <param name="p" type="integer" value="0.80" label="Select % of data to training and testing the models"/>   
        <data format="csv" name="output1" label="dataset_summary.csv" />
        <data format="csv" name="output2" label="accuracy_summary.csv" />
      <param name="test">
      <element name="test-data">
          <collection type="data">
                <element format="csv" name="input" label="test-data/input.csv"/>
        <output format="csv"  name="fit" label="test-data/dataset_summary.csv"/>
        <output format="csv"  name="fit" label="test-data/accuracy_summary.csv"/>
Tool allow us to build 5 different models to predict e.g. species from flower measurements.
In the end we can select the best model for further analysis.

Let’s evaluate 5 different algorithms:
**Linear Discriminant Analysis (LDA)**
**Classification and Regression Trees (CART).**
**k-Nearest Neighbors (kNN).**
**Support Vector Machines (SVM) with a linear kernel.**
**Random Forest (RF)**

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). 
We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed
using exactly the same data splits. It ensures the results are directly comparable.