This is the old vignette - you find the new vignette on CRAN: OneR vignette

Introduction

The following story is one of the most often told in the Data Science community: Some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. And after having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Although this might be an urban legend the fact that it is so often told wants to tell us something:

  1. Many of our Machine Learning models are so complex that we cannot understand them ourselves.
  2. Because of 1. we cannot differentiate between the simpler aspects of a problem which can be tackled by simple models and the more sophisticated ones which need specialized treatment.
The above is not only true for neural networks (and especially deep neural networks) but for most of the methods used today, especially Support Vector Machines and Random Forests and in general all kinds of ensemble based methods.

In one word: We need a good baseline which builds “the best simple model” that strikes a balance between the best accuracy possible with a model that is still simple to grasp: We have developed the OneR package for finding this sweet spot and thereby establishing a new baseline for classification models in Machine Learning.

This package is filling a longstanding gap because only a JAVA based implementation was available so far (RWeka package as an interface for the OneR JAVA class). Additionally several enhancements have been made (see below).

Design principles for the OneR package

The following design principles were followed for programming the package:

The package is based on the – as the name might reveal – one rule classification algorithm [Holte93]. Although the underlying method is simple enough (basically 1-level decision trees, you can find out more here: OneR) several enhancements have been made:

Getting started with a simple example

You can also watch this video which goes through the following example step-by-step:

Quick Start Guide for the OneR package (Video)

Install from CRAN and load package

install.packages("OneR")
library(OneR)

Use the famous Iris dataset and determine optimal bins for numeric data

data <- optbin(iris)

Build model with best predictor

model <- OneR(data, verbose = TRUE)
        
    Attribute    Accuracy
1 * Petal.Width  96%     
2   Petal.Length 95.33%  
3   Sepal.Length 74.67%  
4   Sepal.Width  55.33%  
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

Show learned rules and model diagnostics

summary(model)

Rules:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63]   then Species = versicolor
If Petal.Width = (1.63,2.5]     then Species = virginica

Accuracy:
144 of 150 instances classified correctly (96%)

Contingency table:
            Petal.Width
Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
  setosa               * 50            0          0  50
  versicolor              0         * 48          2  50
  virginica               0            4       * 46  50
  Sum                    50           52         48 150
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 266.35, df = 4, p-value < 2.2e-16

Plot model diagnostics

plot(model)
OneR with Iris data set

Use model to predict data

prediction <- predict(model, data)

Evaluate prediction statistics

eval_model(prediction, data)

           actual            
prediction   setosa versicolor virginica Sum
  setosa         50          0         0  50
  versicolor      0         48         4  52
  virginica       0          2        46  48
  Sum            50         50        50 150

           actual            
prediction   setosa versicolor virginica  Sum
  setosa       0.33       0.00      0.00 0.33
  versicolor   0.00       0.32      0.03 0.35
  virginica    0.00       0.01      0.31 0.32
  Sum          0.33       0.33      0.33 1.00

Accuracy:
0.96 (144/150)

Error rate:
0.04 (6/150)

Please note that the very good accuracy of 96% is reached effortlessly.

Petal.Width is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species:

If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63]   then Species = versicolor
If Petal.Width = (1.63,2.5]     then Species = virginica

The nearly perfect separation of the colors in the diagnostic plot give a good indication of the model’s ability to separate the different species.

A more sophisticated real-world example

The next example tries to find a model for the identification of breast cancer. The example can run without further preparation:

library(OneR)

# http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names
# According to this source the best out-of-sample performance was 95.9%

# #  Attribute                     Domain
# -- -----------------------------------------
# 1. Sample code number           id number
# 2. Clump Thickness              1 - 10
# 3. Uniformity of Cell Size      1 - 10
# 4. Uniformity of Cell Shape     1 - 10
# 5. Marginal Adhesion            1 - 10
# 6. Single Epithelial Cell Size  1 - 10
# 7. Bare Nuclei                  1 - 10
# 8. Bland Chromatin              1 - 10
# 9. Normal Nucleoli              1 - 10
# 10. Mitoses                     1 - 10
# 11. Class:                      (2 for benign, 4 for malignant)

Load and name data

data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", na.strings = "?")
data <- data[-1] #remove sample code number
class <- factor(data[ , ncol(data)], levels = c(2, 4), labels = c("benign", "malignant"))
data <- data[-ncol(data)]; data <- cbind(data, class)
names(data) <- c("Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class")

Divide training (80%) and test set (20%)

set.seed(123)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data.train <- optbin(data[random, ])
Warning message:
In optbin(data[random, ]) :
  at least one instance was removed due to missing values
data.test <- data[-random, ]

Train OneR model on training set

model.train <- OneR(data.train, verbose = T)

    Attribute                   Accuracy
1 * Uniformity of Cell Size     92.83%  
2   Uniformity of Cell Shape    92.1%   
3   Bland Chromatin             90.99%  
4   Bare Nuclei                 90.44%  
5   Single Epithelial Cell Size 88.6%   
6   Normal Nucleoli             87.13%  
7   Marginal Adhesion           86.95%  
8   Clump Thickness             86.03%  
9   Mitoses                     78.68%  
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

Show model and diagnostics

summary(model.train)

Rules:
If Uniformity of Cell Size = (0.991,3.27] then Class = benign
If Uniformity of Cell Size = (3.27,10]    then Class = malignant

Accuracy:
505 of 544 instances classified correctly (92.83%)

Contingency table:
           Uniformity of Cell Size
Class       (0.991,3.27] (3.27,10] Sum
  benign           * 344        10 354
  malignant           29     * 161 190
  Sum                373       171 544
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 381.11, df = 1, p-value < 2.2e-16

Plot model diagnostics

plot(model.train)
OneR with breast cancer data set

Use trained model to predict test set

prediction <- predict(model.train, data.test)

Evaluate model performance on test set

eval_model(prediction, data.test)

           actual           
prediction  benign malignant Sum
  benign        90         8  98
  malignant      1        41  42
  Sum           91        49 140

           actual           
prediction  benign malignant  Sum
  benign      0.64      0.06 0.70
  malignant   0.01      0.29 0.30
  Sum         0.65      0.35 1.00

Accuracy:
0.9357 (131/140)

Error rate:
0.0643 (9/140)

The best reported accuracy on this dataset was at 97.5% and it was reached with considerable effort. The reached accuracy for the test set here lies at 93.6%! This is achieved with the following two simple rules:

## If Uniformity of Cell Size = (0.991,3.27] then Class = benign
## If Uniformity of Cell Size = (3.27,10]    then Class = malignant

Uniformity of Cell Size is identified as the attribute with the highest predictive value. The cut points of the intervals are again found automatically (via the included optbin function). The very good separation of the colors in the diagnostic plot give a good indication of the model’s ability to differentiate between benign and malignant tissue.

Included functions

Help overview

help(package = OneR)

...or as a pdf here: OneR.pdf

Sources

[Holte93] R. Holte: Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, 1993. Available: http://www.mlpack.org/papers/ds.pdf.

Contact

I would love to hear about your experiences with the OneR package. Please drop me a note - you can reach me at my university account: Holger K. von Jouanne-Diedrich

License

This package is under MIT License.