I gave a short demonstration of the Waffles machine learning tool (waffles.sourceforge.net) and I thought I’d summarize things here so that people can refer to it easily. I also made waffles, but I can’t put those online so you’ll just need to make your own. Waffles is both a command line tool and a library. I’m just going to write about the command line too, but the things I do here you can easily do in the library as well.

First let’s generate some random data to test out waffles’ classification on. I wrote a short Ruby script to generate data in three distributions:

#! /usr/bin/ruby

#Random number generator

colors = [‘red‘, ‘green‘]

min = 10

range = 10

(1..ARGV[0].to_i).each {|x| puts "#{min + rand()*range}, #{colors.sample}, first"}

colors = [‘red‘, ‘blue‘]

min = 15

range = 5

(1..ARGV[0].to_i).each {|x| puts "#{min + rand()*range}, #{colors.sample}, second"}

colors = [‘blue‘, ‘red‘]

min = 23

range = 4

(1..ARGV[0].to_i).each {|x| puts "#{min + rand()*range}, #{colors.sample}, third"}

Now we can just use this to generate a csv file with semi-random data. (I’m also using the linux ‘shuf’ comand to randomize the order of the output.)

gencolors.rb 1000 | shuf > color_number.csv

This creates a file with three columns. The first column is a random number in the ranges specified in the ruby code, the second column is a nominal value (red, green, or blue) and the third column is a label (first, second, or third). Now, we would like to predict the label just by looking at the first two columns. This is called classification.

First, let’s let waffles convert this csv file into something more formally defined:

waffles_transform import color_number.csv > color_number.arff

This converts the file to arff format, which defines what kind of data is in each column. We can ask Waffles to tell us a bit about this data:

bash-3.1$ waffles_plot stats color_number.arff

Filename: color_number.arff

Patterns: 3000

Attributes: 3 (Continuous:1, Nominal:2)

0) attr0, Type: Continuous, Mean:19.172761, Dev:4.6513833, Median:18.263193, Min:10.025213, Max:26.992427, Missing:0

1) attr1, Type: Nominal, Values:3, Entropy: 1.4638843, Missing:0

16.8% (504) green

49.4% (1482) red

33.8% (1014) blue

2) attr2, Type: Nominal, Values:3, Entropy: 1.5849625, Missing:0

33.333333% (1000) first

33.333333% (1000) third

33.333333% (1000) second

For the continuous variable in the first column waffles provides statistics like mean and median while for textual labels it just provides entropy and the number of each label.

We can also plot the histogram of the continuous variable with waffle:

waffles_plot histogram color_number.arff -attr 0 -out waffles_hist.png

The figure is a little ugly in my opinion, but it gives you a quick idea of what’s going on. The third group doesn’t overlap the first two so we see a gap. The first two groups overlap though, which hints that it might be difficult to tell them apart. We can also get a quick breakdown of the relationships between all variables like this:

waffles_plot overview color_number.arff -out waffles_overview.png

This produces a figure that shows the relationships between each attribute. Remember that attribute 0 is randomly generated number, attribute 1 is the color, and attribute 2 is the label. If you look at the lower left box you can clearly see that the numeric ranges from the first and second group are going to overlap. Looking at the left-most tile of the middle row we can also see that the color labels are fairly well mixed with the numeric attribute 0 so we probably won’t be able to get 100% classification accuracy.

Now, let’s try some machine learning. If we just want to know what kind of performance we can get from a machine learning tool we would use “cross-validation” to test its performance.

waffles_learn crossvaliidate color_number.arff decisiontree

spits out numbers like this:

Rep: 0, Fold: 0, Accuracy: 0.88

Rep: 0, Fold: 1, Accuracy: 0.89533333333333

Rep: 1, Fold: 0, Accuracy: 0.87533333333333

Rep: 1, Fold: 1, Accuracy: 0.882

Rep: 2, Fold: 0, Accuracy: 0.87733333333333

Rep: 2, Fold: 1, Accuracy: 0.88466666666667

Rep: 3, Fold: 0, Accuracy: 0.87066666666667

Rep: 3, Fold: 1, Accuracy: 0.892

Rep: 4, Fold: 0, Accuracy: 0.888

Rep: 4, Fold: 1, Accuracy: 0.89133333333333

——-

Attr: 2, Mean predictive accuracy: 0.88366666666667, Deviation: 0.0080200366367195

So the predictive accuracy on a newly generated set of data would probably be around 88 percent. Try some other algorithms, like knn, naivebayes, and neuralnet. These gave me, respectively, accuracies of 0.8866, 0.917, and 0.8557. Now, why did the stupid statistics algorithm give the best results? Well, one reason is that we didn’t tune any of the parameters of these algorithms. You can have waffles try to figure out the right parameters for these algorithms with the command:

waffles_learn autotune color_number.arff decisiontree

but for some algorithms like neuralnet this could actually take hours and hours. This is where the “art” in machine learning comes in. Knowing the right tuning parameters for the more powerful algorithms can be difficult but you’ll only get the very best results by fiddling with their parameters. Anyway, running autotune on the decision tree tells me to use “-leafthresh 42” and running it with the neural net tells me to try “-addlayer 64” to add more layers into the perceptron model. You can actually give mutliple addlayer commands to make several layers, but again, understanding exactly how this would improve performance would be tough. Anyway, crossvalidating like this:

waffles_learn crossvalidate color_number.arff decisiontree -leafthresh 42

improves the decision tree’s performance from 0.8845 to 0.9068 and

waffles_learn crossvalidate color_number.arff neuralnet -addlayer 64

improves performance from 0.8557 to 0.914. Notice though, that everything is giving about the same results. I’m guessing that there simply isn’t enough information to do better than this — remember, there was quite an overlap between the first and second groups. If we want to see exactly where mistakes are being made we can first save the model of a classifier:

waffles_learn train color_number.arff knn > knn.json

Now we can generate a confusion matrix which shows where incorrect results are coming from:

waffles_learn test -confusion knn.json color_number.arff

The results look like this:

(Rows=expected values, Cols=predicted values, Elements=number of occurrences)Confusion matrix for attr2 first third second first 1000 0 0 third 0 1000 0 second 0 0 1000

What’s that? It was perfect? Don’t run your test on your training data!

Let’s make a new testing file:

and then we’ll generate another testing file:

ruby gencolors.rb 10000 | shuf > color_number2.csv

waffles_transform import color_number2.csv > color_number2.arff

Now, make sure that the labels are listed in the same order in this new arff file as they were in the previous one. For some reason this throws off the model. Anyway, running the test again gives this:

(Rows=expected values, Cols=predicted values, Elements=number of occurrences)Confusion matrix for attr2 first third second first 5392 0 4608 third 0 10000 0 second 6792 0 3208

And now knn has an accuracy of 0.62. **Lesson: Use different data for training and testing.**

This has gotten a little long so I’ll stop here. If you download waffles it comes with a wizard that can help you make commands and there is good documentation on the website (waffles.sourceforge.net).