Thursday, 26 November 2015



Italian Wines and k-means

I had a couple of hours to kill, and given that I have some high class friends who requested a piece on wine rather than single malts (ok, some people will complain it’s Italian, but still…). I am trying to show how easily a simple old fashioned algorithm like k-means (it is about 50 years old) can do a decent job of classifying data. I am also trying to show that it is important to know what you are trying to do and choose the right algorithm and the right way to apply it.



The starting point is the wine dataset, a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars (you’ll have to taste and decide which is which, if you do please let me know), and there are 13 different dimensions: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavonoids, non-flavonoid phenols, proanthocyanins, colour intensity, hue, od280/od315 of diluted wines and proline. The dataset contains 178 data points classed into the 3 cultivars which I split into a training and validation dataset.

Basically I am trying to get k-means to learn what makes each cultivar (k-means works on centroids, hence the centroid can be thought as the ‘typical’ chemical composition of that cultivar) from the training dataset, then apply it to the validation dataset. The metric I will use for accuracy is simply the percentage of correctly classified cultivars.

First, I blindly applied k-means. We already know that the actual number of clusters in the distribution is 3, so I set K to 3 and use all the variables. 

This is what a simple visualisation of actual distribution of the validation dataset looks like; each node represents a wine type; they are colour coded as per the cultivar they belong to: red for cultivar 1, green for cultivar 2 and blue for cultivar 3. 



Keeping the original colour scheme, we can see that k-means doesn’t do a very good job at predicting the right cluster for each cultivar.




Predicted
A
B
C
Actual
1
12
0
12
2
0
17
4
3
0
12
5

The accuracy of the kmeans is only 47%.

But if we decide to work off normalised values, the picture becomes much clearer:



Even visually this looks like a picture with clearer segments. Applying k-means on this normalised data:





K-means does a much better job at distinguishing the cultivars based on the normalised data:


Predicted
A
B
C
Actual
1
24


2
4
1
16
3

17


The accuracy of k-means jumps to 92%.

In conclusion, it pays to have a clear idea of what the objective is and use the right approach. In this example, it should have been obvious right from the outset that standardisation is necessary; without it the results would be skewed by the fact that the scales for the various measures are very different and this difference distracts from the aim of the analysis. Also, we can see that there sometimes is no need for very sophisticated techniques, and a quick piece of analysis can deliver reasonable results with some fore-thought.
 
The original source of the data is:  https://archive.ics.uci.edu/ml/datasets/Wine