Italian Wines and k-means
I had a couple of hours to kill, and given that I have some
high class friends who requested a piece on wine rather than single malts (ok, some
people will complain it’s Italian, but still…). I am trying to show how easily
a simple old fashioned algorithm like k-means (it is about 50 years old) can do a decent job of
classifying data. I am also trying to show that it is important to know what
you are trying to do and choose the right algorithm and the right way to apply
it.
The starting point is the wine dataset, a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars (you’ll have to taste and
decide which is which, if you do please let me know), and there are 13
different dimensions: alcohol, malic acid, ash, alkalinity of ash, magnesium,
total phenols, flavonoids, non-flavonoid phenols, proanthocyanins, colour
intensity, hue, od280/od315 of diluted wines and proline. The dataset contains
178 data points classed into the 3 cultivars which I split into a training and
validation dataset.
Basically I am trying to get
k-means to learn what makes each cultivar (k-means works on centroids, hence
the centroid can be thought as the ‘typical’ chemical composition of that
cultivar) from the training dataset, then apply it to the validation dataset.
The metric I will use for accuracy is simply the percentage of correctly
classified cultivars.
First, I blindly applied k-means.
We already know that the actual number of clusters in the distribution is 3, so
I set K to 3 and use all the variables.
This is what a simple
visualisation of actual distribution of the validation dataset looks like; each
node represents a wine type; they are colour coded as per the cultivar they
belong to: red for cultivar 1, green for cultivar 2 and blue for cultivar 3.
Keeping the original colour
scheme, we can see that k-means doesn’t do a very good job at predicting the
right cluster for each cultivar.
|
Predicted
|
|||
A
|
B
|
C
|
||
Actual
|
1
|
12
|
0
|
12
|
2
|
0
|
17
|
4
|
|
3
|
0
|
12
|
5
|
The accuracy of the kmeans is only
47%.
But if we decide to work off
normalised values, the picture becomes much clearer:
Even visually this looks like a
picture with clearer segments. Applying k-means on this normalised data:
K-means does a much better job at
distinguishing the cultivars based on the normalised data:
|
Predicted
|
|||
A
|
B
|
C
|
||
Actual
|
1
|
24
|
|
|
2
|
4
|
1
|
16
|
|
3
|
|
17
|
|
The accuracy of k-means jumps to
92%.
In conclusion, it pays to have a
clear idea of what the objective is and use the right approach. In this
example, it should have been obvious right from the outset that standardisation
is necessary; without it the results would be skewed by the fact that the
scales for the various measures are very different and this difference
distracts from the aim of the analysis. Also, we can see that there sometimes
is no need for very sophisticated techniques, and a quick piece of analysis can
deliver reasonable results with some fore-thought.
The original source of the data is: https://archive.ics.uci.edu/ml/datasets/Wine