A Beginner’s Guide to Unsupervised Learning, Part I: K-Means

Kyle Hayes

Analytic Consultant

Published May 6, 2020

In a perfect world, every dataset we obtain would come to us complete with a clear, easily-definable set of categories for each data point. Of course, in a perfect world headphones would never get tangled and every episode of Dr. Who would be available for free streaming. Alas, headphones get tangled, Dr. Who episodes are pay-walled, and more often than not, data comes to us uncategorized.

At least, that’s how I viewed the process of clustering data when I first started. In truth, the practice of clustering uncategorized data is one of the most useful applications in data science, simply because it allows us to discover unforeseen patterns in data. This ability to find these patterns has widespread practical applications, from using customer demographics and purchases to categorize customer types, to determining objects in images, to detecting anomalies in credit card purchases.

In this article, we will explore the most common clustering technique: K-Means. Subsequent articles will cover additional clustering techniques.

Creating a dataset

Before we start, let’s use Scikit-Learn’s “make_blobs” program to create a number of randomly-generated dataset. This program randomly selects a number of category ‘centroids’ according to your specified category quantity, and then distributes your selected number of data-points randomly around each category center.

First, we’re going to create and graph a dataset of 200 data-points, each belonging to one of four different categories. To make sure that the x1 and x2 values are comparable, we’ll also standardize the data prior to moving forward using a standard scaler.

From this, we can see that each of our four data categories is fairly distinct. They’re each distributed in a ‘blob’ shape of roughly the same size (i.e., in each group, both the mean distance from each data point to the group center and the standard deviation from the mean are the same). However, groups ‘c’ and ‘d’ contain significant overlap, indicating that the differences aren’t entirely clear-cut.

Now, in order to give us the true unsupervised dataset experience, let’s remove the labels.

Without the labels, it’s hard to be sure how many categories this dataset even contains, much less where each category begins and ends. That’s where K-Means Clustering comes in.

How Does K-Means Clustering Work?

How does K-Means Clustering work? Here’s a quick explanation:

Wait, What?

Here’s a longer, step-by-step explanation of what you just saw.

Say that you have an unclassified dataset, but you’re sure that somewhere in all that noise, four different categories are just waiting to be found using K-Means classification.

The first thing that the K-Means algorithm will do is pick out four random points somewhere in your data range. It calls these points ‘centroids’.

Next, K-Means cycles through each data point, and finds which centroid is the closest. It categorizes each data point with the same nearest centroid as belonging to the same group.

Depending on your data and your specifications, there are several different ways to measure this difference. The most common measurement is the Euclidean Difference. This defines the distance between two points as:

where q is the first point, p is the second point, and i is each dimension contained in the data point coordinates. For our purposes, this will look like:

Let’s see how our dataset compares to the classified version of the dataset when we label our data points according to these centroids.

So far, the K-Means labels and the original categories don’t look anything like each other. Out of curiosity, what would the accuracy rating be if we used the K-Means labels as prediction values?

Accuracy: 0.52

Yikes, that’s barely any better than random selection. It’s a good thing that K-Means doesn’t stop here.

Once More, From The Top

Now that K-Means has used its initial centroids to categorize the unlabelled dataset, it can now use these categories to recalculate where the centroid should be. It does this by taking each feature, and finding the mean value of that feature for each group. You can imagine the process as each cluster of data points as taking their centroid, and pulling it to their exact center.

Each centroid is now in the exact mathematical center of its subsequent group. However, now that the centroids have moved, not all of the categories hold up. Some of the data points from the ‘black’ and the ‘purple’ groups, for example, are now closer to the ‘blue’ group centroid. What would it look like if we repeated the classification step using this new set of centroids? How would it compare to our original dataset?

Accuracy: 0.575

Now, they’re… still pretty different. However, the K-Means process doesn’t stop here. Now that we’ve reclassified each group, the centroids aren’t in the center of their respective groups anymore. Of course, once we recenter the centroids, we could then reclassify each data point based on their closeness to each centroid, after which point we could recenter each centroid again, and on and on.

So far, we’ve only created the centroids and reclassified the datasets twice. What would happen if we did it eight additional times?

After 10 iterations of the classifying/recentering process, our K-Means labels and our original labels are almost identical. This is because, over repeated iterations, the centroids have a tendency to find and stick to a ‘best fit’ set of cluster groups, i.e. a set of groups where each data point is closer to their own group centroid than to any other group centroid. If we used the K-Means labels as predictions, what would our accuracy look like?

Accuracy: 0.93

Without any supervised machine learning processes, we’ve been able to separate the uncategorized dataset into four different groups that match their categorized counterpart 93% of the time. This shows that, even in we don’t know exactly what we’re looking for, the K-Means process is excellent at defining groups that maximize within-group similarity while minimizing inter-group similarity.

That seems like a lot of work…

Fine, here’s the easy way to do it.

How will I know the correct number of clusters to use?

Technically, since unclassified datasets don’t have any groups to begin with, there’s no ‘correct’ number of clusters. However, some make more sense than others. Let’s see what our dataset looks like divided into 3 clusters instead of 4.

If we didn’t know that this came from a dataset with 4 categories, we might say that this is a pretty good classification just by looking at it. How can you tell the correct number of categories to use in your K-Means algorithm? In fact, if no categories exist, what even makes one classification algorithm better than another?

Fortunately, it’s easy to find the optimal number of categories for your K-Means algorithm. All you need to do is two things.

First, you need to choose an choose an evaluation measurement based on your needs. For K-Means, I like to use the Silhouette Score.

In order to find the Silhouette Score of your model, you take a data point from your model, find the closest outside category to your data point, and find the average distance between your data point and each other data point in that category (b). Next, you find average distance between your data point and each data point in its own category (a), and subtract that from your total. After that, you convert your number to a [-1, 1] scale by dividing it either by a or b, whichever is greater. Finally, you just have to do that for every other data point, and find the average number. Or you can import the function used by SciKit-Learn, whichever is easier.

Output: 0.5371

This basically means that the closer each data point is to data points in its own group, and the further each data point is from data points in other groups, the higher the Silhouette Score is (with 1 being the best possible score).

Second, you need to find the Silhouette Score for each potential number of clusters, and graph them using an Elbow Plot. Why is it called an Elbow Plot? You can probably figure that out.

Based on this Elbow Plot, we can see that having 4 groups produces the most in-group similarity while minimizing inter-group similarity.

So, classifying unsupervised data is just when you K-Means a thing?

Ha ha, no.

Like every other machine learning technique, K-Means makes a number of assumptions about its data, and if the data doesn’t meet these assumptions, the classification system won’t work. The primary assumption that K-Means makes is that each group can more-or-less be best described using spheroid shapes, where data points can be described by how close they are to a hypothetical center. However, not all groups can be classified in this way.

In this instance, while we can see three obvious groups taking on elongated oval shapes, K-Means misses these groups because it’s looking for circular group shapes in which data points are closer to each other than they are to data points from other groups.

Another potential problem comes when clusters have different variances. K-Means assumes that each cluster’s data points will be spread out at roughly the same level as the data points of each other cluster. However, what happens when some clusters are densely packed together, and others are spread out?

When that happens, K-Means has a tendency to bleed the categories together.

Conclusion

K-Means is an intuitive clustering method that allows us to find and classify patterns in datasets that don’t contain dependent variables. However, it makes a number of assumptions that don’t work for every dataset. In my upcoming blogs, I’ll be moving on to other common forms of unsupervised classification techniques, starting out with Hierarchical Clustering. See you then!

Originally published in Medium