In machine learning, and data analysis in general, clustering algorithms are one of the more powerful tools to discover and learn inherent structure or grouping that exists within a dataset. More often than not, the size of a dataset prohibits a data scientist from directly observing these patterns. Due to the massive growth of computational capabilities in the last half century, computers can now perform this pattern recognition for us. In fact, many algorithms used within machine learning were postulated well before we had the computational power to execute them. One such algorithm, known as k-means clustering, was first proposed in 1957. The algorithm is founded in cluster analysis, and seeks to group observational data into clusters based on the similarity of their features.
The clustering algorithm follows this general procedure:
- Place k points (or centroids) into the space defined by the features of the dataset. The dimension of the space will equal the number of features being used. k is user-defined, and equal to the number of clusters.
- Assign each observation to the closest centroid (defined by Euclidean distance). This creates clusters of observations, each with a defining centroid.
- After all observations have been assigned to a centroid, recalculate the positions of the k centroids.
- Repeat 2 and 3 until the centroid positions stabilize.
Personally, I find it easier to understand k-means by seeing each iteration of the algorithm in action. To do this, we’ll download some data, and plot the clustering results after each loop.
The <>a href=https://archive.ics.uci.edu/ml/index.php>UCI Machine Learning Repository has a myriad of datasets ready to use. The wine dataset is what we will be using today. It contains 178 observations of wine grown in the same region in Italy. Each observation is from one of three cultivars (the ‘Class’ feature), with 13 constituent features that are the result of a chemical analysis.
import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt wine_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', \ 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315',\ 'Proline'] wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = wine_names) wine_df = pd.DataFrame(wine_data) wine_df.Class = wine_df.Class - 1
In this first chunk, we’ll import the pandas library and the wine dataset. We’ll then convert the dataset to a pandas DataFrame.
Since we have access to the true Class, our algorithm is known as a supervised learning algorithm. In other situations where k-means is applicable, we do not have access to any true class (because it may not actually exist). This is known as an unsupervised learning algorithm. Supervised and unsupervised algorithms rely on different metrics to evaluate how well the model fits the data. To evaluate how well k-means classifies our observations, we can use a fairly simple metric: the number of observations where the Class feature is correctly predicted out of the total number of observations. The code below produces a plot of the true Classes.
wine_df.plot.scatter(x = 'Alcohol', y = 'OD280/OD315', c= 'Class', figsize=(12,8), colormap='jet')
In our case, we already know that there are three classes of observations, so we can set k = 3. There are different methods of determining the initial position of the centroids (which determine how fast the algorithm converges), but the most basic is randomly. For visualization purposes, I chose to include only two of the features. However, the algorithm will work for any number of dimensions (more on this later). Here is the result of steps 1 and 2 of the first iteration (centroids are the bigger):
kmeans = KMeans(n_clusters=3, init = 'random', max_iter = 1, random_state = 5).fit(wine_df.iloc[:,[12,1]]) centroids_df = pd.DataFrame(kmeans.cluster_centers_, columns = list(wine_df.iloc[:,[12,1]].columns.values)) fig, ax = plt.subplots(1, 1) wine_df.plot.scatter(x = 'Alcohol', y = 'OD280/OD315', c= kmeans.labels_, figsize=(12,8), colormap='jet', ax=ax, mark_right=False) centroids_df.plot.scatter(x = 'Alcohol', y = 'OD280/OD315', ax = ax, s = 80, mark_right=False)
As you can see, the centroid positions are already pretty representative of the cluster centers, but perhaps we can do better. So, we recalculate each centroid’s position by calculating the mean of each cluster (equivalent to the geometric center). We then re-assign each observation to the closest centroid and repeat. Note some of the observations that are between the clusters change color (cluster assignment).
kmeans = KMeans(n_clusters=3, init = 'random', max_iter = 100, random_state = 5).fit(wine_df.iloc[:,[12,1]]) centroids_df = pd.DataFrame(kmeans.cluster_centers_, columns = list(wine_df.iloc[:,[12,1]].columns.values)) fig, ax = plt.subplots(1, 1) wine_df.plot.scatter(x = 'Alcohol', y = 'OD280/OD315', c= kmeans.labels_, figsize=(12,8), colormap='jet', ax=ax, mark_right=False) centroids_df.plot.scatter(x = 'Alcohol', y = 'OD280/OD315', ax = ax, s = 80, mark_right=False)
After several iterations, the results end up being fairly accurate, with k-means correctly predicting around 90% of the correct Class. The plot below marks each incorrectly assigned observation with a yellow star.
The k-means algorithm offers several advantages. It is relatively easy to understand and implement, requiring only a few lines of code in Python. It also works great for uniformly shaped clusters with various degrees of density. However, it doesn’t always work well. It is subject to the curse of dimensionality: As the number of dimensions (features) rises, definitive clusters become increasingly challenging. It also has difficulty with clusters that are non-uniform in shape—Consider, for example, clusters that are concentric in nature (non-spherical) or different sizes. In cases where k-means is not the most ideal algorithm to use, there are many others that are better suited for the data at hand.