Outline
- Section 1: K-means Clustering
- Section 2: Hierarchical Clustering
Introduction
The purpose of this notebook is to understand and implement two popular unsupervised learning methods to cluster data points.
# imports
import numpy as np
import numpy.testing as npt # for testing.
import logging # debugging.
from tqdm.notebook import tqdm # progress
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Initial global configuration for matplotlib
SMALL_SIZE = 12
MEDIUM_SIZE = 16
BIGGER_SIZE = 20
plt.rc('font', size=SMALL_SIZE) # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE) # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title
We generate data with the make_blobs
function. Although clustering methods are commonly used in unsupervised learning settings where no label y (cluster assignments) is available, we generate synthetic data with ground truth clusters for comparison later.
The number of dimensions can be specified by n_features
in the function below. To simplify our setting, we will only consider data points in 3-dimensional space but these methods can also be used in higher dimensions and you are encouraged to try out these scenarios.
# generate data
X, cluster_assignment = make_blobs(n_samples=400, n_features=3, centers=5,
cluster_std=1.2,
random_state=5, center_box=(0, 20))
cluster_assignment
is the ground-truth that represents the true membership of each point to one of the three clusters. In practice we do not have it and we aim to discover these assignments without having any access to a train-split with given labels (as we do in supervised learning).
# Visualize the 3D data
fig = plt.figure(figsize=(12, 8))
ax1, ax2 = fig.add_subplot(121, projection='3d'), fig.add_subplot(122, projection='3d')
ax1.scatter(X[:, 0], X[:, 1], X[:, 2], alpha=0.7)
ax1.set_title('Input Data')
ax1.set_xticklabels([]); ax1.set_yticklabels([]); ax1.set_zticklabels([]);
ax1.set_xlabel('X1'); ax1.set_ylabel('X2'); ax1.set_zlabel('X3')
ax2.scatter(X[:, 0], X[:, 1], X[:, 2], alpha=0.7,
# We use cluster_assignment only to visualize the ground-truth
# (the ideal clustering) in practice we don't have it and aim to estimate y.
c=cluster_assignment)
ax2.set_title('Ground-Truth Assignments (Unknown)')
ax2.set_xticklabels([]); ax2.set_yticklabels([]); ax2.set_zticklabels([]);
ax2.set_xlabel('X1'); ax2.set_ylabel('X2'); ax2.set_zlabel('X3');
# Show the plot
plt.show()
Section 1: k k k-means Clustering (index)
k k k-means is a simple clustering algorithm that follows the following steps:
- Given a number of clusters k k k, assign every sample to one of the k k k-clusters at random.
- Compute the centroid of each of the
k
k
k- clusters:
m l = 1 ∣ c l ∣ ∑ i ∈ c l x i , l = 1 , … , k \boldsymbol{m}_l = \frac{1}{|c_l|} \sum_{i \in c_l} \boldsymbol{x}_i, \quad l=1, \ldots, k ml=∣cl∣1i∈cl∑xi,l=1,…,k - Reassign each x i \boldsymbol{x}_i xi to the closest centroid.
- Repeat step (2) and (3) until:
- Assignments/labels do not change, or
- the within-distance
W
(
C
)
W(C)
W(C) converges, wich is defined as:
1 2 ∑ l = 1 k 1 ∣ c l ∣ ∑ i , j ∈ c l ∣ ∣ x ( i ) − x ( j ) ∣ ∣ 2 \frac{1}{2} \sum_{l=1}^k \frac{1}{|c_l|}\sum_{i, j \in c_l}||\boldsymbol{x}^{(i)} - \boldsymbol{x}^{(j)}||^2 21l=1∑k∣cl∣1i,j∈cl∑∣∣x(i)−x(j)∣∣2 - or, optionally, the number of iterations exceeds a predefined number
max_iters
.
By applying this algorithm to the dataset generated above, we expect the following clustering trajectory:
To implement the 4-step k-means algorithm we develop some useful subroutines:
-
compute_centroids
: from samples and their assignments, return the corresponding centroids (for step 2). -
compute_within_distance
: from centroids, samples, and their assignments, compute the within-distance (used for step 4 convergence check). -
kmeans_assignments
: from centroids and samples, return the new assignments (for step 3)
In kmeans_assignments
and compute_within_distance
functions, we need to consider an edge case. It is possible that the new cluster assignments miss out one of centroids, which means that we now have an empty cluster, or equivalently, we have two clusters accidentally collided to one cluster. The computed centroids in the next iteration will be NaN
, which is a special numerical value to encode missingness, or a problematic result. We need to handle the NaN
centroid carefully in kmeans_assignments
and compute_within_distance
.
Now we implement compute_within_distance
. An equivalent formulation of
W
(
C
)
W(C)
W(C) (page 103 in the notes), can be written as:
W ( C ) = ∑ l = 1 k ∑ i ∈ c l ∣ ∣ x ( i ) − m l ∣ ∣ 2 W(C) = \sum_{l=1}^k \sum_{i \in c_l} ||\boldsymbol{x}^{(i)} - \boldsymbol{m}_l||^2 W(C)=l=1∑ki∈cl∑∣∣x(i)−ml∣∣2
where m l \boldsymbol{m}_l ml denotes the l l l-th cluster centroid.
# EDIT THIS FUNCTION
def compute_within_distance(centroids, X, labels):
"""
Compute the within-cluster distance.
Args:
centroids (np.ndarray): the centroids array, with shape (k, p).
X (np.ndarray): the samples array, with shape (N, p).
labels (np.ndarray): the cluster index of each sample, with shape (N,).
Retruns:
(float): the within-cluster distance.
"""
within_distance = 0.0
k, p = centroids.shape
for l in range(len(centroids)):
centroid = centroids[l]
# Applying aggregate computations on `NaN` values
# can propagate the `NaN` to the results.
# In this case we skip the `NaN` centroid,
# which is effectively of an empty cluster.
if np.isnan(centroid).any():
continue
# Select samples belonging to label=l.
X_cluster = X[labels == l]
# EDIT THE NEXT LINES
# You need to add the `X_cluster` contribution to `within_distance`
# 1. Compute the cluster contribution.
cluster_se = (X_cluster - centroid)**2 # <-- SOLUTION
assert cluster_se.shape == (len(X_cluster), p) # <-- SOLUTION
# 2. Accumulate
within_distance += np.sum(cluster_se) # <-- SOLUTION
return within_distance
Now implement compute_centroids
that we use in step 2.
# EDIT THIS FUNCTION
def compute_centroids(k, X, labels):
"""
Compute the centroids of the clustered points X.
Args:
k (int): total number of clusters.
X (np.ndarray): data points, with shape (N, p)
labels (np.ndarray): cluster assignments for each sample in X, with shape (N,).
Returns:
(np.ndarray): the centroids of the k clusters, with shape (k, p).
"""
N, p = X.shape
centroids = np.zeros((k, p))
# EDIT THE NEXT LINES
for label in range(k):
cluster_X_l = X[labels == label] # <-- SOLUTION
centroids[label] = cluster_X_l.mean(axis=0) # <-- SOLUTION
return centroids
Make sure the following simple test case runs successfully.
# Test case.
X_test = np.array([[1, 1, 0],
[2, 2, 1],
[5, 3, 4],
[8, 3, 2]])
labels = np.array([0, 0, 1, 1])
centroids = np.array([[1.5, 1.5, 0.5],
[6.5, 3, 3]])
# Test compute_centroids
npt.assert_allclose(compute_centroids(2, X_test, labels), centroids)
# Test compute_within_distance.
npt.assert_allclose(compute_within_distance(centroids, X_test, labels), 8.0)
Now, to the subroutine kmeans_assignments
.
# EDIT THIS FUNCTION
def kmeans_assignments(centroids, X):
"""
Assign every example to the index of the closest centroid.
Args:
centroids (np.ndarray): The centroids of the k clusters, shape: (k, p).
X (np.ndarray): The samples array, shape (N, p).
Returns:
(np.ndarray): an assignment matrix to k clusters, by their indices.
"""
k, p = centroids.shape
N, _ = X.shape
# Compute distances between data points and centroids. Assumed shape: (k, N).
distances = np.vstack([np.linalg.norm(X - c, axis=1) for c in centroids]) # <-- SOLUTION
# Note: If any centroid has NaN, the NaN value will propagate into the
# distance corresponding row, we need to skip that row next when we search
# for the closest centroid.
assert distances.shape == (k, N), f"Unexpected shape {distances.shape} != {(k, N)}"
# Assignments are computed by finding the centroid with the minimum distance
# for each sample. The np.nanargmin returns the index of the minimum values
# in `distances` scanning the rows (axis=0) for each column,
# while skipping any nan value found.
return np.nanargmin(distances, axis=0)
Some test cases to pass.
# Test case.
X_test = np.array([[1, 1, 0],
[2, 2, 1],
[5, 3, 4],
[8, 3, 2],
[11,4, -1]])
labels = np.array([0, 0, 1, 1, 1])
centroids = np.array([[1.5, 1.5, 0.5],
[6.5, 3, 3]])
# Test kmeans_assignments
npt.assert_equal(kmeans_assignments(centroids, X_test), labels)
We are ready to implement kmeans_clustering
that builds on the previous subroutines.
# EDIT THIS FUNCTION
def kmeans_clustering(X, k,
max_iters=1000,
epsilon=0.0,
callback=None):
"""
Apply k-means clustering algorithm on the samples in `X` to generate
k clusters.
Notes:
The main steps followed here are described previously:
1. randomly assignments of the points to $k$-clusters.
2. compute the centroid of each of the $k$- clusters.
3. reassign each point to the closest centroid.
4. repeat steps (2) and (3) until:
- assignments/labels do not change, or
- the within-distance $W(C)$ converges with `epsilon` tolerence.
- or the number of iterations exceeds `max_iters`.
Args:
X (np.ndarray): The samples array, shape: (N, p).
k (int): The number of clusters.
max_iters (int): Maximum number of iterations.
epsilon (float): The minimum change in the within-distance to continue.
callback (Callable): a function to be called on the assignments,
the centroids, and within-distance after each iteration, default is None.
Returns:
Tuple[np.ndarray, np.ndarray]: the assignments array to k clusters with
shape (N,) and the centroids array
"""
# Step 1: randomly initialise the cluster assignments.
labels = np.random.choice(k, size=len(X), replace=True) # <-- SOLUTION
within_distance = np.inf
for _ in range(max_iters):
# Step 2: compute the centroids
centroids = co