11. 一起学习机器学习 Kmeans + Hierarchical Clustering

Outline

  • Section 1: K-means Clustering
  • Section 2: Hierarchical Clustering

Introduction

The purpose of this notebook is to understand and implement two popular unsupervised learning methods to cluster data points.

# imports
import numpy as np
import numpy.testing as npt # for testing.
import logging # debugging.
from tqdm.notebook import tqdm # progress

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Initial global configuration for matplotlib
SMALL_SIZE = 12
MEDIUM_SIZE = 16
BIGGER_SIZE = 20

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

We generate data with the make_blobs function. Although clustering methods are commonly used in unsupervised learning settings where no label y (cluster assignments) is available, we generate synthetic data with ground truth clusters for comparison later.

The number of dimensions can be specified by n_features in the function below. To simplify our setting, we will only consider data points in 3-dimensional space but these methods can also be used in higher dimensions and you are encouraged to try out these scenarios.

# generate data
X, cluster_assignment = make_blobs(n_samples=400, n_features=3, centers=5,
                                   cluster_std=1.2,
                                   random_state=5, center_box=(0, 20))

cluster_assignment is the ground-truth that represents the true membership of each point to one of the three clusters. In practice we do not have it and we aim to discover these assignments without having any access to a train-split with given labels (as we do in supervised learning).

# Visualize the 3D data
fig = plt.figure(figsize=(12, 8))
ax1, ax2 = fig.add_subplot(121, projection='3d'), fig.add_subplot(122, projection='3d')

ax1.scatter(X[:, 0], X[:, 1], X[:, 2],  alpha=0.7)
ax1.set_title('Input Data')
ax1.set_xticklabels([]); ax1.set_yticklabels([]); ax1.set_zticklabels([]);
ax1.set_xlabel('X1'); ax1.set_ylabel('X2'); ax1.set_zlabel('X3')

ax2.scatter(X[:, 0], X[:, 1], X[:, 2], alpha=0.7,
           # We use cluster_assignment only to visualize the ground-truth
           # (the ideal clustering) in practice we don't have it and aim to estimate y.
           c=cluster_assignment)
ax2.set_title('Ground-Truth Assignments (Unknown)')
ax2.set_xticklabels([]); ax2.set_yticklabels([]); ax2.set_zticklabels([]);
ax2.set_xlabel('X1'); ax2.set_ylabel('X2'); ax2.set_zlabel('X3');

# Show the plot
plt.show()


外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

Section 1: k k k-means Clustering (index)

k k k-means is a simple clustering algorithm that follows the following steps:

  1. Given a number of clusters k k k, assign every sample to one of the k k k-clusters at random.
  2. Compute the centroid of each of the k k k- clusters:
    m l = 1 ∣ c l ∣ ∑ i ∈ c l x i , l = 1 , … , k \boldsymbol{m}_l = \frac{1}{|c_l|} \sum_{i \in c_l} \boldsymbol{x}_i, \quad l=1, \ldots, k ml=cl1iclxi,l=1,,k
  3. Reassign each x i \boldsymbol{x}_i xi to the closest centroid.
  4. Repeat step (2) and (3) until:
    • Assignments/labels do not change, or
    • the within-distance W ( C ) W(C) W(C) converges, wich is defined as:
      1 2 ∑ l = 1 k 1 ∣ c l ∣ ∑ i , j ∈ c l ∣ ∣ x ( i ) − x ( j ) ∣ ∣ 2 \frac{1}{2} \sum_{l=1}^k \frac{1}{|c_l|}\sum_{i, j \in c_l}||\boldsymbol{x}^{(i)} - \boldsymbol{x}^{(j)}||^2 21l=1kcl1i,jcl∣∣x(i)x(j)2
    • or, optionally, the number of iterations exceeds a predefined number max_iters.

By applying this algorithm to the dataset generated above, we expect the following clustering trajectory:

kmeans

To implement the 4-step k-means algorithm we develop some useful subroutines:

  • compute_centroids: from samples and their assignments, return the corresponding centroids (for step 2).
  • compute_within_distance: from centroids, samples, and their assignments, compute the within-distance (used for step 4 convergence check).
  • kmeans_assignments: from centroids and samples, return the new assignments (for step 3)

In kmeans_assignments and compute_within_distance functions, we need to consider an edge case. It is possible that the new cluster assignments miss out one of centroids, which means that we now have an empty cluster, or equivalently, we have two clusters accidentally collided to one cluster. The computed centroids in the next iteration will be NaN, which is a special numerical value to encode missingness, or a problematic result. We need to handle the NaN centroid carefully in kmeans_assignments and compute_within_distance.

Now we implement compute_within_distance. An equivalent formulation of W ( C ) W(C) W(C) (page 103 in the notes), can be written as:

W ( C ) = ∑ l = 1 k ∑ i ∈ c l ∣ ∣ x ( i ) − m l ∣ ∣ 2 W(C) = \sum_{l=1}^k \sum_{i \in c_l} ||\boldsymbol{x}^{(i)} - \boldsymbol{m}_l||^2 W(C)=l=1kicl∣∣x(i)ml2

where m l \boldsymbol{m}_l ml denotes the l l l-th cluster centroid.

# EDIT THIS FUNCTION
def compute_within_distance(centroids, X, labels):
    """
    Compute the within-cluster distance.

    Args:
        centroids (np.ndarray): the centroids array, with shape (k, p).
        X (np.ndarray): the samples array, with shape (N, p).
        labels (np.ndarray): the cluster index of each sample, with shape (N,).

    Retruns:
        (float): the within-cluster distance.
    """
    within_distance = 0.0
    k, p = centroids.shape

    for l in range(len(centroids)):
        centroid = centroids[l]
        # Applying aggregate computations on `NaN` values
        # can propagate the `NaN` to the results.
        # In this case we skip the `NaN` centroid,
        # which is effectively of an empty cluster.
        if np.isnan(centroid).any():
            continue

        # Select samples belonging to label=l.
        X_cluster = X[labels == l]

        # EDIT THE NEXT LINES
        # You need to add the `X_cluster` contribution to `within_distance`

        # 1. Compute the cluster contribution.
        cluster_se = (X_cluster - centroid)**2 # <-- SOLUTION
        assert cluster_se.shape == (len(X_cluster), p) # <-- SOLUTION
        # 2. Accumulate
        within_distance += np.sum(cluster_se) # <-- SOLUTION

    return within_distance

Now implement compute_centroids that we use in step 2.

# EDIT THIS FUNCTION
def compute_centroids(k, X, labels):
    """
    Compute the centroids of the clustered points X.

    Args:
        k (int): total number of clusters.
        X (np.ndarray): data points, with shape (N, p)
        labels (np.ndarray): cluster assignments for each sample in X, with shape (N,).

    Returns:
        (np.ndarray): the centroids of the k clusters, with shape (k, p).
    """
    N, p = X.shape

    centroids = np.zeros((k, p))

    # EDIT THE NEXT LINES
    for label in range(k):
        cluster_X_l = X[labels == label] # <-- SOLUTION
        centroids[label] = cluster_X_l.mean(axis=0) # <-- SOLUTION

    return centroids

Make sure the following simple test case runs successfully.

# Test case.
X_test = np.array([[1, 1, 0],
                    [2, 2, 1],
                    [5, 3, 4],
                    [8, 3, 2]])
labels = np.array([0, 0, 1, 1])

centroids = np.array([[1.5, 1.5, 0.5],
                      [6.5, 3, 3]])

# Test compute_centroids
npt.assert_allclose(compute_centroids(2, X_test, labels),  centroids)


# Test compute_within_distance.
npt.assert_allclose(compute_within_distance(centroids, X_test, labels), 8.0)

Now, to the subroutine kmeans_assignments.

# EDIT THIS FUNCTION
def kmeans_assignments(centroids, X):
    """
    Assign every example to the index of the closest centroid.

    Args:
        centroids (np.ndarray): The centroids of the k clusters, shape: (k, p).
        X (np.ndarray): The samples array, shape (N, p).

    Returns:
        (np.ndarray): an assignment matrix to k clusters, by their indices.
    """
    k, p = centroids.shape
    N, _ = X.shape

    # Compute distances between data points and centroids. Assumed shape: (k, N).
    distances = np.vstack([np.linalg.norm(X - c, axis=1) for c in centroids]) # <-- SOLUTION
    # Note: If any centroid has NaN, the NaN value will propagate into the
    # distance corresponding row, we need to skip that row next when we search
    # for the closest centroid.

    assert distances.shape == (k, N), f"Unexpected shape {distances.shape} != {(k, N)}"
    # Assignments are computed by finding the centroid with the minimum distance
    # for each sample. The np.nanargmin returns the index of the minimum values
    # in `distances` scanning the rows (axis=0) for each column,
    # while skipping any nan value found.
    return np.nanargmin(distances, axis=0)


Some test cases to pass.

# Test case.
X_test = np.array([[1, 1, 0],
                    [2, 2, 1],
                    [5, 3, 4],
                    [8, 3, 2],
                    [11,4, -1]])

labels = np.array([0, 0, 1, 1, 1])

centroids = np.array([[1.5, 1.5, 0.5],
                      [6.5, 3, 3]])

# Test kmeans_assignments
npt.assert_equal(kmeans_assignments(centroids, X_test),  labels)

We are ready to implement kmeans_clustering that builds on the previous subroutines.

# EDIT THIS FUNCTION
def kmeans_clustering(X, k,
                      max_iters=1000,
                      epsilon=0.0,
                      callback=None):
    """
    Apply k-means clustering algorithm on the samples in `X` to generate
    k clusters.

    Notes:
      The main steps followed here are described previously:
        1. randomly assignments of the points to $k$-clusters.
        2. compute the centroid of each of the $k$- clusters.
        3. reassign each point to the closest centroid.
        4. repeat steps (2) and (3) until:
            - assignments/labels do not change, or
            - the within-distance $W(C)$ converges with `epsilon` tolerence.
            - or the number of iterations exceeds `max_iters`.


    Args:
        X (np.ndarray): The samples array, shape: (N, p).
        k (int): The number of clusters.
        max_iters (int): Maximum number of iterations.
        epsilon (float): The minimum change in the within-distance to continue.
        callback (Callable): a function to be called on the assignments,
            the centroids, and within-distance after each iteration, default is None.

    Returns:
        Tuple[np.ndarray, np.ndarray]: the assignments array to k clusters with
            shape (N,) and the centroids array
    """
    # Step 1: randomly initialise the cluster assignments.
    labels = np.random.choice(k, size=len(X), replace=True) # <-- SOLUTION

    within_distance = np.inf

    for _ in range(max_iters):
        # Step 2: compute the centroids
        centroids = co
上一篇:C语言例4-13:已知圆柱体的底半径是1.5,高是2.0,计算其体积并输出。


下一篇:[数据集][图像分类]河道污染分类数据集1923张4类别