机器学习读书笔记(十一):Clustering Analysis

Clustering Analysis

K-means

some basic words

  1. centroid:质心
  2. medoid:most representative or most frequently occurring points

Steps

  1. Randomly pick kkk centroids from the sample points as initial cluster centers
  2. Assign each sample to the nearest centroid μ(j),j{1,...,k}\mu^{(j)},j \in \{ 1,...,k\}μ(j),j∈{1,...,k}
  3. Move the centroids to the center of the samples that were assigned to it.
  4. Repeat the step 2 and 3 until the cluster assignment do not change or a user-defined tolerance or a maximum number of iterations is reached.

SSE

Based on the Euclidean distance metric, we can describe the k-means algorithm as a imple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors(SSE), which is sometimes also called cluster inertia.
d(x,y)2=j=1m(xjyj)2=xy22d(x,y)^2 = \sum_{j=1} ^{m} (x_j -y_j)^2 = ||x - y||_2^2d(x,y)2=j=1∑m​(xj​−yj​)2=∣∣x−y∣∣22​

SSE=i=1nj=1kw(i,j)x(i)μ(j)22SSE = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2SSE=i=1∑n​j=1∑k​w(i,j)∣∣x(i)−μ(j)∣∣22​
Here, μ(j)\mu^{(j)}μ(j) is the representative point (centroid) for cluster jjj, and w(i,j)w^{(i,j)}w(i,j) =1 if the sample x(i)x^{(i)}x(i) is in cluster j and w(i,j)w^{(i,j)}w(i,j) =0 otherwise.

K-means++

Place the initial centroids far away from each other via the k-means++ algorithm.

Steps

  1. Initialize an empty set MMM to store the k centroids being selected.
  2. Randomly choose the first centroid μj\mu^jμj from the input samples and assign it to MMM.
  3. For each sample x(i)x^{(i)}x(i) that is not in MMM , find the minimum squared distance d(x(i),M)2d(x^{(i)},M)^2d(x(i),M)2 to any of the centroids in MMM.
  4. To randomly select the next centroid μ(p)\mu^{(p)}μ(p), use a weighted probability distribution equal to d(μ(p),M)2id(x(i),M)2\frac {d(\mu^{(p)},M)^2} {\sum_i d(x^{(i)},M)^2}∑i​d(x(i),M)2d(μ(p),M)2​
  5. Repeat steps 2 and 3 until k centroids are chosen.
  6. Proceed with the classic k-means algorithm.

Show

from sklearn.datasets import make_blobs
#y实际上并没有用到,因为我们此时想要的是一个无监督学习模型,不过可以用于结果验证
X, y = make_blobs(n_samples=150, 
                  n_features=2, 
                  centers=3, 
                  cluster_std=0.5, 
                  shuffle=True, 
                  random_state=0)
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], 
            c='white', marker='o', edgecolor='black', s=50)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_01.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)

y_km = km.fit_predict(X)
plt.scatter(X[y_km == 0, 0],
            X[y_km == 0, 1],
            s=50, c='lightgreen',
            marker='s', edgecolor='black',
            label='cluster 1')
plt.scatter(X[y_km == 1, 0],
            X[y_km == 1, 1],
            s=50, c='orange',
            marker='o', edgecolor='black',
            label='cluster 2')
plt.scatter(X[y_km == 2, 0],
            X[y_km == 2, 1],
            s=50, c='lightblue',
            marker='v', edgecolor='black',
            label='cluster 3')
plt.scatter(km.cluster_centers_[:, 0],
            km.cluster_centers_[:, 1],
            s=250, marker='*',
            c='red', edgecolor='black',
            label='centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis
The above algorithms are called hard clustering, where each sample are assigned to exactly one cluster.

Soft clustering (FCM)

also called fuzzy clustering, e.g . fuzzy C-means (FCM)
We replace the hard cluster assignment by probabilities for each point belonging to each cluster.

Steps

  1. Specify the number of k centroids and randomly assign the cluster memberships for each point.
  2. Compute the cluster centroids μ(j),j{1,...,k}\mu^{(j)},j \in \{ 1,...,k\}μ(j),j∈{1,...,k}
  3. Update the cluster memberships for each point.
  4. Repeat the step 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.
    Jm=i=1nj=1kwm(i,j)x(i)μ(j)22,m[1,)J_m = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{m(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2,m \in[1,\infty)Jm​=i=1∑n​j=1∑k​wm(i,j)∣∣x(i)−μ(j)∣∣22​,m∈[1,∞)
    Here, the menbership indicator wm(i,j)w^{m(i,j)}wm(i,j) is not a binary value as in k-means but a real value that denotes the cluster membership probability.
    The exponent mmm, is the so-called fuzziness coefficient, (or simply fuzzifier) that control the fuzziness. The larger the value of mmm, the fuzzier the clusters.
    w(i,j)=[p=1k(x(i)μ(j)2x(i)μ(p)2)2m1]1w^{(i,j)} = \left[ \sum_{p=1}^{k} \left( \frac {||x^{(i)}-\mu^{(j)} ||_2} {||x^{(i)}-\mu^{(p)} ||_2} \right)^{\frac{2}{m-1}} \right]^{-1}w(i,j)=[p=1∑k​(∣∣x(i)−μ(p)∣∣2​∣∣x(i)−μ(j)∣∣2​​)m−12​]−1
    The center μ(j)\mu^{(j)}μ(j) of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
    μ(j)=i=1nwm(i,j)xii=1nwm(i,j)\mu^{(j)} = \frac {\sum _{i=1} ^n w^{m(i,j)}x^{{i}} } {\sum_{i=1}^{n}w^{m(i,j)}}μ(j)=∑i=1n​wm(i,j)∑i=1n​wm(i,j)xi​

Elbow method

distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                init='k-means++', 
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
#plt.savefig('images/11_03.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis

Silhouette plots 轮廓

Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are.

Steps

  1. Calculate the cluster cohesion a(i)a^{(i)}a(i) as the average distance between a sample x(i)x^{(i)}x(i) and all other points in the same cluster.
  2. Calcutale the cluster seperation b(i)b^{(i)}b(i) from the next closet clusteras the average distance between the sample x(i)x^{(i)}x(i) and all samples in the nearest cluster.
  3. Calculate the silhouette s(i)s^{(i)}s(i) as the difference between cluster cohesion and separation divided by the greater of the two , as shown here:
    s(i)=b(i)a(i)max{b(i),a(i)}s^{(i)} = \frac {b^{(i)}-a^{(i)}} {max\{ b^{(i)}, a^{(i)}\}}s(i)=max{b(i),a(i)}b(i)−a(i)​
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)
y_km = km.fit_predict(X)

cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    #取出一类中的silhouette值并排序
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    #作图
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, 
             edgecolor='none', color=color)
    #记下y轴坐标位置
    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)
#画一个均值线
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--") 

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('images/11_04.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis
s 越高区分度度越好, s in range(-1,1),可以作为建模完成后的升华

hierarchical clustering

adavantages:

  1. it allows us to plot dendrograms(visualizations of a binary hierarchical clustering)#系统树图
  2. We do not need to specify the number oif clusters upfront.

approaches:

  1. agglomerative hierarchical clustering (bottom-up)
  2. divisive hierarchical clustering

two algorithm for agglomerative hierarchical clustering:

  1. single linkage
  2. complete linkage
    机器学习读书笔记(十一):Clustering Analysis

Steps

  1. Compute the distance matrix of all samples
  2. Represent each data point as a singleton cluster
  3. Merge the two closest clusters based on the distance of the most dissimilar members
  4. Update the similarity matrix
  5. Repeat steps 2 to 4 until one single cluster remains
import pandas as pd
import numpy as np

np.random.seed(123)

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df

机器学习读书笔记(十一):Clustering Analysis

from scipy.spatial.distance import pdist, squareform

row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)
row_dist

机器学习读书笔记(十一):Clustering Analysis

# 2. correct approach: Condensed distance matrix

row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=['cluster %d' % (i + 1) 
                    for i in range(row_clusters.shape[0])])
from scipy.cluster.hierarchy import dendrogram

# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])

row_dendr = dendrogram(row_clusters, 
                       labels=labels,
                       # make dendrogram black (part 2/2)
                       # color_threshold=np.inf
                       )
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('images/11_11.png', dpi=300, 
#            bbox_inches='tight')
plt.show()

机器学习读书笔记(十一):Clustering Analysis

dendrogram with heatmap

# plot row dendrogram
fig = plt.figure(figsize=(8, 8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])

# note: for matplotlib < v1.5.1, please use orientation='right'
row_dendr = dendrogram(row_clusters, orientation='left')

# reorder data with respect to clustering
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]

axd.set_xticks([])
axd.set_yticks([])

# remove axes spines from dendrogram
for i in axd.spines.values():
    i.set_visible(False)

# plot heatmap
axm = fig.add_axes([0.23, 0.1, 0.6, 0.6])  # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

#plt.savefig('images/11_12.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis

sklearn

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters=3, 
                             affinity='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)
-------------------------------------------------
Cluster labels: [1 0 0 2 1]

DBSCAN

DBSCAN: Density-based Spatial Clustering of Applications with Noise
Criteria:

1. core point: a point is considered as core point if at least a specified number (MinPts) of neighboring points fall within the specified radius ε\varepsilonε.
2.border point: is a point that has fewer neighbors than MinPts within ε\varepsilonε.
3. noise point: all other points that are neither core nor border points are considered noise points.
机器学习读书笔记(十一):Clustering Analysis

steps

  1. Form a seperate cluster for each core point or a connected group of core points (core points are connected if they are no farther away than ε\varepsilonε)
  2. Assign each boarder point to the cluster of its corresponding core point.
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
plt.tight_layout()
#plt.savefig('images/11_14.png', dpi=600)
plt.show()

机器学习读书笔记(十一):Clustering Analysis

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],
            edgecolor='black',
            c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1],
            edgecolor='black',
            c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')

ac = AgglomerativeClustering(n_clusters=2,
                             affinity='euclidean',
                             linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue',
            edgecolor='black',
            marker='o', s=40, label='cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red',
            edgecolor='black',
            marker='s', s=40, label='cluster 2')
ax2.set_title('Agglomerative clustering')

plt.legend()
plt.tight_layout()
# plt.savefig('images/11_15.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
            c='lightblue', marker='o', s=40,
            edgecolor='black', 
            label='cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
            c='red', marker='s', s=40,
            edgecolor='black', 
            label='cluster 2')
plt.legend()
plt.tight_layout()
#plt.savefig('images/11_16.png', dpi=300)
plt.show()

机器学习读书笔记(十一):Clustering Analysis
Thank you for your precise time and patience!

机器学习读书笔记(十一):Clustering Analysis机器学习读书笔记(十一):Clustering Analysis Flying Squirrel 发布了7 篇原创文章 · 获赞 8 · 访问量 904 私信 关注
上一篇:tensorflow keras analysis


下一篇:lr具体使用步骤概述