Randomly pick $k$ k centroids from the sample points as initial cluster centers
Assign each sample to the nearest centroid $\mu^{(j)},j \in \{ 1,...,k\}$ μ(j),j∈{1,...,k}
Move the centroids to the center of the samples that were assigned to it.
Repeat the step 2 and 3 until the cluster assignment do not change or a user-defined tolerance or a maximum number of iterations is reached.

SSE

Based on the Euclidean distance metric, we can describe the k-means algorithm as a imple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors(SSE), which is sometimes also called cluster inertia.
$d(x,y)^2 = \sum_{j=1} ^{m} (x_j -y_j)^2 = ||x - y||_2^2$ d(x,y)2=j=1∑m(xj−yj)2=∣∣x−y∣∣22

$SSE = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2$ SSE=i=1∑nj=1∑kw(i,j)∣∣x(i)−μ(j)∣∣22
Here, $\mu^{(j)}$ μ(j) is the representative point (centroid) for cluster $j$ j, and $w^{(i,j)}$ w(i,j) =1 if the sample $x^{(i)}$ x(i) is in cluster j and $w^{(i,j)}$ w(i,j) =0 otherwise.

K-means++

Place the initial centroids far away from each other via the k-means++ algorithm.

Steps

Initialize an empty set $M$ M to store the k centroids being selected.
Randomly choose the first centroid $\mu^j$ μj from the input samples and assign it to $M$ M.
For each sample $x^{(i)}$ x(i) that is not in $M$ M , find the minimum squared distance $d(x^{(i)},M)^2$ d(x(i),M)2 to any of the centroids in $M$ M.
To randomly select the next centroid $\mu^{(p)}$ μ(p), use a weighted probability distribution equal to $\frac {d(\mu^{(p)},M)^2} {\sum_i d(x^{(i)},M)^2}$ ∑id(x(i),M)2d(μ(p),M)2
Repeat steps 2 and 3 until k centroids are chosen.
Proceed with the classic k-means algorithm.

Show

from sklearn.datasets import make_blobs
#y实际上并没有用到，因为我们此时想要的是一个无监督学习模型，不过可以用于结果验证
X, y = make_blobs(n_samples=150, 
                  n_features=2, 
                  centers=3, 
                  cluster_std=0.5, 
                  shuffle=True, 
                  random_state=0)

import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], 
            c='white', marker='o', edgecolor='black', s=50)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_01.png', dpi=300)
plt.show()

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)

y_km = km.fit_predict(X)

plt.scatter(X[y_km == 0, 0],
            X[y_km == 0, 1],
            s=50, c='lightgreen',
            marker='s', edgecolor='black',
            label='cluster 1')
plt.scatter(X[y_km == 1, 0],
            X[y_km == 1, 1],
            s=50, c='orange',
            marker='o', edgecolor='black',
            label='cluster 2')
plt.scatter(X[y_km == 2, 0],
            X[y_km == 2, 1],
            s=50, c='lightblue',
            marker='v', edgecolor='black',
            label='cluster 3')
plt.scatter(km.cluster_centers_[:, 0],
            km.cluster_centers_[:, 1],
            s=250, marker='*',
            c='red', edgecolor='black',
            label='centroids')
plt.legend(scatterpoints=1)
plt.grid()
plt.tight_layout()
#plt.savefig('images/11_02.png', dpi=300)
plt.show()

The above algorithms are called hard clustering, where each sample are assigned to exactly one cluster.

Soft clustering (FCM)

also called fuzzy clustering, e.g . fuzzy C-means (FCM)
We replace the hard cluster assignment by probabilities for each point belonging to each cluster.

Steps

Specify the number of k centroids and randomly assign the cluster memberships for each point.
Compute the cluster centroids $\mu^{(j)},j \in \{ 1,...,k\}$ μ(j),j∈{1,...,k}
Update the cluster memberships for each point.
Repeat the step 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.
$J_m = \sum_{i=1}^{n}\sum_{j=1}^{k} w^{m(i,j)} ||x^{(i)}-\mu^{(j)} || _2^2,m \in[1,\infty)$ Jm=i=1∑nj=1∑kwm(i,j)∣∣x(i)−μ(j)∣∣22,m∈[1,∞)
Here, the menbership indicator $w^{m(i,j)}$ wm(i,j) is not a binary value as in k-means but a real value that denotes the cluster membership probability.
The exponent $m$ m, is the so-called fuzziness coefficient, (or simply fuzzifier) that control the fuzziness. The larger the value of $m$ m, the fuzzier the clusters.
$w^{(i,j)} = \left[ \sum_{p=1}^{k} \left( \frac {||x^{(i)}-\mu^{(j)} ||_2} {||x^{(i)}-\mu^{(p)} ||_2} \right)^{\frac{2}{m-1}} \right]^{-1}$ w(i,j)=[p=1∑k(∣∣x(i)−μ(p)∣∣2∣∣x(i)−μ(j)∣∣2)m−12]−1
The center $\mu^{(j)}$ μ(j) of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
$\mu^{(j)} = \frac {\sum _{i=1} ^n w^{m(i,j)}x^{{i}} } {\sum_{i=1}^{n}w^{m(i,j)}}$ μ(j)=∑i=1nwm(i,j)∑i=1nwm(i,j)xi

Elbow method

distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                init='k-means++', 
                n_init=10, 
                max_iter=300, 
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.tight_layout()
#plt.savefig('images/11_03.png', dpi=300)
plt.show()

Silhouette plots 轮廓

Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are.

Steps

Calculate the cluster cohesion $a^{(i)}$ a(i) as the average distance between a sample $x^{(i)}$ x(i) and all other points in the same cluster.
Calcutale the cluster seperation $b^{(i)}$ b(i) from the next closet clusteras the average distance between the sample $x^{(i)}$ x(i) and all samples in the nearest cluster.
Calculate the silhouette $s^{(i)}$ s(i) as the difference between cluster cohesion and separation divided by the greater of the two , as shown here:
$s^{(i)} = \frac {b^{(i)}-a^{(i)}} {max\{ b^{(i)}, a^{(i)}\}}$ s(i)=max{b(i),a(i)}b(i)−a(i)

import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples

km = KMeans(n_clusters=3, 
            init='k-means++', 
            n_init=10, 
            max_iter=300,
            tol=1e-04,
            random_state=0)
y_km = km.fit_predict(X)

cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    #取出一类中的silhouette值并排序
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    #作图
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i) / n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, 
             edgecolor='none', color=color)
    #记下y轴坐标位置
    yticks.append((y_ax_lower + y_ax_upper) / 2.)
    y_ax_lower += len(c_silhouette_vals)
#画一个均值线
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color="red", linestyle="--") 

plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')

plt.tight_layout()
#plt.savefig('images/11_04.png', dpi=300)
plt.show()

s 越高区分度度越好， s in range(-1,1),可以作为建模完成后的升华

hierarchical clustering

adavantages:

it allows us to plot dendrograms(visualizations of a binary hierarchical clustering)#系统树图
We do not need to specify the number oif clusters upfront.

approaches:

agglomerative hierarchical clustering (bottom-up)
divisive hierarchical clustering

two algorithm for agglomerative hierarchical clustering:

single linkage
complete linkage

Steps

Compute the distance matrix of all samples
Represent each data point as a singleton cluster
Merge the two closest clusters based on the distance of the most dissimilar members
Update the similarity matrix
Repeat steps 2 to 4 until one single cluster remains

import pandas as pd
import numpy as np

np.random.seed(123)

variables = ['X', 'Y', 'Z']
labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4']

X = np.random.random_sample([5, 3])*10
df = pd.DataFrame(X, columns=variables, index=labels)
df

from scipy.spatial.distance import pdist, squareform

row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')),
                        columns=labels,
                        index=labels)
row_dist

# 2. correct approach: Condensed distance matrix

row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
pd.DataFrame(row_clusters,
             columns=['row label 1', 'row label 2',
                      'distance', 'no. of items in clust.'],
             index=['cluster %d' % (i + 1) 
                    for i in range(row_clusters.shape[0])])

from scipy.cluster.hierarchy import dendrogram

# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])

row_dendr = dendrogram(row_clusters, 
                       labels=labels,
                       # make dendrogram black (part 2/2)
                       # color_threshold=np.inf
                       )
plt.tight_layout()
plt.ylabel('Euclidean distance')
#plt.savefig('images/11_11.png', dpi=300, 
#            bbox_inches='tight')
plt.show()

dendrogram with heatmap

# plot row dendrogram
fig = plt.figure(figsize=(8, 8), facecolor='white')
axd = fig.add_axes([0.09, 0.1, 0.2, 0.6])

# note: for matplotlib < v1.5.1, please use orientation='right'
row_dendr = dendrogram(row_clusters, orientation='left')

# reorder data with respect to clustering
df_rowclust = df.iloc[row_dendr['leaves'][::-1]]

axd.set_xticks([])
axd.set_yticks([])

# remove axes spines from dendrogram
for i in axd.spines.values():
    i.set_visible(False)

# plot heatmap
axm = fig.add_axes([0.23, 0.1, 0.6, 0.6])  # x-pos, y-pos, width, height
cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r')
fig.colorbar(cax)
axm.set_xticklabels([''] + list(df_rowclust.columns))
axm.set_yticklabels([''] + list(df_rowclust.index))

#plt.savefig('images/11_12.png', dpi=300)
plt.show()

sklearn

from sklearn.cluster import AgglomerativeClustering

ac = AgglomerativeClustering(n_clusters=3, 
                             affinity='euclidean', 
                             linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)
-------------------------------------------------
Cluster labels: [1 0 0 2 1]

DBSCAN

DBSCAN: Density-based Spatial Clustering of Applications with Noise
Criteria:

1. core point: a point is considered as core point if at least a specified number (MinPts) of neighboring points fall within the specified radius $\varepsilon$ ε.
2.border point: is a point that has fewer neighbors than MinPts within $\varepsilon$ ε.
3. noise point: all other points that are neither core nor border points are considered noise points.

steps

Form a seperate cluster for each core point or a connected group of core points (core points are connected if they are no farther away than $\varepsilon$ ε)
Assign each boarder point to the cluster of its corresponding core point.

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
plt.tight_layout()
#plt.savefig('images/11_14.png', dpi=600)
plt.show()

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))

km = KMeans(n_clusters=2, random_state=0)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1],
            edgecolor='black',
            c='lightblue', marker='o', s=40, label='cluster 1')
ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1],
            edgecolor='black',
            c='red', marker='s', s=40, label='cluster 2')
ax1.set_title('K-means clustering')

ac = AgglomerativeClustering(n_clusters=2,
                             affinity='euclidean',
                             linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue',
            edgecolor='black',
            marker='o', s=40, label='cluster 1')
ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red',
            edgecolor='black',
            marker='s', s=40, label='cluster 2')
ax2.set_title('Agglomerative clustering')

plt.legend()
plt.tight_layout()
# plt.savefig('images/11_15.png', dpi=300)
plt.show()

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)
plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
            c='lightblue', marker='o', s=40,
            edgecolor='black', 
            label='cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
            c='red', marker='s', s=40,
            edgecolor='black', 
            label='cluster 2')
plt.legend()
plt.tight_layout()
#plt.savefig('images/11_16.png', dpi=300)
plt.show()

Thank you for your precise time and patience!

Flying Squirrel 发布了7 篇原创文章 · 获赞 8 · 访问量 904 私信关注

码农公寓

Clustering Analysis

K-means

some basic words

Steps

SSE

K-means++

Steps

Show

Soft clustering (FCM)

Steps

Elbow method

Silhouette plots 轮廓

Steps

hierarchical clustering

Steps

dendrogram with heatmap

sklearn

DBSCAN

steps

相关文章