Clustering -- THU 机器学习 2020

What can we do with unlabeled data? 

  • Data clustering
    • Partition examples into groups when no pre-defined categories/classes are available
  • Dimensionality reduction
    • Reduce the number of variables under consideration
  • Outlier detection
    • Identification of new or unknown data or signal that a machine learning system is not aware of during training
  • Modeling the data density

what is clustering

  • “Birds of a feather flock together. ”
  • small intra-cluster distance
  • large inter-cluster distance
  • Soft clustering vs. hard clustering
    • Soft: same object can belong to different clusters
    • Hard: same object can only belong to single cluster

Hierarchical clustering

Agglomerative (层次凝聚式聚类)

Clustering -- THU 机器学习 2020


Clustering -- THU 机器学习 2020

cluster similarity:

Clustering -- THU 机器学习 2020

Divisive (层次划分式聚类)

Clustering -- THU 机器学习 2020

discussion on hierarchical clustering

Clustering -- THU 机器学习 2020




Clustering -- THU 机器学习 2020

                                                      (step 1)

Clustering -- THU 机器学习 2020


Clustering -- THU 机器学习 2020

K-means 一定能收敛,但不一定是最优解

How can we decide K?

Clustering -- THU 机器学习 2020

discussion on K-means:

Clustering -- THU 机器学习 2020

K-medoid clustering


The basic strategy:

  • first arbitrarily find a representative object (medoid) for each cluster
  • Iteration:
    • Each remaining object is clustered with the medoid to which it is the most similar
    • Replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved (The quality of the cluster is estimated by a cost function: the average dissimilarity(object, the medoid))

Clustering -- THU 机器学习 2020

