Clustering -- THU 机器学习 2020

What can we do with unlabeled data? 

  • Data clustering
    • Partition examples into groups when no pre-defined categories/classes are available
  • Dimensionality reduction
    • Reduce the number of variables under consideration
  • Outlier detection
    • Identification of new or unknown data or signal that a machine learning system is not aware of during training
  • Modeling the data density

what is clustering

  • “Birds of a feather flock together. ”
  • small intra-cluster distance
  • large inter-cluster distance
  • Soft clustering vs. hard clustering
    • Soft: same object can belong to different clusters
    • Hard: same object can only belong to single cluster

Hierarchical clustering

Agglomerative (层次凝聚式聚类)

Clustering -- THU 机器学习 2020

凝聚式层次聚类算法:

Clustering -- THU 机器学习 2020

cluster similarity:

Clustering -- THU 机器学习 2020

Divisive (层次划分式聚类)

Clustering -- THU 机器学习 2020

discussion on hierarchical clustering

Clustering -- THU 机器学习 2020

 

K-means 

步骤:

Clustering -- THU 机器学习 2020

                                                      (step 1)

Clustering -- THU 机器学习 2020

                                               (step2)

Clustering -- THU 机器学习 2020

K-means 一定能收敛,但不一定是最优解

How can we decide K?

Clustering -- THU 机器学习 2020

discussion on K-means:

Clustering -- THU 机器学习 2020

K-medoid clustering

与k-means不同的是,k-中值clustering的"中心点"必须是一个真实存在的点,而不能是一个虚拟的"中心点"。这个真实存在的点应该是该聚类里到其他点距离之和最小的那个点。

The basic strategy:

  • first arbitrarily find a representative object (medoid) for each cluster
  • Iteration:
    • Each remaining object is clustered with the medoid to which it is the most similar
    • Replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved (The quality of the cluster is estimated by a cost function: the average dissimilarity(object, the medoid))

Clustering -- THU 机器学习 2020

上一篇:R语言统计分布及模拟


下一篇:linux查看进程(java)启动时间