What can we do with unlabeled data?
- Data clustering
- Partition examples into groups when no pre-defined categories/classes are available
- Dimensionality reduction
- Reduce the number of variables under consideration
- Outlier detection
- Identification of new or unknown data or signal that a machine learning system is not aware of during training
- Modeling the data density
what is clustering
- “Birds of a feather flock together. ”
- small intra-cluster distance
- large inter-cluster distance
- Soft clustering vs. hard clustering
- Soft: same object can belong to different clusters
- Hard: same object can only belong to single cluster
Hierarchical clustering
Agglomerative (层次凝聚式聚类)
凝聚式层次聚类算法:
cluster similarity:
Divisive (层次划分式聚类)
discussion on hierarchical clustering
K-means
步骤:
(step 1)
(step2)
K-means 一定能收敛,但不一定是最优解
How can we decide K?
discussion on K-means:
K-medoid clustering
与k-means不同的是,k-中值clustering的"中心点"必须是一个真实存在的点,而不能是一个虚拟的"中心点"。这个真实存在的点应该是该聚类里到其他点距离之和最小的那个点。
The basic strategy:
- first arbitrarily find a representative object (medoid) for each cluster
- Iteration:
- Each remaining object is clustered with the medoid to which it is the most similar
- Replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved (The quality of the cluster is estimated by a cost function: the average dissimilarity(object, the medoid))