统计了14天的气象数据D(指标包括outlook,temperature,humidity,windy),并已知这些天气是否打球(play)。如果给出新一天的气象指标数据:sunny,cool,high,TRUE,判断一下会不会去打球。
outlook | temperature | humidity | windy | play |
sunny | hot | high | FALSE | no |
sunny | hot | high | TRUE | no |
overcast | hot | high | FALSE | yes |
rainy | mild | high | FALSE | yes |
rainy | cool | normal | FALSE | yes |
rainy | cool | normal | TRUE | no |
overcast | cool | normal | TRUE | yes |
sunny | mild | high | FALSE | no |
sunny | cool | normal | FALSE | yes |
rainy | mild | normal | FALSE | yes |
sunny | mild | normal | TRUE | yes |
overcast | mild | high | TRUE | yes |
overcast | hot | normal | FALSE | yes |
rainy | mild | high | TRUE | no |
预备知识:信息熵
熵是无序性(或不确定性)的度量指标。假如事件A的全概率划分是(A1,A2,...,An),每部分发生的概率是(p1,p2,...,pn),那信息熵定义为:
通常以2为底数,所以信息熵的单位是bit。
C4.5算法
构造树的基本想法是随着树深度的增加,节点的熵迅速地降低。熵降低的速度越快越好,这样我们有望得到一棵高度最矮的决策树。
在没有给定任何天气信息时,根据历史数据,我们只知道新的一天打球的概率是9/14,不打的概率是5/14。此时的熵为:
Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940
属性有4个:outlook,temperature,humidity,windy。我们首先要决定哪个属性作树的根节点。
对每项指标分别统计:在不同的取值下打球和不打球的次数。
outlook | temperature | humidity | windy | play | |||||||||
yes | no | yes | no | yes | no | yes | no | yes | no | ||||
sunny | 2 | 3 | hot | 2 | 2 | high | 3 | 4 | FALSE | 6 | 2 | 9 | 5 |
overcast | 4 | 0 | mild | 4 | 2 | normal | 6 | 1 | TRUR | 3 | 3 | ||
rainy | 3 | 2 | cool | 3 | 1 |
下面对属性集中每个属性分别计算信息熵,如下所示:
Info(outlook) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694
Info(temperature) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.911
Info(huminity) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789
Info(windy) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892
根据上面的数据,我们可以计算选择第一个根结点所依赖的信息增益值,计算如下所示:
gain(outlook) = Info(D) - Info(outlook) = 0.940 - 0.694 = 0.246
gain(temperature) = Info(D) - Info(temperature) = 0.940 - 0.911 = 0.029
gain(huminity) = Info(D) - Info(huminity) = 0.940 - 0.789 = 0.151
gain(windy) = Info(D) - Info(windy) = 0.940 - 0.892 = 0.048
接下来,我们计算分裂信息度量H(V):
- outlook属性
属性outlook有3个取值,其中sunny有5个样本、rainy有5个样本、overcast有4个样本,则
H(outlook) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.577406282852345
- temperature属性
属性temperature有3个取值,其中Hot有4个样本、Mild有6个样本、Cool有4个样本,则
H(temperature) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.5566567074628228
- huminity属性
属性huminity有2个取值,其中Normal有7个样本、High有7个样本,则
H(huminity) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0
- windy属性
属性windy有2个取值,其中True有6个样本、False有8个样本,则
H(windy) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.9852281360342516
根据上面计算结果,我们可以计算信息增益率,如下所示:
IGR(outlook) = Info(outlook) / H(outlook) = 0.246/1.577406282852345 = 0.15595221261270145
IGR(temperature) = Info(temperature) / H(temperature) = 0.029 / 1.5566567074628228 = 0.018629669509642094
IGR(huminity) = Info(huminity) / H(huminity) = 0.151/1.0 = 0.151
IGR(windy) = Info(windy) / H(windy) = 0.048/0.9852281360342516 = 0.048719680492692784
所以我们可以选出第一个根节点是outlook
最后得到的决策树为:
参考文献:
[1]http://blog.csdn.net/xuxurui007/article/details/18045943
[2]http://www.cnblogs.com/zhangchaoyang/articles/2842490.html