决策树的概念其实不难理解,下面一张图是某女生相亲时用到的决策树:
基本上可以理解为:一堆数据,附带若干属性,每一条记录最后都有一个分类(见或者不见),然后根据每种属性可以进行划分(比如年龄是>30还是<=30),这样构造出来的一棵树就是我们所谓的决策树了,决策的规则都在节点上,通俗易懂,分类效果好。
那为什么跟节点要用年龄,而不是长相?这里我们在实现决策树的时候采用的是ID3算法,在选择哪个属性作为节点的时候采用信息论原理,所谓的信息增益。信息增益指原有数据集的熵-按某个属性分类后数据集的熵。信息增益越大越好(说明按某个属性分类后比较纯),我们会选择使得信息增益最大的那个属性作为当层节点的标记,再进行递归构造决策树。
首先我们构造数据集:
def createDataSet():
dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
features = ['no surfacing','flippers']
return dataSet,features
构造决策树:(采用python字典来递归构造,一些代码看看就能看懂)
def treeGrowth(dataSet,features):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0])==len(classList):
return classList[0]
if len(dataSet[0])==1:# no more features
return classify(classList) bestFeat = findBestSplit(dataSet)#bestFeat is the index of best feature
bestFeatLabel = features[bestFeat]
myTree = {bestFeatLabel:{}}
featValues = [example[bestFeat] for example in dataSet]
uniqueFeatValues = set(featValues)
del (features[bestFeat])
for values in uniqueFeatValues:
subDataSet = splitDataSet(dataSet,bestFeat,values)
myTree[bestFeatLabel][values] = treeGrowth(subDataSet,features)
return myTree
当没有多余的feature,但是剩下的样本不完全是一样的类别是,采用出现次数多的那个类别:
def classify(classList):
'''
find the most in the set
'''
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(),key = operator.itemgetter(1),reverse = True)
return sortedClassCount[0][0]
寻找用于分裂的最佳属性:(遍历所有属性,算信息增益)
def findBestSplit(dataset):
numFeatures = len(dataset[0])-1
baseEntropy = calcShannonEnt(dataset)
bestInfoGain = 0.0
bestFeat = -1
for i in range(numFeatures):
featValues = [example[i] for example in dataset]
uniqueFeatValues = set(featValues)
newEntropy = 0.0
for val in uniqueFeatValues:
subDataSet = splitDataSet(dataset,i,val)
prob = len(subDataSet)/float(len(dataset))
newEntropy += prob*calcShannonEnt(subDataSet)
if(baseEntropy - newEntropy)>bestInfoGain:
bestInfoGain = baseEntropy - newEntropy
bestFeat = i
return bestFeat
选择完分裂属性后,就行数据集的分裂:
def splitDataSet(dataset,feat,values):
retDataSet = []
for featVec in dataset:
if featVec[feat] == values:
reducedFeatVec = featVec[:feat]
reducedFeatVec.extend(featVec[feat+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
计算数据集的熵:
def calcShannonEnt(dataset):
numEntries = len(dataset)
labelCounts = {}
for featVec in dataset:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0 for key in labelCounts:
prob = float(labelCounts[key])/numEntries
if prob != 0:
shannonEnt -= prob*log(prob,2)
return shannonEnt
下面根据上面构造的决策树进行数据的分类:
def predict(tree,newObject):
while isinstance(tree,dict):
key = tree.keys()[0]
tree = tree[key][newObject[key]]
return tree if __name__ == '__main__':
dataset,features = createDataSet()
tree = treeGrowth(dataset,features)
print tree
print predict(tree,{'no surfacing':1,'flippers':1})
print predict(tree,{'no surfacing':1,'flippers':0})
print predict(tree,{'no surfacing':0,'flippers':1})
print predict(tree,{'no surfacing':0,'flippers':0})
结果如下:
决策树是这样的:
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
四个预测:
yes
no
no
no
和给定的数据集分类一样(预测的数据是从给定数据集里面抽取的,当然一般数据多的话,会拿一部分做训练数据,剩余的做测试数据)
归纳一下ID3的优缺点:
优点:实现比较简单,产生的规则如果用图表示出来的话,清晰易懂,分类效果好
缺点:只能处理属性值离散的情况(连续的用C4.5),在选择最佳分离属性的时候容易选择那些属性值多的一些属性。