C4.5算法是对ID3算法的改进,在决策树的生成过程中,使用了信息增益率作为属性选择的方法,其具体的算法步骤如下:
输入:训练数据集D,特征集A,阈值e
输出:决策树T
1.如果D中所有实例属于同一类C,则置T为单结点树,并将C作为该结点的类,返回T
2.如果A=∅,则置T为单结点树,并将D中实例数最大的类C作为该结点的类,返回T
3.否则,计算A中各特征对D的信息增益率,选择信息增益率最大的特征Ak
4.如果Ak的信息增益率小于阈值e,则置T为单结点树,并将D中实例数最大的类C作为该结点的类,返回T
5.否则,对Ak的每一个可能值ai,依Ak=ai将D分割为子集若干非空Di,将属性Ak作为一个结点,其每个属性值ai作为一个分支,分别构建子结点,由结点及其子结点构成树T,返回T
6.对结点i,以Di为训练集,以A−{Ak}为特征集,递归地调用步骤(1)∼(5)得到子树Ti,返回Ti
通过上述算法步骤可以发现,ID3算法和C4.5算法步骤基本一致,唯一的变化就是,在第四步时将ID3算法中的信息增益,改成了C4.5算法中的信息增益率。其他步骤两种算法完全一致。
代码实现
# 加载数据
def loadDataSet(dataPath):
dataset = []
with open(dataPath) as file:
lines = file.readlines()
for line in lines:
values = line.strip().split(' ')
dataset.append(values)
return dataset
# 根据属性值,分割数据集
def splitDataSet(dataset, axis, value):
retDataSet = []
for featVec in dataset:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
# 计算数据集的信息熵
def calShannonEnt(dataset):
numEntries = len(dataset) * 1.0
labelCounts = dict()
for featVec in dataset:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = labelCounts[key] / numEntries
import math
shannonEnt -= prob * math.log(prob, 2)
return shannonEnt
# 计算分割后的数据集相较于原数据集的信息增益
def InfoGain(dataset, axis, baseShannonEnt):
featList = [example[axis] for example in dataset]
uniqueVals = set(featList)
newShannonEnt = 0.0
numEntries = len(dataset) * 1.0
for value in uniqueVals:
subDataSet = splitDataSet(dataset, axis, value)
ent = calShannonEnt(subDataSet)
prob = len(subDataSet) / numEntries
newShannonEnt += prob * ent
infoGain = baseShannonEnt - newShannonEnt
return infoGain
# 计算属性的分裂信息值
def SplitInfo(dataset, axis):
numEntries = len(dataset) * 1.0
labelsCount = dict()
ent = 0.0
for featVec in dataset:
value = featVec[axis]
if value not in labelsCount:
labelsCount[value] = 0
labelsCount[value] += 1
for key in labelsCount:
prob = labelsCount[key] / numEntries
import math
ent -= prob * math.log(prob, 2)
return ent
# 计算属性的信息增益率
def GainRate(dataset, baseset, axis, baseShannonEnt):
infoGain = InfoGain(dataset, axis, baseShannonEnt)
splitInfo = SplitInfo(baseset, axis)
return infoGain / splitInfo
# 根据信息增益率,来选择属性
def ChooseBestFeatureByGainRate(dataset, baseset):
numFeature = len(dataset[0]) - 1
baseShannonEnt = calShannonEnt(dataset)
bestGainRate = 0.0
bestFeature = -1
for i in range(numFeature):
gainRate = GainRate(dataset, baseset, i, baseShannonEnt)
if gainRate > bestGainRate:
bestGainRate = gainRate
bestFeature = i
return bestFeature
# 构建决策树
def createTree(dataset, baseset, labels):
classList = [example[-1] for example in dataset]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataset[0]) == 1:
return majorityCnt(classList)
bestFeature = ChooseBestFeatureByGainRate(dataset, baseset)
bestFeatureLabel = labels[bestFeature]
myTree = {bestFeatureLabel:{}}
del(labels[bestFeature])
featValues = [example[bestFeature] for example in dataset]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatureLabel][value] = \
createTree(splitDataSet(dataset, bestFeature, value), baseset, subLabels)
return myTree