《统计学习方法》(李航)学习笔记
【1】第一章 统计学习方法概论
1.3.2的2. 经验风险最小值与结构风险最小值
中提到'当模型是条件概率分布,损失函数是对数损失函数时,经验风险最小化等价于极大似然估计'。P9
个人注解:
\[假设空间 F=\{ P|P(Y|X);\theta\} \] \[极大似然函数 L(\theta)={\Pi^n_1P(y_i|x_i)} \] \[对数极大似然函数 log(L(\theta))={\Sigma^n_1logP(y_i|x_i)} \]考虑到N是常数,则
\[经验风险 -\frac{1}{N}\Sigma_1^n{log(P(y_i|x_i))} \]由此可知,经验风险最小,实际上等价于对数极大似然函数取最大。
【2】1.4.2的例1.1对wj求偏导,P12
\[L(w)=\frac{1}{2}\Sigma_{i=1}^N\Sigma_{j=0}^M{[w_jx_i^j-y_i]^2} \]对上式求导:
\[\frac{1}{2}\Sigma_{i=1}^N2(w_jx_i^j-y_j)x_i^j=0 \\ 即: \Sigma_{i=1}^Nw_jx_i^{2j}=\Sigma_{i=1}^Nx_i^jy_j \\ 即: w_j=\frac{\Sigma_{i=1}^Nx_iy_i}{\Sigma_{i=1}^Nx_i^{j+1}} \]《机器学习》(西瓜书)
【1】第四章 决策树ID3算法递归结束的三个条件
(1)子结点中的样本属于同一类。
(2)子结点的特征用完了。
(3)子结点没有样本了。
https://www.jianshu.com/p/d153130b813f
【2】第四章 决策树连续值处理
西瓜书 决策树连续值属性处理 (P83~85)
import numpy as np
import pandas as pd
西瓜的密度及对应的标签如下:
density=[(0.697,'y'),(0.774,'y'),(0.643,'y'),(0.608,'y'),(0.556,'y'),
(0.403,'y'),(0.481,'y'),(0.437,'y'),(0.666,'n'),(0.243,'n'),(0.245,'n'),
(0.343,'n'),(0.639,'n'),(0.657,'n'),(0.36,'n'),(0.593,'n'),(0.719,'n')]
df=pd.DataFrame(density,columns=['density','good_or_bad'])
df
density | good_or_bad | |
---|---|---|
0 | 0.697 | y |
1 | 0.774 | y |
2 | 0.643 | y |
3 | 0.608 | y |
4 | 0.556 | y |
5 | 0.403 | y |
6 | 0.481 | y |
7 | 0.437 | y |
8 | 0.666 | n |
9 | 0.243 | n |
10 | 0.245 | n |
11 | 0.343 | n |
12 | 0.639 | n |
13 | 0.657 | n |
14 | 0.360 | n |
15 | 0.593 | n |
16 | 0.719 | n |
数据集D的信息熵:
def calEnt(df):
"""
计算信息熵
"""
good=(df.iloc[:,1]=='y').sum()
bad=(df.iloc[:,1]=='n').sum()
good_ratio=good/(good+bad)
bad_ratio=1-good_ratio
if 1 in [good_ratio,bad_ratio]:
return 0
ent=-(np.log2(good_ratio)*good_ratio+np.log2(bad_ratio)*bad_ratio)
return ent
ent_D=calEnt(df);ent_D
0.9975025463691153
根据二分法取候选值
x1=df.iloc[:,0].values.copy();x1 #注意这里要用copy!否则就是view,而后续的sort将改变df的index!
array([0.697, 0.774, 0.643, 0.608, 0.556, 0.403, 0.481, 0.437, 0.666,
0.243, 0.245, 0.343, 0.639, 0.657, 0.36 , 0.593, 0.719])
x1.sort()
t_ready=(x1[:-1]+x1[1:])/2;t_ready
array([0.244 , 0.294 , 0.3515, 0.3815, 0.42 , 0.459 , 0.5185, 0.5745,
0.6005, 0.6235, 0.641 , 0.65 , 0.6615, 0.6815, 0.708 , 0.7465])
a,b=np.array([1,2]),np.array([3,4])
求解Gain(D,a)
def get_gain(df,t_ready):
"""
求解连续属性的信息增益
df为连续属性数据集
t_ready是根据二分法求解的连续属性待定值
"""
result=[]
ent_D=calEnt(df)
df_count=len(df)
for i in t_ready:
small_part=df.where(df.iloc[:,0]<=i).dropna()
large_part=df.where(df.iloc[:,0]>i).dropna()
small_part_ent=calEnt(small_part)
large_part_ent=calEnt(large_part)
small_count=len(small_part)
large_count=len(large_part)
ratio_group=np.array([small_count/df_count,large_count/df_count])
ent_group=np.array([small_part_ent,large_part_ent])
gain=ent_D-(ratio_group*ent_group).sum()
result.append((i,gain))
return result
results=get_gain(df,t_ready)
results.sort(key=lambda x:x[1])
results
[(0.708, 0.00033345932649475607),
(0.6615, 0.0007697888924075302),
(0.641, 0.0013653507075551685),
(0.5745, 0.002226985278291793),
(0.6005, 0.002226985278291793),
(0.5185, 0.003585078590305879),
(0.6234999999999999, 0.003585078590305879),
(0.65, 0.006046489176565584),
(0.6815, 0.024085993037174735),
(0.45899999999999996, 0.03020211515891169),
(0.244, 0.05632607578088),
(0.7464999999999999, 0.06696192680347068),
(0.42000000000000004, 0.0934986902367243),
(0.29400000000000004, 0.1179805181500242),
(0.35150000000000003, 0.18613819904679052),
(0.3815, 0.2624392604045632)]
因此,可得属性“密度”的信息增益是0.262,对应的划分点为0.381,与西瓜书的结果一致!
再验证含糖率:
suger=[(0.46,'y'),(0.376,'y'),(0.264,'y'),(0.318,'y'),(0.215,'y'),(0.237,'y'),(0.149,'y'),(0.211,'y'),
(0.091,'n'),(0.267,'n'),(0.057,'n'),(0.099,'n'),(0.161,'n'),(0.198,'n'),(0.37,'n'),(0.042,'n'),(0.103,'n')]
df_suger=pd.DataFrame(suger,columns=['suger','good_or_bad'])
xx=df_suger.iloc[:,0].values.copy()
xx.sort()
suger_ready=(xx[1:]+xx[:-1])/2
results=get_gain(df_suger,suger_ready)
results.sort(key=lambda x:x[1]);results
[(0.2655, 0.02025677859928121),
(0.344, 0.024085993037174735),
(0.0495, 0.05632607578088),
(0.2505, 0.06150029019652836),
(0.41800000000000004, 0.06696192680347068),
(0.2925, 0.0715502899435908),
(0.074, 0.1179805181500242),
(0.22599999999999998, 0.12369354800009502),
(0.373, 0.14078143361499595),
(0.155, 0.15618502398692902),
(0.095, 0.18613819904679052),
(0.213, 0.21114574906025385),
(0.1795, 0.2354661674053965),
(0.101, 0.2624392604045632),
(0.20450000000000002, 0.33712865788827096),
(0.126, 0.34929372233065203)]
因此,可得属性“含糖率”的信息增益是0.349,对应的划分点为0.126,与西瓜书的结果一致!