特征重要性计算之LOFO与FLOFO

1. 引入

特征的重要性,即feature importance,使用sklearn自带的一些模型,就能计算出来。
比如RandomForest取feature_importance的用法如下:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
data = load_iris()
x_data = data.data
y_data = data.target
print(x_data.shape,y_data.shape)# (150, 4) (150,)
model = RandomForestClassifier()
model.fit(x_data,y_data)
model.feature_importances_ # array([0.09366231, 0.02290373, 0.44489138, 0.43854258])

模型训练后,通过调用模型的属性(feature_importances_),就能得到各个特征的重要性值,值越大说明特征重要性越高。

但有些模型并不自带计算feature_importances的机制,那该如何得到feature_importances呢?

2. LOFO与FLOFO

  1. LOFO

LOFO是Leave One Feature Out的缩写,他计算特征重要性的思路是:遍历去掉每一个特征,用留下的特征训练模型,在验证集上评估模型效果,以此来衡量模型的重要性。

用验证集评估模型时,使用KFold的方式,K次训练、预测的过程,就能得到K个评估值,所以LOFO能输出特征重要性的均值与标准差。

如果不输入模型,参考1中LOFO的实现默认用LightGBM来进行评估。

  1. FLOFO

FLOFO是Fast LOFO的意思。

LOFO的计算过程,需要循环迭代“移除一个特征,KFold训练评估”的整个过程,比较耗时。FLOFO是为了加速(简化)这个过程的。

FLOFO会使用全特征来训练好一个模型,然后依次循环迭代“对某一个特征值进行随机扰动,使用已经训练好的模型来验证”,这个过程不需要重新训练模型,所以会很快。FLOFO的重要性,就用扰动前的结果减去扰动后的结果。(结果在这里可以是AUC/ACC之类的值,这个扰动前后结果越大,说明特征越重要)

3. LOFO示例代码

下面对sklearn自带的breast_cancer数据集使用LOFO:

  1. 导入依赖
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

这里导入模型、KFold,和LOFO。要注意LOFO中大量使用dataframe,所以还要导入pandas。

  1. 导入数据集

导入breast_cancer数据集,要注意需要将数据转换为dataframe格式。

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
print(df.shape)# (569, 31)

breast_cancer是二分类数据集,即target中只有0和1两个数值。

  1. 将数据集转换为LOFO的Dataset格式

必须要将dataframe的数据集包装为Dataset结构,才能调用LOFO的相关接口。

dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])

参数 target 是df中表示y值的列名,features是df中的特征名list。

  1. 获取任意模型对应的feature_importance

这里以 RandomForestClassifier 为例

model = RandomForestClassifier()
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()

以5Fold为例来对模型做验证,所以最终每个特征能得到5个importance值。importance_df的值如下

feature importance_mean importance_std val_imp_0 val_imp_1 val_imp_2 val_imp_3 val_imp_4
26 area error 0.0104953 0.0151496 0.0263158 0.0175439 0.0175439 0.00877193 -0.0176991
23 worst perimeter 0.00878746 0.0124054 0.0175439 0 -0.00877193 0.0263158 0.00884956
29 mean smoothness 0.00704859 0.00863287 -0.00877193 0.00877193 0.00877193 0.00877193 0.0176991
24 mean texture 0.00704859 0.0170287 -0.00877193 0.0350877 0 -0.00877193 0.0176991
1 mean radius 0.00527868 0.00702537 0 0.0175439 0 0 0.00884956
16 mean compactness 0.00355535 0.0143273 0 0 0.00877193 -0.0175439 0.0265487
4 perimeter error 0.0035243 0.0105341 -0.00877193 0.0175439 -0.00877193 0.00877193 0.00884956
9 worst area 0.00175439 0.00656431 0.00877193 0.00877193 0 -0.00877193 0
11 mean symmetry 0.00175439 0.00656431 0 0.00877193 0.00877193 -0.00877193 0
3 worst fractal dimension 0.00175439 0.00350877 0 0 0 0.00877193 0
22 worst radius 0.00175439 0.0085947 -0.00877193 0.0175439 0 0 0
8 radius error 0.00173886 0.00861375 -0.00877193 0.00877193 0.00877193 0.00877193 -0.00884956
17 texture error 3.10511e-05 0.0124494 -0.00877193 0 0.00877193 -0.0175439 0.0176991
19 mean concavity 1.55255e-05 0.00962338 0 0.00877193 0 -0.0175439 0.00884956
14 fractal dimension error 0 0 0 0 0 0 0
2 mean concave points -1.55255e-05 0.00962338 0.0175439 0 -0.00877193 0 -0.00884956
7 worst concavity -4.65766e-05 0.0147618 0 0.0175439 0.00877193 0 -0.0265487
10 concavity error -0.00173886 0.00658923 0 0 -0.00877193 -0.00877193 0.00884956
0 mean fractal dimension -0.00175439 0.00350877 0 0 0 -0.00877193 0
13 worst compactness -0.00175439 0.00350877 -0.00877193 0 0 0 0
28 mean perimeter -0.00350877 0.00701754 0 0 0 -0.0175439 0
12 smoothness error -0.0035243 0.00431644 0 0 -0.00877193 0 -0.00884956
27 mean area -0.00703307 0.00656853 -0.00877193 0 0 -0.0175439 -0.00884956
18 compactness error -0.00880298 0.00788074 -0.00877193 0 -0.0175439 0 -0.0176991
  1. 画出feature_importance图

LOFO自带了画图的接口,可以直接对importance_df做可视化

plot_importance(importance_df, figsize=(12, 20))

这样就能得到feature importanc的排序输出结果:

特征重要性计算之LOFO与FLOFO

  1. 多分类数据集测试

测试了iris数据集,发现程序可以正常运行,但importance_df中的值都为NaN,哪怕把scoring="f1"改为scoring="f1_macro"后就正常了。

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_iris(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1_macro",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
  1. 最终代码

综合上述过程,得到直接能运行的代码如下:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

4. FLOFO示例代码

Fast LOFO直接调用FLOFOImportance即可,参考代码如下:

from lofo import FLOFOImportance
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
# step-01: prepare data
data = load_breast_cancer(as_frame=True)# load as dataframe
x_data = data.data.to_numpy()
y_data = data.target.values
df = data.data
df['target']=data.target.values
# repeat more data since FLOFO need > 1000 data
df=pd.DataFrame(pd.np.repeat(df.values,2,axis=0),columns=df.columns)
# step-02: train model
model = RandomForestClassifier()
model.fit(x_data,y_data)
# step-03: fast-lofo
lofo_imp = FLOFOImportance(validation_df=df, target="target", features=[col for col in df.columns if col != 'target'],scoring="f1",trained_model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

FLOFOImportance与LOFOImportance的几点区别:

  1. FLOFOImportance不在需要将数据包装为Dataset结构
  2. FLOFOImportance需要先训练模型,再调用FLOFO
  3. FLOFOImportance接口使用与LOFO稍有不同

结论

  1. RandomForest能直接对多分类数据计算feature_importances_
  2. LOFO默认使用LightGBM来计算得到feature_importances_
  3. 参考1中的LOFO库支持多分类数据集,但需要把scoring="f1"改为scoring="f1_macro"等支持多类别的评估准则
  4. FLOFO(Fast LOFO)比LOFO运行更快

参考

  1. https://github.com/aerdem4/lofo-importance
  2. https://juejin.cn/post/7020237735516438564
上一篇:第七章第三十三题(文化:中国生肖)(Culture: Chinese Zodiac)


下一篇:【python】字典的定义及常用操作