1. 引入
特征的重要性,即feature importance,使用sklearn自带的一些模型,就能计算出来。
比如RandomForest取feature_importance的用法如下:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
data = load_iris()
x_data = data.data
y_data = data.target
print(x_data.shape,y_data.shape)# (150, 4) (150,)
model = RandomForestClassifier()
model.fit(x_data,y_data)
model.feature_importances_ # array([0.09366231, 0.02290373, 0.44489138, 0.43854258])
模型训练后,通过调用模型的属性(feature_importances_),就能得到各个特征的重要性值,值越大说明特征重要性越高。
但有些模型并不自带计算feature_importances的机制,那该如何得到feature_importances呢?
2. LOFO与FLOFO
- LOFO
LOFO是Leave One Feature Out的缩写,他计算特征重要性的思路是:遍历去掉每一个特征,用留下的特征训练模型,在验证集上评估模型效果,以此来衡量模型的重要性。
用验证集评估模型时,使用KFold的方式,K次训练、预测的过程,就能得到K个评估值,所以LOFO能输出特征重要性的均值与标准差。
如果不输入模型,参考1中LOFO的实现默认用LightGBM来进行评估。
- FLOFO
FLOFO是Fast LOFO的意思。
LOFO的计算过程,需要循环迭代“移除一个特征,KFold训练评估”的整个过程,比较耗时。FLOFO是为了加速(简化)这个过程的。
FLOFO会使用全特征来训练好一个模型,然后依次循环迭代“对某一个特征值进行随机扰动,使用已经训练好的模型来验证”,这个过程不需要重新训练模型,所以会很快。FLOFO的重要性,就用扰动前的结果减去扰动后的结果。(结果在这里可以是AUC/ACC之类的值,这个扰动前后结果越大,说明特征越重要)
3. LOFO示例代码
下面对sklearn自带的breast_cancer数据集使用LOFO:
- 导入依赖
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
这里导入模型、KFold,和LOFO。要注意LOFO中大量使用dataframe,所以还要导入pandas。
- 导入数据集
导入breast_cancer数据集,要注意需要将数据转换为dataframe格式。
data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
print(df.shape)# (569, 31)
breast_cancer是二分类数据集,即target中只有0和1两个数值。
- 将数据集转换为LOFO的Dataset格式
必须要将dataframe的数据集包装为Dataset结构,才能调用LOFO的相关接口。
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
参数 target 是df中表示y值的列名,features是df中的特征名list。
- 获取任意模型对应的feature_importance
这里以 RandomForestClassifier 为例
model = RandomForestClassifier()
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
以5Fold为例来对模型做验证,所以最终每个特征能得到5个importance值。importance_df的值如下
feature | importance_mean | importance_std | val_imp_0 | val_imp_1 | val_imp_2 | val_imp_3 | val_imp_4 | |
---|---|---|---|---|---|---|---|---|
26 | area error | 0.0104953 | 0.0151496 | 0.0263158 | 0.0175439 | 0.0175439 | 0.00877193 | -0.0176991 |
23 | worst perimeter | 0.00878746 | 0.0124054 | 0.0175439 | 0 | -0.00877193 | 0.0263158 | 0.00884956 |
29 | mean smoothness | 0.00704859 | 0.00863287 | -0.00877193 | 0.00877193 | 0.00877193 | 0.00877193 | 0.0176991 |
24 | mean texture | 0.00704859 | 0.0170287 | -0.00877193 | 0.0350877 | 0 | -0.00877193 | 0.0176991 |
1 | mean radius | 0.00527868 | 0.00702537 | 0 | 0.0175439 | 0 | 0 | 0.00884956 |
16 | mean compactness | 0.00355535 | 0.0143273 | 0 | 0 | 0.00877193 | -0.0175439 | 0.0265487 |
4 | perimeter error | 0.0035243 | 0.0105341 | -0.00877193 | 0.0175439 | -0.00877193 | 0.00877193 | 0.00884956 |
9 | worst area | 0.00175439 | 0.00656431 | 0.00877193 | 0.00877193 | 0 | -0.00877193 | 0 |
11 | mean symmetry | 0.00175439 | 0.00656431 | 0 | 0.00877193 | 0.00877193 | -0.00877193 | 0 |
3 | worst fractal dimension | 0.00175439 | 0.00350877 | 0 | 0 | 0 | 0.00877193 | 0 |
22 | worst radius | 0.00175439 | 0.0085947 | -0.00877193 | 0.0175439 | 0 | 0 | 0 |
8 | radius error | 0.00173886 | 0.00861375 | -0.00877193 | 0.00877193 | 0.00877193 | 0.00877193 | -0.00884956 |
17 | texture error | 3.10511e-05 | 0.0124494 | -0.00877193 | 0 | 0.00877193 | -0.0175439 | 0.0176991 |
19 | mean concavity | 1.55255e-05 | 0.00962338 | 0 | 0.00877193 | 0 | -0.0175439 | 0.00884956 |
14 | fractal dimension error | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | mean concave points | -1.55255e-05 | 0.00962338 | 0.0175439 | 0 | -0.00877193 | 0 | -0.00884956 |
7 | worst concavity | -4.65766e-05 | 0.0147618 | 0 | 0.0175439 | 0.00877193 | 0 | -0.0265487 |
10 | concavity error | -0.00173886 | 0.00658923 | 0 | 0 | -0.00877193 | -0.00877193 | 0.00884956 |
0 | mean fractal dimension | -0.00175439 | 0.00350877 | 0 | 0 | 0 | -0.00877193 | 0 |
13 | worst compactness | -0.00175439 | 0.00350877 | -0.00877193 | 0 | 0 | 0 | 0 |
28 | mean perimeter | -0.00350877 | 0.00701754 | 0 | 0 | 0 | -0.0175439 | 0 |
12 | smoothness error | -0.0035243 | 0.00431644 | 0 | 0 | -0.00877193 | 0 | -0.00884956 |
27 | mean area | -0.00703307 | 0.00656853 | -0.00877193 | 0 | 0 | -0.0175439 | -0.00884956 |
18 | compactness error | -0.00880298 | 0.00788074 | -0.00877193 | 0 | -0.0175439 | 0 | -0.0176991 |
- 画出feature_importance图
LOFO自带了画图的接口,可以直接对importance_df做可视化
plot_importance(importance_df, figsize=(12, 20))
这样就能得到feature importanc的排序输出结果:
- 多分类数据集测试
测试了iris数据集,发现程序可以正常运行,但importance_df中的值都为NaN,哪怕把scoring="f1"改为scoring="f1_macro"后就正常了。
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
data = load_iris(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1_macro",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
- 最终代码
综合上述过程,得到直接能运行的代码如下:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
4. FLOFO示例代码
Fast LOFO直接调用FLOFOImportance即可,参考代码如下:
from lofo import FLOFOImportance
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
# step-01: prepare data
data = load_breast_cancer(as_frame=True)# load as dataframe
x_data = data.data.to_numpy()
y_data = data.target.values
df = data.data
df['target']=data.target.values
# repeat more data since FLOFO need > 1000 data
df=pd.DataFrame(pd.np.repeat(df.values,2,axis=0),columns=df.columns)
# step-02: train model
model = RandomForestClassifier()
model.fit(x_data,y_data)
# step-03: fast-lofo
lofo_imp = FLOFOImportance(validation_df=df, target="target", features=[col for col in df.columns if col != 'target'],scoring="f1",trained_model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
FLOFOImportance与LOFOImportance的几点区别:
- FLOFOImportance不在需要将数据包装为Dataset结构
- FLOFOImportance需要先训练模型,再调用FLOFO
- FLOFOImportance接口使用与LOFO稍有不同
结论
- RandomForest能直接对多分类数据计算feature_importances_
- LOFO默认使用LightGBM来计算得到feature_importances_
- 参考1中的LOFO库支持多分类数据集,但需要把scoring="f1"改为scoring="f1_macro"等支持多类别的评估准则
- FLOFO(Fast LOFO)比LOFO运行更快
参考
- https://github.com/aerdem4/lofo-importance
- https://juejin.cn/post/7020237735516438564