Datawhale 零基础入门数据挖掘-Task3 特征工程
三、 特征工程目标
赛题:零基础入门数据挖掘 - 二手车交易价格预测
3.1 特征工程目标
-
对于特征进行进一步分析,并对于数据进行处理
-
完成对于特征工程的分析
3.2 内容介绍
常见的特征工程包括:
- 异常处理:
- 通过箱线图(或 3-Sigma)分析删除异常值;
- BOX-COX 转换(处理有偏分布);
- 长尾截断;
- 特征归一化/标准化:
- 标准化(转换为标准正态分布);
- 归一化(抓换到 [0,1] 区间);
- 针对幂律分布,可以采用公式: l o g ( 1 + x 1 + m e d i a n ) log(\frac{1+x}{1+median}) log(1+median1+x)
- 数据分桶:
- 等频分桶;
- 等距分桶;
- Best-KS 分桶(类似利用基尼指数进行二分类);
- 卡方分桶;
- 缺失值处理:
- 不处理(针对类似 XGBoost 等树模型);
- 删除(缺失数据太多);
- 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
- 分箱,缺失值一个箱;
- 特征构造:
- 构造统计量特征,报告计数、求和、比例、标准差等;
- 时间特征,包括相对时间和绝对时间,节假日,双休日等;
- 地理信息,包括分箱,分布编码等方法;
- 非线性变换,包括 log/ 平方/ 根号等;
- 特征组合,特征交叉;
- 仁者见仁,智者见智。
- 特征筛选
- 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
- 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
- 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
- 降维
- PCA/ LDA/ ICA;
- 特征选择也是一种降维。
3.3 代码示例
3.3.0 导入数据
# 导入需要使用的库
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold, RepeatedKFold
from sklearn.metrics import mean_absolute_error
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
Test_data = reduce_mem_usage(pd.read_csv('data/car_testA_0110.csv', sep=' '))
Train_data = reduce_mem_usage(pd.read_csv('data/car_train_0110.csv', sep=' '))
Train_data.shape
其实在上一节中我们已经有了一个基本的做特征工程的思路,这里不再赘述数据的基本信息。
3.3.1 删除异常值
3.3.1.1 数据清洗
注:这里并不是所有的数据都是用于这种方法,这里的过程我们称之为数据清洗,但实践之后会发现本赛题的数据进行清洗之后会留下185138条数据,删除了1\4的数据,个人认为这里会破坏原有数据的全面性,所以未做此处的处理。
# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data, col_name, scale=3):
"""
用于清洗异常值,默认用 box_plot(scale=3)进行清洗
:param data: 接收 pandas 数据格式
:param col_name: pandas 列名
:param scale: 尺度
:return:
"""
def box_plot_outliers(data_ser, box_scale):
"""
利用箱线图去除异常值
:param data_ser: 接收 pandas.Series 数据格式
:param box_scale: 箱线图尺度,
:return:
"""
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
val_low = data_ser.quantile(0.25) - iqr
val_up = data_ser.quantile(0.75) + iqr
rule_low = (data_ser < val_low)
rule_up = (data_ser > val_up)
return (rule_low, rule_up), (val_low, val_up)
data_n = data.copy()
data_series = data_n[col_name]
rule, value = box_plot_outliers(data_series, box_scale=scale)
index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
print("Delete number is: {}".format(len(index)))
data_n = data_n.drop(index)
data_n.reset_index(drop=True, inplace=True)
print("Now column number is: {}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule[0]]
outliers = data_series.iloc[index_low]
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe())
index_up = np.arange(data_series.shape[0])[rule[1]]
outliers = data_series.iloc[index_up]
print("Description of data larger than the upper bound is:")
print(pd.Series(outliers).describe())
fig, ax = plt.subplots(1, 2, figsize=(10, 7))
sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
return data_n
import matplotlib.pyplot as plt
import seaborn as sns
# 数据清洗
for i in [ 'v_8', 'v_23']:
print(i)
Train_data=outliers_proc(Train_data, i, scale=3)
v_8
Delete number is: 48536
Now column number is: 201464
Description of data less than the lower bound is:
count 4.853600e+04
mean 6.556511e-07
std 0.000000e+00
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 0.000000e+00
max 6.532669e-04
Name: v_8, dtype: float64
Description of data larger than the upper bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: v_8, dtype: float64
v_23
Delete number is: 16326
Now column number is: 185138
Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: v_23, dtype: float64
Description of data larger than the upper bound is:
count 1.632600e+04
mean inf
std 5.332031e-01
min 4.511719e+00
25% 4.730469e+00
50% 4.988281e+00
75% 5.351562e+00
max 8.578125e+00
Name: v_23, dtype: float64
3.3.1.2 其他数据异常值处理
注:我们上一节在分析数据的时候发现v_14和price存在部分极端值,这里将极端值当作异常值删除。
Train_data = Train_data.drop(Train_data[Train_data['v_14']>8].index)
Train_data = Train_data.drop(Train_data[Train_data['price'] < 3].index)
3.3.1.3 减小数据内存占用
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum()
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
3.3.2 特征构造
# 训练集和测试集放在一起,方便构造特征
Train_data['price'] = np.log1p(Train_data['price'])
# 合并方便后面的操作
df = pd.concat([Train_data, Test_data], ignore_index=True)
# 对类别较少的特征采用one-hot编码
one_hot_list = ['fuelType','gearbox','notRepairedDamage','bodyType']
for col in one_hot_list:
one_hot = pd.get_dummies(df[col])
one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
df = pd.concat([df,one_hot],axis=1)
-
One hot 编码进行数据的分类更准确,许多机器学习算法无法直接用于数据分类。数据的类别必须转换成数字,对于分类的输入和输出变量都是一样的。
-
我们可以直接使用整数编码,需要时重新调整。这可能适用于在类别之间存在自然关系的问题,例如温度“冷”(0)和”热“(1)的标签。
-
当没有关系时,可能会出现问题,一个例子可能是标签的“狗”和“猫”。在这些情况下,我们想让网络更具表现力,为每个可能的标签值提供概率式数字。这有助于进行问题网络建模。当输出变量使用one-hot编码时,它可以提供比单个标签更准确的一组预测。
one-hot编码介绍: [什么是one hot编码?为什么要使用one hot编码? - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/37471802)
## 1、第一步处理无用值和基本无变化的值
#SaleID肯定没用,但是我们可以用来统计别的特征的group数量
#name一般没什么好挖掘的,不过同名的好像不少,可以挖掘一下
df['name_count'] = df.groupby(['name'])['SaleID'].transform('count')
# del df['name']
#seller有一个特殊值,训练集特有测试集没有,把它删除掉
df.drop(df[df['seller'] == 0].index, inplace=True)
del df['offerType']
del df['seller']
## 2、第二步处理缺失值
# 以下特征全部填充0
df['fuelType'] = df['fuelType'].fillna(0)
df['bodyType'] = df['bodyType'].fillna(0)
df['gearbox']=df['gearbox'].fillna(0)
df['notRepairedDamage']=df['notRepairedDamage'].fillna(0)
df['model'] = df['model'].fillna(0)
# 3、第三步处理异常值
# 异常值就目前初步判断,只有notRepairedDamage的值有问题,还有题目规定了范围的power。处理一下
df['power'] = df['power'].map(lambda x: 600 if x>600 else x)
df['notRepairedDamage'] = df['notRepairedDamage'].astype('str').apply(lambda x: x if x != '-' else None).astype('float32')
注:这里给出我的特征工程全部处理过程,欢迎和大佬们共同讨论。
## 1、时间,地区啥的
#时间
from datetime import datetime
def date_process(x):
year = int(str(x)[:4])
month = int(str(x)[4:6])
day = int(str(x)[6:8])
if month < 1:
month = 1
date = datetime(year, month, day)
return date
df['regDate'] = df['regDate'].apply(date_process)
df['creatDate'] = df['creatDate'].apply(date_process)
df['regDate_year'] = df['regDate'].dt.year
df['regDate_month'] = df['regDate'].dt.month
df['regDate_day'] = df['regDate'].dt.day
df['creatDate_year'] = df['creatDate'].dt.year
df['creatDate_month'] = df['creatDate'].dt.month
df['creatDate_day'] = df['creatDate'].dt.day
df['car_age_day'] = (df['creatDate'] - df['regDate']).dt.days
df['car_age_year'] = round(df['car_age_day'] / 365, 1)
df['year_kilometer'] = df['kilometer'] / df['car_age_year']
#地区
df['regionCode_count'] = df.groupby(['regionCode'])['SaleID'].transform('count')
df['city'] = df['regionCode'].apply(lambda x : str(x)[:2])
## 2、分类特征
# 对可分类的连续特征进行分桶,kilometer是已经分桶了
bin = [i*10 for i in range(31)]
df['power_bin'] = pd.cut(df['power'], bin, labels=False)
tong = df[['power_bin', 'power']].head()
bin = [i*10 for i in range(24)]
df['model_bin'] = pd.cut(df['model'], bin, labels=False)
tong = df[['model_bin', 'model']].head()
# 将稍微取值多一点的分类特征与price进行特征组合,做了非常多组,但是在最终使用的时候,每组分开测试,挑选真正work的特征
Train_gb = Train_data.groupby("regionCode")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['regionCode_amount'] = len(kind_data)
info['regionCode_price_max'] = kind_data.price.max()
info['regionCode_price_median'] = kind_data.price.median()
info['regionCode_price_min'] = kind_data.price.min()
info['regionCode_price_sum'] = kind_data.price.sum()
info['regionCode_price_std'] = kind_data.price.std()
info['regionCode_price_mean'] = kind_data.price.mean()
info['regionCode_price_skew'] = kind_data.price.skew()
info['regionCode_price_kurt'] = kind_data.price.kurt()
info['regionCode_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "regionCode"})
df = df.merge(brand_fe, how='left', on='regionCode')
Train_gb = Train_data.groupby("brand")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['brand_amount'] = len(kind_data)
info['brand_price_max'] = kind_data.price.max()
info['brand_price_median'] = kind_data.price.median()
info['brand_price_min'] = kind_data.price.min()
info['brand_price_sum'] = kind_data.price.sum()
info['brand_price_std'] = kind_data.price.std()
info['brand_price_mean'] = kind_data.price.mean()
info['brand_price_skew'] = kind_data.price.skew()
info['brand_price_kurt'] = kind_data.price.kurt()
info['brand_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
df = df.merge(brand_fe, how='left', on='brand')
Train_gb = df.groupby("model_bin")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['model_amount'] = len(kind_data)
info['model_price_max'] = kind_data.price.max()
info['model_price_median'] = kind_data.price.median()
info['model_price_min'] = kind_data.price.min()
info['model_price_sum'] = kind_data.price.sum()
info['model_price_std'] = kind_data.price.std()
info['model_price_mean'] = kind_data.price.mean()
info['model_price_skew'] = kind_data.price.skew()
info['model_price_kurt'] = kind_data.price.kurt()
info['model_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "model"})
df = df.merge(brand_fe, how='left', on='model')
Train_gb = Train_data.groupby("kilometer")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['kilometer_amount'] = len(kind_data)
info['kilometer_price_max'] = kind_data.price.max()
info['kilometer_price_median'] = kind_data.price.median()
info['kilometer_price_min'] = kind_data.price.min()
info['kilometer_price_sum'] = kind_data.price.sum()
info['kilometer_price_std'] = kind_data.price.std()
info['kilometer_price_mean'] = kind_data.price.mean()
info['kilometer_price_skew'] = kind_data.price.skew()
info['kilometer_price_kurt'] = kind_data.price.kurt()
info['kilometer_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "kilometer"})
df = df.merge(brand_fe, how='left', on='kilometer')
Train_gb = Train_data.groupby("bodyType")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['bodyType_amount'] = len(kind_data)
info['bodyType_price_max'] = kind_data.price.max()
info['bodyType_price_median'] = kind_data.price.median()
info['bodyType_price_min'] = kind_data.price.min()
info['bodyType_price_sum'] = kind_data.price.sum()
info['bodyType_price_std'] = kind_data.price.std()
info['bodyType_price_mean'] = kind_data.price.mean()
info['bodyType_price_skew'] = kind_data.price.skew()
info['bodyType_price_kurt'] = kind_data.price.kurt()
info['bodyType_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "bodyType"})
df = df.merge(brand_fe, how='left', on='bodyType')
Train_gb = Train_data.groupby("fuelType")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['fuelType_amount'] = len(kind_data)
info['fuelType_price_max'] = kind_data.price.max()
info['fuelType_price_median'] = kind_data.price.median()
info['fuelType_price_min'] = kind_data.price.min()
info['fuelType_price_sum'] = kind_data.price.sum()
info['fuelType_price_std'] = kind_data.price.std()
info['fuelType_price_mean'] = kind_data.price.mean()
info['fuelType_price_skew'] = kind_data.price.skew()
info['fuelType_price_kurt'] = kind_data.price.kurt()
info['fuelType_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "fuelType"})
df = df.merge(brand_fe, how='left', on='fuelType')
Train_gb = Train_data.groupby("v_8")
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['v_8_amount'] = len(kind_data)
info['v_8_price_max'] = kind_data.price.max()
info['v_8_price_median'] = kind_data.price.median()
info['v_8_price_min'] = kind_data.price.min()
info['v_8_price_sum'] = kind_data.price.sum()
info['v_8_price_std'] = kind_data.price.std()
info['v_8_price_mean'] = kind_data.price.mean()
info['v_8_price_skew'] = kind_data.price.skew()
info['v_8_price_kurt'] = kind_data.price.kurt()
info['v_8_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "v_8"})
df = df.merge(brand_fe, how='left', on='v_8')
Train_gb = df.groupby('car_age_year')
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['car_age_year_amount'] = len(kind_data)
info['car_age_year_price_max'] = kind_data.price.max()
info['car_age_year_price_median'] = kind_data.price.median()
info['car_age_year_price_min'] = kind_data.price.min()
info['car_age_year_price_sum'] = kind_data.price.sum()
info['car_age_year_price_std'] = kind_data.price.std()
info['car_age_year_price_mean'] = kind_data.price.mean()
info['car_age_year_price_skew'] = kind_data.price.skew()
info['car_age_year_price_kurt'] = kind_data.price.kurt()
info['car_age_year_price_mad'] = kind_data.price.mad()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "car_age_year"})
df = df.merge(brand_fe, how='left', on='car_age_year')
# 测试分类特征与price时,发现有点效果,立马对model进行处理
for kk in [ "regionCode","brand","model","bodyType","fuelType"]:
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['car_age_day'] > 0]
info[kk+'_days_max'] = kind_data.car_age_day.max()
info[kk+'_days_min'] = kind_data.car_age_day.min()
info[kk+'_days_std'] = kind_data.car_age_day.std()
info[kk+'_days_mean'] = kind_data.car_age_day.mean()
info[kk+'_days_median'] = kind_data.car_age_day.median()
info[kk+'_days_sum'] = kind_data.car_age_day.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['power'] > 0]
info[kk+'_power_max'] = kind_data.power.max()
info[kk+'_power_min'] = kind_data.power.min()
info[kk+'_power_std'] = kind_data.power.std()
info[kk+'_power_mean'] = kind_data.power.mean()
info[kk+'_power_median'] = kind_data.power.median()
info[kk+'_power_sum'] = kind_data.power.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['v_0'] > 0]
info[kk+'_v_0_max'] = kind_data.v_0.max()
info[kk+'_v_0_min'] = kind_data.v_0.min()
info[kk+'_v_0_std'] = kind_data.v_0.std()
info[kk+'_v_0_mean'] = kind_data.v_0.mean()
info[kk+'_v_0_median'] = kind_data.v_0.median()
info[kk+'_v_0_sum'] = kind_data.v_0.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['v_3'] > 0]
info[kk+'_v_3_max'] = kind_data.v_3.max()
info[kk+'_v_3_min'] = kind_data.v_3.min()
info[kk+'_v_3_std'] = kind_data.v_3.std()
info[kk+'_v_3_mean'] = kind_data.v_3.mean()
info[kk+'_v_3_median'] = kind_data.v_3.median()
info[kk+'_v_3_sum'] = kind_data.v_3.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['v_16'] > 0]
info[kk+'_v_16_max'] = kind_data.v_16.max()
info[kk+'_v_16_min'] = kind_data.v_16.min()
info[kk+'_v_16_std'] = kind_data.v_16.std()
info[kk+'_v_16_mean'] = kind_data.v_16.mean()
info[kk+'_v_16_median'] = kind_data.v_16.median()
info[kk+'_v_16_sum'] = kind_data.v_16.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
Train_gb = df.groupby(kk)
all_info = {}
for kind, kind_data in Train_gb:
info = {}
kind_data = kind_data[kind_data['v_18'] > 0]
info[kk+'_v_18_max'] = kind_data.v_16.max()
info[kk+'_v_18_min'] = kind_data.v_16.min()
info[kk+'_v_18_std'] = kind_data.v_16.std()
info[kk+'_v_18_mean'] = kind_data.v_16.mean()
info[kk+'_v_18_median'] = kind_data.v_16.median()
info[kk+'_v_18_sum'] = kind_data.v_16.sum()
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
df = df.merge(brand_fe, how='left', on=kk)
## 3、连续数值特征
# 都是匿名特征 比较训练集和测试集分布 分析完 基本没什么问题 先暂且全部保留咯
# 后期也许得对相似度较大的进行剔除处理
# 对简易lgb模型输出的特征重要度较高的几个连续数值特征对price进行刻画
# kk="regionCode"
# # dd = 'v_3'[0, 3, 6, 11, 16, 17, 18]
# for dd in ['v_0','v_1','v_3','v_16','v_17','v_18','v_22','v_23']:
# Train_gb = df.groupby(kk)
# all_info = {}
# for kind, kind_data in Train_gb:
# info = {}
# kind_data = kind_data[kind_data[dd] > -10000000]
# info[kk+'_'+dd+'_max'] = kind_data[dd].max()
# info[kk+'_'+dd+'_min'] = kind_data[dd].min()
# info[kk+'_'+dd+'_std'] = kind_data[dd].std()
# info[kk+'_'+dd+'_mean'] = kind_data[dd].mean()
# info[kk+'_'+dd+'_median'] = kind_data[dd].median()
# info[kk+'_'+dd+'_sum'] = kind_data[dd].sum()
# all_info[kind] = info
# brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
# df = df.merge(brand_fe, how='left', on=kk)
# dd = 'v_0'
# Train_gb = df.groupby(kk)
# all_info = {}
# for kind, kind_data in Train_gb:
# info = {}
# kind_data = kind_data[kind_data[dd]> -10000000]
# info[kk+'_'+dd+'_max'] = kind_data.v_0.max()
# info[kk+'_'+dd+'_min'] = kind_data.v_0.min()
# info[kk+'_'+dd+'_std'] = kind_data.v_0.std()
# info[kk+'_'+dd+'_mean'] = kind_data.v_0.mean()
# info[kk+'_'+dd+'_median'] = kind_data.v_0.median()
# info[kk+'_'+dd+'_sum'] = kind_data.v_0.sum()
# all_info[kind] = info
# all_info[kind] = info
# brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
# df = df.merge(brand_fe, how='left', on=kk)
for i in ['v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10','v_11', 'v_12', 'v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22','v_23']:
df[i+'**2']=df[i]**2
df[i+'**3']=df[i]**3
df[i+'log']=np.log1p(df[i])
## 构造多项式特征
df['v_0_2']=(df['v_0'])**2
df['v_3_6']=(df['v_3'])**6
df['v_3_9']=(df['v_3'])**9
df['v_6_8']=(df['v_6'])**8
df['v_7_2']=(df['v_7'])**2
df['v_7_8']=(df['v_7'])**8
df['v_7_12']=(df['v_7'])**12
df['v_10_6']=(df['v_10'])**6
df['v_11_2']=(df['v_11'])**2
df['v_11_3']=(df['v_11'])**3
df['v_11_4']=(df['v_11'])**4
df['v_11_6']=(df['v_11'])**6
df['v_11_8']=(df['v_11'])**8
df['v_11_9']=(df['v_11'])**9
for i in [2,3,4,6,8]:
df['v_15_'+str(i)]=(df['v_15'])**i
df['v_16_6']=(df['v_16'])**6
df['v_18_9']=(df['v_18'])**9
df['v_21_8']=(df['v_21'])**8
df['v_22_2']=(df['v_22'])**2
df['v_22_8']=(df['v_22'])**8
df['v_22_12']=(df['v_22'])**12
df['v_23_9']=(df['v_23'])**9
df['v_23_18']=(df['v_23'])**18
for i in [2,3,4,27,18]:
df['kilometer_'+str(i)]=(df['kilometer'])**i
for i in [8,9,27,18]:
df['bodyType_'+str(i)]=(df['bodyType'])**i
for i in [2,3,4,6,8,9,12,28]:
df['gearbox_'+str(i)]=(df['gearbox'])**i
## 主要是对匿名特征和几个重要度较高的分类特征进行特征交叉
#第一批特征工程
for i in range(24):#range(23)
for j in range(24):
df['new'+str(i)+'*'+str(j)]=df['v_'+str(i)]*df['v_'+str(j)]
#第二批特征工程
for i in range(24):
for j in range(24):
df['new'+str(i)+'+'+str(j)]=df['v_'+str(i)]+df['v_'+str(j)]
# # 第三批特征工程
for i in range(24):
df['new' + str(i) + '*power'] = df['v_' + str(i)] * df['power']
for i in range(24):
df['new' + str(i) + '*day'] = df['v_' + str(i)] * df['car_age_day']
for i in range(24):
df['new' + str(i) + '*year'] = df['v_' + str(i)] * df['car_age_year']
# #第四批特征工程
for i in range(24):
for j in range(24):
df['new'+str(i)+'-'+str(j)]=df['v_'+str(i)]-df['v_'+str(j)]
df['new'+str(i)+'/'+str(j)]=df['v_'+str(i)]/df['v_'+str(j)]
'''
多项式特征,测试过为3阶最好
'''
from sklearn import preprocessing
feature_cols = [ 'v_0', 'v_3','v_18', 'v_16']
poly_data = df[feature_cols]
poly = preprocessing.PolynomialFeatures(3,interaction_only=True)
poly_data_ndarray = poly.fit_transform(poly_data)
poly_data_final = pd.DataFrame(poly_data_ndarray,columns=poly.get_feature_names(poly_data.columns))
poly_data_final.drop(columns=[ 'v_0', 'v_3','v_18', 'v_16'],inplace=True)
# 将二次转换过后的数据拼接到原数据集上
df =pd.merge(df,poly_data_final, how='left',right_index=True,left_index=True)
df.drop(columns=['1'],inplace=True)
# 替换inf类数据
df.replace([np.inf,-np.inf],np.nan,inplace=True)
# df=df.fillna(method='ffill')
feature_aggs = {}
# for i in sparse_feature:
for i in ['name', 'model', 'regionCode']:
feature_aggs[i] = ['count', 'nunique']
for j in ['power', 'kilometer', 'car_age_day']:#,'v_4','v_8','v_10','v_12','v_13'
feature_aggs[j] = ['mean','max','min','std','median','count']
def create_new_feature(df):
result = df.copy()
# for feature in sparse_feature:
for feature in ['name', 'model', 'regionCode']:
aggs = feature_aggs.copy()
aggs.pop(feature)
grouped = result.groupby(feature).agg(aggs)
grouped.columns = ['{}_{}_{}'.format(feature, i[0], i[1]) for i in grouped.columns]
grouped = grouped.reset_index().rename(columns={0: feature})
result = pd.merge(result, grouped, how='left', on=feature)
return result
df = create_new_feature(df)
from tqdm import *
from scipy.stats import entropy
feat_cols = []
### count编码
for f in tqdm(['car_age_year','model', 'brand', 'regionCode']):
df[f + '_count'] = df[f].map(df[f].value_counts())
feat_cols.append(f + '_count')
# ### 用数值特征对类别特征做统计刻画,随便挑了几个跟price相关性最高的匿名特征
# for f1 in tqdm(['model', 'brand', 'regionCode']):
# group = data.groupby(f1, as_index=False)
# for f2 in tqdm(['v_0', 'v_3', 'v_8', 'v_12']):
# feat = group[f2].agg({
# '{}_{}_max'.format(f1, f2): 'max', '{}_{}_min'.format(f1, f2): 'min',
# '{}_{}_median'.format(f1, f2): 'median', '{}_{}_mean'.format(f1, f2): 'mean',
# '{}_{}_std'.format(f1, f2): 'std', '{}_{}_mad'.format(f1, f2): 'mad'
# })
# data = data.merge(feat, on=f1, how='left')
# feat_list = list(feat)
# feat_list.remove(f1)
# feat_cols.extend(feat_list)
### 类别特征的二阶交叉
for f_pair in tqdm([['model', 'brand'], ['model', 'regionCode'], ['brand', 'regionCode']]):
### 共现次数
df['_'.join(f_pair) + '_count'] = df.groupby(f_pair)['SaleID'].transform('count')
### nunique、熵
df = df.merge(df.groupby(f_pair[0], as_index=False)[f_pair[1]].agg({
'{}_{}_nunique'.format(f_pair[0], f_pair[1]): 'nunique',
'{}_{}_ent'.format(f_pair[0], f_pair[1]): lambda x: entropy(x.value_counts() / x.shape[0])
}), on=f_pair[0], how='left')
df = df.merge(df.groupby(f_pair[1], as_index=False)[f_pair[0]].agg({
'{}_{}_nunique'.format(f_pair[1], f_pair[0]): 'nunique',
'{}_{}_ent'.format(f_pair[1], f_pair[0]): lambda x: entropy(x.value_counts() / x.shape[0])
}), on=f_pair[1], how='left')
### 比例偏好
df['{}_in_{}_prop'.format(f_pair[0], f_pair[1])] = df['_'.join(f_pair) + '_count'] / df[f_pair[1] + '_count']
df['{}_in_{}_prop'.format(f_pair[1], f_pair[0])] = df['_'.join(f_pair) + '_count'] / df[f_pair[0] + '_count']
feat_cols.extend([
'_'.join(f_pair) + '_count',
'{}_{}_nunique'.format(f_pair[0], f_pair[1]), '{}_{}_ent'.format(f_pair[0], f_pair[1]),
'{}_{}_nunique'.format(f_pair[1], f_pair[0]), '{}_{}_ent'.format(f_pair[1], f_pair[0]),
'{}_in_{}_prop'.format(f_pair[0], f_pair[1]), '{}_in_{}_prop'.format(f_pair[1], f_pair[0])
])
以上为特征构造的过程,一千多个特征,肯定会有很多对预测结果产生反效果的特征,也会有一些特征相关性很高,出现特征荣誉的情况,那我们接下来做一下特征筛选。
3.3.3 特征筛选
1) 过滤式
# 相关性分析
f=[]
numerical_cols = df.select_dtypes(exclude='object').columns
feature_cols = [col for col in numerical_cols if
col not in['name','regDate','creatDate','model','brand','regionCode','seller','regDates','creatDates']]
for i in feature_cols:
print(i,df[i].corr(df['price'], method='spearman'))
f.append([i,df[i].corr(df['price'], method='spearman')])
f.sort(key=lambda x:x[1])
f.sort(key=lambda x:abs(x[1]),reverse=True)
new_f=[]
for i ,j in f:
if abs(j)>0.8:
new_f.append(i)
print(i,j)
这里我们只看与price相关性超过0.8的特征,其他特征对于price预测效果意义不大,不做考虑。
# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
2) 包裹式
!pip install mlxtend
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
k_features=10,
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
sfs.fit(x, y)
sfs.k_feature_names_
上述代码运行时间太久了,不建议做尝试。
# 画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()
3.4 经验总结
特征工程是比赛中最至关重要的的一块,特征工程的好坏往往会决定了最终的排名和成绩。
特征构造也属于特征工程的一部分,其目的是为了增强数据的表达。
- 有些比赛的特征是匿名特征,这导致我们并不清楚特征相互直接的关联性,这时我们就只有单纯基于特征进行处理,比如装箱,groupby等这样一些操作进行一些特征统计,此外还可以对特征进行进一步的 log,exp 等变换,或者对多个特征进行四则运算(如上面我们算出的使用时长),多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理,当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。
- 对于知道特征含义(非匿名)的特征工程,特别是在工业类型比赛中,会基于信号处理,频域提取,丰度,偏度等构建更为有实际意义的特征,这就是结合背景的特征构建,在推荐系统中也是这样的,各种类型点击率统计,各时段统计,加用户属性的统计等等,这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理,从而才能更好的找到 magic。
当然特征工程其实是和模型结合在一起的,这就是为什么要为 LR NN 做分桶和特征归一化的原因,而对于特征的处理效果和特征重要性等往往要通过模型来验证。
总的来说,特征工程是一个入门简单,但想精通非常难的一件事。