Datawhale 零基础入门数据挖掘-Task3 特征工程

Datawhale 零基础入门数据挖掘-Task3 特征工程

三、 特征工程目标

赛题:零基础入门数据挖掘 - 二手车交易价格预测

3.1 特征工程目标

  • 对于特征进行进一步分析,并对于数据进行处理

  • 完成对于特征工程的分析

3.2 内容介绍

常见的特征工程包括:

  1. 异常处理:
    • 通过箱线图(或 3-Sigma)分析删除异常值;
    • BOX-COX 转换(处理有偏分布);
    • 长尾截断;
  2. 特征归一化/标准化:
    • 标准化(转换为标准正态分布);
    • 归一化(抓换到 [0,1] 区间);
    • 针对幂律分布,可以采用公式: l o g ( 1 + x 1 + m e d i a n ) log(\frac{1+x}{1+median}) log(1+median1+x​)
  3. 数据分桶:
    • 等频分桶;
    • 等距分桶;
    • Best-KS 分桶(类似利用基尼指数进行二分类);
    • 卡方分桶;
  4. 缺失值处理:
    • 不处理(针对类似 XGBoost 等树模型);
    • 删除(缺失数据太多);
    • 插值补全,包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等;
    • 分箱,缺失值一个箱;
  5. 特征构造:
    • 构造统计量特征,报告计数、求和、比例、标准差等;
    • 时间特征,包括相对时间和绝对时间,节假日,双休日等;
    • 地理信息,包括分箱,分布编码等方法;
    • 非线性变换,包括 log/ 平方/ 根号等;
    • 特征组合,特征交叉;
    • 仁者见仁,智者见智。
  6. 特征筛选
    • 过滤式(filter):先对数据进行特征选择,然后在训练学习器,常见的方法有 Relief/方差选择发/相关系数法/卡方检验法/互信息法;
    • 包裹式(wrapper):直接把最终将要使用的学习器的性能作为特征子集的评价准则,常见方法有 LVM(Las Vegas Wrapper) ;
    • 嵌入式(embedding):结合过滤式和包裹式,学习器训练过程中自动进行了特征选择,常见的有 lasso 回归;
  7. 降维
    • PCA/ LDA/ ICA;
    • 特征选择也是一种降维。

3.3 代码示例

3.3.0 导入数据

# 导入需要使用的库
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold, RepeatedKFold
from sklearn.metrics import mean_absolute_error
from sklearn import linear_model
import warnings

warnings.filterwarnings('ignore')
Test_data = reduce_mem_usage(pd.read_csv('data/car_testA_0110.csv', sep=' '))
Train_data = reduce_mem_usage(pd.read_csv('data/car_train_0110.csv', sep=' '))
Train_data.shape

其实在上一节中我们已经有了一个基本的做特征工程的思路,这里不再赘述数据的基本信息。

3.3.1 删除异常值

3.3.1.1 数据清洗

注:这里并不是所有的数据都是用于这种方法,这里的过程我们称之为数据清洗,但实践之后会发现本赛题的数据进行清洗之后会留下185138条数据,删除了1\4的数据,个人认为这里会破坏原有数据的全面性,所以未做此处的处理。

# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值,默认用 box_plot(scale=3)进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """

    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度,
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)

    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n
import matplotlib.pyplot as plt
import seaborn as sns
# 数据清洗
for i in [ 'v_8',  'v_23']:
    print(i)
    Train_data=outliers_proc(Train_data, i, scale=3)
v_8
Delete number is: 48536
Now column number is: 201464
Description of data less than the lower bound is:
count    4.853600e+04
mean     6.556511e-07
std      0.000000e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      6.532669e-04
Name: v_8, dtype: float64
Description of data larger than the upper bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: v_8, dtype: float64
v_23
Delete number is: 16326
Now column number is: 185138
Description of data less than the lower bound is:
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: v_23, dtype: float64
Description of data larger than the upper bound is:
count    1.632600e+04
mean              inf
std      5.332031e-01
min      4.511719e+00
25%      4.730469e+00
50%      4.988281e+00
75%      5.351562e+00
max      8.578125e+00
Name: v_23, dtype: float64

3.3.1.2 其他数据异常值处理

注:我们上一节在分析数据的时候发现v_14和price存在部分极端值,这里将极端值当作异常值删除。

Train_data = Train_data.drop(Train_data[Train_data['v_14']>8].index)
Train_data = Train_data.drop(Train_data[Train_data['price'] < 3].index)

3.3.1.3 减小数据内存占用

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

3.3.2 特征构造

# 训练集和测试集放在一起,方便构造特征
Train_data['price'] = np.log1p(Train_data['price'])

# 合并方便后面的操作
df = pd.concat([Train_data, Test_data], ignore_index=True)
# 对类别较少的特征采用one-hot编码
one_hot_list = ['fuelType','gearbox','notRepairedDamage','bodyType']
for col in one_hot_list:
    one_hot = pd.get_dummies(df[col])
    one_hot.columns = [col+'_'+str(i) for i in range(len(one_hot.columns))]
    df = pd.concat([df,one_hot],axis=1)
  • One hot 编码进行数据的分类更准确,许多机器学习算法无法直接用于数据分类。数据的类别必须转换成数字,对于分类的输入和输出变量都是一样的。

  • 我们可以直接使用整数编码,需要时重新调整。这可能适用于在类别之间存在自然关系的问题,例如温度“冷”(0)和”热“(1)的标签。

  • 当没有关系时,可能会出现问题,一个例子可能是标签的“狗”和“猫”。在这些情况下,我们想让网络更具表现力,为每个可能的标签值提供概率式数字。这有助于进行问题网络建模。当输出变量使用one-hot编码时,它可以提供比单个标签更准确的一组预测。

    one-hot编码介绍: [什么是one hot编码?为什么要使用one hot编码? - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/37471802)


## 1、第一步处理无用值和基本无变化的值
#SaleID肯定没用,但是我们可以用来统计别的特征的group数量
#name一般没什么好挖掘的,不过同名的好像不少,可以挖掘一下
df['name_count'] = df.groupby(['name'])['SaleID'].transform('count')
# del df['name']

#seller有一个特殊值,训练集特有测试集没有,把它删除掉
df.drop(df[df['seller'] == 0].index, inplace=True)
del df['offerType']
del df['seller']

## 2、第二步处理缺失值
# 以下特征全部填充0
df['fuelType'] = df['fuelType'].fillna(0)
df['bodyType'] = df['bodyType'].fillna(0)
df['gearbox']=df['gearbox'].fillna(0)
df['notRepairedDamage']=df['notRepairedDamage'].fillna(0)
df['model'] = df['model'].fillna(0)

# 3、第三步处理异常值
# 异常值就目前初步判断,只有notRepairedDamage的值有问题,还有题目规定了范围的power。处理一下
df['power'] = df['power'].map(lambda x: 600 if x>600 else x)
df['notRepairedDamage'] = df['notRepairedDamage'].astype('str').apply(lambda x: x if x != '-' else None).astype('float32')

注:这里给出我的特征工程全部处理过程,欢迎和大佬们共同讨论。

## 1、时间,地区啥的
#时间
from datetime import datetime
def date_process(x):
    year = int(str(x)[:4])
    month = int(str(x)[4:6])
    day = int(str(x)[6:8])

    if month < 1:
        month = 1

    date = datetime(year, month, day)
    return date

df['regDate'] = df['regDate'].apply(date_process)
df['creatDate'] = df['creatDate'].apply(date_process)
df['regDate_year'] = df['regDate'].dt.year
df['regDate_month'] = df['regDate'].dt.month
df['regDate_day'] = df['regDate'].dt.day
df['creatDate_year'] = df['creatDate'].dt.year
df['creatDate_month'] = df['creatDate'].dt.month
df['creatDate_day'] = df['creatDate'].dt.day
df['car_age_day'] = (df['creatDate'] - df['regDate']).dt.days
df['car_age_year'] = round(df['car_age_day'] / 365, 1)

df['year_kilometer'] = df['kilometer'] / df['car_age_year']

#地区
df['regionCode_count'] = df.groupby(['regionCode'])['SaleID'].transform('count')
df['city'] = df['regionCode'].apply(lambda x : str(x)[:2])


## 2、分类特征
# 对可分类的连续特征进行分桶,kilometer是已经分桶了
bin = [i*10 for i in range(31)]
df['power_bin'] = pd.cut(df['power'], bin, labels=False)
tong = df[['power_bin', 'power']].head()


bin = [i*10 for i in range(24)]
df['model_bin'] = pd.cut(df['model'], bin, labels=False)
tong = df[['model_bin', 'model']].head()

# 将稍微取值多一点的分类特征与price进行特征组合,做了非常多组,但是在最终使用的时候,每组分开测试,挑选真正work的特征
Train_gb = Train_data.groupby("regionCode")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['regionCode_amount'] = len(kind_data)
    info['regionCode_price_max'] = kind_data.price.max()
    info['regionCode_price_median'] = kind_data.price.median()
    info['regionCode_price_min'] = kind_data.price.min()
    info['regionCode_price_sum'] = kind_data.price.sum()
    info['regionCode_price_std'] = kind_data.price.std()
    info['regionCode_price_mean'] = kind_data.price.mean()
    info['regionCode_price_skew'] = kind_data.price.skew()
    info['regionCode_price_kurt'] = kind_data.price.kurt()
    info['regionCode_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "regionCode"})
df = df.merge(brand_fe, how='left', on='regionCode')

Train_gb = Train_data.groupby("brand")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_mean'] = kind_data.price.mean()
    info['brand_price_skew'] = kind_data.price.skew()
    info['brand_price_kurt'] = kind_data.price.kurt()
    info['brand_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
df = df.merge(brand_fe, how='left', on='brand')

Train_gb = df.groupby("model_bin")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['model_amount'] = len(kind_data)
    info['model_price_max'] = kind_data.price.max()
    info['model_price_median'] = kind_data.price.median()
    info['model_price_min'] = kind_data.price.min()
    info['model_price_sum'] = kind_data.price.sum()
    info['model_price_std'] = kind_data.price.std()
    info['model_price_mean'] = kind_data.price.mean()
    info['model_price_skew'] = kind_data.price.skew()
    info['model_price_kurt'] = kind_data.price.kurt()
    info['model_price_mad'] = kind_data.price.mad()
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "model"})
df = df.merge(brand_fe, how='left', on='model')

Train_gb = Train_data.groupby("kilometer")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['kilometer_amount'] = len(kind_data)
    info['kilometer_price_max'] = kind_data.price.max()
    info['kilometer_price_median'] = kind_data.price.median()
    info['kilometer_price_min'] = kind_data.price.min()
    info['kilometer_price_sum'] = kind_data.price.sum()
    info['kilometer_price_std'] = kind_data.price.std()
    info['kilometer_price_mean'] = kind_data.price.mean()
    info['kilometer_price_skew'] = kind_data.price.skew()
    info['kilometer_price_kurt'] = kind_data.price.kurt()
    info['kilometer_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "kilometer"})
df = df.merge(brand_fe, how='left', on='kilometer')

Train_gb = Train_data.groupby("bodyType")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['bodyType_amount'] = len(kind_data)
    info['bodyType_price_max'] = kind_data.price.max()
    info['bodyType_price_median'] = kind_data.price.median()
    info['bodyType_price_min'] = kind_data.price.min()
    info['bodyType_price_sum'] = kind_data.price.sum()
    info['bodyType_price_std'] = kind_data.price.std()
    info['bodyType_price_mean'] = kind_data.price.mean()
    info['bodyType_price_skew'] = kind_data.price.skew()
    info['bodyType_price_kurt'] = kind_data.price.kurt()
    info['bodyType_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "bodyType"})
df = df.merge(brand_fe, how='left', on='bodyType')


Train_gb = Train_data.groupby("fuelType")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['fuelType_amount'] = len(kind_data)
    info['fuelType_price_max'] = kind_data.price.max()
    info['fuelType_price_median'] = kind_data.price.median()
    info['fuelType_price_min'] = kind_data.price.min()
    info['fuelType_price_sum'] = kind_data.price.sum()
    info['fuelType_price_std'] = kind_data.price.std()
    info['fuelType_price_mean'] = kind_data.price.mean()
    info['fuelType_price_skew'] = kind_data.price.skew()
    info['fuelType_price_kurt'] = kind_data.price.kurt()
    info['fuelType_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "fuelType"})
df = df.merge(brand_fe, how='left', on='fuelType')


Train_gb = Train_data.groupby("v_8")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['v_8_amount'] = len(kind_data)
    info['v_8_price_max'] = kind_data.price.max()
    info['v_8_price_median'] = kind_data.price.median()
    info['v_8_price_min'] = kind_data.price.min()
    info['v_8_price_sum'] = kind_data.price.sum()
    info['v_8_price_std'] = kind_data.price.std()
    info['v_8_price_mean'] = kind_data.price.mean()
    info['v_8_price_skew'] = kind_data.price.skew()
    info['v_8_price_kurt'] = kind_data.price.kurt()
    info['v_8_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "v_8"})
df = df.merge(brand_fe, how='left', on='v_8')


Train_gb = df.groupby('car_age_year')
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['car_age_year_amount'] = len(kind_data)
    info['car_age_year_price_max'] = kind_data.price.max()
    info['car_age_year_price_median'] = kind_data.price.median()
    info['car_age_year_price_min'] = kind_data.price.min()
    info['car_age_year_price_sum'] = kind_data.price.sum()
    info['car_age_year_price_std'] = kind_data.price.std()
    info['car_age_year_price_mean'] = kind_data.price.mean()
    info['car_age_year_price_skew'] = kind_data.price.skew()
    info['car_age_year_price_kurt'] = kind_data.price.kurt()
    info['car_age_year_price_mad'] = kind_data.price.mad()

    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "car_age_year"})
df = df.merge(brand_fe, how='left', on='car_age_year')





# 测试分类特征与price时,发现有点效果,立马对model进行处理
for kk in [ "regionCode","brand","model","bodyType","fuelType"]:
    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['car_age_day'] > 0]
        info[kk+'_days_max'] = kind_data.car_age_day.max()
        info[kk+'_days_min'] = kind_data.car_age_day.min()
        info[kk+'_days_std'] = kind_data.car_age_day.std()
        info[kk+'_days_mean'] = kind_data.car_age_day.mean()
        info[kk+'_days_median'] = kind_data.car_age_day.median()
        info[kk+'_days_sum'] = kind_data.car_age_day.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)

    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['power'] > 0]
        info[kk+'_power_max'] = kind_data.power.max()
        info[kk+'_power_min'] = kind_data.power.min()
        info[kk+'_power_std'] = kind_data.power.std()
        info[kk+'_power_mean'] = kind_data.power.mean()
        info[kk+'_power_median'] = kind_data.power.median()
        info[kk+'_power_sum'] = kind_data.power.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)
    
    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['v_0'] > 0]
        info[kk+'_v_0_max'] = kind_data.v_0.max()
        info[kk+'_v_0_min'] = kind_data.v_0.min()
        info[kk+'_v_0_std'] = kind_data.v_0.std()
        info[kk+'_v_0_mean'] = kind_data.v_0.mean()
        info[kk+'_v_0_median'] = kind_data.v_0.median()
        info[kk+'_v_0_sum'] = kind_data.v_0.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)
    
    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['v_3'] > 0]
        info[kk+'_v_3_max'] = kind_data.v_3.max()
        info[kk+'_v_3_min'] = kind_data.v_3.min()
        info[kk+'_v_3_std'] = kind_data.v_3.std()
        info[kk+'_v_3_mean'] = kind_data.v_3.mean()
        info[kk+'_v_3_median'] = kind_data.v_3.median()
        info[kk+'_v_3_sum'] = kind_data.v_3.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)
    
    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['v_16'] > 0]
        info[kk+'_v_16_max'] = kind_data.v_16.max()
        info[kk+'_v_16_min'] = kind_data.v_16.min()
        info[kk+'_v_16_std'] = kind_data.v_16.std()
        info[kk+'_v_16_mean'] = kind_data.v_16.mean()
        info[kk+'_v_16_median'] = kind_data.v_16.median()
        info[kk+'_v_16_sum'] = kind_data.v_16.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)
    
    Train_gb = df.groupby(kk)
    all_info = {}
    for kind, kind_data in Train_gb:
        info = {}
        kind_data = kind_data[kind_data['v_18'] > 0]
        info[kk+'_v_18_max'] = kind_data.v_16.max()
        info[kk+'_v_18_min'] = kind_data.v_16.min()
        info[kk+'_v_18_std'] = kind_data.v_16.std()
        info[kk+'_v_18_mean'] = kind_data.v_16.mean()
        info[kk+'_v_18_median'] = kind_data.v_16.median()
        info[kk+'_v_18_sum'] = kind_data.v_16.sum()
        all_info[kind] = info
    brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
    df = df.merge(brand_fe, how='left', on=kk)

## 3、连续数值特征
# 都是匿名特征 比较训练集和测试集分布 分析完 基本没什么问题 先暂且全部保留咯
# 后期也许得对相似度较大的进行剔除处理
# 对简易lgb模型输出的特征重要度较高的几个连续数值特征对price进行刻画
# kk="regionCode"
# # dd = 'v_3'[0, 3, 6, 11, 16, 17, 18]
# for dd in ['v_0','v_1','v_3','v_16','v_17','v_18','v_22','v_23']:
#     Train_gb = df.groupby(kk)
#     all_info = {}
#     for kind, kind_data in Train_gb:
#         info = {}
#         kind_data = kind_data[kind_data[dd] > -10000000]
#         info[kk+'_'+dd+'_max'] = kind_data[dd].max()
#         info[kk+'_'+dd+'_min'] = kind_data[dd].min()
#         info[kk+'_'+dd+'_std'] = kind_data[dd].std()
#         info[kk+'_'+dd+'_mean'] = kind_data[dd].mean()
#         info[kk+'_'+dd+'_median'] = kind_data[dd].median()
#         info[kk+'_'+dd+'_sum'] = kind_data[dd].sum()
#         all_info[kind] = info
#     brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
#     df = df.merge(brand_fe, how='left', on=kk)



# dd = 'v_0'
# Train_gb = df.groupby(kk)
# all_info = {}
# for kind, kind_data in Train_gb:
#     info = {}
#     kind_data = kind_data[kind_data[dd]> -10000000]
#     info[kk+'_'+dd+'_max'] = kind_data.v_0.max()
#     info[kk+'_'+dd+'_min'] = kind_data.v_0.min()
#     info[kk+'_'+dd+'_std'] = kind_data.v_0.std()
#     info[kk+'_'+dd+'_mean'] = kind_data.v_0.mean()
#     info[kk+'_'+dd+'_median'] = kind_data.v_0.median()
#     info[kk+'_'+dd+'_sum'] = kind_data.v_0.sum()
#     all_info[kind] = info
#     all_info[kind] = info
# brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": kk})
# df = df.merge(brand_fe, how='left', on=kk)
for i in ['v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10','v_11', 'v_12', 'v_13', 'v_14', 'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22','v_23']:
    df[i+'**2']=df[i]**2
    df[i+'**3']=df[i]**3
    df[i+'log']=np.log1p(df[i])
## 构造多项式特征



df['v_0_2']=(df['v_0'])**2

df['v_3_6']=(df['v_3'])**6
df['v_3_9']=(df['v_3'])**9

df['v_6_8']=(df['v_6'])**8

df['v_7_2']=(df['v_7'])**2
df['v_7_8']=(df['v_7'])**8
df['v_7_12']=(df['v_7'])**12

df['v_10_6']=(df['v_10'])**6

df['v_11_2']=(df['v_11'])**2
df['v_11_3']=(df['v_11'])**3
df['v_11_4']=(df['v_11'])**4
df['v_11_6']=(df['v_11'])**6
df['v_11_8']=(df['v_11'])**8
df['v_11_9']=(df['v_11'])**9

for i in [2,3,4,6,8]:
    df['v_15_'+str(i)]=(df['v_15'])**i

df['v_16_6']=(df['v_16'])**6

df['v_18_9']=(df['v_18'])**9

df['v_21_8']=(df['v_21'])**8

df['v_22_2']=(df['v_22'])**2
df['v_22_8']=(df['v_22'])**8
df['v_22_12']=(df['v_22'])**12

df['v_23_9']=(df['v_23'])**9
df['v_23_18']=(df['v_23'])**18
for i in [2,3,4,27,18]:
    df['kilometer_'+str(i)]=(df['kilometer'])**i
for i in [8,9,27,18]:
    df['bodyType_'+str(i)]=(df['bodyType'])**i
for i in [2,3,4,6,8,9,12,28]:
    df['gearbox_'+str(i)]=(df['gearbox'])**i
## 主要是对匿名特征和几个重要度较高的分类特征进行特征交叉
#第一批特征工程
for i in range(24):#range(23)
    for j in range(24):
        df['new'+str(i)+'*'+str(j)]=df['v_'+str(i)]*df['v_'+str(j)]


#第二批特征工程
for i in range(24):
    for j in range(24):
        df['new'+str(i)+'+'+str(j)]=df['v_'+str(i)]+df['v_'+str(j)]

# # 第三批特征工程
for i in range(24):
    df['new' + str(i) + '*power'] = df['v_' + str(i)] * df['power']

for i in range(24):
    df['new' + str(i) + '*day'] = df['v_' + str(i)] * df['car_age_day']

for i in range(24):
    df['new' + str(i) + '*year'] = df['v_' + str(i)] * df['car_age_year']


# #第四批特征工程
for i in range(24):
    for j in range(24):
        df['new'+str(i)+'-'+str(j)]=df['v_'+str(i)]-df['v_'+str(j)]
        df['new'+str(i)+'/'+str(j)]=df['v_'+str(i)]/df['v_'+str(j)]
'''
多项式特征,测试过为3阶最好
'''
from sklearn import preprocessing
feature_cols = [ 'v_0', 'v_3','v_18', 'v_16']
poly_data = df[feature_cols] 
poly = preprocessing.PolynomialFeatures(3,interaction_only=True)
poly_data_ndarray = poly.fit_transform(poly_data)
poly_data_final = pd.DataFrame(poly_data_ndarray,columns=poly.get_feature_names(poly_data.columns))
poly_data_final.drop(columns=[  'v_0', 'v_3','v_18', 'v_16'],inplace=True)
# 将二次转换过后的数据拼接到原数据集上
df =pd.merge(df,poly_data_final, how='left',right_index=True,left_index=True) 
df.drop(columns=['1'],inplace=True)

# 替换inf类数据
df.replace([np.inf,-np.inf],np.nan,inplace=True)
# df=df.fillna(method='ffill')
feature_aggs = {}
# for i in sparse_feature:
for i in ['name', 'model', 'regionCode']:
    feature_aggs[i] = ['count', 'nunique']
for j in ['power', 'kilometer', 'car_age_day']:#,'v_4','v_8','v_10','v_12','v_13'
    feature_aggs[j] = ['mean','max','min','std','median','count']
def create_new_feature(df):
    result = df.copy()
#     for feature in sparse_feature:
    for feature in ['name', 'model', 'regionCode']:
        aggs = feature_aggs.copy()
        aggs.pop(feature)
        grouped = result.groupby(feature).agg(aggs)
        grouped.columns = ['{}_{}_{}'.format(feature, i[0], i[1]) for i in grouped.columns]
        grouped = grouped.reset_index().rename(columns={0: feature})
        result = pd.merge(result, grouped, how='left', on=feature)
    return result
df = create_new_feature(df)
from tqdm import *
from scipy.stats import entropy

feat_cols = []

### count编码
for f in tqdm(['car_age_year','model', 'brand', 'regionCode']):
    df[f + '_count'] = df[f].map(df[f].value_counts())
    feat_cols.append(f + '_count')

# ### 用数值特征对类别特征做统计刻画,随便挑了几个跟price相关性最高的匿名特征
# for f1 in tqdm(['model', 'brand', 'regionCode']):
#     group = data.groupby(f1, as_index=False)
#     for f2 in tqdm(['v_0', 'v_3', 'v_8', 'v_12']):
#         feat = group[f2].agg({
#             '{}_{}_max'.format(f1, f2): 'max', '{}_{}_min'.format(f1, f2): 'min',
#             '{}_{}_median'.format(f1, f2): 'median', '{}_{}_mean'.format(f1, f2): 'mean',
#             '{}_{}_std'.format(f1, f2): 'std', '{}_{}_mad'.format(f1, f2): 'mad'
#         })
#         data = data.merge(feat, on=f1, how='left')
#         feat_list = list(feat)
#         feat_list.remove(f1)
#         feat_cols.extend(feat_list)


### 类别特征的二阶交叉
for f_pair in tqdm([['model', 'brand'], ['model', 'regionCode'], ['brand', 'regionCode']]):
    ### 共现次数
    df['_'.join(f_pair) + '_count'] = df.groupby(f_pair)['SaleID'].transform('count')
    ### nunique、熵
    df = df.merge(df.groupby(f_pair[0], as_index=False)[f_pair[1]].agg({
        '{}_{}_nunique'.format(f_pair[0], f_pair[1]): 'nunique',
        '{}_{}_ent'.format(f_pair[0], f_pair[1]): lambda x: entropy(x.value_counts() / x.shape[0])
    }), on=f_pair[0], how='left')
    df = df.merge(df.groupby(f_pair[1], as_index=False)[f_pair[0]].agg({
        '{}_{}_nunique'.format(f_pair[1], f_pair[0]): 'nunique',
        '{}_{}_ent'.format(f_pair[1], f_pair[0]): lambda x: entropy(x.value_counts() / x.shape[0])
    }), on=f_pair[1], how='left')
    ### 比例偏好
    df['{}_in_{}_prop'.format(f_pair[0], f_pair[1])] = df['_'.join(f_pair) + '_count'] / df[f_pair[1] + '_count']
    df['{}_in_{}_prop'.format(f_pair[1], f_pair[0])] = df['_'.join(f_pair) + '_count'] / df[f_pair[0] + '_count']
    
    feat_cols.extend([
        '_'.join(f_pair) + '_count',
        '{}_{}_nunique'.format(f_pair[0], f_pair[1]), '{}_{}_ent'.format(f_pair[0], f_pair[1]),
        '{}_{}_nunique'.format(f_pair[1], f_pair[0]), '{}_{}_ent'.format(f_pair[1], f_pair[0]),
        '{}_in_{}_prop'.format(f_pair[0], f_pair[1]), '{}_in_{}_prop'.format(f_pair[1], f_pair[0])
    ])

以上为特征构造的过程,一千多个特征,肯定会有很多对预测结果产生反效果的特征,也会有一些特征相关性很高,出现特征荣誉的情况,那我们接下来做一下特征筛选。

3.3.3 特征筛选

1) 过滤式

# 相关性分析
f=[]
numerical_cols = df.select_dtypes(exclude='object').columns
feature_cols = [col for col in numerical_cols  if
             col  not in['name','regDate','creatDate','model','brand','regionCode','seller','regDates','creatDates']]
for i in feature_cols:
    print(i,df[i].corr(df['price'], method='spearman'))
    f.append([i,df[i].corr(df['price'], method='spearman')])
f.sort(key=lambda x:x[1])

f.sort(key=lambda x:abs(x[1]),reverse=True)
new_f=[]
for i ,j in f:
    if abs(j)>0.8:
        new_f.append(i)
    print(i,j)

这里我们只看与price相关性超过0.8的特征,其他特征对于price预测效果意义不大,不做考虑。

# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

Datawhale 零基础入门数据挖掘-Task3 特征工程

2) 包裹式

!pip install mlxtend
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
sfs.fit(x, y)
sfs.k_feature_names_ 

上述代码运行时间太久了,不建议做尝试。

# 画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()

Datawhale 零基础入门数据挖掘-Task3 特征工程

3.4 经验总结

特征工程是比赛中最至关重要的的一块,特征工程的好坏往往会决定了最终的排名和成绩。

特征构造也属于特征工程的一部分,其目的是为了增强数据的表达。

  • 有些比赛的特征是匿名特征,这导致我们并不清楚特征相互直接的关联性,这时我们就只有单纯基于特征进行处理,比如装箱,groupby等这样一些操作进行一些特征统计,此外还可以对特征进行进一步的 log,exp 等变换,或者对多个特征进行四则运算(如上面我们算出的使用时长),多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理,当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。
  • 对于知道特征含义(非匿名)的特征工程,特别是在工业类型比赛中,会基于信号处理,频域提取,丰度,偏度等构建更为有实际意义的特征,这就是结合背景的特征构建,在推荐系统中也是这样的,各种类型点击率统计,各时段统计,加用户属性的统计等等,这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理,从而才能更好的找到 magic。

当然特征工程其实是和模型结合在一起的,这就是为什么要为 LR NN 做分桶和特征归一化的原因,而对于特征的处理效果和特征重要性等往往要通过模型来验证。

总的来说,特征工程是一个入门简单,但想精通非常难的一件事。

Task 3-特征工程 END.

上一篇:Docker训练营Docker基础知识学习笔记task3——Docker入门Dockerfile详解及镜像创建


下一篇:Opencv task3