文章目录
1.数据集获取
1.1 数据集介绍
1.2 导入必要的工具包
1.3 数据读取
2.数据探索
2.1 查看标签的分布情况
2.2 查看缺失值
2.3 查看特征数据类型
2.4 异常点检测
2.5 相关性检验
2.6 年龄对还款的影响(探索)
2.7 外部数据源
文章正文
1. 数据集获取
1.1 数据集介绍
本次使用的数据集来源于Kaggle平台,由home credict提供,该公司的服务致力于向无银行账户的人群提供信贷。
我们只用到以下数据集:
- application_train/application_test:
– 关于Home credit贷款申请信息的训练和测试数据集,客户由SK-ID-CURR标识。
– TARGET为0表示贷款已还清,TARGET为1表示贷款未还清。
1.2导入必要的工具包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
plt.style.use('fivethirtyeight')
1.3数据读取
1.3.1 读取训练集
df_train=pd.read_csv('./application_train.csv')
df_train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 122 columns
1.3.2 训练集数据整体概览
df_train.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
1.3.3 读取测试集
df_test=pd.read_csv('./application_test.CSV')
df_test.info()
# 输出信息
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
5 rows × 121 columns
1.3.4 测试集数据整体概览
df_test.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
2. 数据探索
- 数据探索(EDA)通过统计和图表发现数据趋势、规律、模式、关系和异常的开放式过程,目的是从数据中发现有用的信息。
2.1 查看标签的分布情况
- 标签就是我们要预测的目标,0表示及时还款,1表示违约,首先来看两者数量的对比。
- 通过value_counts方法可以统计每一类的计数。
df_train['TARGET'].value_counts()
# 输出信息
0 282686
1 24825
Name: TARGET, dtype: int64
- 画图查看标签分布情况。
df_train['TARGET'].plot.hist()
- 结论:不管是通过value_counts方法还是绘图方式可以看出数据存在类别不平衡问题即多数类样本是少数类样本数量的10倍以上。
2.2 查看缺失值
- 机器学习和数据挖掘所使用的数据永远不可能是完美的,因为实际收集的数据常常会缺失重要的字段,因此我们要经常用到缺失值处理。
- 我们要查看每列特征的缺失值数量和占比,构建一个缺失值统计函数。
def missing_values(df):
missing_val=df.isnull().sum()
missing_val_percent=100*df.isnull().sum()/len(df)
missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(1)
print('数据集有:'+str(df.shape[1])+'列,''其中'+str(missing_value_data.shape[0])+'列存在缺失值')
return missing_val_data
# 也可以实时编写一段:
missing_val=df_train.isnull().sum()
missing_val_percent=100*missing_val/len(df_train)
missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(2)
# 输出结果:
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.87
COMMONAREA_AVG 214865 69.87
COMMONAREA_MODE 214865 69.87
NONLIVINGAPARTMENTS_MEDI 213514 69.43
NONLIVINGAPARTMENTS_MODE 213514 69.43
... ... ...
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.33
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.33
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.33
EXT_SOURCE_2 660 0.21
AMT_GOODS_PRICE 278 0.09
64 rows × 2 columns
# 缺失值有多少列
missing_val_data.shape[0]
64
- 查看缺失值占比大于20%的特征有哪些
missing_val_data[missing_val_data['% of Total Values']>20]
# 输出结果
Missing Values % of Total Values
COMMONAREA_MEDI 214865 69.87
COMMONAREA_AVG 214865 69.87
COMMONAREA_MODE 214865 69.87
NONLIVINGAPARTMENTS_MEDI 213514 69.43
NONLIVINGAPARTMENTS_MODE 213514 69.43
NONLIVINGAPARTMENTS_AVG 213514 69.43
FONDKAPREMONT_MODE 210295 68.39
LIVINGAPARTMENTS_MODE 210199 68.35
LIVINGAPARTMENTS_MEDI 210199 68.35
LIVINGAPARTMENTS_AVG 210199 68.35
FLOORSMIN_MODE 208642 67.85
FLOORSMIN_MEDI 208642 67.85
FLOORSMIN_AVG 208642 67.85
YEARS_BUILD_MODE 204488 66.50
YEARS_BUILD_MEDI 204488 66.50
YEARS_BUILD_AVG 204488 66.50
OWN_CAR_AGE 202929 65.99
LANDAREA_AVG 182590 59.38
LANDAREA_MEDI 182590 59.38
LANDAREA_MODE 182590 59.38
BASEMENTAREA_MEDI 179943 58.52
BASEMENTAREA_AVG 179943 58.52
BASEMENTAREA_MODE 179943 58.52
EXT_SOURCE_1 173378 56.38
NONLIVINGAREA_MEDI 169682 55.18
NONLIVINGAREA_MODE 169682 55.18
NONLIVINGAREA_AVG 169682 55.18
ELEVATORS_MEDI 163891 53.30
ELEVATORS_MODE 163891 53.30
ELEVATORS_AVG 163891 53.30
WALLSMATERIAL_MODE 156341 50.84
APARTMENTS_MODE 156061 50.75
APARTMENTS_MEDI 156061 50.75
APARTMENTS_AVG 156061 50.75
ENTRANCES_MODE 154828 50.35
ENTRANCES_AVG 154828 50.35
ENTRANCES_MEDI 154828 50.35
LIVINGAREA_MEDI 154350 50.19
LIVINGAREA_MODE 154350 50.19
LIVINGAREA_AVG 154350 50.19
HOUSETYPE_MODE 154297 50.18
FLOORSMAX_MEDI 153020 49.76
FLOORSMAX_AVG 153020 49.76
FLOORSMAX_MODE 153020 49.76
YEARS_BEGINEXPLUATATION_AVG 150007 48.78
YEARS_BEGINEXPLUATATION_MEDI 150007 48.78
YEARS_BEGINEXPLUATATION_MODE 150007 48.78
TOTALAREA_MODE 148431 48.27
EMERGENCYSTATE_MODE 145755 47.40
OCCUPATION_TYPE 96391 31.35
- 统计缺失值大于20%的特征数量
missing_val_data[missing_val_data['% of Total Values']>20].count()
Missing Values 50 % of Total Values 50 dtype: int64
2.3 查看特征数据类型
- 统计特征数据类型数量,int64和float64都是数值型特征,objects是类别型特征。
df_train.dtypes
df_train.dtypes
# 输出结果:
SK_ID_CURR int64
TARGET int64
NAME_CONTRACT_TYPE object
CODE_GENDER object
FLAG_OWN_CAR object
...
AMT_REQ_CREDIT_BUREAU_DAY float64
AMT_REQ_CREDIT_BUREAU_WEEK float64
AMT_REQ_CREDIT_BUREAU_MON float64
AMT_REQ_CREDIT_BUREAU_QRT float64
AMT_REQ_CREDIT_BUREAU_YEAR float64
Length: 122, dtype: object
df_train.dtypes.value_counts()
df_train.dtypes.values()
# 输出结果:
float64 65
int64 41
object 16
dtype: int64
- 查看object(类别型)特征有多少种取值。
df_train.select_dtypes('object')
df_train.select_dtypes('object')
# 输出结果:
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
0 Cash loans M N Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Laborers WEDNESDAY Business Entity Type 3 reg oper account block of flats Stone, brick No
1 Cash loans F N N Family State servant Higher education Married House / apartment Core staff MONDAY School reg oper account block of flats Block No
2 Revolving loans M Y Y Unaccompanied Working Secondary / secondary special Single / not married House / apartment Laborers MONDAY Government NaN NaN NaN NaN
3 Cash loans F N Y Unaccompanied Working Secondary / secondary special Civil marriage House / apartment Laborers WEDNESDAY Business Entity Type 3 NaN NaN NaN NaN
4 Cash loans M N Y Unaccompanied Working Secondary / secondary special
- 每一种object的特征都多少类型的取值。
df_train.select_dtypes('object').apply(pd.Series.nunique,axis=0)
df_train.select_dtypes('object').apply(pd.Series.nunique,axis=0)
# 输出结果:
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
- 结论:大部分类别型(object)特征只有少量的取值,后续要对它们做编码操作,即字符特征转换为数值特征。
2.4 异常点检测
- 数据探索一般会对特征做异常点检测,异常点又称为离群点,它的定义是‘样本中的一个或几个观测值离其他观测值较远,暗示它们可能来自不同的总体’。
- 异常点的产生有可能来源于误输入,设备测量错误,数据极端情况,所以在实际业务中,需要跟数据采集人员确认产生原因。