(踏上有意义的旅程)数据探索--找出规律揭示奥秘

文章目录

1.数据集获取

1.1 数据集介绍
1.2 导入必要的工具包
1.3 数据读取

2.数据探索

2.1 查看标签的分布情况
2.2 查看缺失值
2.3 查看特征数据类型
2.4 异常点检测
2.5 相关性检验
2.6 年龄对还款的影响(探索)
2.7 外部数据源

文章正文

1. 数据集获取

1.1 数据集介绍

本次使用的数据集来源于Kaggle平台,由home credict提供,该公司的服务致力于向无银行账户的人群提供信贷。
(踏上有意义的旅程)数据探索--找出规律揭示奥秘
我们只用到以下数据集:

  • application_train/application_test:
    – 关于Home credit贷款申请信息的训练和测试数据集,客户由SK-ID-CURR标识。
    – TARGET为0表示贷款已还清,TARGET为1表示贷款未还清。

1.2导入必要的工具包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
plt.style.use('fivethirtyeight')

1.3数据读取

1.3.1 读取训练集
df_train=pd.read_csv('./application_train.csv')
df_train.head()
SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100002	1	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	1.0
1	100003	0	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
2	100004	0	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
3	100006	0	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 122 columns
1.3.2 训练集数据整体概览
df_train.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
1.3.3 读取测试集
df_test=pd.read_csv('./application_test.CSV')
df_test.info()
# 输出信息
SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100001	Cash loans	F	N	Y	0	135000.0	568800.0	20560.5	450000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
1	100005	Cash loans	M	N	Y	0	99000.0	222768.0	17370.0	180000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	3.0
2	100013	Cash loans	M	Y	Y	0	202500.0	663264.0	69777.0	630000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	1.0	4.0
3	100028	Cash loans	F	N	Y	2	315000.0	1575000.0	49018.5	1575000.0	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	3.0
4	100038	Cash loans	M	Y	N	1	180000.0	625500.0	32067.0	625500.0	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 121 columns
1.3.4 测试集数据整体概览
df_test.info()
# 输出信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB

2. 数据探索

  • 数据探索(EDA)通过统计和图表发现数据趋势、规律、模式、关系和异常的开放式过程,目的是从数据中发现有用的信息。

2.1 查看标签的分布情况

  • 标签就是我们要预测的目标,0表示及时还款,1表示违约,首先来看两者数量的对比。
  • 通过value_counts方法可以统计每一类的计数。
df_train['TARGET'].value_counts()
# 输出信息
0    282686
1     24825
Name: TARGET, dtype: int64
  • 画图查看标签分布情况。
df_train['TARGET'].plot.hist()

(踏上有意义的旅程)数据探索--找出规律揭示奥秘

  • 结论:不管是通过value_counts方法还是绘图方式可以看出数据存在类别不平衡问题即多数类样本是少数类样本数量的10倍以上。

2.2 查看缺失值

  • 机器学习和数据挖掘所使用的数据永远不可能是完美的,因为实际收集的数据常常会缺失重要的字段,因此我们要经常用到缺失值处理。
  • 我们要查看每列特征的缺失值数量和占比,构建一个缺失值统计函数。
def missing_values(df):
	missing_val=df.isnull().sum()
	missing_val_percent=100*df.isnull().sum()/len(df)
	missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
	missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
	missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(1)
print('数据集有:'+str(df.shape[1])+'列,''其中'+str(missing_value_data.shape[0])+'列存在缺失值')
return missing_val_data
# 也可以实时编写一段:
missing_val=df_train.isnull().sum()
missing_val_percent=100*missing_val/len(df_train)
missing_val_data=pd.concat([missing_val,missing_val_percent],axis=1)
missing_val_data=missing_val_data.rename(columns={0:'Missing Values',1:'% of Total Values'})
missing_val_data=missing_val_data[missing_val_data.iloc[:,1]!=0].sort_values('% of Total Values',ascending=False).round(2)
# 输出结果:
	Missing Values	% of Total Values
COMMONAREA_MEDI	214865	69.87
COMMONAREA_AVG	214865	69.87
COMMONAREA_MODE	214865	69.87
NONLIVINGAPARTMENTS_MEDI	213514	69.43
NONLIVINGAPARTMENTS_MODE	213514	69.43
...	...	...
OBS_60_CNT_SOCIAL_CIRCLE	1021	0.33
DEF_60_CNT_SOCIAL_CIRCLE	1021	0.33
OBS_30_CNT_SOCIAL_CIRCLE	1021	0.33
EXT_SOURCE_2	660	0.21
AMT_GOODS_PRICE	278	0.09
64 rows × 2 columns
# 缺失值有多少列
missing_val_data.shape[0]
64
  • 查看缺失值占比大于20%的特征有哪些
missing_val_data[missing_val_data['% of Total Values']>20]
# 输出结果
Missing Values	% of Total Values
COMMONAREA_MEDI	214865	69.87
COMMONAREA_AVG	214865	69.87
COMMONAREA_MODE	214865	69.87
NONLIVINGAPARTMENTS_MEDI	213514	69.43
NONLIVINGAPARTMENTS_MODE	213514	69.43
NONLIVINGAPARTMENTS_AVG	213514	69.43
FONDKAPREMONT_MODE	210295	68.39
LIVINGAPARTMENTS_MODE	210199	68.35
LIVINGAPARTMENTS_MEDI	210199	68.35
LIVINGAPARTMENTS_AVG	210199	68.35
FLOORSMIN_MODE	208642	67.85
FLOORSMIN_MEDI	208642	67.85
FLOORSMIN_AVG	208642	67.85
YEARS_BUILD_MODE	204488	66.50
YEARS_BUILD_MEDI	204488	66.50
YEARS_BUILD_AVG	204488	66.50
OWN_CAR_AGE	202929	65.99
LANDAREA_AVG	182590	59.38
LANDAREA_MEDI	182590	59.38
LANDAREA_MODE	182590	59.38
BASEMENTAREA_MEDI	179943	58.52
BASEMENTAREA_AVG	179943	58.52
BASEMENTAREA_MODE	179943	58.52
EXT_SOURCE_1	173378	56.38
NONLIVINGAREA_MEDI	169682	55.18
NONLIVINGAREA_MODE	169682	55.18
NONLIVINGAREA_AVG	169682	55.18
ELEVATORS_MEDI	163891	53.30
ELEVATORS_MODE	163891	53.30
ELEVATORS_AVG	163891	53.30
WALLSMATERIAL_MODE	156341	50.84
APARTMENTS_MODE	156061	50.75
APARTMENTS_MEDI	156061	50.75
APARTMENTS_AVG	156061	50.75
ENTRANCES_MODE	154828	50.35
ENTRANCES_AVG	154828	50.35
ENTRANCES_MEDI	154828	50.35
LIVINGAREA_MEDI	154350	50.19
LIVINGAREA_MODE	154350	50.19
LIVINGAREA_AVG	154350	50.19
HOUSETYPE_MODE	154297	50.18
FLOORSMAX_MEDI	153020	49.76
FLOORSMAX_AVG	153020	49.76
FLOORSMAX_MODE	153020	49.76
YEARS_BEGINEXPLUATATION_AVG	150007	48.78
YEARS_BEGINEXPLUATATION_MEDI	150007	48.78
YEARS_BEGINEXPLUATATION_MODE	150007	48.78
TOTALAREA_MODE	148431	48.27
EMERGENCYSTATE_MODE	145755	47.40
OCCUPATION_TYPE	96391	31.35
  • 统计缺失值大于20%的特征数量
  • missing_val_data[missing_val_data['% of Total Values']>20].count()
  • Missing Values 50 % of Total Values 50 dtype: int64

2.3 查看特征数据类型

  • 统计特征数据类型数量,int64和float64都是数值型特征,objects是类别型特征。
  • df_train.dtypes
df_train.dtypes
# 输出结果:
SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
                               ...   
AMT_REQ_CREDIT_BUREAU_DAY     float64
AMT_REQ_CREDIT_BUREAU_WEEK    float64
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
Length: 122, dtype: object
  • df_train.dtypes.value_counts()
df_train.dtypes.values()
# 输出结果:
float64    65
int64      41
object     16
dtype: int64
  • 查看object(类别型)特征有多少种取值。
  • df_train.select_dtypes('object')
df_train.select_dtypes('object')
# 输出结果:
	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	OCCUPATION_TYPE	WEEKDAY_APPR_PROCESS_START	ORGANIZATION_TYPE	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	Cash loans	M	N	Y	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	Laborers	WEDNESDAY	Business Entity Type 3	reg oper account	block of flats	Stone, brick	No
1	Cash loans	F	N	N	Family	State servant	Higher education	Married	House / apartment	Core staff	MONDAY	School	reg oper account	block of flats	Block	No
2	Revolving loans	M	Y	Y	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	Laborers	MONDAY	Government	NaN	NaN	NaN	NaN
3	Cash loans	F	N	Y	Unaccompanied	Working	Secondary / secondary special	Civil marriage	House / apartment	Laborers	WEDNESDAY	Business Entity Type 3	NaN	NaN	NaN	NaN
4	Cash loans	M	N	Y	Unaccompanied	Working	Secondary / secondary special
  • 每一种object的特征都多少类型的取值。
  • df_train.select_dtypes('object').apply(pd.Series.nunique,axis=0)
df_train.select_dtypes('object').apply(pd.Series.nunique,axis=0)
# 输出结果:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
  • 结论:大部分类别型(object)特征只有少量的取值,后续要对它们做编码操作,即字符特征转换为数值特征。

2.4 异常点检测

  • 数据探索一般会对特征做异常点检测,异常点又称为离群点,它的定义是‘样本中的一个或几个观测值离其他观测值较远,暗示它们可能来自不同的总体’。
  • 异常点的产生有可能来源于误输入,设备测量错误,数据极端情况,所以在实际业务中,需要跟数据采集人员确认产生原因。
上一篇:在C#中导出数据到Excel


下一篇:简体字和繁体字转换四种方法