以支付宝的芝麻信用为例,其分值范围在350-950分。一般认为分值越高,信用越好,个人业务的违约率越低。这里用的也是与FICO评分类似的个人信用评分工具。
FICO评分的只要思路是:多大量拥有多个属性的用户数据进行收集/分析/转换,使用各项统计指标(如相关系数/卡方校验/方差膨胀系数等)对属性进行取舍/复制/组合,最终得到一个量化的/综合的/可用于对比的分值。分值的高低,一方面反映了用户历史信用记录的好坏,另一方面暗示了未来违约可能性的大小。
这里使用到的原始数据主要包括客户的个人信息(包括性别/年龄/工作岗位/婚姻情况/学历状况等)/账户信息(包括各种账户的数量/存贷款余额等),以及该客户是否存在违约的分类标签。
数据集经过处理后,需要经过数据分箱/属性选择以及离散分类标签与连续信用评分结果的转换等过程。
一.导入数据
#导入包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#查看原始数据
data = pd.read_csv('credit.csv')
print('原始数据概况')
data.info()
#数据为12.6M,11个属性,有两个属性MonthlyIncome和NumberOfDependents中有空值
数据为12.6M,有11个属性,有两个属性MonthlyIncome和NumberOfDependents中有空值
二.数据清洗
因为MonthlyIncome和NumberOfDependents中有缺失值,需要对其进行处理,常用的方法1.有上下值填入空白处;2.属性均值填入空白处。但是各有它的缺点。这里定义了一个set_missing的函数,用来填充随机森林回归算法对缺失值进行填充。
#进行数据清洗
#用随机森林方法对MonthlyIncome缺失值进行预测填充,这里定义了set_missing函数
from sklearn.ensemble import RandomForestRegressor
def set_missing(df):
print('随机森林回归填充0值:')
process_df = df.iloc[:,[5,0,1,2,3,4,6,7,8,9]]
#把第5列的MonthlyIncome提前到第0列,作为一个标签,便于后续划分数据
#分成有数值缺失值两组
known = process_df.loc[process_df['MonthlyIncome']!=0].values
unknown = process_df.loc[process_df['MonthlyIncome']==0].values
X = known[:,1:]
y = known[:,0]
#用x,y训练随机森林回归算法
rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)
rfr.fit(X,y)
#得到的模型进行缺失值预测
predicted = rfr.predict(unknown[:,1:]).round(0)
#得到的预测结果填补原缺失数据
df.loc[df['MonthlyIncome'] == 0,'MonthlyIncome'] = predicted
return df
定义outlier_processing函数,用于对属性中的离群数据点进行删除处理
这里用到的方法是数据分箱:将属性的取值分成若干段(箱体),落在同一个箱体范围内的数据,用一个统一的数值代替。
#对属性中的异常值进行删除处理,找出离群点,先计算最大和最小阈值作为删除标准
#最小阈值 = 第一四位点 - 1.5*(第三四分位点 - 第一四分位点)
#最大阈值 = 第三四位点 + 1.5*(第三四分位点 - 第一四分位点)
# <最小阈值 和 >最大阈值的行将会被删除
#定义了outlier_processing函数,用于处理离群数据点
def outlier_processing(df,cname):
s = df[cname]
onequater = s.quantile(0.25)
threequater = s.quantile(0.75)
irq = threequater - onequater
min = onequater - 1.5*irq
max = threequater + 1.5*irq
df = df[df[cname]<=max]
df = df[df[cname]>=min]
return df
MonthlyIncome原始分布图和处理后分布图
#对MonthlyIncome列进行数据整理
print('MonthlyIncome属性离群点原始分布:')
data[['MonthlyIncome']].boxplot()
plt.savefig('MonthlyIncome1.png',dpi = 300,bbox_inches = 'tight')
plt.show()
print('删除离群点,填充缺失数据:')
data = outlier_processing(data, 'MonthlyIncome')#删除离群点
data = set_missing(data)#填充缺失数据
print('处理MonthlyIncome后数据概况:')
data.info()#查看整理后数据
#图像显示
data[['MonthlyIncome']].boxplot()#箱线图
plt.savefig('MonthlyIncome2.png',dpi = 300,bbox_inches = 'tight')
plt.show()
删除离群点,填充缺失数据后,数据集少了2M
同理,对其他属性进行离群点处理
#同理,对其他属性进行离群点处理
data = outlier_processing(data, 'age')
data = outlier_processing(data, 'RevolvingUtilizationOfUnsecuredLines')
data = outlier_processing(data, 'DebtRatio')
data = outlier_processing(data, 'NumberOfOpenCreditLinesAndLoans')
data = outlier_processing(data, 'NumberRealEstateLoansOrLines')
data = outlier_processing(data, 'NumberOfDependents')
对于三个取值过于集中的属性进行手工处理
#三个取值过于集中的属性,三个四分位点的值相等,直接用outlier_processing的函数会导致所有值被删除
#因此对这三个属性进行手工处理
features = ['NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse','NumberOfTimes90DaysLate']
features_labels = ['30-59days','60-89days','90+days']
print('三个属性的原始分布:')
data[features].boxplot()
plt.xticks([1,2,3],features_labels)
plt.savefig('三个属性的原始分布', dpi = 300 ,bbox_inches = 'tight')
plt.show()
print('删除离群点后:')
data = data[data['NumberOfTime30-59DaysPastDueNotWorse']<90]
data = data[data['NumberOfTime60-89DaysPastDueNotWorse']<90]
data = data[data['NumberOfTimes90DaysLate']<90]
data[features].boxplot()
plt.xticks([1,2,3],features_labels)
plt.savefig('三个属性的整理后分布', dpi = 300 ,bbox_inches = 'tight')
plt.show()
print('处理离群点后数据概况:')
data.info()
#生成数据集和测试集
from sklearn.model_selection import train_test_split
#原始值0为正常,1为违约。因为习惯上信用评分越高,违约的可能越小,所以将原始值0和1置换
data['SeriousDlqin2yrs'] = 1-data['SeriousDlqin2yrs']
Y = data['SeriousDlqin2yrs']
X = data.iloc[:,1:]
#拆分训练集和数据集
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state = 0)
train = pd.concat([Y_train,X_train],axis = 1)
test = pd.concat([Y_test,X_test],axis = 1)
clasTest = test.groupby('SeriousDlqin2yrs')['SeriousDlqin2yrs'].count()
print('训练集数据')
print(train.shape)
print('测试集数据')
print(test.shape)
三.属性选择
除了用相关性分析排除绝对值较小的属性外,可通过WoE(Weight of Evidence):迹象权重 与 IV(Information Value):信息值两个指标来考察某个属性对于目标变量影响的重要程度,从而决定属性的取舍。
这两个指标的计算公式如下:
WoE = In(pctlGood/pctlBad)
MIV = WoE*(pctlGood-pctlBad)
IV = ∑ MIV
信息值IV的取值与要研究的目标变量之间相关性强弱的关系是:
0<IV<0.02极弱
0.02<=IV<0.1弱
0.1<=IV<0.03一般
0.3<=IV<0.5强
0.5<=IV<1.0很强
#对属性进行分箱,并计算WOE和IV值
#对属性进行分箱,并计算WOE和IV值
def mono_bin(res,feat,n = 10):
good = res.sum()
bad = res.count()-good
d1 = pd.DataFrame({'feat':feat,'res':res,'Bucket':pd.cut(feat,n)})
d2 = d1.groupby('Bucket',as_index = True)
d3 = pd.DataFrame(d2.feat.min(),columns = ['min'])
d3['min'] = d2.min().feat
d3['max'] = d2.max().feat
d3['sum'] = d2.sum().res
d3['total'] = d2.count().res
d3['rate'] = d2.mean().res
d3['woe'] = np.log((d3['rate']/(1-d3['rate']))/(good/bad))
d3['goodattribute'] = d3['sum']/good
d3['badattribute'] = (d3['total']-d3['sum'])/bad
iv = ((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
d4 = (d3.sort_values(by = 'min'))
cut = []
cut.append(float('-inf'))
for i in range(1,n):
qua = feat.quantile(i/(n))
cut.append(round(qua,4))
cut.append(float('inf'))
woe = list(d4['woe'].round(3))
return d4,iv,cut,woe
def self_bin(res,feat,cat):
good = res.sum()
bad = res.count()-good
d1 = pd.DataFrame({'feat':feat,'res':res,'Bucket':pd.cut(feat,cat)})
d2 = d1.groupby('Bucket',as_index = True)
d3 = pd.DataFrame(d2.feat.min(),columns = ['min'])
d3['min'] = d2.min().feat
d3['max'] = d2.max().feat
d3['sum'] = d2.sum().res
d3['total'] = d2.count().res
d3['rate'] = d2.mean().res
d3['woe'] = np.log((d3['rate']/(1-d3['rate']))/(good/bad))
d3['goodattribute'] = d3['sum']/good
d3['badattribute'] = (d3['total']-d3['sum'])/bad
iv = ((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
d4 = (d3.sort_values(by = 'min'))
woe = list(d4['woe'].round(3))
return d4,iv,woe
将各个属性按照指定间隔进行分箱,这里定义了cutx3/6/7/8/910分箱
pinf = float('inf')
ninf = float('-inf')
dfx1,ivx1,cutx1,woex1 = mono_bin(train['SeriousDlqin2yrs'],train['RevolvingUtilizationOfUnsecuredLines'],n = 10)
#显示RevolvingUtilizationOfUnsecuredLines分箱和WOE信息
print('='*60)
print('显示RevolvingUtilizationOfUnsecuredLines分箱和WOE信息:')
print(dfx1)
dfx2,ivx2,cutx2,woex2 = mono_bin(train['SeriousDlqin2yrs'],train['age'],n = 10)
dfx4,ivx4,cutx4,woex4 = mono_bin(train['SeriousDlqin2yrs'],train['DebtRatio'],n = 10)
dfx5,ivx5,cutx5,woex5 = mono_bin(train['SeriousDlqin2yrs'],train['MonthlyIncome'],n = 10)
#对3,6,7,8,9,10列数据进行指定间隔分箱
cutx3 = [ninf,0,1,3,5,pinf]
cutx6 = [ninf,1,2,3,5,pinf]
cutx7 = [ninf,0,1,3,5,pinf]
cutx8 = [ninf,0,1,2,3,pinf]
cutx9 = [ninf,0,1,3,pinf]
cutx10 = [ninf,0,1,2,3,5,pinf]
#按照cutx3指定的间隔把NumberOfTime30-59DaysPastDueNotWorse属性分成5段
dfx3,ivx3,woex3 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTime30-59DaysPastDueNotWorse'],cutx3)
#显示NumberOfTime30-59DaysPastDueNotWorse分箱和woe信息:
print('='*60)
print('NumberOfTime30-59DaysPastDueNotWorse分箱和woe信息:')
print(dfx3)
dfx6,ivx6,woex6 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfOpenCreditLinesAndLoans'],cutx6)
dfx7,ivx7,woex7 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTimes90DaysLate'],cutx7)
dfx8,ivx8,woex8 = self_bin(train['SeriousDlqin2yrs'],train['NumberRealEstateLoansOrLines'],cutx8)
dfx9,ivx9,woex9 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTime60-89DaysPastDueNotWorse'],cutx9)
dfx10,ivx10,woex10 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfDependents'],cutx10)
对求出的IV属性进行画图
#按照iv选取属性
ivlist = [ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
index = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
fig1 = plt.figure(1)
ax1 = fig1.add_subplot(1,1,1)
x = np.arange(len(index))+1
ax1.bar(x,ivlist,width = 0.48,color = 'yellow',alpha = 0.5)
ax1.set_xticks(x)
ax1.set_xticklabels(index,rotation = 0,fontsize = 12)
ax1.set_ylabel('IV(information value)',fontsize = 14)
for a,b in zip(x,ivlist):
plt.text(a,b+0.01,'%.4f'%b,ha = 'center',va = 'bottom',fontsize = 10)
plt.savefig('iv取值.png', dpi = 300,bbox_inches = 'tight')
plt.show()
四.模型训练阶段
定义了get_woe函数,用于将原始数据转换为WoE值,以提高模型的训练结果。调用get_wor函数,将训练集和测试集属性转化为WoE值
#模型训练阶段
#求出属性的对应woe值
def get_woe(feat,cut,woe):
res = []
for row in feat.iteritems():
value = row[1]
j = len(cut)-2
m = len(cut)-2
while j>=0:
if value>=cut[j]:
j=-1
else:
j-=1
m-=1
res.append(woe[m])
return res
#调用get_woe函数,分别将训练集和测试集的属性值转为woe值
woe_train = pd.DataFrame()
woe_train['SeriousDlqin2yrs'] = train['SeriousDlqin2yrs']
woe_train['RevolvingUtilizationOfUnsecuredLines'] = get_woe(train['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1)
woe_train['age'] = get_woe(train['age'], cutx2, woex2)
woe_train['NumberOfTime30-59DaysPastDueNotWorse'] = get_woe(train['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3)
woe_train['DebtRatio'] = get_woe(train['DebtRatio'], cutx4, woex4)
woe_train['MonthlyIncome'] = get_woe(train['MonthlyIncome'], cutx5, woex5)
woe_train['NumberOfOpenCreditLinesAndLoans'] = get_woe(train['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6)
woe_train['NumberOfTimes90DaysLate'] = get_woe(train['NumberOfTimes90DaysLate'], cutx7, woex7)
woe_train['NumberRealEstateLoansOrLines'] = get_woe(train['NumberRealEstateLoansOrLines'], cutx8, woex8)
woe_train['NumberOfTime60-89DaysPastDueNotWorse'] = get_woe(train['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9)
woe_train['NumberOfDependents'] = get_woe(train['NumberOfDependents'], cutx10, woex10)
#将测试集各属性替换成woe
woe_test = pd.DataFrame()
woe_test['SeriousDlqin2yrs'] = train['SeriousDlqin2yrs']
woe_test['RevolvingUtilizationOfUnsecuredLines'] = get_woe(train['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1)
woe_test['age'] = get_woe(train['age'], cutx2, woex2)
woe_test['NumberOfTime30-59DaysPastDueNotWorse'] = get_woe(train['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3)
woe_test['DebtRatio'] = get_woe(train['DebtRatio'], cutx4, woex4)
woe_test['MonthlyIncome'] = get_woe(train['MonthlyIncome'], cutx5, woex5)
woe_test['NumberOfOpenCreditLinesAndLoans'] = get_woe(train['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6)
woe_test['NumberOfTimes90DaysLate'] = get_woe(train['NumberOfTimes90DaysLate'], cutx7, woex7)
woe_test['NumberRealEstateLoansOrLines'] = get_woe(train['NumberRealEstateLoansOrLines'], cutx8, woex8)
woe_test['NumberOfTime60-89DaysPastDueNotWorse'] = get_woe(train['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9)
woe_test['NumberOfDependents'] = get_woe(train['NumberOfDependents'], cutx10, woex10)
import statsmodels.api as sm
from sklearn.metrics import roc_curve,auc
Y = woe_train['SeriousDlqin2yrs']
X = woe_train.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis = 1)
X1 = sm.add_constant(X)
logit = sm.Logit(Y,X1)
Logit_model = logit.fit()
print('输出拟合的各项系数')
print(Logit_model.params)
画出模型AUC曲线
Y_test = woe_test['SeriousDlqin2yrs']
X_test = woe_test.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans',
'NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1)
X3 = sm.add_constant(X_test)
resu = Logit_model.predict(X3)
fpr,tpr,threshold = roc_curve(Y_test,resu)
rocauc = auc(fpr,tpr)
plt.plot(fpr,tpr,'y',label='AUC=%0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'p--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('TurePositive')
plt.xlabel('FalsePositive')
plt.savefig('模型AUC曲线.png',dpi=300,bbox_inches='tight')
print('模型AUC曲线:')
plt.show()
#定义get_score函数用于计算各个分箱的基础得分
def get_score(coe,woe,factor):
scores = []
for w in woe:
score = round(coe*w*factor,0)
scores.append(score)
return scores
#定义compte_score函数,计算具体属性取值对应的基础得分
def compute_score(feat,cut,score):
res = []
for row in feat.iteritems():
value = row[1]
j = len(cut)-2
m = len(cut)-2
while j>=0:
if value>=cut[j]:
j=-1
else:
j-=1
m-=1
res.append(score[m])
return res
import math
coe = Logit_model.params
p = 20/math.log(2)
q = 600-20*math.log(20)/math.log(2)
baseScore = round(q+p*coe[0],0)
x1 = get_score(coe[1],woex1,p)
print('第1列属性取值在各分箱段对应的分数')
print(x1)
x2 = get_score(coe[2],woex2,p)
x3 = get_score(coe[3],woex3,p)
x7 = get_score(coe[4],woex7,p)
x9 = get_score(coe[5],woex9,p)
#print(x2)
#print(x3)
#print(x3)
#计算分数
test['BaseScore'] = np.zeros(len(test))+baseScore
test['x1'] = compute_score(test['RevolvingUtilizationOfUnsecuredLines'],cutx1,x1)
test['x2'] = compute_score(test['age'],cutx2,x2)
test['x3'] = compute_score(test['NumberOfTime30-59DaysPastDueNotWorse'],cutx3,x3)
test['x7'] = compute_score(test['NumberOfTimes90DaysLate'],cutx7,x7)
test['x9'] = compute_score(test['NumberOfTime60-89DaysPastDueNotWorse'],cutx9,x9)
test['Score'] = test['x1']+test['x2']+test['x3']+test['x7']+test['x9']+baseScore
第1列属性取值在各分箱段对应的分数
[20.0, 10.0, 4.0, -2.0, -7.0, -13.0, -19.0, -21.0, -41.0, -38.0]
Normal = test.loc[test['SeriousDlqin2yrs']==1]
Charged = test.loc[test['SeriousDlqin2yrs']==0]
print('测试集中正常客户组信用评分统计描述')
print(Normal['Score'].describe())
print('测试集中违约客户组信用评分统计描述')
print(Charged['Score'].describe())
import seaborn as sns
plt.figure(figsize = (10,4))
sns.kdeplot(Normal['Score'],label = 'normal',linewidth = 2,linestyle = '--')
sns.kdeplot(Charged['Score'],label = 'charged',linewidth = 2,linestyle = '-')
plt.xlabel('Score',fontdict = {'size':10})
plt.ylabel('probability',fontdict = {'size':10})
plt.title('normal/charged',fontdict={'size':18})
plt.savefig('违约与正常客户的信用分布情况.png',dpi = 300,bbox_inches = 'tight')
plt.show()
违约客户与正常客户的信用分数分分布
将训练好的模型用于客户信用评分
#将训练好的模型用于客户信用评分
cusInfo = {'RevolvingUtilizationOfUnsecuredLines':0.248537,'age':48,'NumberOfTime30-59DaysPastDueNotWorse':0,
'NumberOfTime60-89DaysPastDueNotWorse':0,'DebtRatio':0.177586,'MonthlyIncome':4166,
'NumberOfOpenCreditLinesAndLoans':11,'NumberOfTimes90DaysLate':0,'NumberRealEstateLoansOrLines':1,
'NumberOfTime60-89DaysPastDueNotWorse':0,'NumberOfDependents':0}
custData = pd.DataFrame(cusInfo,pd.Index(range((1))))
custData.drop(['DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis = 1)
custData['x1'] = compute_score(custData['RevolvingUtilizationOfUnsecuredLines'], cutx1,x1)
custData['x2'] = compute_score(custData['age'], cutx2,x2)
custData['x3'] = compute_score(custData['NumberOfTime30-59DaysPastDueNotWorse'], cutx3,x3)
custData['x7'] = compute_score(custData['NumberOfTimes90DaysLate'], cutx7,x7)
custData['x9'] = compute_score(custData['NumberOfTime60-89DaysPastDueNotWorse'], cutx9,x9)
custData['Score'] = custData['x1']+custData['x2']+custData['x3']+custData['x7']+custData['x9']+baseScore
print('该客户的信用评分为:')
print(custData.loc[0,'Score'])
该客户的信用评分为:
613.0