机器学习模型评估与超参数调优详解
文章目录
摘要:在task3和task4中我们有对回归问题的模型评估和超参数调优进行了简单的介绍,在这一章,我们需要进一步学习机器学习模型评估与超参数调优,以及在最后,完成一个分类的实战演练
1. 用管道简化工作流
在很多机器学习算法中,我们可能需要做一系列的基本操作后才能进行建模,如:在建立逻辑回归之前,我们可能需要先对数据进行标准化,然后使用PCA将维,最后拟合逻辑回归模型并预测。那有没有什么办法可以同时进行这些操作,使得这些操作形成一个工作流呢?下面请看代码:
#加载基本工具库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")
import warnings
warnings.filterwarnings("ignore")
# 加载数据
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
# 做基本的数据预处理
from sklearn.preprocessing import LabelEncoder
X = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder() #将M-B等字符串编码成计算机能识别的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 数据切分8:2
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
把所有的操作全部封在一个管道pipeline内形成一个工作流:标准化+PCA+逻辑回归
方式1:make_pipline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
pipe_lr1 = make_pipeline(StandardScaler(),PCA(n_components=2),LogisticRegression(random_state=1))
pipe_lr1.fit(X_train,y_train)
y_pred1 = pipe_lr1.predict(X_test)
print("Test Accuracy: %.3f"% pipe_lr1.score(X_test,y_test))
Test Accuracy: 0.956
方式2:Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe_lr2 = Pipeline([['std',StandardScaler()],['pca',PCA(n_components=2)],['lr',LogisticRegression(random_state=1)]])
pipe_lr2.fit(X_train,y_train)
y_pred2 = pipe_lr2.predict(X_test)
print("Test Accuracy: %.3f"% pipe_lr2.score(X_test,y_test))
Test Accuracy: 0.956
两种方式的区别
使用Pipeline:
- 名称是显式的,如果需要,您不必弄清楚;
- 如果您更改步骤中使用的estimator/transformer,则名称不会更改,例如如果将LogisticRegression()替换为LinearSVC(),您仍然可以使用clf__C。
make_pipeline:
- 表示法更短,并且更具可读性;
- 名称是使用简单规则(估算器的小写名称)自动生成的。
简单来说,就是“是否会自动生成名称”的区别。
2.使用k折交叉验证评估模型性能
k折交叉验证在task3中有提及过:我们把训练样本分成K等分,然后用K-1个样本集当做训练集,剩下的一份样本集为验证集去估计由K-1个样本集得到的模型的精度,这个过程重复K次取平均值得到测试误差的一个估计
C
V
(
K
)
=
1
K
∑
i
=
1
K
M
S
E
i
CV_{(K)} = \frac{1}{K}\sum\limits_{i=1}^{K}MSE_i
CV(K)=K1i=1∑KMSEi
# 评估方式1:k折交叉验证
from sklearn.model_selection import cross_val_score
scores1 = cross_val_score(estimator=pipe_lr1,X = X_train,y = y_train,cv=10,n_jobs=1)
print("CV accuracy scores:%s" % scores1)
print("CV accuracy:%.3f +/-%.3f"%(np.mean(scores1),np.std(scores1)))
CV accuracy scores:[0.93478261 0.93478261 0.95652174 0.95652174 0.93478261 0.95555556
0.97777778 0.93333333 0.95555556 0.95555556]
CV accuracy:0.950 +/-0.014
# 评估方式2:分层k折交叉验证
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train)
scores2 = []
for k,(train,test) in enumerate(kfold):
pipe_lr1.fit(X_train[train],y_train[train])
score = pipe_lr1.score(X_train[test],y_train[test])
scores2.append(score)
print('Fold:%2d,Class dist.:%s,Acc:%.3f'%(k+1,np.bincount(y_train[train]),score))
print('\nCV accuracy :%.3f +/-%.3f'%(np.mean(scores2),np.std(scores2)))
Fold: 1,Class dist.:[256 153],Acc:0.935
Fold: 2,Class dist.:[256 153],Acc:0.935
Fold: 3,Class dist.:[256 153],Acc:0.957
Fold: 4,Class dist.:[256 153],Acc:0.957
Fold: 5,Class dist.:[256 153],Acc:0.935
Fold: 6,Class dist.:[257 153],Acc:0.956
Fold: 7,Class dist.:[257 153],Acc:0.978
Fold: 8,Class dist.:[257 153],Acc:0.933
Fold: 9,Class dist.:[257 153],Acc:0.956
Fold:10,Class dist.:[257 153],Acc:0.956
CV accuracy :0.950 +/-0.014
3. 使用学习和验证曲线调试算法
如果模型过于复杂,即模型有太多的*度或者参数,就会有过拟合的风险(高方差);而模型过于简单,则会有欠拟合的风险(高偏差)。
下面我们用这些曲线去识别并解决方差和偏差问题:
# 用学习曲线诊断偏差与方差
from sklearn.model_selection import learning_curve
pipe_lr3 = make_pipeline(StandardScaler(),LogisticRegression(random_state=1,penalty='l2'))
train_sizes,train_scores,test_scores = learning_curve(
estimator=pipe_lr3,
X=X_train,
y=y_train,
train_sizes=np.linspace(0.1,1,10),
cv=10,
n_jobs=1)
train_mean = np.mean(train_scores,axis=1)
train_std = np.std(train_scores,axis=1)
test_mean = np.mean(test_scores,axis=1)
test_std = np.std(test_scores,axis=1)
plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.15,color='blue')
plt.plot(train_sizes,test_mean,color='red',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.15,color='red')
plt.xlabel("Number of training samples")
plt.ylabel("Accuracy")
plt.legend(loc='lower right')
plt.ylim([0.8,1.02])
plt.show()
通过learning_curve函数的train_size可以控制用于生产学习曲线的样本的绝对或者相对数量。
设置train_sizes=np.linspace(0.1, 1.0, 10),来使用训练集是等间距间隔的10个样本。
通过cv参数设置k值,
通过fill_between函数加入平均准确率标准差的信息,表示评估结果的方差。
# 用验证曲线解决欠拟合和过拟合
from sklearn.model_selection import validation_curve
pipe_lr3 = make_pipeline(StandardScaler(),LogisticRegression(random_state=1,penalty='l2'))
param_range = [0.001,0.01,0.1,1.0,10.0,100.0]
train_scores,test_scores = validation_curve(
estimator=pipe_lr3,
X=X_train,
y=y_train,
param_name='logisticregression__C',
param_range=param_range,
cv=10,
n_jobs=1)
train_mean = np.mean(train_scores,axis=1)
train_std = np.std(train_scores,axis=1)
test_mean = np.mean(test_scores,axis=1)
test_std = np.std(test_scores,axis=1)
plt.plot(param_range,train_mean,
color='blue',marker='o',
markersize=5,label='training accuracy')
plt.fill_between(param_range,train_mean+train_std,
train_mean-train_std,alpha=0.15,
color='blue')
plt.plot(param_range,test_mean,
color='red',marker='s',
markersize=5,label='validation accuracy')
plt.fill_between(param_range,test_mean+test_std,
test_mean-test_std,alpha=0.15,color='red')
plt.xscale('log')
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc='lower right')
plt.ylim([0.8,1.02])
plt.show()
验证的是参数C,定义在逻辑回归的正则化参数,记为clf__C。
通过param_range参数设置值的范围。
由图可知,C的最优值为0.1附近。
4.通过网格搜索进行超参数调优
如果只有一个参数需要调整,那么用验证曲线手动调整是一个好方法,但是随着需要调整的超参数越来越多的时候,我们能不能自动去调整呢?!!!注意对比各个算法的时间复杂度
超参数在task4中有提到过,主要有网格搜索和随机搜索两种,在他说k中是使用SVR的例子,本文中使用的SVC的例子。
SVM、SVC、SVR三者的区别
可以很简单的解释这三者的关系
- SVM=Support Vector Machine 是支持向量
- SVC=Support Vector Classification就是支持向量机用于分类,
- SVR=Support Vector Regression.就是支持向量机用于回归分析
SVM模型的几种
- svm.LinearSVC Linear Support Vector Classification.
- svm.LinearSVR Linear Support Vector Regression.
- svm.NuSVC Nu-Support Vector Classification.
- svm.NuSVR Nu Support Vector Regression.
- svm.OneClassSVM Unsupervised Outlier Detection.
- svm.SVC C-Support Vector Classification.
- svm.SVR Epsilon-Support Vector Regression.
# 方式1:网格搜索GridSearchCV()
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time
start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("网格搜索经历时间:%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)
网格搜索经历时间:2.735 S
0.9846859903381642
{'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}
# 方式2:随机网格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import time
start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
#param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}]
gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("随机网格搜索经历时间:%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)
随机网格搜索经历时间:0.221 S
0.9758937198067633
{'svc__kernel': 'linear', 'svc__gamma': 0.1, 'svc__C': 1.0}
5. 比较不同的性能评估指标
有时候,准确率不是我们唯一需要考虑的评价指标,因为有时候会存在各类预测错误的代价不一样。例如:在预测一个人的肿瘤疾病的时候,如果病人A真实得肿瘤但是我们预测他是没有肿瘤,跟A真实是健康但是预测他是肿瘤,二者付出的代价很大区别(想想为什么)。所以我们需要其他更加广泛的指标:
- 1.误差率 E R R = F P + F N F P + F N + T P + T N ERR = \frac{FP+FN}{FP+FN+TP+TN} ERR=FP+FN+TP+TNFP+FN.
- 2.准确率 A C C = T P + T N F P + F N + T P + T N ACC = \frac{TP+TN}{FP+FN+TP+TN} ACC=FP+FN+TP+TNTP+TN.
- 3.假阳率 F P R = F P F P + T N FPR = \frac{FP}{FP+TN} FPR=FP+TNFP.
- 4.真阳率 T P R = T P F N + T P TPR = \frac{TP}{FN+TP} TPR=FN+TPTP.
- 5.精度 P R E = T P T P + F P PRE = \frac{TP}{TP+FP} PRE=TP+FPTP.
- 6.召回率 R E C = T P T P + F N REC = \frac{TP}{TP+FN} REC=TP+FNTP
- 7.F1值 F 1 = 2 P R E × R E C P R E + R E C F1 = 2\frac{PRE\times REC}{PRE + REC} F1=2PRE+RECPRE×REC
绘制混淆矩阵
# 绘制混淆矩阵
from sklearn.metrics import confusion_matrix
pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
for j in range(confmat.shape[1]):
ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()
各种指标的计算
# 各种指标的计算
from sklearn.metrics import precision_score,recall_score,f1_score
print('Precision:%.3f'%precision_score(y_true=y_test,y_pred=y_pred))
print('recall_score:%.3f'%recall_score(y_true=y_test,y_pred=y_pred))
print('f1_score:%.3f'%f1_score(y_true=y_test,y_pred=y_pred))
Precision:0.976
recall_score:0.952
f1_score:0.964
# 将不同的指标与GridSearch结合
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
gs = gs.fit(X_train,y_train)
print(gs.best_score_)
print(gs.best_params_)
0.9880771478667446
{'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}
# 绘制ROC曲线
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###计算真阳率和假阳率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假阳率为横坐标,真阳率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()
<Figure size 432x288 with 0 Axes>
实战演练——SVM应用:人脸识别
数据集分析
LFW全称为Labeled Faces in the Wild, 是一个应用于人脸识别问题的数据库。
LFW语料图片,每张图片都有人名Label标记。每个人可能有多张不同情况下情景下的图片。如George W Bush 有530张图片,而有一些人名对应的图片可能只有一张或者几张。我们将选取出现最多的人名作为人脸识别的类别,如本实验中选取出现频数超过70的人名为类别, 那么共计1288张图片。其中包括Ariel Sharon, Colin Powell, Donald Rumsfeld, George W Bush, Gerhard Schroeder, Hugo Chavez , Tony Blair等7个人名。
问题描述
通过对7个人名的提取特征和标记,进行新输入的照片进行标记人名。这是一个多分类的问题,在本数据集合中类别数目为7. 这个问题的解决,不仅可以应用于像公司考勤一样少量人员的识别,也可以应用到新数据的标注中。语料库进一步标注,将进一步扩大训练数据集合数据量,从而进一步提高人脸识别的精确度。因此,对于图片的人名正确标注问题,或者这个多分类问题的研究和使用是有应用价值的。
数据处理
训练与测试数据中样本数量为1288,对样本图片进行下采样后特征数为1850,所有人脸的Label数目为7。
首先将数据集划分为训练集合和测试集合,测试集合占25%(一般应该10%或者20%),训练数据进行训练过程中,将分为训练集合和验证集合。通过验证集合选择最优模型,使用测试结合测试模型性能。
其次,通过对训练集合PCA分解,提取特征脸,提高训练速度,防止过度拟合。
官方代码演示
#导入工具包
from time import time
import logging
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC
#读入数据
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
n_samples, h, w = lfw_people.images.shape
X = lfw_people.data
n_features = X.shape[1]
y = lfw_people.target # 预测的标签是人的id
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)
Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
#使用k折交叉验证评估
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42) #将数据划分为训练集和测试集,测试集合占25%
n_components = 80
print("Extracting the top %d eigenfaces from %d faces"
% (n_components, X_train.shape[0]))
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train) #PCA降维
print("done in %0.3fs" % (time() - t0))
eigenfaces = pca.components_.reshape((n_components, h, w))
print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))
Extracting the top 80 eigenfaces from 966 faces
done in 0.064s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.009s
#用网络搜索交叉检验来寻找最优参数组合
print("Fitting the classifier to the training set")
t0 = time()
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], } #C 是对错误部分的乘法,gamma为合成点
clf = GridSearchCV(
SVC(kernel='rbf', class_weight='balanced'), param_grid #rbf处理图像较好,C和gamma组合,穷举出最好的一个组合 使用GridSearchCV进行*组合,最终确定合适的组合
)
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_) #输出最好的模型的信息
Fitting the classifier to the training set
done in 13.795s
Best estimator found by grid search:
SVC(C=1000.0, class_weight='balanced', gamma=0.01)
# 在测试集上面评估
print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)
print("done in %0.3fs" % (time() - t0))
print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))
Predicting people's names on the test set
done in 0.025s
precision recall f1-score support
Ariel Sharon 1.00 0.62 0.76 13
Colin Powell 0.84 0.88 0.86 60
Donald Rumsfeld 1.00 0.63 0.77 27
George W Bush 0.83 0.99 0.90 146
Gerhard Schroeder 0.82 0.72 0.77 25
Hugo Chavez 1.00 0.60 0.75 15
Tony Blair 0.90 0.75 0.82 36
accuracy 0.86 322
macro avg 0.91 0.74 0.80 322
weighted avg 0.87 0.86 0.85 322
[[ 8 2 0 3 0 0 0]
[ 0 53 0 6 1 0 0]
[ 0 2 17 8 0 0 0]
[ 0 2 0 144 0 0 0]
[ 0 0 0 5 18 0 2]
[ 0 2 0 2 1 9 1]
[ 0 2 0 5 2 0 27]]
可以看出,准确率相当高。这么高的准确率,是由于我们仅仅选取了每个标识人名数目> 70的人名,但是大量的仅仅出现1次的人名存在。如果考虑这种数据稀疏性,将大大降低结果的准确率。但是,真实应用中,数据稀疏性问题是不得不考虑的问题。
def plot_gallery(images, titles, h, w, n_row=3, n_col=4): #使用matplotlib对预测进行评估
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
#在测试集的部分绘制结果
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
plt.show()
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()
这是PCA降维后的特征空间图,降低后的维度(ndim)越小,丢失的信息越多
小结
本案例中准确度相当高,真实情况中的人脸识别问题的照片通常不会被切割的那么整齐(即使像素相同),两类人脸分类机制的唯一差别其实是特征选择:你需要用更复杂的算法找到人脸,然后提取图片中与像素无关的人脸特征。这里问题有一不错的解决方案,就是用openCV配合其他手段,包括最先进的通用图像的特征提取工具,来获取人脸特征数据。