目录
(5)对超参数进行调优
参数与超参数
- 模型参数是模型内部分配置变量,根据数据进行估计
- 模型超参数是模型外部的配置,其值无法从数据中估计。
常用的调参方法有以下两种:
网格搜索
思想:把所有超参数选择列出来做排列组合
class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
- estimator:所使用的模型。
- param_grid:值为字典或列表,即需要最优化的参数的取值。
- scoring :准确度评价标准,默认None,即使用estimator的误差估计函数。
- cv可以是整数或者交叉验证生成器或一个可迭代器。1) 默认为None,即3折交叉验证;2) 整数k:k折交叉验证;3) 自定义交叉验证生成器;4) 自定义生成训练集和测试集的迭代器。
- refit:默认为True。即在搜索参数结束后,
随机搜索
参数的随机搜索中的每个参数都是从可能的参数值的分布中采样的。有助于降低计算成本。
RandomSearchCV的搜索策略:
- 对于搜索范围是distribution的超参数,根据给定分布随机采样;
- 对于搜索范围是list的超参数,在给定的list中等概率采样(不放回抽样);
- 对以上得到的n_iter组采样结果,进行遍历;
RandomSearch VS GridSearch
- 目标函数为f(x,y)=g(x)+h(y),绿色为g(x),黄色为h(y),目标是求f的最大值;
- g(x)在数值上明显大于h(y),因此f(x,y)≈g(x);
- 都进行9次搜索,可以看到左图实际只探索了各三个点,而右图探索了9个点;
- 右图更可能找到目标函数的最大值。
class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False)
- estimator:使用的模型
- param_distributions:待选的参数组合
- cv:交叉验证的次数
- n_iter迭代的次数
实例
加载Boston房价的数据
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
features = boston.feature_names
人为指定超参数
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler # 标准化数据
from sklearn.pipeline import make_pipeline # 使用管道,把预处理和模型形成一个流程
reg_svr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
reg_svr.fit(X, y)
reg_svr.score(X,y)
0.7024525421955277
未调参,引入10折交叉验证
from sklearn.model_selection import cross_val_score
import numpy as np
pipe_SVR = make_pipeline(StandardScaler(),SVR())
score1 = cross_val_score(estimator=pipe_SVR,
X = X,
y = y,
scoring = 'r2',
cv = 10)
print(score1)
print("CV accuracy: %.3f +/- %.3f" % ((np.mean(score1)),np.std(score1)))
[ 0.74943112 0.72244189 0.18237941 0.04934372 0.56317173 0.05674098
0.59932148 0.20889779 -1.61288394 0.35310766]
CV accuracy: 0.187 +/- 0.649
网格搜索
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
("svr",SVR())])
param_range = [0.0001,0.001,0.01,0.1,1,10,100,1000]
param_grid = [{'svr__C':param_range,"svr__kernel":["linear","rbf"]}]
gs = GridSearchCV(estimator=pipe_svr,
param_grid = param_grid,
scoring = 'r2',
cv = 10)
gs = gs.fit(X,y)
print(gs.best_score_)
print(gs.best_params_)
0.48352398136697194
{‘svr__C’: 10, ‘svr__kernel’: ‘rbf’}
对gamma也进行调参
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
("svr",SVR())])
param_range = [0.0001,0.001,0.01,0.1,1,10,100,1000]
param_grid = [{'svr__C':param_range,'svr__gamma':param_range,"svr__kernel":["linear","rbf"]}]
gs = GridSearchCV(estimator=pipe_svr,
param_grid = param_grid,
scoring = 'r2',
cv = 10)
gs = gs.fit(X,y)
print(gs.best_score_)
print(gs.best_params_)
0.6081303070817127
{‘svr__C’: 1000, ‘svr__gamma’: 0.001, ‘svr__kernel’: ‘rbf’}
随机搜索
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
("svr",SVR())])
distributions = dict(svr__C = uniform(loc=1.0,scale=4),
svr__kernel=['linear','rbf'],
svr__gamma = uniform(loc=0,scale=4))
rs = RandomizedSearchCV(estimator=pipe_svr,
param_distributions = distributions,
scoring = 'r2',
cv = 10)
rs = rs.fit(X,y)
print(rs.best_score_)
print(rs.best_params_)
0.3032819067297142
{‘svr__C’: 1.1091543360933844, ‘svr__gamma’: 0.5979948471677052, ‘svr__kernel’: ‘linear’}
参考资料:
DataWhale开源资料
随机搜索RandomizedSearchCV原理