DataWhale集成学习(上)——Task04

目录

(5)对超参数进行调优

参数与超参数

  • 模型参数是模型内部分配置变量,根据数据进行估计
  • 模型超参数是模型外部的配置,其值无法从数据中估计。

常用的调参方法有以下两种:

网格搜索

思想:把所有超参数选择列出来做排列组合

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
  • estimator:所使用的模型。
  • param_grid:值为字典或列表,即需要最优化的参数的取值。
  • scoring :准确度评价标准,默认None,即使用estimator的误差估计函数。
  • cv可以是整数或者交叉验证生成器或一个可迭代器。1) 默认为None,即3折交叉验证;2) 整数k:k折交叉验证;3) 自定义交叉验证生成器;4) 自定义生成训练集和测试集的迭代器。
  • refit:默认为True。即在搜索参数结束后,

随机搜索

参数的随机搜索中的每个参数都是从可能的参数值的分布中采样的。有助于降低计算成本。

RandomSearchCV的搜索策略:

  1. 对于搜索范围是distribution的超参数,根据给定分布随机采样;
  2. 对于搜索范围是list的超参数,在给定的list中等概率采样(不放回抽样);
  3. 对以上得到的n_iter组采样结果,进行遍历;

RandomSearch VS GridSearch
DataWhale集成学习(上)——Task04

  1. 目标函数为f(x,y)=g(x)+h(y),绿色为g(x),黄色为h(y),目标是求f的最大值;
  2. g(x)在数值上明显大于h(y),因此f(x,y)≈g(x);
  3. 都进行9次搜索,可以看到左图实际只探索了各三个点,而右图探索了9个点;
  4. 右图更可能找到目标函数的最大值。
class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False)
  • estimator:使用的模型
  • param_distributions:待选的参数组合
  • cv:交叉验证的次数
  • n_iter迭代的次数

实例

加载Boston房价的数据

from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
features = boston.feature_names

人为指定超参数

from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler     # 标准化数据
from sklearn.pipeline import make_pipeline   # 使用管道,把预处理和模型形成一个流程

reg_svr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
reg_svr.fit(X, y)
reg_svr.score(X,y)

0.7024525421955277

未调参,引入10折交叉验证

from sklearn.model_selection import cross_val_score
import numpy as np

pipe_SVR = make_pipeline(StandardScaler(),SVR())

score1 = cross_val_score(estimator=pipe_SVR,
                         X = X,
                         y = y,
                         scoring = 'r2',
                         cv = 10)
print(score1)
print("CV accuracy: %.3f +/- %.3f" % ((np.mean(score1)),np.std(score1)))

[ 0.74943112 0.72244189 0.18237941 0.04934372 0.56317173 0.05674098
0.59932148 0.20889779 -1.61288394 0.35310766]
CV accuracy: 0.187 +/- 0.649

网格搜索

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
                     ("svr",SVR())])

param_range = [0.0001,0.001,0.01,0.1,1,10,100,1000]
param_grid = [{'svr__C':param_range,"svr__kernel":["linear","rbf"]}]
gs = GridSearchCV(estimator=pipe_svr,
                  param_grid = param_grid,
                  scoring = 'r2',
                  cv = 10)
gs = gs.fit(X,y)
print(gs.best_score_)
print(gs.best_params_)

0.48352398136697194
{‘svr__C’: 10, ‘svr__kernel’: ‘rbf’}

对gamma也进行调参

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
                     ("svr",SVR())])

param_range = [0.0001,0.001,0.01,0.1,1,10,100,1000]
param_grid = [{'svr__C':param_range,'svr__gamma':param_range,"svr__kernel":["linear","rbf"]}]
gs = GridSearchCV(estimator=pipe_svr,
                  param_grid = param_grid,
                  scoring = 'r2',
                  cv = 10)
gs = gs.fit(X,y)
print(gs.best_score_)
print(gs.best_params_)

0.6081303070817127
{‘svr__C’: 1000, ‘svr__gamma’: 0.001, ‘svr__kernel’: ‘rbf’}

随机搜索

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
                     ("svr",SVR())])

distributions = dict(svr__C = uniform(loc=1.0,scale=4),
                     svr__kernel=['linear','rbf'],
                     svr__gamma = uniform(loc=0,scale=4))
rs = RandomizedSearchCV(estimator=pipe_svr,
                        param_distributions = distributions,
                        scoring = 'r2',
                        cv = 10)
rs = rs.fit(X,y)
print(rs.best_score_)
print(rs.best_params_)

0.3032819067297142
{‘svr__C’: 1.1091543360933844, ‘svr__gamma’: 0.5979948471677052, ‘svr__kernel’: ‘linear’}

参考资料:
DataWhale开源资料
随机搜索RandomizedSearchCV原理

上一篇:python使用sklearn中的SVM(入门级)


下一篇:使用