以使用 KNN 给 digits 数据集分类为例:
Python 原生代码实现寻找最佳超参数
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2)
from sklearn.neighbors import KNeighborsClassifier
使用 k 作为超参数
best_score = 0.0
best_k = -1
for k in range(1,11):
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
print(best_k, best_score)
# 1 0.9861111111111112
# 如果最好的值是边界值,如10,则最好对 10 以上的数据再进行搜索。
超参数 添加距离 weights
best_score = 0.0
best_k = -1
best_method = ""
for method in ['uniform', 'distance']:
for k in range(1,11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
best_method = method
print(best_k, best_method, best_score)
# 1 uniform 0.9861111111111112
超参数 添加距离范式 p
p 默认为2,即使用 欧氏距离。
%%time # 距离需要开根号,比较耗时,这里计时
best_score = 0.0
best_k = -1
best_p = -1
for p in range(1, 6):
for k in range(1,11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights='distance', p=p)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_k = k
best_p = p
print(best_k, best_p, best_score)
'''
1 2 0.9861111111111112
CPU times: user 14.8 s, sys: 46.7 ms, total: 14.9 s
Wall time: 14.9 s
'''
以上搜索方式也称为 网格搜索。
使用 sklearn 中的网格搜索
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2)
from sklearn.neighbors import KNeighborsClassifier
# 定义要搜索的参数
param_grid = [{'weights': ['uniform'],
'n_neighbors': [i for i in range(1,11)]
},
{'weights': ['distance'],
'n_neighbors': [i for i in range(1,11)],
'p': [i for i in range(1,6)]
}]
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
# CV 的意思是 Cross Validation,交叉验证。
grid_search = GridSearchCV(knn_clf, param_grid)
%%time
# 比较耗时,
grid_search.fit(X_train, y_train)
# CPU times: user 43.3 s, sys: 93.2 ms, total: 43.4 s
# Wall time: 43.5 s
'''
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=5, p=2,
weights='uniform'),
iid='warn', n_jobs=None,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
'''
grid_search.best_estimator_ # 最佳分类器对应的参数
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')
# 最佳准确度
grid_search.best_score_
# 0.9846903270702854
# 最佳参数
grid_search.best_params_
# {'n_neighbors': 1, 'weights': 'uniform'}
# 以上属性末尾都有下划线,代表一个原则:不是由用户传入的数据,而是类自己计算的结果,命名都是 名字后跟一个下划线。
# 将最佳模型传给这个 knn
knn_clf = grid_search.best_estimator_
knn_clf.predict(X_test)
'''
array([4, 0, 9, 1, 8, 7, 1, 5, 1, 6, 6, 7, 6, 1, 5, 5, 7, 6, 2, 7, 4, 6, 1, 5, 2, 9, 5, 4, 6, 5, 6, 3, 4, 0, 9, 9, 8, 4, 6, 8, 8, 5, 7, ... 5, 7, 8, 0, 4, 1, 4, 5])
'''
knn_clf.score(X_test, y_test)
# 0.9861111111111112
提升效率
# 以上搜索过程是可以并行处理的;n_jobs 决定了为计算机分配几个核来处理,默认为1,代表单核;传-1代表传所有核。
# verbose 表示在搜索过程中进行输出,这样在长时间搜索的时候,可以了解搜索状态。传入整数,整数越大,输出信息越详细。
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
'''
Fitting 3 folds for each of 60 candidates, totalling 180 fits
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 2.3s
[Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 8.9s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 11.1s finished
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=1, p=2,
weights='uniform'),
iid='warn', n_jobs=-1,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
'''
关于距离
KNeighborsClassifier 中默认使用闵式距离,p为2(欧式距离);可以使用 metric 参数修改距离;
sklearn 官网文档列出了不同的距离
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
Metrics intended for real-valued vector spaces:
identifier | class name | args | distance function |
---|---|---|---|
“euclidean” | EuclideanDistance | sqrt(sum((x - y)^2)) |
|
“manhattan” | ManhattanDistance | sum(|x - y|) |
|
“chebyshev” | ChebyshevDistance | max(|x - y|) |
|
“minkowski” | MinkowskiDistance | p | sum(|x - y|^p)^(1/p) |
“wminkowski” | WMinkowskiDistance | p, w | sum(|w * (x - y)|^p)^(1/p) |
“seuclidean” | SEuclideanDistance | V | sqrt(sum((x - y)^2 / V)) |
“mahalanobis” | MahalanobisDistance | V or VI | sqrt((x - y)' V^-1 (x - y)) |
Metrics intended for two-dimensional vector spaces: Note that the haversine distance metric requires data in the form of [latitude, longitude] and both inputs and outputs are in units of radians.
identifier | class name | distance function |
---|---|---|
“haversine” | HaversineDistance | 2 arcsin(sqrt(sin^2(0.5*dx) + cos(x1)cos(x2)sin^2(0.5*dy))) |
Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.
identifier | class name | distance function |
---|---|---|
“hamming” | HammingDistance | N_unequal(x, y) / N_tot |
“canberra” | CanberraDistance | sum(|x - y| / (|x| + |y|)) |
“braycurtis” | BrayCurtisDistance | sum(|x - y|) / (sum(|x|) + sum(|y|)) |
Metrics intended for boolean-valued vector spaces: Any nonzero entry is evaluated to “True”. In the listings below, the following abbreviations are used:
- N : number of dimensions
- NTT : number of dims in which both values are True
- NTF : number of dims in which the first value is True, second is False
- NFT : number of dims in which the first value is False, second is True
- NFF : number of dims in which both values are False
- NNEQ : number of non-equal dimensions, NNEQ = NTF + NFT
- NNZ : number of nonzero dimensions, NNZ = NTF + NFT + NTT
identifier | class name | distance function |
---|---|---|
“jaccard” | JaccardDistance | NNEQ / NNZ |
“matching” | MatchingDistance | NNEQ / N |
“dice” | DiceDistance | NNEQ / (NTT + NNZ) |
“kulsinski” | KulsinskiDistance | (NNEQ + N - NTT) / (NNEQ + N) |
“rogerstanimoto” | RogersTanimotoDistance | 2 * NNEQ / (N + NNEQ) |
“russellrao” | RussellRaoDistance | NNZ / N |
“sokalmichener” | SokalMichenerDistance | 2 * NNEQ / (N + NNEQ) |
“sokalsneath” | SokalSneathDistance | NNEQ / (NNEQ + 0.5 * NTT) |
User-defined distance:
identifier | class name | args |
---|---|---|
“pyfunc” | PyFuncDistance | func |