数据集划分方法
k折交叉验证
1、将全部训练集S分成k个不相交的子集,假设S中的训练样例个数为m,那么每一个子集有m/k个训练样例,,相应的子集称作{s1,s2,…
,sk}。 2、每次从分好的子集里面,拿出一个作为测试集,其它k-1个作为训练集 3、在k-1个训练集上训练出学习器模型。
4、把这个模型放到测试集上,得到分类率。 5、计算k次求得的分类率的平均值,作为该模型或者假设函数的真实分类率。
这个方法充分利用了所有样本。但计算比较繁琐,需要训练k次,测试k次。
使用
# kFold
import numpy as np
from sklearn.model_selection import KFold
x = np.array([[1,2],[3,4],[5, 6],[7,8],[9, 10],[11,12]])
y = np.array([1,2,3,4,5,6])
kf = KFold(n_splits=2) #3,就是3折
kf.get_n_splits(x)Iprint(kf)
for train_index,test_index in kf.split(X): #训练集索引 测试集索引
print("Train Index : ", train_index, ",Test Index :", test_index)
X_train,X_test = X[train_index], X[test_index] #使用索引对原始数据进行切片
y_train,y_test = y[train_index], yltest_index]
# print(x_train,X_test, y_train,y_test)
# GroupKfold K折迭代器的变体
import numpy as np
from sklearn. model_selection import GroupKFold
x = np.array([[1,2],[3,4],[5, 6],[7,8],[9,10],[11,12]])
y = np.array([1,2,3,4,5,6])
groups = np.array([1,2,3,4,5,6])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)print(group_kfold)
for train_index,test_index in group_kfold.split(X, y,groups): # 返回一个元组迭代器
print("Train Index:", train_index,",Test Index:" , test_index)
x_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
# print(X_train,X_test, y_train,y_test)
# sklearn.model_selection.StratifiedKFold #分层交叉验证
import numpy as np
from sklearn.model_selection import StratifiedKFold
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) #分层就是在前三个里面随机挑选一个,在后三个里面也要随机挑选一个
y = np.array([1,1,1,2,2,2])
skf = StratifiedKFold(n_splits=3)skf.get_n_splits(x, y)
print(skf)
for train_index,test_index in skf.split(X, y):
print(" Train Index: ", train_index, ",Test Index:", test_index)
x_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
留一法
留一法验证(Leave-one-out,L0O) :
假设有N个样本,将每一个样本作为测试样本,其它N-1个样本作为训练样本。这样得到N个分类器,N个测试结果。用这N个结果的平均值来衡量模型的性能。
如果Loo与K-fold cv比较, Loo在N个样本上建立N个模型而不是k个。更进一步,N个模型的每一个都是在N-1个样本上训练的,而不是(k-1)n / k。两种方法中,假定k不是很大而且k<<N。LOO比 k-fold cv更加耗时。
留P法验证(Leave-p-out):
有N个样本,将每P个样本作为测试样本,其它N-P个样本作为训练样本。这样得到() train-test pairs。不像LeaveOneOut和 KFold,当P>1时,测试集将会发生重叠。当P=1的时候,就变成了留一法。
使用
# sklearn.model_selection.LeaveOneOut 留一法 有多少个样本就训练测试多少遍
import numpy as np
from sklearn. model_selection import LeaveOneOut
x = np.array([[1,2],[3,| 4]A[5, 6],[7,8],[9,10],[11,12]])
y = np.array([1,2,3,4,5,6])
loo = LeaveOneOut O
loo.get_n_splits(X)
print(l1oo)
for train_index,test_index in loo.split(X):
print("TRAIN:", train_index,"TEST:", test_index)
X_train,X_test = X[train_index],x[test_index]
y _train,y_test = y[train_index],yltest_index]
# print(X train,X_test,y_train,y_test)
#sklearn.model_ selection.LeavePOut 留P法 无法保证样本比例的均衡性
import numpy as np
from sklearn. model_ selection import LeavePOut
x = np. array([[1, 2], [3, 4], [5, 6],[7, 8],[9, 10], [11, 12]])
y = np. array([1,2, 3, 4, 5,6])
lpo = LeavePOut (p=3) #每次从中取出3个作为测试样本,其他作为训练样本
lpo.get_ n splits(X)
print (1po)
for train_ index, test_ index in lpo. split(X):
print(" TRAIN:”,train_ index," TEST:' ,test_ index)
X_ train, X _test = X[train_ index], X[test_ index]
y_ train, y. test 1 y[train_ index], y[test_ index]
# print(X train, X test, 上train,上test)
随机划分法
ShuffleSplit迭代器产生指定数量的独立的train / test数据集划分。首先对样本全体随机打乱,然后再划分出train
/test对。 可以使用随机数种子random_state来控制随机数序列发生器使得运算结果可重现。
ShuffleSplit是KFold交叉验证的比较好的替代,它允许更好的控制迭代次数和train / test样本比例。
StratifiedShuffleSplito是
ShuffleSplit的一个变体,返回分层划分,也就是在创建划分的时候要保证每个划分中类的样本比例与整体数据集中的原始比例保持一致。
使用
# sklearn.model_selection.Shufflesplit 随机划分
import numpy as np
from sklearn. model_selection import ShuffleSplit
x = np.array([[1,2],[3,4,[5,6],[7,8],[9,10],[11,12]]) #样本点
y = np.array([1,2,4,2,1,2]) #类标签
rs = ShuffleSplit(n_splits=3,test_size=.25,random_state=O) #,test_size=.25 指定测试集占的样本比例为多少
rs.get_n_splits(X)
print(rs)
for train_index,test_index in rs.split(X):
print("TRAIN:", train_index,"TEST:", test_index)
print('=======================================================')
rs = ShuffleSplit(n_splits=3,train_size=0.5,test_size=.25, random_state=0) #train_size=0.5,test_size=.25 不见得加起来一定要等于1,但通常情况下都是1 可以从训练集中看到,0.5则训练个数为3
for train_index,test_index in rs.split(X):
print("TRAIN:", train_index,"TEST:", test_index)
# sklearn.model_selection.StratifiedShuffleSplit 分层随机划分
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
x = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y = np.array([1,2,1,2,1,2])
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0) #(n_splits=3 代表划分为几个
sss.get_n_splits(X, y)
print(sss)
for train_index,test_index in sss.split(X, y):
print("TRAIN:", train_index,"TEST:", test_index)
X_train,X_test = X[train_index],x[test_index]
y_train, y_test = y[train_index], yltest_index]