机器学习入门-交叉验证选择参数(数据切分)train_test_split(under_x, under_y, test_size, random_state), (交叉验证的数据切分)KFold， recall_score(召回率)

2024-03-08 09:08:15

1. train_test_split(under_x, under_y, test_size=0.3, random_state=0) # under_x, under_y 表示输入数据, test_size表示切分的训练集和测试集的比例， random_state 随机种子

2. KFold(len(train_x), 5, shuffle=False) # len(train_x) 第一个参数数据数据大小， 5表示切分的个数，即循环的次数， shuffle表示是否进行打乱数据

3. recall_score 表示的是召回率，即预测对的/这个类别的个数

我们将数据分为训练集和测试集，为了确定好参数，我们从训练集中对数据进行再次的切分，切分成训练集和验证集以此来获得好的训练参数

我们对正则化参数c做验证

交叉验证的意思是比如，KFold(len(train_x), 5, shuffle=False) 将索引值分成5份，四分作为训练集，1份作为验证集,为了防止由于部分数据表现不好，导致结果的偏低或者偏高

训练集验证集

1234 5

2345 1

3451 2

4512 3

5123 4

一共5次循环，对获得的score求平均作为最终的预测得分

我们使用recall_score 来做为验证结果，使用KFold来进行数据的索引的拆分，返回最佳的参数

# 进行整体数据的拆分

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=0)

# 进行下采样数据的拆分

under_train_x, under_text_x, under_train_y, under_test_y = train_test_split(under_x, under_y, test_size=0.3, random_state=0)

from sklearn.cross_validation import KFold

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import recall_score

# 使用交叉验证来选择参数

def printing_KFold_score(train_x, train_y):

    """

    进行数据的交叉验证

    :param train_x:输入的数据的变量

    :param train_y:输入数据的标签

    :return: 返回最佳的参数

    """

    # 对数据的索引进行拆分

    fold = KFold(len(train_x), 5, shuffle=False)

    # 正则化参数

    c_parameter = [0.01, 0.1, 1, 10, 100]

    # 建立DataFrame用于参数和recall得分的储存

    train_score = pd.DataFrame(index=range(len(c_parameter), 2), columns=['c_parameter', 'F_score_mean'])

    train_score['c_parameter'] = c_parameter

    for c in c_parameter:

        scores = []

        for iter, fol in enumerate(fold, start=1):

            lr = LogisticRegression(C=c, penalty='l1')

            lr.fit(train_x.iloc[fol[0], :], train_y.iloc[fol[0], :])

            pred_y = lr.predict(train_x.iloc[fol[1], :])

            # 导入recall_score模块进行计算

            score = recall_score(train_y.iloc[fol[1], :], pred_y)

            print('{} {}'.format(iter, score))

            scores.append(score)

        mean_score = np.mean(scores)

        train_score['F_score_mean'] = mean_score

    print(train_score)

    # 根据索引, idxmax() 表示获得最大值的索引,获得最佳的best_parameter

    best_parameter = train_score.iloc[train_score['F_score_mean'].idxmax(), :]['c_parameter']

    print('the best_parameter is {}'.format(best_parameter))

    return best_parameter

best_c = printing_KFold_score(under_train_x, under_train_y)

码农公寓

相关文章