GBDT+LR

GBDT+LR

LR属于线性模型,容易并行化,可以轻松处理上亿条数据,但是学习能力十分有限,需要大量的特征工程来增加模型的学习能力。但大量的特征工程耗时耗力同时并不一定会带来效果提升。因此,如何自动发现有效的特征、特征组合,弥补人工经验不足,缩短LR特征实验周期,是亟需解决的问题。
FM模型通过隐变量的方式,发现两两特征之间的组合关系,但这种特征组合仅限于两两特征之间,后来发展出来了使用深度神经网络去挖掘更高层次的特征组合关系。但其实在使用神经网络之前,GBDT也是一种经常用来发现特征组合的有效思路。文哥的学习日记

因此LR模型要折腾特征,LR模型说简单一些就是没有hiden层的神经网络,激活函数是sigmoid

为什么建树采用GBDT而非RF:RF也是多棵树,但从效果上有实践证明不如GBDT。且GBDT前面的树,特征分裂主要体现对多数样本有区分度的特征;后面的树,主要体现的是经过前N颗树,残差仍然较大的少数样本。优先选用在整体上有区分度的特征,再选用针对少数样本有区分度的特征,思路更加合理,这应该也是用GBDT的原因。

模型评估标准

Evaluation metrics: Since we are most concerned with the impact of the factors to the machine learning model,we use the accuracy of prediction instead of metrics directly related to profit and revenue. In this work, we use Normalized Entropy (NE) and calibration as our major evaluation metric.
模型的评估标准有2种

NE

归一化的互信息熵
一个ctr的流程小石头的码蜂窝
GBDT+LR

Normalized Entropy or more accurately, Normalized Cross-Entropy is equivalent to the average log loss per impression divided by what the average log loss per impression would be if a model predicted the background click through rate (CTR) for every impression. In other words, it is the predictive log loss normalized by the entropy of the background CTR. The background CTR is the average empirical CTR of the training data set. It would be perhaps more descriptive to refer to the metric as the Normalized Logarithmic Loss. The lower the value is, the better is the prediction made by the model. The reason for this normalization is that the closer the background CTR is to either 0 or 1, the easier it is to achieve a better log loss. Dividing by the en- tropy of the background CTR makes the NE insensitive to the background CTR. Assume a given training data set has N examples with labels y i ∈ {−1, +1} and estimated probability of click p i where i = 1, 2, ...N. The average empirical CTR as p

\(N E=\frac{-\frac{1}{N} \sum_{i=1}^{n}\left(\frac{1+y_{i}}{2} \log \left(p_{i}\right)+\frac{1-y_{i}}{2} \log \left(1-p_{i}\right)\right)}{-(p * \log (p)+(1-p) * \log (1-p))}\)

  • 归一的化的互信息熵精度更好
  • NE=预测的log loss除以background CTR的熵
  • NE越小模型性能越好
  • 除以了background CTR的熵,使得NE对background CTR不敏感
  • p代表平均经验CTR

Calibration

Calibration is the ratio of the average estimated CTR and empirical CTR. In other words, it is the ratio of the number of expected clicks to the number of actually observed clicks.Calibration is a very important metric since accurate and well-calibrated prediction of CTR is essential to the success of online bidding and auction. The less the calibration differs from 1, the better the model is. We only report calibration in the experiments where it is non-trivial.

Calibration校准是期待或预测的点击数除以实际的点击数。它是一个比例
Calibration越接近1,模型性能越好

Area-Under-ROC (AUC) is also a pretty good metric for measuring ranking quality without considering calibration. In a realistic environment, we expect the prediction to be accurate instead of merely getting the optimal ranking order to avoid potential under-delivery or overdelivery. NE measures the goodness of predictions and implicitly reflects calibration. For example, if a model overpredicts by 2x and we apply a global multiplier 0.5 to fix
the calibration, the corresponding NE will be also improved even though AUC remains the same. See [12] for in-depth study on these metrics.
AUC也是一个非常不错的评价指标,但是它有个问题。比如当我们的模型预测的CTR概率都偏高了2倍,我们可以通过Calibration校准,使用一个全局的0.5的系数来修正。修正之后NE也会提高,而AUC却保持不变。
在实际工作中,我们希望得到的是尽可能准确的预测每个广告被点击的概率,而不是仅仅得到相对的概率排序。所以AUC不如上面的NE、Calibration合适。

GBDT+LR的步骤

正如它的名字一样,GBDT+LR 由两部分组成,其中GBDT用来对训练集提取特征作为新的训练输入数据,LR作为新训练输入数据的分类器。

GBDT+LR

具体来讲,有以下几个步骤:

3.1 GBDT首先对原始训练数据做训练,得到一个二分类器,当然这里也需要利用网格搜索寻找最佳参数组合。

3.2 与通常做法不同的是,当GBDT训练好做预测的时候,输出的并不是最终的二分类概率值,而是要把模型中的每棵树计算得到的预测概率值所属的叶子结点位置记为1,这样,就构造出了新的训练数据。

举个例子,下图是一个GBDT+LR 模型结构,设GBDT有两个弱分类器,分别以蓝色和红色部分表示,其中蓝色弱分类器的叶子结点个数为3,红色弱分类器的叶子结点个数为2,并且蓝色弱分类器中对0-1 的预测结果落到了第二个叶子结点上,红色弱分类器中对0-1 的预测结果也落到了第二个叶子结点上。那么我们就记蓝色弱分类器的预测结果为[0 1 0],红色弱分类器的预测结果为[0 1],综合起来看,GBDT的输出为这些弱分类器的组合[0 1 0 0 1] ,或者一个稀疏向量(数组)。
GBDT+LR

这里的思想与One-hot独热编码类似,事实上,在用GBDT构造新的训练数据时,采用的也正是One-hot方法。并且由于每一弱分类器有且只有一个叶子节点输出预测结果,所以在一个具有n个弱分类器、共计m个叶子结点的GBDT中,每一条训练数据都会被转换为1*m维稀疏向量,且有n个元素为1,其余m-n 个元素全为0。

3.3 新的训练数据构造完成后,下一步就要与原始的训练数据中的label(输出)数据一并输入到Logistic Regression分类器中进行最终分类器的训练。思考一下,在对原始数据进行GBDT提取为新的数据这一操作之后,数据不仅变得稀疏,而且由于弱分类器个数,叶子结点个数的影响,可能会导致新的训练数据特征维度过大的问题,因此,在Logistic Regression这一层中,可使用正则化来减少过拟合的风险,在Facebook的论文中采用的是L1正则化。

实现

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier


gbm1 = GradientBoostingClassifier(n_estimators=50, random_state=10, subsample=0.6, max_depth=7,
                                  min_samples_split=900)
gbm1.fit(X_train, Y_train)
train_new_feature = gbm1.apply(X_train)
train_new_feature = train_new_feature.reshape(-1, 50)

enc = OneHotEncoder()

enc.fit(train_new_feature)

# # 每一个属性的最大取值数目
# print('每一个特征的最大取值数目:', enc.n_values_)
# print('所有特征的取值数目总和:', enc.n_values_.sum())

train_new_feature2 = np.array(enc.transform(train_new_feature).toarray())

另一种实现方式

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': {'binary_logloss'},
    'num_leaves': 64,
    'num_trees': 100,
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}


# number of leaves,will be used in feature transformation
num_leaf = 64

print('Start training...')
# train
gbm = lgb.train(params=params,
                train_set=lgb_train,
                valid_sets=lgb_train, )


print('Start predicting...')
# y_pred分别落在100棵树上的哪个节点上
y_pred = gbm.predict(x_train, pred_leaf=True)
y_pred_prob = gbm.predict(x_train)


result = []
threshold = 0.5
for pred in y_pred_prob:
    result.append(1 if pred > threshold else 0)
print('result:', result)


print('Writing transformed training data')
transformed_training_matrix = np.zeros([len(y_pred), len(y_pred[1]) * num_leaf],
                                       dtype=np.int64)  # N * num_tress * num_leafs
for i in range(0, len(y_pred)):
    # temp表示在每棵树上预测的值所在节点的序号(0,64,128,...,6436 为100棵树的序号,中间的值为对应树的节点序号)
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    # 构造one-hot 训练数据集
    transformed_training_matrix[i][temp] += 1

y_pred = gbm.predict(x_test, pred_leaf=True)
print('Writing transformed testing data')
transformed_testing_matrix = np.zeros([len(y_pred), len(y_pred[1]) * num_leaf], dtype=np.int64)
for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    # 构造one-hot 测试数据集
    transformed_testing_matrix[i][temp] += 1

cnblogs

优点

肯定是预测的更准确

上一篇:多值类别特征加入CTR预估模型的方法


下一篇:CBC和CTR模式下的AES