python – 如何实现xgboost的增量培训?

问题是由于列车数据大小,我的列车数据无法放入RAM.所以我需要一种方法,首先在整列火车数据集上构建一棵树,计算残差构建另一棵树等等(如渐变提升树那样).显然,如果我在某个循环中调用model = xgb.train(param,batch_dtrain,2) – 它将无济于事,因为在这种情况下它只是为每个批次重建整个模型.

解决方法:

免责声明:我也是xgboost的新手,但我想我想出来了.

在第一批训练后尝试保存模型.然后,在连续运行时,为xgb.train方法提供已保存模型的文件路径.

这是一个小实验,我跑来说服自己说它有效:

首先,将波士顿数据集拆分为训练和测试集.
然后将训练集分成两半.
在上半场安装一个模型并获得一个作为基准的分数.
然后在下半场安装两个型号;一个模型将具有附加参数xgb_model.如果传入额外的参数并没有什么区别,那么我们会期望他们的分数相似.
但是,幸运的是,新模型的表现似乎比第一个好得多.

import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse

X = load_boston()['data']
y = load_boston()['target']

# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train, 
                                                     y_train, 
                                                     test_size=0.5,
                                                     random_state=0)

xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)

params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')

# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')

print(mse(model_1.predict(xg_test), y_test))     # benchmark
print(mse(model_2_v1.predict(xg_test), y_test))  # "before"
print(mse(model_2_v2.predict(xg_test), y_test))  # "after"

# 23.0475232194
# 39.6776876084
# 27.2053239482

如果有什么不清楚,请告诉我!

参考:https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py

上一篇:我的XGBoost学习经历及动手实践


下一篇:xgboost与gdbt的不同和优化