ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(一)

2023-03-14 16:38:12

原文题目：《Complete Guide to Parameter Tuning in XGBoost with codes in Python》
原文地址：https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
所有权为原文所有，本文只负责翻译。

概述/Overview

XGBoost is a powerful machine learning algorithm especially where speed and accuracy are concerned
We need to consider different parameters and their values to be specified while implementing an XGBoost model
The XGBoost model requires parameter tuning to improve and fully leverage its advantages over other algorithms

--------------------------------------------------------------------------------------------------------------------------------------

xgboost是一种强大的机器学习算法，特别是在速度和精度方面。
在实现XGBoost模型时，我们需要考虑不同的参数及其要被确定的数值。
xgboost模型需要参数调整，以改进和充分利用其相对于其他算法的优势。

介绍/Introduction

If things don’t go your way in predictive modeling, use XGboost. XGBoost algorithm has become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.
如果在预测建模中事情不太成功，那么使用xgboost。xgboost算法已经成为许多数据科学家的终极武器。这是一种高度复杂的算法，其强大程度足以处理各种不规则的数据。

Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?
使用xgboost构建模型很容易。但是，使用xgboost改进模型是困难的（至少我努力做了很多）。该算法使用多个参数。为了改进模型，必须对参数进行调整。很难找到实际问题的答案，比如你应该调整哪些参数？为了获得最佳输出，这些参数的理想值是多少？

This article is best suited to people who are new to XGBoost. In this article, we’ll learn the art of parameter tuning along with some useful information about XGBoost. Also, we’ll practice this algorithm using a data set in Python.
这篇文章最适合刚接触XGBoost的人。在本文中，我们将学习参数调优的艺术，以及一些有关xgboost的有用信息。另外，我们将使用Python中的数据集来实践此算法。

你应该知道什么/What should you know ?

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python, I highly recommend going through that before reading further. It will help you bolster your understanding of boosting in general and parameter tuning for GBM.
xgboost（极端梯度增强）是梯度增强算法的高级实现。由于我在上一篇文章–《 Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python( python中的GBM参数微调的完整指南）)》中详细介绍了渐变增强机器，所以我强烈建议在进一步阅读之前仔细阅读。它将帮助您增强对GBM一般增强和参数调整的理解。

Special Thanks: Personally, I would like to acknowledge the timeless support provided by Mr. Su* Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. He is helping us guide thousands of data scientists. A big thanks to SRK!
特别感谢：个人角度，我想感谢Su* Rajkumar（aka SRK）先生提供的一直以来的支持，目前AV排名2。没有他的帮助，这篇文章是不可能的。他正在帮助我们指导成千上万的数据科学家。非常感谢SRK！

目录/Table of Contents

The XGBoost Advantage
Understanding XGBoost Parameters
Tuning Parameters (with Example)

--------------------------------------------------------------------------------------------------------------------------------------

xgboost的优势
了解xgboost参数
调谐参数（举例）

1. xgboost的优势/The XGBoost Advantage

I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:
我一直欣赏这种算法在预测模型中注入的增强功能。当我更多地了解它的高精度背后的性能和科学性时，我发现了许多优势：

正则化/Regularization:
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
  标准的GBM实现没有像XGBoost那样的规范化，因此它也有助于减少过拟合。
- In fact, XGBoost is also known as ‘regularized boosting‘ technique.
  事实上，xgboost也被称为“规则化增压”技术。
并行处理/Parallel Processing:
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
  XGBoost实现了并行处理，与GBM相比速度快得惊人。
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
  但是仔细一想，我们知道提升是一个连续的过程，所以它如何被并行化呢？我们知道每棵树只能在前一棵树之后才能被建造，那么是什么阻止了我们用所有的核心来建造一棵树呢？我希望你知道我从哪里来。请查看此链接以进一步了解。
- XGBoost also supports implementation on Hadoop.
  XGBoost还支持Hadoop上的实现。
高灵活性/High Flexibility
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
  xgboost允许用户定义自定义优化目标和评估标准。
- This adds a whole new dimension to the model and there is no limit to what we can do.
  这为模型增加了一个全新的维度，我们所能做的没有限制。
处理缺少的值/Handling Missing Values
- XGBoost has an in-built routine to handle missing values.
  XGBoost有一个内置的例程来处理丢失的值。
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
  用户需要提供与其他观察值不同的值，并将其作为参数传递。XGBoost尝试不同的方法，因为它在每个节点上遇到一个缺少值的情况，并了解将来要为缺少值采取什么路径。
树木修剪/Tree Pruning:
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
  当一个GBM在分割中遇到负损失时，它将停止分割一个节点。因此，它更像是一个贪婪的算法。
- XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
  另一方面，xgboost将拆分到指定的最大深度，然后开始向后修剪树，删除没有正增益的拆分。
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
  另一个好处是有时负损失的分割，比如-2，然后正损失的分割+10。GBM遇到-2时会停止。但是xgboost会更深入，它会看到拆分的+8的组合效果，并保持两者。
内置交叉验证/Built-in Cross-Validation
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
  XGBoost允许用户在每次提升过程迭代时运行交叉验证，因此很容易在一次运行中获得准确的最佳提升迭代次数。
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
  这与GBM不同，我们必须运行网格搜索，并且只能测试有限的值。
支撑现有的模型/Continue on Existing Model
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
  用户可以从上次运行的迭代开始训练xgboost模型。在某些特定的应用中，这可能具有显著的优势。
- GBM implementation of sklearn also has this feature so they are even on this point.
  sklearn的gbm实现也有这个特性，所以它们在这一点上更平稳。

I hope now you understand the sheer power XGBoost algorithm. Note that these are the points which I could muster. You know a few more? Feel free to drop a comment below and I will update the list.
我希望您现在能够理解XGBoost算法的强大功能。请注意，这些是我可以收集的要点。你知道更多吗？请随意在下面添加评论，我将更新列表。

Did I whet your appetite ? Good. You can refer to following web-pages for a deeper understanding:
我有没有激起你的食欲？很好。您可以参考以下网页以进一步了解：

XGBoost Guide – Introduction to Boosted Trees xgboost指南-增强型树介绍
Words from the Author of XGBoost [Video] XGBoos作者的描述

码农公寓

概述/Overview

介绍/Introduction

你应该知道什么/What should you know ?

1. xgboost的优势/The XGBoost Advantage

相关文章