1.什么是随机森林?
如果读者接触过决策树(Decision Tree) 的话,那么会很容易理解什么是随机森林。随机森林就是通过集成学习的思想将多棵树集成的一种算法,它的基本单元是决策树,而它的本质属于机器学习的一大分支——集成学习(Ensemble Learning)方法。随机森林的名称中有两个关键词,一个是“随机”,一个就是“森林”。“森林”我们很好理解,一棵叫做树,那么成百上千棵就可以叫做森林了,这样的比喻还是很贴切的,其实这也是随机森林的主要思想–集成思想的体现。“随机”的含义我们会在下边部分讲到。
其实从直观角度来解释,每棵决策树都是一个分类器(假设现在针对的是分类问题),那么对于一个输入样本,N棵树会有N个分类结果。而随机森林集成了所有的分类投票结果,将投票次数最多的类别指定为最终的输出,这就是一种最简单的 Bagging 思想。
Each tree is grown as follows:
- If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
- If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
- Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
- The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
- The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.
2 随机森林的特点
我们前边提到,随机森林是一种很灵活实用的方法,它有如下几个特点:
- 在当前所有算法中,具有极好的准确率/It is unexcelled in accuracy among current algorithms;
- 能够有效地运行在大数据集上/It runs efficiently on large data bases;
- 能够处理具有高维特征的输入样本,而且不需要降维/It can handle thousands of input variables without variable deletion;
- 能够评估各个特征在分类问题上的重要性/It gives estimates of what variables are important in the classification;
- 在生成过程中,能够获取到内部生成误差的一种无偏估计/It generates an internal unbiased estimate of the generalization error as the forest building progresses;
- 对于缺省值问题也能够获得很好得结果/It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
- It has methods for balancing error in class population unbalanced data sets.
- Generated forests can be saved for future use on other data.
- Prototypes are computed that give information about the relation between the variables and the classification.
- It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
实际上,随机森林的特点不只有这六点,它就相当于机器学习领域的Leatherman(多面手),你几乎可以把任何东西扔进去,它基本上都是可供使用的。在估计推断映射方面特别好用,以致都不需要像SVM那样做很多参数的调试。
3 随机森林的相关基础知识
1) 信息、熵以及信息增益的概念
这三个基本概念是决策树的根本,是决策树利用特征来分类时,确定特征选取顺序的依据。理解了它们,决策树你也就了解了大概。
-
信息 的概念:信息是对一个抽象事物的命名,无论用不用‘信息’来命名这种抽象事物,或者用其他名称来命名这种抽象事物,这种抽象事物是客观存在的。
引用香农的话来说,信息是用来消除随机不确定性的东西。 当然这句话虽然经典,但是还是很难去搞明白这种东西到底是个什么样,可能在不同的地方来说,指的东西又不一样。
-
熵的概念
熵是约翰.冯.诺依曼建议使用的命名(当然是英文),在信息论和概率论中熵是对随机变量不确定性的度量,与上边联系起来,熵便是信息的期望值。
熵只依赖X的分布,和X的取值没有关系,熵是用来度量不确定性,当熵越大,概率为X=xi的不确定性越大,反之越小,在机器学习分类中,熵越大即这个类别的不确定性更大,反之越小。
当p=0或p=1时,H§=0,随机变量完全没有不确定性,当p=0.5时,H§=1,此时随机变量的不确定性最大条件熵
条件熵是用来解释信息增益而引入的概念,概率定义:随机变量X在给定条件下随机变量Y的条件熵,对定义描述为:X给定条件下Y的条件概率分布的熵对X的数学期望,在机器学习中为选定某个特征后的熵。
-
信息增益
信息增益在决策树算法中是用来选择特征的指标,信息增益越大,则这个特征的选择性越好,在概率中定义为:待分类的集合的熵和选定某个特征的条件熵之差(这里指的是经验熵或经验条件熵,由于真正的熵并不知道,是根据样本计算出来的)
注意:这里不要理解偏差,因为上边说了熵是类别的,但是在这里又说是集合的熵,没区别,因为在计算熵的时候是根据各个类别对应的值求期望来等到熵
2)决策树
决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。常见的决策树算法有C4.5、ID3和CART。
3)集成学习
集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型,各自独立地学习和作出预测。这些预测最后结合成单预测,因此优于任何一个单分类的做出预测。
随机森林是集成学习的一个子类,它依靠于决策树的投票选择来决定最后的分类结果。你可以在这找到用python实现集成学习的文档:Scikit 学习文档。https://scikit-learn.org/dev/modules/ensemble.html
4. python实现
class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Parameters:
-
n_estimators:int, default=100
The number of trees in the forest. -
criterion:{“gini”, “entropy”}, default=”gini”
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific. -
max_depth:int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. -
min_samples_split:int or float, default=2
The minimum number of samples required to split an internal node:- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
-
min_samples_leaf:int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
"""
RandomForestClassifier
skleran 的随机森林回归模型,应用流程。
2.源数据随机的切分:75%作为训练数据 25%最为测试数据
1.训练数据中的因变量(分类变量)处理成数字形式
3.设定参数,训练/fit
4.对测试数据,预测/predict结果y_pre
5.对预测数据y列,y_pre列,生成混淆矩阵,显示分类/预测效果
"""
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
#读取鸢尾花数据集
iris = load_iris()
#DataFrame(数据,columns列名 index行名)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
#numpy.random.uniform(low,high,size)
#功能:从一个均匀分布[low,high)中随机采样,注意定义域是左闭右开,即包含low,不包含high.
#判断:小于0.75的判为true
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #相当于随机抽取了75%作为训练数据
#将数字类别转为文字类别
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
train, test = df[df['is_train']==True], df[df['is_train']==False]#拆分训练集和测试集
features = df.columns[:4] # 前4个指标 为自变量
clf = RandomForestClassifier(n_jobs=-1) # n_jobs=2是线程数 n_jobs调为-1表示 cpu有多少核,就有多少并行数
y, _ = pd.factorize(train['species'])# 将文字类别 转为数字类别。一种序列化方法。
# 第一参数是序列化后结果,第二个是参考
clf.fit(train[features], y) # 训练过程
print(clf.decision_path(train[features]))
preds = iris.target_names[clf.predict(test[features])] # 获取测试数据预测结果
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])#生成混淆矩阵
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
#from sklearn.lda import LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
h = .02 # step size in the mesh
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
"Random Forest", "AdaBoost", "Naive Bayes", "LDA", "QDA"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
AdaBoostClassifier(),
GaussianNB(),
LDA(),
QDA()]
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.2, factor=0.5, random_state=1),
linearly_separable
]
figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds in datasets:
# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# just plot the dataset first
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
figure.subplots_adjust(left=.02, right=.98)
plt.show()