python 机器学习
(四)、回归预测
1、线性回归器
arg
ω
,
b
min
L
(
ω
,
b
)
=
arg
ω
,
b
min
∑
k
=
1
m
(
f
(
ω
,
x
,
b
)
−
y
k
)
2
\arg_{\boldsymbol{\omega},b}\min L(\boldsymbol{\omega},b) = \arg_{\boldsymbol{\omega},b} \min \sum_{k=1}^{m} (f(\boldsymbol{\omega},\boldsymbol{x},b) -y^k)^2
argω,bminL(ω,b)=argω,bmink=1∑m(f(ω,x,b)−yk)2
学习得到决定模型的参数,即参数
ω
\boldsymbol{\omega}
ω和
b
b
b
step1:美国波士顿地区房价数据描述
# 从sklearn.datasets导入波士顿房价数据读取器。
from sklearn.datasets import load_boston
# 从读取房价数据存储在变量boston中。
boston = load_boston()
# 输出数据描述。
print boston.DESCR
Boston House Prices dataset
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity',
- Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth
- International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
step2:美国波士顿地区房价数据分割
# 从sklearn.cross_validation导入数据分割器。
from sklearn.cross_validation import train_test_split
# 导入numpy并重命名为np。
import numpy as np
X = boston.data
y = boston.target
# 随机采样25%的数据构建测试样本,其余作为训练样本。
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.25)
# 分析回归目标值的差异。
print "The max target value is", np.max(boston.target)
print "The min target value is", np.min(boston.target)
print "The average target value is", np.mean(boston.target)
The max target value is 50.0
The min target value is 5.0
The average target value is 22.5328063241
step3:训练与测试数据标准化处理
# 从sklearn.preprocessing导入数据标准化模块。
from sklearn.preprocessing import StandardScaler
# 分别初始化对特征和目标值的标准化器。
ss_X = StandardScaler()
ss_y = StandardScaler()
# 分别对训练和测试数据的特征以及目标值进行标准化处理。
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)
y_train = ss_y.fit_transform(y_train)
y_test = ss_y.transform(y_test)
C:\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py:646: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
step4:使用线性回归模型LinearRegression 和 SGDRegressor分别对数据进行学习和预测
LinearRegression
# 从sklearn.linear_model导入LinearRegression。
from sklearn.linear_model import LinearRegression
# 使用默认配置初始化线性回归器LinearRegression。
lr = LinearRegression()
# 使用训练数据进行参数估计。
lr.fit(X_train, y_train)
# 对测试数据进行回归预测。
lr_y_predict = lr.predict(X_test)
SGDRegressor
# 从sklearn.linear_model导入SGDRegressor。
from sklearn.linear_model import SGDRegressor
# 使用默认配置初始化线性回归器SGDRegressor。
sgdr = SGDRegressor()
# 使用训练数据进行参数估计。
sgdr.fit(X_train, y_train)
# 对测试数据进行回归预测。
sgdr_y_predict = sgdr.predict(X_test)
MAE(Mean Absolute Error)
S
S
a
b
s
=
∑
i
=
1
m
∣
y
i
−
y
ˉ
∣
,
M
A
E
=
S
S
a
b
s
m
SS_{abs} = \sum_{i=1}^m |y^i -\bar{y}|, \quad MAE=\frac{SS_{abs}}{m}
SSabs=i=1∑m∣yi−yˉ∣,MAE=mSSabs
MSE(Mean Squared Error)
S
S
t
o
t
=
∑
i
=
1
m
(
y
i
−
y
ˉ
)
2
,
M
A
E
=
S
S
t
o
t
m
SS_{tot} = \sum_{i=1}^m (y^i -\bar{y})^2, \quad MAE=\frac{SS_{tot}}{m}
SStot=i=1∑m(yi−yˉ)2,MAE=mSStot
R-squared
S
S
a
b
s
=
∑
i
=
1
m
(
y
i
−
f
(
x
i
)
)
2
,
R
2
=
1
−
S
S
r
e
s
S
S
t
o
t
SS_{abs} = \sum_{i=1}^m (y^i -f(\boldsymbol{x}^i))^2, \quad R^2=1-\frac{SS_{res}}{SS_{tot}}
SSabs=i=1∑m(yi−f(xi))2,R2=1−SStotSSres
f
(
x
i
)
f(\boldsymbol{x}^i)
f(xi)代表回归模型根据特征向量
x
i
x^i
xi 的预测值
S
S
t
o
t
SS_{tot}
SStot代表测试数据真实值的方差(内部差异)
S
S
r
e
s
SS_{res}
SSres代表回归值与真实值之间的平方差异(回归差异)
R-squared用来衡量模型回归结果的波动可被真实验证的百分比,也暗示了模型在数值回归方面的能力。
step5:使用三种回归评价机制对模型回归能力进行评价
LinearRegression
# 使用LinearRegression模型自带的评估模块,并输出评估结果。
print ('The value of default measurement of LinearRegression is', lr.score(X_test, y_test))
# 从sklearn.metrics依次导入r2_score、mean_squared_error以及mean_absoluate_error用于回归性能的评估。
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# 使用r2_score模块,并输出评估结果。
print ('The value of R-squared of LinearRegression is', r2_score(y_test, lr_y_predict))
# 使用mean_squared_error模块,并输出评估结果。
print ('The mean squared error of LinearRegression is',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))
# 使用mean_absolute_error模块,并输出评估结果。
print ('The mean absoluate error of LinearRegression is',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))
The value of default measurement of LinearRegression is 0.6763403831
The value of R-squared of LinearRegression is 0.6763403831
The mean squared error of LinearRegression is 25.0969856921
The mean absoluate error of LinearRegression is 3.5261239964
SGDRegressor
# 使用SGDRegressor模型自带的评估模块,并输出评估结果。
print ('The value of default measurement of SGDRegressor is', sgdr.score(X_test, y_test))
# 使用r2_score模块,并输出评估结果。
print ('The value of R-squared of SGDRegressor is', r2_score(y_test, sgdr_y_predict))
# 使用mean_squared_error模块,并输出评估结果。
print ('The mean squared error of SGDRegressor is',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))
#.inverse_transform() --标准化后的数据转换为原始数据
# 使用mean_absolute_error模块,并输出评估结果。
print ('The mean absoluate error of SGDRegressor is',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))
The value of default measurement of SGDRegressor is 0.659853975749
The value of R-squared of SGDRegressor is 0.659853975749
The mean squared error of SGDRegressor is 26.3753630607
The mean absoluate error of SGDRegressor is 3.55075990424
根据Scikit-learn官网的建议,如果数据规模超过10万,推荐使用随机梯度估计参数模型(SGD Classifier/ Regressor)
2、支持向量机(回归)
step1:使用三种不同核函数配置的支持向量机模型进行训练,并作出预测
# 从sklearn.svm中导入支持向量机(回归)模型。
from sklearn.svm import SVR
# 使用线性核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
linear_svr_y_predict = linear_svr.predict(X_test)
# 使用多项式核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
poly_svr_y_predict = poly_svr.predict(X_test)
# 使用径向基核函数配置的支持向量机进行回归训练,并且对测试样本进行预测。
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
rbf_svr_y_predict = rbf_svr.predict(X_test)
step2:对三种核函数配置下的支持向量机回归模型在相同测试集上进行性能评估
Linear SVR
# 使用R-squared、MSE和MAE指标对三种配置的支持向量机(回归)模型在相同测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print ('R-squared value of linear SVR is', linear_svr.score(X_test, y_test))
print ('The mean squared error of linear SVR is',
mean_squared_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(linear_svr_y_predict)))
print ('The mean absoluate error of linear SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test),ss_y.inverse_transform(linear_svr_y_predict)))
R-squared value of linear SVR is 0.65171709743
The mean squared error of linear SVR is 26.6433462972
The mean absoluate error of linear SVR is 3.53398125112
Poly SVR
print ('R-squared value of Poly SVR is', poly_svr.score(X_test, y_test))
print ('The mean squared error of Poly SVR is',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))
print ('The mean absoluate error of Poly SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))
R-squared value of Poly SVR is 0.404454058003
The mean squared error of Poly SVR is 46.179403314
The mean absoluate error of Poly SVR is 3.75205926674
RBF SVR
print ('R-squared value of RBF SVR is', rbf_svr.score(X_test, y_test))
print ('The mean squared error of RBF SVR is',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))
print ('The mean absoluate error of RBF SVR is',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))
R-squared value of RBF SVR is 0.756406891227
The mean squared error of RBF SVR is 18.8885250008
The mean absoluate error of RBF SVR is 2.60756329798
3、 k k k 近邻(回归)
step1:使用两种不同配置的k近邻回归模型对美国波士顿房价数据进行回归预测
# 从sklearn.neighbors导入KNeighborRegressor(K近邻回归器)。
from sklearn.neighbors import KNeighborsRegressor
# 初始化K近邻回归器,并且调整配置,使得预测的方式为平均回归:weights='uniform'。
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict = uni_knr.predict(X_test)
# 初始化K近邻回归器,并且调整配置,使得预测的方式为根据距离加权回归:weights='distance'。
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)
step2:对两种不同配置的k近邻回归模型在美国波士顿房价数据上进行预测性能评估
uniform-weighted
# 使用R-squared、MSE以及MAE三种指标对平均回归配置的K近邻模型在测试集上进行性能评估。
print ('R-squared value of uniform-weighted KNeighorRegression:', uni_knr.score(X_test, y_test))
print ('The mean squared error of uniform-weighted KNeighorRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))
print ('The mean absoluate error of uniform-weighted KNeighorRegression', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))
R-squared value of uniform-weighted KNeighorRegression: 0.690345456461
The mean squared error of uniform-weighted KNeighorRegression: 24.0110141732
The mean absoluate error of uniform-weighted KNeighorRegression 2.96803149606
distance-weighted
# 使用R-squared、MSE以及MAE三种指标对根据距离加权回归配置的K近邻模型在测试集上进行性能评估。
print ('R-squared value of distance-weighted KNeighorRegression:', dis_knr.score(X_test, y_test))
print ('The mean squared error of distance-weighted KNeighorRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))
print ('The mean absoluate error of distance-weighted KNeighorRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))
R-squared value of distance-weighted KNeighorRegression: 0.719758997016
The mean squared error of distance-weighted KNeighorRegression: 21.7302501609
The mean absoluate error of distance-weighted KNeighorRegression: 2.80505687851
4、回归树
step1:使用回归树对美国波士顿房价训练数据进行学习,并对测试数据进行预测
DecisionTreeRegressor
# 从sklearn.tree中导入DecisionTreeRegressor。
from sklearn.tree import DecisionTreeRegressor
# 使用默认配置初始化DecisionTreeRegressor。
dtr = DecisionTreeRegressor()
# 用波士顿房价的训练数据构建回归树。
dtr.fit(X_train, y_train)
# 使用默认配置的单一回归树对测试数据进行预测,并将预测值存储在变量dtr_y_predict中。
dtr_y_predict = dtr.predict(X_test)
step2:对单一回归树模型在美国波士顿房价测试数据上的预测性能进行评估
# 使用R-squared、MSE以及MAE指标对默认配置的回归树在测试集上进行性能评估。
print ('R-squared value of DecisionTreeRegressor:', dtr.score(X_test, y_test))
print ('The mean squared error of DecisionTreeRegressor:',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))
print ('The mean absoluate error of DecisionTreeRegressor:',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))
R-squared value of DecisionTreeRegressor: 0.694084261863
The mean squared error of DecisionTreeRegressor: 23.7211023622
The mean absoluate error of DecisionTreeRegressor: 3.14173228346
5、集成模型(回归)
step1:三种集成回归模型对美国波士顿房价训练数据进行学习,并对测试数据进行预测
RandomForestRegressor、ExtraTreesGressor(极端回归森林)以及GradientBoostingRegressor
# 从sklearn.ensemble中导入RandomForestRegressor、ExtraTreesGressor以及GradientBoostingRegressor。
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
# 使用RandomForestRegressor训练模型,并对测试数据做出预测,结果存储在变量rfr_y_predict中。
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_y_predict = rfr.predict(X_test)
# 使用ExtraTreesRegressor训练模型,并对测试数据做出预测,结果存储在变量etr_y_predict中。
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)
# 使用GradientBoostingRegressor训练模型,并对测试数据做出预测,结果存储在变量gbr_y_predict中。
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)
step2:对三种集成模型在美国波士顿房价测试数据上的预测性能进行评估
# 使用R-squared、MSE以及MAE指标对默认配置的随机回归森林在测试集上进行性能评估。
print ('R-squared value of RandomForestRegressor:', rfr.score(X_test, y_test))
print ('The mean squared error of RandomForestRegressor:',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print ('The mean absoluate error of RandomForestRegressor:',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
R-squared value of RandomForestRegressor: 0.802399786277
The mean squared error of RandomForestRegressor: 15.322176378
The mean absoluate error of RandomForestRegressor: 2.37417322835
# 使用R-squared、MSE以及MAE指标对默认配置的极端回归森林在测试集上进行性能评估。
print ( 'R-squared value of ExtraTreesRegessor:', etr.score(X_test, y_test))
print ('The mean squared error of ExtraTreesRegessor:',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
print ('The mean absoluate error of ExtraTreesRegessor:',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
# 利用训练好的极端回归森林模型,输出每种特征对预测目标的贡献度。
print np.sort(zip(etr.feature_importances_, boston.feature_names), axis=0)
R-squared value of ExtraTreesRegessor: 0.81953245067
The mean squared error of ExtraTreesRegessor: 13.9936874016
The mean absoluate error of ExtraTreesRegessor: 2.35881889764
[['0.00197153649824' 'AGE']
['0.0121265798375' 'B']
['0.0166147338152' 'CHAS']
['0.0181685042979' 'CRIM']
['0.0216752406979' 'DIS']
['0.0230936940337' 'INDUS']
['0.0244030043403' 'LSTAT']
['0.0281224515813' 'NOX']
['0.0315825286843' 'PTRATIO']
['0.0455441477115' 'RAD']
['0.0509648681724' 'RM']
['0.355492216395' 'TAX']
['0.370240493935' 'ZN']]
# 使用R-squared、MSE以及MAE指标对默认配置的梯度提升回归树在测试集上进行性能评估。
print ('R-squared value of GradientBoostingRegressor:', gbr.score(X_test, y_test))
print ('The mean squared error of GradientBoostingRegressor:',
mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
print ('The mean absoluate error of GradientBoostingRegressor:',
mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
R-squared value of GradientBoostingRegressor: 0.842602871434
The mean squared error of GradientBoostingRegressor: 12.2047771094
The mean absoluate error of GradientBoostingRegressor: 2.28597618665