1 导入数据
import pandas
使用DataFrames里的head()看看前五行是什么样子的:
housing_data = pandas.read_csv(r'C:\Users\Administrator\Desktop\PHD\Machine learning\housing.csv')
housing_data.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
从这里我们注意到两个事情:
- 数据集中包含20640个实例,另外要注意在total_bedrooms这个属性里只有20433个非空值,意味着207个区域缺失该特征(很关键),后面必须进行相应的处理。
- 除了ocean_proximity之外,其他属性都是数字,而ocean_proximity的类型是object,意味着它可以是任何的python对象,但因为我们看了excel,知道它是个文本属性。且有几个文本是重复的,意味着它可能是一个分类属性。
因此得出当前两个待办事项
- total_bedrooms属性里的空缺值处理
- 分类属性→数字化处理
housing_data["ocean_proximity"].value_counts()
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
使用describe()方法可以显示述数值属性的摘要
housing_data.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
另外一种快速了解数据类型的方法是绘制每个数值属性的直方图,可以直接在整个数据集上调用hist()方法,绘制每个属性的直方图
%matplotlib inline
#使用jupyter自己的图形后端
import matplotlib.pyplot
housing_data.hist(bins = 50, figsize = (20,15)) #bins代表区间分布段
matplotlib.pyplot.show()
从直方图中要注意以下几点:
- median_income看起来单位不像是USD,经核实,数据已按比例缩小,且下限为0.5,上限为15,单位“万美元”
- house_median_age和median_house_value也被设置了上限,而median_house_value是本次学习的目标属性,可能有大问题发生
- 这些属性被缩放的比例不同
- 很多分布都不对称,在中位数右侧要远比左侧远
接下来要创建测试集
Scikit-learn提供了一些函数,可以通过很多方式将数据集分成多个子集,最简单的就是train_test_split,它的random_state参数使得可以设置随机生成器种子;并且如果把行数相同的多个数据一次性发给它,它会根据相同的索引将其拆分。
import numpy
def split_train_test(data, test_ratio):
#随机排序然后按照比例切片
shuffled_indices = numpy.random.permutation(len(data))
test_setting_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_setting_size]
train_indices = shuffled_indices[test_setting_size:]
return data.iloc[train_indices], data.iloc[test_indices]
#样本数据切割为训练集和测试集,随机种子=42
from sklearn.model_selection import train_test_split #scikit_learb提供的函数
train_set, test_set = train_test_split(housing_data, test_size = 0.2, random_state = 42)
print(len(train_set), len(test_set))
16512 4128
上面说的都是纯随机的情形,但被告之,要预测房价中位数,收入中位数是一个非常重要的属性,因此我们想确保在收入属性上,测试集能够代表整个数据集中各种不同类型的收入层次。但由于median_income是一个连续的数值属性,因此在这里我们将创建一个收入的分类属性。
经过观察直方图,发现大多收入数值聚集在1.5~6左右,但也有一部分远远超过了6。分层的原则要保证每一层都要有数量足够的实例,不然实例不足的层,其重要程度可能会被错误的估计。也就是说,层不能分的太多,而每一层都要足够大。
下面创建五个收入类别属性(用1~5做标签),0-1.5是类别1,以此类推
#对收入进行分层,创建5个收入类别属性,标签分别为1~5
housing_data["income_cat"] = pandas.cut(housing_data["median_income"],
bins = [0., 1.5, 3.0, 4.5, 6., numpy.inf], labels = [1, 2, 3, 4 ,5])
housing_data["income_cat"].hist()
现在可以根据类别分层抽样来,使用Scikit-learn的StratifiedShuffleSplit类
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(housing_data, housing_data["income_cat"]):
strat_train_set = housing_data.loc[train_index]
strat_test_set = housing_data.loc[test_index]
检验下是否遵循我们的想法分层抽样:
strat_test_set["income_cat"].value_counts(ascending=True, normalize=True)
#若无normalize=True则只是简单的计数
1 0.039729
5 0.114583
4 0.176357
2 0.318798
3 0.350533
Name: income_cat, dtype: float64
分层抽样完之后,删掉income_cat属性,将数据恢复原样
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis = 1, inplace = True)
2 从数据探索和可视化中获得洞见
刚才也只是简单的浏览数据,从而对手头上的数据产生一个大致的了解。现在我们要深入一点
首先,把测试集放在一边,只探索训练集
housing_data = strat_train_set.copy()
housing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 16512 non-null float64
1 latitude 16512 non-null float64
2 housing_median_age 16512 non-null float64
3 total_rooms 16512 non-null float64
4 total_bedrooms 16354 non-null float64
5 population 16512 non-null float64
6 households 16512 non-null float64
7 median_income 16512 non-null float64
8 median_house_value 16512 non-null float64
9 ocean_proximity 16512 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB
2.1 将地理数据可视化
根据地理位置信息,建立一个各区域的分布图以便数据可视化
housing_data.plot(kind = "scatter", x = "longitude", y = "latitude", title = "Test", alpha = 1)
housing_data.plot(kind = "scatter", x = "longitude", y = "latitude", title = "Test", alpha = 0.5)
housing_data.plot(kind = "scatter", x = "longitude", y = "latitude", title = "Test", alpha = 0.1)
alpha的数值代表数据点的透明度,选择低alpha可以更好的突出高密度地区;
可以很明显的看出存在着一条高密度的长线
然后我们用每个圆圈的大小代表了每个区域的人口数量,选项S,颜色代表价格,使用了一个名为jet的预定义颜色表来进行可视化,从低到高其颜色分别为从蓝到红
housing_data.plot(kind = "scatter", x = "longitude", y = "latitude", title = "Test", alpha = 0.4,
s = housing_data["population"]/100, label = "population", figsize = (10, 7))
housing_data.plot(kind = "scatter", x = "longitude", y = "latitude", title = "Test", alpha = 0.4,
s = housing_data["population"]/100, label = "population", figsize = (10, 7),
c = "median_house_value", cmap = matplotlib.pyplot.get_cmap("jet"), colorbar = True)
matplotlib.pyplot.legend()
很明显的显示出房价与地理位置和人口密度息息相关。
2.2 寻找数据之间的相关性
可使用corr() 方法轻松计算出每对属性之间的标准相关系数,也称为皮尔逊系数
corr_matrix = housing_data.corr() #计算每对属性之间的标准相关系数
corr_matrix["median_house_value"].sort_values(ascending = False)
现在来看看每个属性与median_house_value的相关性分别是多少?
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
第二个方法是使用pandas的scatter_matrix函数,它会绘制出每个数值属性相对于其他数值属性的相关性,现在我们有11个数值属性,可以得到11*11=121个图像,现在我们仅仅关注那些相关度比较高的,最有潜力的属性"median_house_value", “median_income”, “total_rooms”, “housing_median_age”
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing_data[attributes], figsize = (12, 8))
最有潜力的肯定是这个median_income属性来,所以我们单独拿这个属性出来放大看看
housing_data.plot(kind = "scatter", x = "median_income", y = "median_house_value", alpha = 0.1)
发现几个问题:
- 二者相关性的确强,可以清楚的看到上升的趋势,并且点也不是太分散
- 前面提到的50w美元的价格上限在图中是一条很明显的线,45w有一条,35w也有一条,再往下可能还有,为了避免这个算法学习之后重现这些怪异的数据,可能会考虑尝试删除这些数据
2.3 试验不同的属性组合
在准备给机器学习算法输入数据之前,最后一件事应该是尝试各种属性的组合,比如“每个房间的卧室数量”“每个家庭的人口数量”等等一些来获得对数据更深层次的洞见。
housing_data["rooms_per_household"] = housing_data["total_rooms"]/housing_data["households"]
housing_data["bedrooms_per_room"] = housing_data["total_bedrooms"]/housing_data["total_rooms"]
housing_data["population_per_household"] = housing_data["population"]/housing_data["households"]
corr_matrix = housing_data.corr()
corr_matrix["median_house_value"].sort_values(ascending = False)
median_house_value 1.000000
median_income 0.687160
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64
3 机器学习算法的数据准备工作
是时候给机器学习算法准备数据了!!!!
我们要将预测器和标签进行分离
#预测器和标签分离
housing_data = strat_train_set.drop("median_house_value", axis = 1) #创建预测器
housing_labels = strat_train_set["median_house_value"].copy() #创建标签
3.1 数据清理
之前待办事项中有两个待办事项
- total_bedrooms属性里的空缺值处理
- 分类属性→数字化处理
机器学习算法无法在缺失的特征上工作,因此我们要想一些办法来解决它,三种方法:
#在注意到total_bedrooms有数据缺失之后的三种处理方式
#housing_data.dropna(subset = ["total_bedrooms"]) #1.放弃这些区域
#housing_data.drop("total_bedrooms", axis = 1) #2.放弃整个属性
#3.将缺失的值用某个值填充
#median = housing_data["total_bedrooms"].median() #中位值
#housing_data["total_bedrooms"].fillna(median, inplace = True)
对于方法3,Scikit-Learn提供的一个非常容易的类来处理缺失值:SimpleImputer
首先创建一个SimpleImputer实类,指定你要用属性的中为数值替换改属性的缺失值。
注意中位数值只能在数值属性上计算,所以我们要创建一个没有文本属性ocean_proximity的数据副本
#Scikit-Learn提供的一个非常容易的类来处理缺失值:SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = "median")
housing_only_number = housing_data.drop("ocean_proximity", axis = 1) #去掉最后一个Dtype = object,留下的全是数值属性
imputer.fit(housing_only_number)
#imputer.statistics_
#使用imputer将缺失值替换成中位数值从而完成训练集转换
X = imputer.transform(housing_only_number)
housing_transimit = pandas.DataFrame(X, columns=housing_only_number.columns, index = housing_only_number.index)
housing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 16512 non-null float64
1 latitude 16512 non-null float64
2 housing_median_age 16512 non-null float64
3 total_rooms 16512 non-null float64
4 total_bedrooms 16354 non-null float64
5 population 16512 non-null float64
6 households 16512 non-null float64
7 median_income 16512 non-null float64
8 ocean_proximity 16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.3+ MB
housing_transimit.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 16512 non-null float64
1 latitude 16512 non-null float64
2 housing_median_age 16512 non-null float64
3 total_rooms 16512 non-null float64
4 total_bedrooms 16512 non-null float64
5 population 16512 non-null float64
6 households 16512 non-null float64
7 median_income 16512 non-null float64
dtypes: float64(8)
memory usage: 1.1 MB
3.2 处理文本和分类属性
#处理文本和分类属性
housing_only_text = housing_data[["ocean_proximity"]]
housing_only_text.head(10)
ocean_proximity | |
---|---|
17606 | <1H OCEAN |
18632 | <1H OCEAN |
14650 | NEAR OCEAN |
3230 | INLAND |
3555 | <1H OCEAN |
19480 | INLAND |
8879 | <1H OCEAN |
13685 | INLAND |
4937 | <1H OCEAN |
4861 | <1H OCEAN |
##Scikit-Learnd的OrdinalEncoder类,将分类属性从文本转到数字
#from sklearn.preprocessing import OrdinalEncoder
#ordinal_encoder = OrdinalEncoder()
#housing_only_text_encoded = ordinal_encoder.fit_transform(housing_only_text)
#housing_only_text_encoded[:10]
#有一个缺陷是系统会认为两个相近的值比两个离得较远的值更为相似一些,但在此例中显然不对
另一种方法将类别编码为独热向量。
比如当类别是1H OCEAN时,一个属性为1(热),其他属性为0(冷),以此类推
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_data_cat_1hot = cat_encoder.fit_transform(housing_only_text)
housing_data_cat_1hot
3.3 自定义转换器
from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return numpy.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return numpy.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing_data.values)
3.4 特征缩放
如果输入的数值属性有着非常大的比例差异,往往会导致机器学习算法的性能表现不佳,比如total_bedrooms从6-39320,而收入中位数的范围只有0-15。
同比例缩放所有属性的两种方法是最小-最大缩放和标准化。
- 最小-最大缩放又叫做归一化,将值重新缩放使其最终范围归于0-1之间,实现的方式是将值减去最小值并除以最大值和最小值的差。对此Scikit-Learnd提供来一个名为MinMaxScaler的转换器,最终的范围可以通过调整超参数feature_range进行更改。
- 标准化则是首先减去平均值,然后除以方差,使得结果的分布具备单位方差。相比于归一化来说,标准化的受异常值的影响更小,Scikit-Learnd提供来一个名为StandardScaler的转换器
3.5 转换流水线
很多数据的转换的步骤需要以正确的顺序来执行,而Scikit-Learnd正好提供了Pipeline这样的类来支持这样的转换。
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy = "median")),
('attribs_adder', CombinedAttributesAdder()),
('standard_scaler', StandardScaler()),
])
housing_num_transmit = num_pipeline.fit_transform(housing_only_number)
OK我们现在让它用来将所有转换应用的房屋数据
首先导入ColumnTransformer类,接下来获得数值列名称类别和类别类名称列表,然后构造一个ColumnTransformer,构造函数需要一个元组列表,每个元组都包含一个名字,一个转换器,以及一个该转换器能够应用的列名字(或者说是索引)的列表。
数值列按照之前定义的转换流水线num_pipeline进行转换,类别列用OneHotEncoder进行转换,最后我们将ColumnTransformer应用到房屋数据:它将每个转换器应用于适当的列,并沿第二个轴合并输出。
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_only_number)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("number", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_data_prepared = full_pipeline.fit_transform(housing_data)
4 选择和训练模型
至此,问题也有了,数据也有了,初步的数据探索也做了,对训练集和测试集进行了分层的抽样并编写了转换流水线,从而可以自动清理和准备机器学习算法的数据了,是时候选择学习模型并展开训练了!!!
4.1 训练和评估训练集
先训练一个简单的线性回归模型:
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(housing_data_prepared, housing_labels)
现在就拥有了一个可以工作的线性回归模型了!抽几个训练集的实例试试看:
some_data = housing_data.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", linear_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))
Predictions: [210644.60459286 317768.80697211 210956.43331178 59218.98886849
189747.55849879]
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
结果差强人意。。。。预测失效接近40%,然后用Scikit-Learnd的mean_squared_error函数来测量整个训练集上回归模型的RMSE
from sklearn.metrics import mean_squared_error
housing_predictions = linear_reg.predict(housing_data_prepared)
linear_mean_squared_error = mean_squared_error(housing_labels, housing_predictions)
linear_root_mean_squared_error = numpy.sqrt(linear_mean_squared_error)
linear_root_mean_squared_error
68628.19819848922
鉴于大多数区域的median_house_value分布在120000-265000之间,预测误差到68628美元只能说是差强人意。。。
我们来训练一个DecisionTreeRegressor模型,它能够从数据中找到复杂的非线性关系:
from sklearn.tree import DecisionTreeRegressor
tree_regressor = DecisionTreeRegressor()
tree_regressor.fit(housing_data_prepared, housing_labels)
housing_predictions = tree_regressor.predict(housing_data_prepared)
tree_mean_squared_error = mean_squared_error(housing_labels, housing_predictions)
tree_root_mean_squared_error = numpy.sqrt(tree_mean_squared_error)
tree_root_mean_squared_error
0.0
完全没有错误!!!可能么!!!更有可能的是严重的过拟合!
所以我们要拿训练集的一部分用于训练,另一部分用于模型验证。
4.2 使用交叉验证来更好的评估模型
使用Scikit-Learnd的K-折交叉验证功能:它将训练集随机分割成10个不同的子集,每个子集称为一个折叠,然后对决策树模型进行10次训练和评估(每次挑选9个折叠进行训练,1个折叠进行评估),产生的结果是一个包含10次评估分数的数组。
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_regressor, housing_data_prepared, housing_labels, scoring = "neg_mean_squared_error", cv = 10)
tree_rmse_scores = numpy.sqrt(-scores)
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("STD deviation:", scores.std())
display_scores(tree_rmse_scores)
Scores: [67996.21325211 66542.53238088 69721.41972986 68817.16498204
72531.25381808 75053.01734777 70324.82552897 69758.72477194
78183.09683024 70924.64331243]
Mean: 70985.28919543003
STD deviation: 3281.1582693484047
该决策树的评分约为70985,上下浮动3281,好像比线性回归模型还要糟糕,让我们来验证下线性回归模型的评分
linear_scores = cross_val_score(linear_reg, housing_data_prepared, housing_labels, scoring = "neg_mean_squared_error", cv = 10)
linear_rmse_scores = numpy.sqrt(-linear_scores)
display_scores(linear_rmse_scores)
Scores: [66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
Mean: 69052.46136345083
STD deviation: 2731.6740017983434
评分69052,上下浮动2731,果然!!!决策树模型严重过拟合,表现的比线性回归模型还要糟糕。
最后试一下Scikit-Learnd提供的RandomForestRegressor模型,随机森林模型:通过对特征的随机子集进行多个决策树的训练,然后对其预测取平均,在多个模型的基础上简历模型,称之为集成学习
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_data_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_data_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = numpy.sqrt(forest_mse)
forest_rmse
18768.693177609803
forest_scores = cross_val_score(forest_reg, housing_data_prepared, housing_labels, scoring = "neg_mean_squared_error", cv = 10)
forest_rmse_scores = numpy.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Scores: [49600.73299171 47871.30268765 49415.54568672 52156.84656598
49657.58232171 53518.34810305 48741.57393079 47896.55352242
52857.7953758 50327.94218747]
Mean: 50204.42233732922
STD deviation: 1898.5558303507255
结果看起来好很多,但请注意:训练集上的分数仍然远低于验证集,意味着该模型仍然对训练集过拟合!!
存在的可能解决方法包括简化模型,约束模型,或获得更多的训练数据。
5 微调模型
5.1 网格搜索
现在我们得到来一个有效模型的候选列表,使用Scikit-Learnd的GridSearchCV来找到一个好的超参数组合,我们要做的就是告诉它要进行实验的超参数是什么,以及需要尝试的值,它将使用交叉验证的方式来评估超参数的所有可能组合。
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators':[3, 10, 30], 'max_features':[2, 4, 6, 8]},
{'bootstrap':[False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]},
]
#forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = "neg_mean_squared_error", return_train_score = True)
grid_search.fit(housing_data_prepared, housing_labels)
这个param_grid告诉Scikit-Learnd,首先评估第一个字典里的’n_estimators和’max_features’的所有34=12种超参数值组合,接着尝试第二个dict里面的所有23=6中组合。
总而言之,网格搜索将搜索RandomForestRegressor超参数值得18种组合,并对每个模型进行5次训练(因为我们选的是5-折交叉验证),换句话说,也就是90次训练!!
完成之后获得最佳参数组合
{‘max_features’: 6, ‘n_estimators’: 30}
grid_search.best_params_
{'max_features': 6, 'n_estimators': 30}
grid_search.best_estimator_
RandomForestRegressor(max_features=6, n_estimators=30)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(numpy.sqrt(-mean_score), params)
63621.71870457367 {'max_features': 2, 'n_estimators': 3}
55156.10441096297 {'max_features': 2, 'n_estimators': 10}
52972.36386345024 {'max_features': 2, 'n_estimators': 30}
60857.57003471426 {'max_features': 4, 'n_estimators': 3}
52537.59253309114 {'max_features': 4, 'n_estimators': 10}
50699.208732249266 {'max_features': 4, 'n_estimators': 30}
59159.42854302436 {'max_features': 6, 'n_estimators': 3}
52360.130902224366 {'max_features': 6, 'n_estimators': 10}
49724.297318772 {'max_features': 6, 'n_estimators': 30}
58988.37228078155 {'max_features': 8, 'n_estimators': 3}
52285.6205720589 {'max_features': 8, 'n_estimators': 10}
49768.90668989496 {'max_features': 8, 'n_estimators': 30}
62205.753688508434 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54413.178463038115 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
60649.35867939349 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52552.278389531006 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
58942.44912351133 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51574.70549976989 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
OK!!!现在我们得到了最佳解决方案是 {‘max_features’: 6, ‘n_estimators’: 30},这个组合的RMSE分数为49724。
5.2 随机搜索
待补充
5.3 集成方法
5.4 分析最佳模型及其误差
5.5 通过测试集评估系统
终于得到了一个优秀的系统,是时候用测试集评估最终模型了!!!
从测试集中获取预测期和标签,运行full_pipeline来转换数据,然后在测试集上评估最终模型:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis = 1)
Y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(Y_test, final_predictions)
final_rmse = numpy.sqrt(final_mse)
final_rmse
48023.24668138537