雁塔区python培训有哪些

2021-10-11 07:50:51

1、预测型数据分析：回归、分类和聚类3.1回归：对数值型变量进行预测

例子：预测股票、房价、空气质量

分析两组变量之间的关系

x:自变量（特征）

y:因变量

通过x,预测y : f(x)=y

x:房子大小；y:房子价格

回归经典方法线性回归

监督学习：已有一些训练样本（训练集），同时知道x和y

OLS(Ordinary Least Squares):使得预测的y和真实的y在训练集上误差的平方最小

以鸢尾花数据集为例

#导入数据集

import pandas as pd

url = 'https://www.gairuo.com/file/data/dataset/iris.data'

df = pd.read_csv(url)

#观察数据集，变量之间的关系需要借助经验及专业知识

import seaborn as sns

%matplotlib inline

sns.regplot(x="petal_width",y="petal_length",data=df)

#训练模型，得出截距和相关系数

from sklearn import linear_model

lm=linear_model.LinearRegression()

features=["petal_width"]

X=df[features]

y=df["petal_length"]

model=lm.fit(X,y)

print(model.intercept_,model.coef_)

#y=1.090572145877378+2.22588531*x

#求出预测值

import numpy as np

new_x = 3

new_x = np.array(new_x).reshape(1, -1)

pre_y = model.predict(new_x)

print(pre_y)

#多个因变量时

from sklearn import linear_model

lm=linear_model.LinearRegression()

features=["petal_width","sepal_length"]

X=df[features]

y=df["petal_length"]

model=lm.fit(X,y)

print(model.intercept_,model.coef_)

#y=-1.5023745801152821+1.74439298*x1+0.54251492*x2

import numpy as np

new_x = [2,6.5]

new_x = np.array(new_x).reshape(1, -1)

pre_y = model.predict(new_x)

print(pre_y)

#预测性能的评估：训练集/测试集划分

将所有已知X和y的样本划分为训练集和测试集

常用的划分比例为8:2或9:1

##交叉检验（cross-validation）

保证每一个样本都会被测试过一次

scikit-Learn中进行交叉检验

from sklearn.model_selection import cross_val_score

score=-cross_val_score(lm,X,y,cv=5,scoring="neg_mean_absolute_error")

score

#常规的操作是取误差的平均值作为最终的衡量比较,值越小越好

import numpy as np

print(np.mean(score)

#cv=5:交叉检验的次数为5次

回归常用的打分函数:

“neg_mean_absolute_error”

平均绝对误差

MAE=

“neg_mean_squared_error”

均方误差

MSE=

例如：

码农公寓

相关文章