线性回归
线性回归通过一个或者多个自变量与因变量之间进行建模的回归分析。其中特点为一个或多个称为回归系数的模型参数的线性组合。
通用公式
w, x为矩阵
,
损失函数(误差大小)
yi为真实值大小,hw(xi)为拟合后的预测值,目的就是找到使得损失函数H(θ)最小的w值
最小二乘法的正规方程求解
求解出来是这样的
下面在实操中会讲解怎么解出来的
正规方程通过数学方法直接求得误差最小时各w的大小
线性回归之正规方程API
sklearn.linear_model.LinearRegression()
coef_:回归系数,就是拟合曲线的斜率
线性回归实操
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn import datasets
X = np.linspace(0,10,50).reshape(-1,1)
y = np.random.randint(2,8,size = 1)*X
y/X
# 6.0
lr = LinearRegression()
lr.fit(X,y)
# coeficient 效率,斜率
# w ---->weight 权重,
lr.coef_
# array([[6.]])
# 线性代数中的矩阵运算
np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
# array([[6.]])
最小二乘法的求解过程
# 波士顿房价
boston = datasets.load_boston()
X = boston['data']
y = boston['target']
# X.shape
# (506, 13)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)
lr = LinearRegression(fit_intercept=False)
lr.fit(X_train,y_train)
# 斜率个数:由特征的个数决定
display(lr.coef_,lr.intercept_)
# array([-9.26976824e-02, 5.00885901e-02, -1.86824366e-02, 1.50119273e+00,
# -1.95417595e+00, 5.90585604e+00, 1.67920393e-03, -9.67567832e-01,
# 1.61699715e-01, -1.02736498e-02, -3.57977720e-01, 1.30524117e-02,
# -4.43193079e-01])
# 0.0
线性回归算法的参数
LinearRegression( *, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, )
fit_intercept:是否使用截距,若为True,则返回的截距不为零,若为False,返回0截距
normalize:是否将X值归一化处理
n_jobs:是否启用多线程进行运算。启用的话可以加快运行求解速度
# 算法预测的结果
lr.predict(X_test).round(2)[:25]
# array([25.4 , 23.33, 28.61, 10.94, 21.17, 35.01, 23.26, 32.6 , 14.12,
# 13.82, 23.71, 18.12, 13.42, 16.32, 22.99, 22.35, 16.06, 33.19,
# 40.34, 20.09, 17.57, 11.57, 15.45, 26.96, 26.34])
w = lr.coef_
X_test.dot(w).round(2)[:25]
# array([25.4 , 23.33, 28.61, 10.94, 21.17, 35.01, 23.26, 32.6 , 14.12,
# 13.82, 23.71, 18.12, 13.42, 16.32, 22.99, 22.35, 16.06, 33.19,
# 40.34, 20.09, 17.57, 11.57, 15.45, 26.96, 26.34])
显示结果完全相同,说明算法内部封装就是按照这个进行运算的
# '真实'的房价显示
y_test[:25]
# array([26.2, 21.4, 30.1, 11.7, 20.1, 33.3, 50. , 31.6, 15.7, 12.8, 20.6,
# 18.6, 10.9, 20.8, 24.1, 25. , 17.2, 33.1, 41.7, 21.8, 19.5, 6.3,
# 14.3, 25. , 33. ])
lr = LinearRegression(fit_intercept=True)
lr.fit(X_train,y_train)
display(lr.coef_,lr.intercept_)
# array([-1.14751190e-01, 4.97264829e-02, -1.15787678e-02, 1.25839311e+00,
# -1.91452645e+01, 3.46645666e+00, 7.31591100e-03, -1.68944530e+00,
# 3.22336415e-01, -1.37462643e-02, -9.97868763e-01, 6.65425800e-03,
# -5.54376021e-01])
# 42.65246251234132
lr.predict(X_test).round(2)[:15]
# array([23.24, 25.26, 30.23, 15.51, 19.66, 35.96, 24.89, 33.05, 14.66,
# 12.7 , 22.67, 15.85, 14.55, 18.78, 19.52])
# 根据斜率和截距构造方程,进行求解的结果
(X_test.dot(lr.coef_) + lr.intercept_).round(2)[:15]
# array([23.24, 25.26, 30.23, 15.51, 19.66, 35.96, 24.89, 33.05, 14.66,
# 12.7 , 22.67, 15.85, 14.55, 18.78, 19.52])
也是一样的