线性回归

2024-03-25 15:58:16

说到回归一般都指线性回归，回归的目的是预测数值型的目标值。线性回归模型的优点在于结果易于理解，计算简单，缺点在于对非线性的数据拟合不好。

原理

线性回归就是对输入特征加权求和，在加上偏置项进行预测。公式如下所示：

$$
widehat{h}=theta_{0}+theta_{1}x_{1}+theta_{2}x_{2}+…+theta_{n}x_{n}
tag{1}
$$

其中，$widehat{h}$代表预测值，$n$是特征的个数，${x_i}$代表第i个特征值，$theta_{i}$ 代表第i个特征的权重，将上式向量化有

$$
widehat{h}=h_{theta}(X)=Xtheta
tag{2}
$$

其中$theta$是模型的参数向量，X是实例的特征向量，$h_{theta}$是使用模型参数$theta$做参数的假设函数。

* 注：这里为什么写成$Xtheta$的形式，而不是$theta X$呢？在实际使用中，实例的特征是行向量，由多个实例构成输入矩阵，而权重是列向量，其具体形式如下
$$
Xtheta=
begin{bmatrix}
x_0^1 & x_0^2 & … &x_0^n \
x_1^1 & x_1^2 & … &x_1^n \
… & … & … &… \
x_m^1 & x_m^1 & … &x_m^1
end{bmatrix}
begin{bmatrix}
theta_0 \
theta_1 \
… \
theta_n
end{bmatrix}
$$

为了衡量模型效果，一般使用MSE(均方误差，Mean Square Error)描述线性回归的预测值和真实值之间的偏差。假设输入的样本特征为$x_1,x_2,…,x_n$，对应的样本值为$y_1,y_2,y_3,…,y_n$，则MSE的定义为

$$
J(theta)=frac{1}{2}(widehat{y}_i-y_i)^2=frac{1}{2}sum^n_{i=1}(h_{theta}(x_i)-y_i)^2
tag{3}
$$
写成向量形式有
$$
J(theta)=frac{1}{2}(Xtheta-Y)^2
tag{4}
$$
以上就是线性回归的基本公式和损失函数。

求解

我们的目标是找到合适的一组权重$theta$让$J(theta)$最小，这样就说明我们的模型越符合实际效果越好。一般有两种方式进行求解，直接求解和梯度下降。

解析求解

我们已经拿到了损失函数的表达形式，可以从公式$(4)$中直接推到出$theta$，过程如下。

$$
begin{split}
J(theta)=frac{1}{2}(Xtheta-Y)^2&=frac{1}{2}(Xtheta-Y)^T(Xtheta-Y)\
&=frac{1}{2}(theta^TX^T-Y^T)(Xtheta-Y)\
&=frac{1}{2}(theta^TX^TXtheta-theta^TX^TY-Y^TXtheta-Y^TY)
end{split}
tag{5}
$$
接下来对$J(theta)$求偏导

$$
begin{split}
frac{partial^{2}J(theta)}{partialtheta^{2}}&=frac{1}{2}(2 cdot X^TXtheta-2 cdot X^TY)\
&=X^TXtheta-X^TY
end{split}
tag{6}
$$

* 注：这里用到了几个矩阵求导公式
$$frac{dAB}{dB}=A^T$$
$$frac{dA^TB}{dA}=B$$
$$frac{dX^TAX}{dX}=(A+A^T)X$$

令导数为0，有

$$frac{partial^{2}J(theta)}{partialtheta^{2}}=0$$

$$X^TXtheta-X^TY=0$$

$$X^TXtheta=X^TY$$

$$theta=(X^TX)^{-1}X^TY tag{7}$$

最终推到出$(7)$，直接求解需要用到实例数据$X$及对应的预测值$Y$，在输入不大的情况下可以快速求得权重矩阵。但$(7)$中含有矩阵求逆运算，其算法复杂度通常在$O(n^2.4)$到$O(n^3)$之间，在高维大数据量的情况下直接求解的计算量很大；而且对于某些损失函数是凸函数的模型($X^TX$不可逆)来说，无法求得损失函数的偏导，该方法失效，只能使用梯度下降算法。

梯度下降

梯度下降是一种通用的优化算法，梯度下降的中心思想就是迭代地调整参数从而使成本函数最小化。

* 注：梯度下降的形象解释：
假设你迷失在山上的迷雾中，你能感受到的只有脚下地面的坡度。快速到达山脚的一个策略就是沿着最陡的方向下坡。这就是梯度下降的做法：通过测量参数向量$theta$相关的误差函数的局部梯度，并不断沿着梯度的方向调整，直到梯度降为0，到达最小值。

具体来说，首先使用一个随机的$theta$值，即随机初始化，然后逐步改进，每次踏出一步，每一步都尝试降低成本函数，直到算法收敛到一个最小值。梯度下降中一个重要的参数是每一步的步长，该参数被称为学习率。如果学习率太低，算法需要经过大量迭代才能收敛，如果太高，有可能越过最小值甚至比上一步迭代的结果还大，导致模型无法收敛。

对于线性回归，梯度下降的数学表达形式如下。假设第$i$次迭代的权重为$theta_i$，学习率为$eta$，第$i+1$次迭代的权重为
$$
theta_{i+1}=theta_i-eta cdot frac{partial^2J(theta)}{partialtheta^2}
tag{8}
$$

实现

下面使用tensorflow实现两种求解方式，以加州房价预测数据为例。

tensorflow实现解析求解


import numpy as np
import tensorflow as tf
from sklearn.datasets import fetch_california_housing

n_epochs = 10000  # 把样本集数据学习1000次
learning_rate = 0.01  # 步长 学习率 不能太大 太大容易来回震荡 太小 耗时间，跳不出局部最优解
# 可以写learn_rate动态变化，随着迭代次数越来越大 ，学习率越来越小 learning_rate/n_epoches
housing = fetch_california_housing()
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]
housing_data_target = housing.target.reshape(-1, 1)< 大专栏  线性回归br/>X = tf.constant(housing_data_plus_bias, tf.float32)
y = tf.constant(housing_data_target, tf.float32)
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)
with tf.Session() as sess:
    theta_value = theta.eval()
    print theta_value

tensorflow实现梯度下降


import tensorflow as tf
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

n_epochs = 10000  # 把样本集数据学习1000次
learning_rate = 0.01  # 步长 学习率 不能太大 太大容易来回震荡 太小 耗时间，跳不出局部最优解
# 可以写learn_rate动态变化，随着迭代次数越来越大 ，学习率越来越小 learning_rate/n_epoches
housing = fetch_california_housing()
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]
# 可以使用TensorFlow或者Numpy或者sklearn的StandardScaler去进行归一化
# 归一化可以最快的找到最优解
# 常用的归一化方式：
# 最大最小值归一化 (x-min)/(max-min)
# 方差归一化 x/方差
# 均值归一化 x-均值 结果有正有负 可以使调整时的速度越来越快。
scaler = StandardScaler().fit(housing_data_plus_bias)  # 创建一个归一化对象
scaled_housing_data_plus_bias = scaler.transform(housing_data_plus_bias)  # 真正执行 因为来源于sklearn所以会直接执行，不会延迟。
housing_data_target = housing.target.reshape(-1, 1)

X = tf.placeholder(tf.float32, name='X')
y = tf.placeholder(tf.float32, name='y')

# random_uniform函数创建图里一个节点包含随机数值，给定它的形状和取值范围，就像numpy里面rand()函数
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name='theta')  #参数theta，列向量，按照-1.0到1.0随机给
y_pred = tf.matmul(X, theta, name="predictions")  # 相乘 m行一列
error = y_pred - y  # 列向量和列向量相减 是一组数
mse = tf.reduce_mean(tf.square(error), name="mse")  # 误差平方加和，公式（3）的手动实现
gradients = 2.0 / m * tf.matmul(tf.transpose(X), error)  # 梯度公式
training_op = tf.assign(theta, theta - learning_rate * gradients)  # 即公式(8)，需对该式进行迭代

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)  # 初始化

    for epoch in range(n_epochs):  # 迭代1000次
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE = ",
                  sess.run(mse, feed_dict={X: scaled_housing_data_plus_bias, y: housing_data_target}))  # 每运行100次的时候输出
        sess.run(training_op, feed_dict={X: scaled_housing_data_plus_bias, y: housing_data_target})

    best_theta = theta.eval()  # 最后的w参数值
    print(best_theta)

以上代码详见线性回归tensorflow实现

#参考

【TensorFlow篇】–Tensorflow框架初始，实现机器学习中多元线性回归
机器学习实战-基于Scikit-Learn和Tensorflow，机械工业出版社
机器学习实战，人民邮电出版社

码农公寓

原理

求解

解析求解

梯度下降

实现

tensorflow实现解析求解

tensorflow实现梯度下降

相关文章