机器学习算法进阶学习笔记——回归算法

机器学习算法进阶学习笔记——回归算法

线性回归

最大似然估计MLE

y ( i ) = θ T x ( i ) + ε ( i ) y^{(i)}=\theta^{T} x^{(i)}+\varepsilon^{(i)} y(i)=θTx(i)+ε(i)
p ( ϵ ( i ) ) = 1 2 π σ exp ⁡ ( − ( ϵ ( i ) ) 2 2 σ 2 ) p\left(\epsilon^{(i)}\right)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(\epsilon^{(i)}\right)^{2}}{2 \sigma^{2}}\right) p(ϵ(i))=2π ​σ1​exp(−2σ2(ϵ(i))2​)
p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ exp ⁡ ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) p\left(y^{(i)} \mid x^{(i)} ; \theta\right)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) p(y(i)∣x(i);θ)=2π ​σ1​exp(−2σ2(y(i)−θTx(i))2​)
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) L(\theta)=\prod_{i=1}^{m} p\left(y^{(i)} \mid x^{(i)} ; \theta\right) L(θ)=∏i=1m​p(y(i)∣x(i);θ)
= ∏ i = 1 m 1 2 π σ exp ⁡ ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) =\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) =∏i=1m​2π ​σ1​exp(−2σ2(y(i)−θTx(i))2​)

高斯对数似然与最小二乘

ℓ ( θ ) = log ⁡ L ( θ ) = log ⁡ ∏ i = 1 m 1 2 π σ exp ⁡ ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = ∑ i = 1 m log ⁡ 1 2 π σ exp ⁡ ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = m log ⁡ 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \begin{aligned} \ell(\theta) &=\log L(\theta) \\ &=\log \prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m} \log \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=m \log \frac{1}{\sqrt{2 \pi} \sigma}-\frac{1}{\sigma^{2}} \cdot \frac{1}{2} \sum_{i=1}^{m}\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2} \\ & J(\theta)=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} \end{aligned} ℓ(θ)​=logL(θ)=logi=1∏m​2π ​σ1​exp(−2σ2(y(i)−θTx(i))2​)=i=1∑m​log2π ​σ1​exp(−2σ2(y(i)−θTx(i))2​)=mlog2π ​σ1​−σ21​⋅21​i=1∑m​(y(i)−θTx(i))2J(θ)=21​i=1∑m​(hθ​(x(i))−y(i))2​

θ \theta θ的解析式的求解过程

将M个N维样本组成矩阵X:
X的每一行对应一个样本, 共M个样本(measurements)
X的每一列对应样本的一个维度,共N维(regressors)
还有额外的一维常数项,全为1
目标函数 J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = 1 2 ( X θ − y ) T ( X θ − y ) \quad J(\theta)=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}=\frac{1}{2}(X \theta-y)^{T}(X \theta-y) J(θ)=21​∑i=1m​(hθ​(x(i))−y(i))2=21​(Xθ−y)T(Xθ−y)
梯度 : ∇ θ J ( θ ) = ∇ θ ( 1 2 ( X θ − y ) T ( X θ − y ) ) = ∇ θ ( 1 2 ( θ T X T − y T ) ( X θ − y ) ) : \quad \nabla_{\theta} J(\theta)=\nabla_{\theta}\left(\frac{1}{2}(X \theta-y)^{T}(X \theta-y)\right)=\nabla_{\theta}\left(\frac{1}{2}\left(\theta^{T} X^{T}-y^{T}\right)(X \theta-y)\right) :∇θ​J(θ)=∇θ​(21​(Xθ−y)T(Xθ−y))=∇θ​(21​(θTXT−yT)(Xθ−y))
= ∇ θ ( 1 2 ( θ T X T X θ − θ T X T y − y T X θ + y T y ) ) = 1 2 ( 2 X T X θ − X T y − ( y T X ) T ) = X T X θ − X T y ⟶ 0 ( 求 驻 点 ) \begin{array}{l} =\nabla_{\theta}\left(\frac{1}{2}\left(\theta^{T} X^{T} X \theta-\theta^{T} X^{T} y-y^{T} X \theta+y^{T} y\right)\right) \\ =\frac{1}{2}\left(2 X^{T} X \theta-X^{T} y-\left(y^{T} X\right)^{T}\right)=X^{T} X \theta-X^{T} y \longrightarrow \text 0(求驻点) \end{array} =∇θ​(21​(θTXTXθ−θTXTy−yTXθ+yTy))=21​(2XTXθ−XTy−(yTX)T)=XTXθ−XTy⟶0(求驻点)​

最小二乘法意义下的参数最优解

参数的解析式

θ = ( X T X ) − 1 X T y \theta=\left(X^{T} X\right)^{-1} X^{T} y θ=(XTX)−1XTy

若 X T X X^{T} X XTX不可逆或防止过拟合,增加 λ \lambda λ扰动

θ = ( X T X + λ I ) − 1 X T y \theta=\left(X^{T} X+\lambda I\right)^{-1} X^{T} y θ=(XTX+λI)−1XTy

简便”方法记忆结论

X θ = y ⇒ X T X θ = X T y ⇒ θ = ( X T X ) − 1 X T y \begin{array}{l} X \theta=y \Rightarrow X^{T} X \theta=X^{T} y \\ \Rightarrow \theta=\left(X^{T} X\right)^{-1} X^{T} y \end{array} Xθ=y⇒XTXθ=XTy⇒θ=(XTX)−1XTy​

正则与防止过拟合

正则项与防上过拟合 { λ > 0 ρ ∈ [ 0 , 1 ] \left\{\begin{array}{l}\lambda>0 \\ \rho \in[0,1]\end{array}\right. {λ>0ρ∈[0,1]​
□ \square □ L2-norm: J ( θ ⃗ ) = 1 2 ∑ i = 1 m ( h θ ⃗ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 \quad J(\vec{\theta})=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\vec{\theta}}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\lambda \sum_{j=1}^{n} \theta_{j}^{2} J(θ )=21​∑i=1m​(hθ ​(x(i))−y(i))2+λ∑j=1n​θj2​
□ L 1 \square \mathrm{L} 1 □L1 -norm : J ( θ ⃗ ) = 1 2 ∑ i = 1 m ( h θ ⃗ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n ∣ θ j ∣ \quad J(\vec{\theta})=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\vec{\theta}}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\lambda \sum_{j=1}^{n}\left|\theta_{j}\right| J(θ )=21​∑i=1m​(hθ ​(x(i))−y(i))2+λ∑j=1n​∣θj​∣
□ \square □ Elastic Net J ( θ ⃗ ) = 1 2 ∑ i = 1 m ( h θ ˉ ( x ( i ) ) − y ( i ) ) 2 + λ ( ρ ⋅ ∑ j = 1 n ∣ θ j ∣ + ( 1 − ρ ) ⋅ ∑ j = 1 n θ j 2 ) J(\vec{\theta})=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\bar{\theta}}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\lambda\left(\rho \cdot \sum_{j=1}^{n}\left|\theta_{j}\right|+(1-\rho) \cdot \sum_{j=1}^{n} \theta_{j}^{2}\right) J(θ )=21​∑i=1m​(hθˉ​(x(i))−y(i))2+λ(ρ⋅∑j=1n​∣θj​∣+(1−ρ)⋅∑j=1n​θj2​)

Moore-Penrose广义逆矩阵(伪逆)

口 若A为非奇异矩阵,则线性方程组Ax=b的解 为 x = ( A T A ) − 1 A T ⋅ b x=\left(A^{T} A\right)^{-1} A^{T} \cdot b x=(ATA)−1AT⋅b, 从方程解的直观意义上, 可以定义:
A + = ( A T A ) − 1 A T A^{+}=\left(A^{T} A\right)^{-1} A^{T} A+=(ATA)−1AT
口 若A为可逆方阵, A + = ( A T A ) − 1 A T A^{+}=\left(A^{T} A\right)^{-1} A^{T} A+=(ATA)−1AT 即为 A − 1 A^{-1} A−1 ( A T A ) − 1 A T = A − 1 ( A T ) − 1 A T = A − 1 \left(A^{T} A\right)^{-1} A^{T}=A^{-1}\left(A^{T}\right)^{-1} A^{T}=A^{-1} (ATA)−1AT=A−1(AT)−1AT=A−1
□ \square □ 当A为矩阵(非方阵)时, 称 A + A^+ A+称为 A A A的广义逆 (违逆)。

SVD计算矩阵的广义逆

□ \square □ 对于m × n \times \mathrm{n} ×n 的 矩阵A, 若它的SVD分解为:
A = U ⋅ Σ ⋅ V T A=U \cdot \Sigma \cdot V^{T} A=U⋅Σ⋅VT
□ \square □ 则 A \mathrm{A} A 的广义逆为 : A + = V ⋅ Σ − 1 ⋅ U T : A^{+}=V \cdot \Sigma^{-1} \cdot U^{T} :A+=V⋅Σ−1⋅UT
可以验证,若A是n × n \times \mathrm{n} ×n 的可逆阵, 则 A ⋅ A + = I A \cdot A^{+}=I A⋅A+=I
若A是不可逆阵或 m ≠ n m\neq n m​=n, 则 A + = ( A T ⋅ A ) − 1 A A^{+}=\left(A^{T} \cdot A\right)^{-1} A A+=(AT⋅A)−1A

梯度下降法

J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\frac{1}{2} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} J(θ)=21​∑i=1m​(hθ​(x(i))−y(i))2

  • 初始化 θ \theta θ(随机初始化)
  • 沿着负梯度方向迭代, 更新后的 θ \theta θ 使 J ( θ ) J(\theta) J(θ)盒小
    θ = θ − α ⋅ ∂ J ( θ ) ∂ θ \theta=\theta-\alpha \cdot \frac{\partial J(\theta)}{\partial \theta} θ=θ−α⋅∂θ∂J(θ)​
    α : \alpha: α: 学习率、步长

梯度方向

∂ ∂ θ j J ( θ ) = ∂ ∂ θ j 1 2 ( h θ ( x ) − y ) 2 = 2 ⋅ 1 2 ( h θ ( x ) − y ) ⋅ ∂ ∂ θ j ( h θ ( x ) − y ) = ( h θ ( x ) − y ) ⋅ ∂ ∂ θ j ( ∑ i = 0 n θ i x i − y ) = ( h θ ( x ) − y ) x j \begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}} \frac{1}{2}\left(h_{\theta}(x)-y\right)^{2} \\ &=2 \cdot \frac{1}{2}\left(h_{\theta}(x)-y\right) \cdot \frac{\partial}{\partial \theta_{j}}\left(h_{\theta}(x)-y\right) \\ &=\left(h_{\theta}(x)-y\right) \cdot \frac{\partial}{\partial \theta_{j}}\left(\sum_{i=0}^{n} \theta_{i} x_{i}-y\right) \\ &=\left(h_{\theta}(x)-y\right) x_{j} \end{aligned} ∂θj​∂​J(θ)​=∂θj​∂​21​(hθ​(x)−y)2=2⋅21​(hθ​(x)−y)⋅∂θj​∂​(hθ​(x)−y)=(hθ​(x)−y)⋅∂θj​∂​(i=0∑n​θi​xi​−y)=(hθ​(x)−y)xj​​

  • 批量梯度下降法
    Repeat until convergence {
    θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j}:=\theta_{j}+\alpha \sum_{i=1}^{m}\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)} θj​:=θj​+α∑i=1m​(y(i)−hθ​(x(i)))xj(i)​
    }

  • 随机梯度下降法
    Loop
    for i = 1 \mathrm{i}=1 i=1 to m , { \mathrm{m},\{ m,{
    θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x \theta_{j}:=\theta_{j}+\alpha\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x θj​:=θj​+α(y(i)−hθ​(x(i)))x
    }

  • mini-batch
    Repeat until convergence{
    θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( \theta_{j}:=\theta_{j}+\alpha \sum_{i=1}^{m}\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(} θj​:=θj​+αi=1∑m​(y(i)−hθ​(x(i)))xj(​
    }

    Loop{
    for i = 1 \mathrm{i}=1 i=1 to m , { \mathrm{m},\{ m,{
    θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j}:=\theta_{j}+\alpha\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)} θj​:=θj​+α(y(i)−hθ​(x(i)))xj(i)​
    }
    }

Logistic回归

h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}} \quad hθ​(x)=g(θTx)=1+e−θTx1​
g ′ ( x ) = ( 1 1 + e − x ) ′ = e − x ( 1 + e − x ) 2 g^{\prime}(x)=\left(\frac{1}{1+e^{-x}}\right)^{\prime}=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}} g′(x)=(1+e−x1​)′=(1+e−x)2e−x​
= 1 1 + e − x ⋅ e − x 1 + e − x = 1 1 + e − x ⋅ ( 1 − 1 1 + e − x ) =\frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}}=\frac{1}{1+e^{-x}} \cdot\left(1-\frac{1}{1+e^{-x}}\right) =1+e−x1​⋅1+e−xe−x​=1+e−x1​⋅(1−1+e−x1​)
= g ( x ) ⋅ ( 1 − g ( x ) ) =g(x) \cdot(1-g(x)) =g(x)⋅(1−g(x))
机器学习算法进阶学习笔记——回归算法

Logistic回归参数估计

□ \square □ 假定:
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) \begin{aligned} P(y=1 \mid x ; \theta) &=h_{\theta}(x) \\ P(y=0 \mid x ; \theta) &=1-h_{\theta}(x) \end{aligned} P(y=1∣x;θ)P(y=0∣x;θ)​=hθ​(x)=1−hθ​(x)​
p ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y L ( θ ) = p ( y ⃗ ∣ X ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) \begin{aligned} p(y \mid x ; \theta) &=\left(h_{\theta}(x)\right)^{y}\left(1-h_{\theta}(x)\right)^{1-y} \\ L(\theta) &=p(\vec{y} \mid X ; \theta) \\ &=\prod_{i=1}^{m} p\left(y^{(i)} \mid x^{(i)} ; \theta\right) \\ &=\prod_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)\right)^{y^{(i)}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)^{1-y^{(i)}} \end{aligned} p(y∣x;θ)L(θ)​=(hθ​(x))y(1−hθ​(x))1−y=p(y ​∣X;θ)=i=1∏m​p(y(i)∣x(i);θ)=i=1∏m​(hθ​(x(i)))y(i)(1−hθ​(x(i)))1−y(i)​
l ( θ ) = log ⁡ L ( θ ) = ∑ m y ( i ) log ⁡ h ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h ( x ( i ) ) ) l(\theta)=\log L(\theta)=\sum^{m} y^{(i)} \log h\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h\left(x^{(i)}\right)\right) l(θ)=logL(θ)=∑my(i)logh(x(i))+(1−y(i))log(1−h(x(i)))

∂ l ( θ ) ∂ θ j = ∑ i = 1 m ( y ( i ) h ( x ( i ) ) − 1 − y ( i ) 1 − h ( x ( i ) ) ) ⋅ ∂ h ( x ( i ) ) ∂ θ j \frac{\partial l(\theta)}{\partial \theta_{j}}=\sum_{i=1}^{m}\left(\frac{y^{(i)}}{h\left(x^{(i)}\right)}-\frac{1-y^{(i)}}{1-h\left(x^{(i)}\right)}\right) \cdot \frac{\partial h\left(x^{(i)}\right)}{\partial \theta_{j}} ∂θj​∂l(θ)​=∑i=1m​(h(x(i))y(i)​−1−h(x(i))1−y(i)​)⋅∂θj​∂h(x(i))​
= ∑ i = 1 m ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) ⋅ ∂ g ( θ T x ( i ) ) ∂ θ j =\sum_{i=1}^{m}\left(\frac{y^{(i)}}{g\left(\theta^{T} x^{(i)}\right)}-\frac{1-y^{(i)}}{1-g\left(\theta^{T} x^{(i)}\right)}\right) \cdot \frac{\partial g\left(\theta^{T} x^{(i)}\right)}{\partial \theta_{j}} =∑i=1m​(g(θTx(i))y(i)​−1−g(θTx(i))1−y(i)​)⋅∂θj​∂g(θTx(i))​
= ∑ i = 1 m ( y ( i ) g ( θ T x ( i ) ) − 1 − y ( i ) 1 − g ( θ T x ( i ) ) ) ⋅ g ( θ T x ( i ) ) ⋅ ( 1 − g ( θ T x ( i ) ) ) ⋅ ∂ θ T x ( i ) ∂ θ j =\sum_{i=1}^{m}\left(\frac{y^{(i)}}{g\left(\theta^{T} x^{(i)}\right)}-\frac{1-y^{(i)}}{1-g\left(\theta^{T} x^{(i)}\right)}\right) \cdot g\left(\theta^{T} x^{(i)}\right) \cdot\left(1-g\left(\theta^{T} x^{(i)}\right)\right) \cdot \frac{\partial \theta^{T} x^{(i)}}{\partial \theta_{j}} =∑i=1m​(g(θTx(i))y(i)​−1−g(θTx(i))1−y(i)​)⋅g(θTx(i))⋅(1−g(θTx(i)))⋅∂θj​∂θTx(i)​
= ∑ i = 1 m ( y ( i ) ( 1 − g ( θ T x ( i ) ) ) − ( 1 − y ( i ) ) g ( θ T x ( i ) ) ) ⋅ x j ( i ) =\sum_{i=1}^{m}\left(y^{(i)}\left(1-g\left(\theta^{T} x^{(i)}\right)\right)-\left(1-y^{(i)}\right) g\left(\theta^{T} x^{(i)}\right)\right) \cdot x_{j}^{(i)} =∑i=1m​(y(i)(1−g(θTx(i)))−(1−y(i))g(θTx(i)))⋅xj(i)​
= ∑ i = 1 m ( y ( i ) − g ( θ T x ( i ) ) ) ⋅ x j ( i ) =\sum_{i=1}^{m}\left(y^{(i)}-g\left(\theta^{T} x^{(i)}\right)\right) \cdot x_{j}^{(i)} =∑i=1m​(y(i)−g(θTx(i)))⋅xj(i)​

Logistic回归参数的学习规则:

θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j}:=\theta_{j}+\alpha\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)} θj​:=θj​+α(y(i)−hθ​(x(i)))xj(i)​

比较上面的结果和线性回归的结论的差别:
它们具有相同的形式!
Repeat until convergence{
θ j : = θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta_{j}:=\theta_{j}+\alpha \sum_{i=1}^{m}\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)} θj​:=θj​+α∑i=1m​(y(i)−hθ​(x(i)))xj(i)​
}

Loop { for i = 1 \mathrm{i}=1 i=1 to m , { \mathrm{m}, \{ m,{
θ j : = θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \quad \theta_{j}:=\theta_{j}+\alpha\left(y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)} θj​:=θj​+α(y(i)−hθ​(x(i)))xj(i)​
}
}

原因:线性回归假设样本服从高斯分布,逻辑斯蒂回归假设样本服从二项分布,二项分布和高斯分布同属于指数分布族

上一篇:python ftplib 模块的使用


下一篇:Attention