统计学习(一):最大似然估计

文章目录

最大似然估计(Maximum likelihood estimation)可以简单理解为我们有一堆数据(数据之间是独立同分布的.iid),为了得到这些数据,我们设计了一个模型,最大似然估计就是求使模型能够得到这些数据的最大可能性的参数,这是一个统计(statistics)问题

与概率(probability)的区别:概率是我们已知参数 θ \theta θ来预测结果,比如对于标准高斯分布 X ~ N ( 0 , 1 ) X~N(0, 1) X~N(0,1),我们知道了确切的表达式,那么最终通过模型得到的结果我们大致也可以猜测到。但是对于统计问题,我们预先知道了结果,比如我们有10000个样本(他们可能服从某一分布,假设服从高斯分布),我们的目的就是估计 μ & σ \mu \& \sigma μ&σ使得我们假设的模型能够最大概率的生成我们目前知道的样本

一、似然函数定义

似然函数是一种关于统计模型中的参数的函数,表示模型参数中的似然性,用 L L L表示,给定输出 x x x时,关于参数 θ \theta θ的似然函数 L ( θ ∣ x ) L(\theta|x) L(θ∣x)在数值上等于给定参数 θ \theta θ后变量X的概率
L ( θ ∣ x ) = P ( X = x ∣ θ ) L(\theta|x) = P(X=x|\theta) L(θ∣x)=P(X=x∣θ)
在统计学习中,我们有 N N N个样本 x 1 , x 2 , x 3 . . . x N x_{1}, x_{2}, x_{3}...x_{N} x1​,x2​,x3​...xN​,假设他们之间是相互独立的,那么似然函数
L ( θ ) = P ( X 1 = x 1 , X 2 = x 2 . . . X N = x N ) = ∏ i = 1 N p ( X i = x i ) = ∏ i = 1 N p ( x i , θ ) L(\theta) = P(X_{1}=x_{1}, X_{2}=x_{2}...X_{N}=x_{N}) = \prod_{i=1}^{N}p(X_{i}=x_{i}) = \prod_{i=1}^{N}p(x_{i},\theta) L(θ)=P(X1​=x1​,X2​=x2​...XN​=xN​)=i=1∏N​p(Xi​=xi​)=i=1∏N​p(xi​,θ)
最大似然函数的目的就是求解一个 θ \theta θ使得 L ( θ ) L(\theta) L(θ)最大化

二、最大似然估计的无偏性判断

这里用一维高斯分布来判断 μ \mu μ和 σ 2 \sigma^2 σ2的无偏性及有偏性,一维高斯分布函数
f ( x ∣ θ ) = f ( x ∣ μ , σ ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x|\theta)=f(x|\mu, \sigma)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma ^2}} f(x∣θ)=f(x∣μ,σ)=2π ​σ1​e−2σ2(x−μ)2​
其中最大似然估计
M L E : θ ^ = arg max ⁡ θ   l n L ( X ∣ μ , σ ) MLE:\hat\theta = \underset {\theta}{\operatorname {arg\,max}}~lnL(X|\mu, \sigma) MLE:θ^=θargmax​ lnL(X∣μ,σ)

分为三种情况

(1)已知 σ 2 \sigma^{2} σ2,未知 μ \mu μ,求 μ \mu μ的最大似然估计量 μ ^ \hat\mu μ^​

似然函数: L ( X ∣ μ ) = ∏ i = 1 N p ( x i ∣ μ ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\mu)=\prod_{i=1}^{N}p(x_{i}|\mu)=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣μ)=∏i=1N​p(xi​∣μ)=∏i=1N​2π ​σ1​e−2σ2(xi​−μ)2​

两边分别取对数: l n L ( X ∣ μ ) = l n ∏ i = 1 N p ( x i ∣ μ ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\mu)=ln\prod_{i=1}^{N}p(x_{i}|\mu)=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣μ)=ln∏i=1N​p(xi​∣μ)=−2N​ln(2π)−Nlnσ−2σ21​∑i=1N​(xi​−μ)2

两边对 μ \mu μ求导
d l n L ( X ∣ μ ) d μ = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 ∑ i = 1 N ( x i − μ ) = 0 → ∑ i = 1 N x i − N μ = 0 μ ^ = 1 N ∑ i = 1 N x i = X ‾ \frac{dlnL(X|\mu)}{d\mu}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ \sum_{i=1}^{N}(x_{i}-\mu)=0 \rightarrow \sum_{i=1}^{N}x_{i}-N\mu=0 \\ \hat \mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}= \overline{X} dμdlnL(X∣μ)​=i=1∑N​σ21​(xi​−μ)=0i=1∑N​(xi​−μ)=0→i=1∑N​xi​−Nμ=0μ^​=N1​i=1∑N​xi​=X
可以发现,当 σ 2 \sigma^{2} σ2已知时, μ \mu μ的最大似然估计量只受样本的影响, μ ^ \hat \mu μ^​是 μ \mu μ的无偏估计

E [ μ ^ ] = E [ 1 N ∑ i = 1 N x i ] = 1 N ∑ i = 1 N E [ x i ] = 1 N N μ = μ E[\hat \mu]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}]=\frac{1}{N}\sum_{i=1}^{N}E[x_{i}]=\frac{1}{N}N\mu=\mu E[μ^​]=E[N1​∑i=1N​xi​]=N1​∑i=1N​E[xi​]=N1​Nμ=μ

(2)已知 μ \mu μ,未知 σ 2 \sigma^{2} σ2,求 σ 2 \sigma^{2} σ2的最大似然估计量 σ ^ 2 \hat\sigma^{2} σ^2

似然函数: L ( X ∣ σ 2 ) = ∏ i = 1 N p ( x i ∣ σ 2 ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\sigma^{2})=\prod_{i=1}^{N}p(x_{i}|\sigma^{2})=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣σ2)=∏i=1N​p(xi​∣σ2)=∏i=1N​2π ​σ1​e−2σ2(xi​−μ)2​

两边分别取对数: l n L ( X ∣ σ 2 ) = l n ∏ i = 1 N p ( x i ∣ σ 2 ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\sigma^{2})=ln\prod_{i=1}^{N}p(x_{i}|\sigma^{2})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣σ2)=ln∏i=1N​p(xi​∣σ2)=−2N​ln(2π)−Nlnσ−2σ21​∑i=1N​(xi​−μ)2

两边对 σ 2 \sigma^{2} σ2求导
d l n L ( X ∣ σ 2 ) d σ 2 = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 − N 2 σ 2 + 1 2 σ 4 ∑ i = 1 N ( x i − μ ) 2 = 0 σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ) 2 \frac{dlnL(X|\sigma^{2})}{d\sigma^{2}}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ -\frac{N}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}\sum_{i=1}^{N}(x_{i}-\mu)^{2}=0 \\ \hat \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^2 dσ2dlnL(X∣σ2)​=i=1∑N​σ21​(xi​−μ)=0−2σ2N​+2σ41​i=1∑N​(xi​−μ)2=0σ^2=N1​i=1∑N​(xi​−μ)2
可以发现,当 μ \mu μ已知时, σ ^ 2 \hat \sigma^{2} σ^2的最大似然估计量受到样本以及样本均值的影响, σ ^ 2 \hat \sigma^{2} σ^2是 σ 2 \sigma^{2} σ2的无偏估计

E [ σ ^ 2 ] = E [ 1 N ∑ i = 1 N ( x i − μ ) 2 ] = E [ 1 N ∑ i = 1 N x i 2 − 1 N ∑ i = 1 N 2 x i μ + 1 N ∑ i = 1 N μ 2 ] = E [ 1 N ∑ N i = 1 x i 2 − 2 μ 2 + μ 2 ] = E [ 1 N ∑ i = 1 N x i 2 − μ 2 ] = 1 N ∑ i = 1 N ( E ( x i 2 ) − E 2 ( x i ) ) = D ( x i ) = σ 2 E[\hat \sigma^{2}]=E[\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{2}]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\frac{1}{N}\sum_{i=1}^{N}2x_{i}\mu+\frac{1}{N}\sum_{i=1}^{N}\mu^{2}] = E[\frac{1}{N}\sum_{N}^{i=1}x_{i}^{2}-2\mu^{2}+\mu^{2}] \\ = E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\mu^{2}] = \frac{1}{N}\sum_{i=1}^{N}(E(x_{i}^2)-E^{2}(x_{i})) = D(x_{i}) = \sigma^{2} E[σ^2]=E[N1​∑i=1N​(xi​−μ)2]=E[N1​∑i=1N​xi2​−N1​∑i=1N​2xi​μ+N1​∑i=1N​μ2]=E[N1​∑Ni=1​xi2​−2μ2+μ2]=E[N1​∑i=1N​xi2​−μ2]=N1​∑i=1N​(E(xi2​)−E2(xi​))=D(xi​)=σ2

(3) μ \mu μ和 σ 2 \sigma^{2} σ2均未知,求 μ \mu μ、 σ 2 \sigma^{2} σ2的最大似然估计量 μ ^ \hat\mu μ^​和 σ ^ 2 \hat\sigma^{2} σ^2

似然函数: L ( X ∣ μ , σ 2 ) = ∏ i = 1 N p ( x i ∣ μ , σ 2 ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\mu, \sigma^{2})=\prod_{i=1}^{N}p(x_{i}|\mu, \sigma^{2})=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣μ,σ2)=∏i=1N​p(xi​∣μ,σ2)=∏i=1N​2π ​σ1​e−2σ2(xi​−μ)2​

两边分别取对数: l n L ( X ∣ μ , σ 2 ) = l n ∏ i = 1 N p ( x i ∣ μ , σ 2 ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\mu, \sigma^{2})=ln\prod_{i=1}^{N}p(x_{i}|\mu, \sigma^{2})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣μ,σ2)=ln∏i=1N​p(xi​∣μ,σ2)=−2N​ln(2π)−Nlnσ−2σ21​∑i=1N​(xi​−μ)2

  • 两边对 μ \mu μ求导

d l n L ( X ∣ μ ) d μ = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 ∑ i = 1 N ( x i − μ ) = 0 → ∑ i = 1 N x i − N μ = 0 μ ^ = 1 N ∑ i = 1 N x i = X ‾ \frac{dlnL(X|\mu)}{d\mu}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ \sum_{i=1}^{N}(x_{i}-\mu)=0 \rightarrow \sum_{i=1}^{N}x_{i}-N\mu=0 \\ \hat \mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}= \overline{X} dμdlnL(X∣μ)​=i=1∑N​σ21​(xi​−μ)=0i=1∑N​(xi​−μ)=0→i=1∑N​xi​−Nμ=0μ^​=N1​i=1∑N​xi​=X

  • 两边对 σ 2 \sigma^{2} σ2求导

d l n L ( X ∣ σ 2 ) d σ 2 = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 − N 2 σ 2 + 1 2 σ 4 ∑ i = 1 N ( x i − μ ) 2 = 0 σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ^ ) 2 = 1 N ∑ i = 1 N ( x i − X ‾ ) 2 \frac{dlnL(X|\sigma^{2})}{d\sigma^{2}}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ -\frac{N}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}\sum_{i=1}^{N}(x_{i}-\mu)^{2}=0 \\ \hat \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\hat \mu)^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline X)^2 dσ2dlnL(X∣σ2)​=i=1∑N​σ21​(xi​−μ)=0−2σ2N​+2σ41​i=1∑N​(xi​−μ)2=0σ^2=N1​i=1∑N​(xi​−μ^​)2=N1​i=1∑N​(xi​−X)2

可以发现,当 μ \mu μ的最大似然估计量 μ ^ \hat \mu μ^​只受样本的影响(因为在计算时 σ 2 \sigma^{2} σ2被消去了), μ ^ \hat \mu μ^​是 μ \mu μ的无偏估计

E [ μ ^ ] = E [ X ‾ ] = E [ 1 N ∑ i = 1 N x i ] = 1 N ∑ i = 1 N E [ x i ] = 1 N N μ = μ E[\hat \mu]=E[\overline X]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}]=\frac{1}{N}\sum_{i=1}^{N}E[x_{i}]=\frac{1}{N}N\mu=\mu E[μ^​]=E[X]=E[N1​∑i=1N​xi​]=N1​∑i=1N​E[xi​]=N1​Nμ=μ

但是在计算 σ 2 \sigma^{2} σ2的最大似然估计量 σ ^ 2 \hat \sigma^{2} σ^2不仅受到样本的影响,还受到 μ \mu μ的影响,其中 μ \mu μ未知,只能用计算出的 μ ^ \hat \mu μ^​来替代,通过下面计算可以发现** σ ^ 2 \hat \sigma^{2} σ^2是$ \sigma^{2}$的有偏估计**

E [ σ ^ 2 ] = E [ 1 N ∑ i = 1 N ( x i − X ‾ ) 2 ] = E [ 1 N ∑ i = 1 N x i 2 − 1 N ∑ i = 1 N 2 x i X ‾ + 1 N ∑ i = 1 N X ‾ 2 ] = E [ 1 N ∑ N i = 1 x i 2 − 2 X ‾ 2 + X ‾ 2 ] = E { ( 1 N ∑ i = 1 N x i 2 − X ‾ 2 ) − ( X ‾ 2 − X ‾ 2 ) } = E [ ( 1 N ∑ i = 1 N x i 2 − X ‾ 2 ) ] − E ( X ‾ 2 − X ‾ 2 ) = 1 N ∑ i = 1 N [ E ( x i 2 ) − E 2 ( x i ) ] − [ E ( X ‾ 2 ) − E 2 ( X ‾ ) ] = D ( x i ) − D ( X ‾ ) = σ 2 − σ 2 N = N − 1 N σ 2 \begin{aligned} E[\hat \sigma^{2}] &= E[\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline X)^{2}] = E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\frac{1}{N}\sum_{i=1}^{N}2x_{i}\overline X+\frac{1}{N}\sum_{i=1}^{N}\overline X^{2}] \\ & = E[\frac{1}{N}\sum_{N}^{i=1}x_{i}^{2}-2\overline X^{2}+\overline X^{2}] = E\{(\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\overline X^{2})-(\overline X^{2}-\overline X^{2})\} \\ & = E[(\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\overline X^{2})]-E(\overline X^{2}-\overline X^{2}) \\ & = \frac{1}{N}\sum_{i=1}^{N}[E(x_{i}^2)-E^{2}(x_{i})]-[E(\overline X^{2})-E^{2}(\overline X)] \\ & = D(x_{i})-D(\overline X) = \sigma^{2}-\frac{\sigma^{2}}{N} =\frac{N-1}{N}\sigma^{2} \end{aligned} E[σ^2]​=E[N1​i=1∑N​(xi​−X)2]=E[N1​i=1∑N​xi2​−N1​i=1∑N​2xi​X+N1​i=1∑N​X2]=E[N1​N∑i=1​xi2​−2X2+X2]=E{(N1​i=1∑N​xi2​−X2)−(X2−X2)}=E[(N1​i=1∑N​xi2​−X2)]−E(X2−X2)=N1​i=1∑N​[E(xi2​)−E2(xi​)]−[E(X2)−E2(X)]=D(xi​)−D(X)=σ2−Nσ2​=NN−1​σ2​

所以在计算样本的方差 S 2 S^{2} S2时,需要在在前面乘上一个系数,即 S 2 = N N − 1 E [ σ ^ 2 ] S^{2}=\frac{N}{N-1}E[\hat \sigma^{2}] S2=N−1N​E[σ^2]

三、最大似然和最小二乘的关系

当数据为高斯分布时,最大似然和最小二乘相同

假设一个模型为线性回归模型,噪声为高斯噪声

已知 f θ ( x ) = f ( y ∣ x , w ) = ∑ i = 1 N x i w i T + ϵ = x w T + ϵ f_{\theta}(\mathbf{x}) = f(y|x,w) = \sum_{i=1}^{N}x_{i}w_{i}^{T}+\epsilon = \mathbf{x} \mathbf{w^{T}}+\mathbf{\epsilon} fθ​(x)=f(y∣x,w)=∑i=1N​xi​wiT​+ϵ=xwT+ϵ,设 ϵ i ~ N ( 0 , σ 2 ) \epsilon_{i}~N(0, \sigma^{2}) ϵi​~N(0,σ2), f ( y i ∣ x i , w i ) = y i ~ N ( x i w i T , σ 2 ) f(y_{i}|x_{i},w_{i})=y_{i}~N(x_{i}w_{i}^{T}, \sigma^{2}) f(yi​∣xi​,wi​)=yi​~N(xi​wiT​,σ2)

由上面推导的最大似然函数求解: arg max ⁡ w   l n L ( w ) = l n ∏ i = 1 N p ( y i ∣ x i , w i ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( y i − x i w i T ) 2 \underset {w}{\operatorname {arg\,max}}~lnL(w)=ln\prod_{i=1}^{N}p(y_{i}|x_{i},w_{i})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 wargmax​ lnL(w)=ln∏i=1N​p(yi​∣xi​,wi​)=−2N​ln(2π)−Nlnσ−2σ21​∑i=1N​(yi​−xi​wiT​)2

由于前两项都与 w w w无关,因此可以将上式简化为: arg max ⁡ w   l n L ( w ) = − 1 2 σ 2 ∑ i = 1 N ( y i − x i w i T ) 2 ~ ∑ i = 1 N ( y i − x i w i T ) 2 \underset {w}{\operatorname {arg\,max}}~lnL(w)=-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2~\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 wargmax​ lnL(w)=−2σ21​∑i=1N​(yi​−xi​wiT​)2~∑i=1N​(yi​−xi​wiT​)2

而最小二乘法的公式也是如此: arg min ⁡ w   f ( w ) = ∑ i = 1 N ( y i − x i w i T ) 2 = ∣ ∣ Y − X W T ∣ ∣ 2 2 \underset {w}{\operatorname {arg\,min}}~f(w)=\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 = \vert\vert Y-XW^{T}\vert\vert_{2}^{2} wargmin​ f(w)=∑i=1N​(yi​−xi​wiT​)2=∣∣Y−XWT∣∣22​

上一篇:高斯分布和马氏距离


下一篇:【莫比乌斯反演】学习笔记