文章目录
最大似然估计(Maximum likelihood estimation)可以简单理解为我们有一堆数据(数据之间是独立同分布的.iid),为了得到这些数据,我们设计了一个模型,最大似然估计就是求使模型能够得到这些数据的最大可能性的参数,这是一个统计(statistics)问题
与概率(probability)的区别:概率是我们已知参数 θ \theta θ来预测结果,比如对于标准高斯分布 X ~ N ( 0 , 1 ) X~N(0, 1) X~N(0,1),我们知道了确切的表达式,那么最终通过模型得到的结果我们大致也可以猜测到。但是对于统计问题,我们预先知道了结果,比如我们有10000个样本(他们可能服从某一分布,假设服从高斯分布),我们的目的就是估计 μ & σ \mu \& \sigma μ&σ使得我们假设的模型能够最大概率的生成我们目前知道的样本
一、似然函数定义
似然函数是一种关于统计模型中的参数的函数,表示模型参数中的似然性,用
L
L
L表示,给定输出
x
x
x时,关于参数
θ
\theta
θ的似然函数
L
(
θ
∣
x
)
L(\theta|x)
L(θ∣x)在数值上等于给定参数
θ
\theta
θ后变量X的概率
L
(
θ
∣
x
)
=
P
(
X
=
x
∣
θ
)
L(\theta|x) = P(X=x|\theta)
L(θ∣x)=P(X=x∣θ)
在统计学习中,我们有
N
N
N个样本
x
1
,
x
2
,
x
3
.
.
.
x
N
x_{1}, x_{2}, x_{3}...x_{N}
x1,x2,x3...xN,假设他们之间是相互独立的,那么似然函数
L
(
θ
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
.
.
.
X
N
=
x
N
)
=
∏
i
=
1
N
p
(
X
i
=
x
i
)
=
∏
i
=
1
N
p
(
x
i
,
θ
)
L(\theta) = P(X_{1}=x_{1}, X_{2}=x_{2}...X_{N}=x_{N}) = \prod_{i=1}^{N}p(X_{i}=x_{i}) = \prod_{i=1}^{N}p(x_{i},\theta)
L(θ)=P(X1=x1,X2=x2...XN=xN)=i=1∏Np(Xi=xi)=i=1∏Np(xi,θ)
最大似然函数的目的就是求解一个
θ
\theta
θ使得
L
(
θ
)
L(\theta)
L(θ)最大化
二、最大似然估计的无偏性判断
这里用一维高斯分布来判断
μ
\mu
μ和
σ
2
\sigma^2
σ2的无偏性及有偏性,一维高斯分布函数
f
(
x
∣
θ
)
=
f
(
x
∣
μ
,
σ
)
=
1
2
π
σ
e
−
(
x
−
μ
)
2
2
σ
2
f(x|\theta)=f(x|\mu, \sigma)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma ^2}}
f(x∣θ)=f(x∣μ,σ)=2π
σ1e−2σ2(x−μ)2
其中最大似然估计
M
L
E
:
θ
^
=
arg max
θ
l
n
L
(
X
∣
μ
,
σ
)
MLE:\hat\theta = \underset {\theta}{\operatorname {arg\,max}}~lnL(X|\mu, \sigma)
MLE:θ^=θargmax lnL(X∣μ,σ)
分为三种情况
(1)已知 σ 2 \sigma^{2} σ2,未知 μ \mu μ,求 μ \mu μ的最大似然估计量 μ ^ \hat\mu μ^
似然函数: L ( X ∣ μ ) = ∏ i = 1 N p ( x i ∣ μ ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\mu)=\prod_{i=1}^{N}p(x_{i}|\mu)=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣μ)=∏i=1Np(xi∣μ)=∏i=1N2π σ1e−2σ2(xi−μ)2
两边分别取对数: l n L ( X ∣ μ ) = l n ∏ i = 1 N p ( x i ∣ μ ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\mu)=ln\prod_{i=1}^{N}p(x_{i}|\mu)=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣μ)=ln∏i=1Np(xi∣μ)=−2Nln(2π)−Nlnσ−2σ21∑i=1N(xi−μ)2
两边对
μ
\mu
μ求导
d
l
n
L
(
X
∣
μ
)
d
μ
=
∑
i
=
1
N
1
σ
2
(
x
i
−
μ
)
=
0
∑
i
=
1
N
(
x
i
−
μ
)
=
0
→
∑
i
=
1
N
x
i
−
N
μ
=
0
μ
^
=
1
N
∑
i
=
1
N
x
i
=
X
‾
\frac{dlnL(X|\mu)}{d\mu}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ \sum_{i=1}^{N}(x_{i}-\mu)=0 \rightarrow \sum_{i=1}^{N}x_{i}-N\mu=0 \\ \hat \mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}= \overline{X}
dμdlnL(X∣μ)=i=1∑Nσ21(xi−μ)=0i=1∑N(xi−μ)=0→i=1∑Nxi−Nμ=0μ^=N1i=1∑Nxi=X
可以发现,当
σ
2
\sigma^{2}
σ2已知时,
μ
\mu
μ的最大似然估计量只受样本
的影响,
μ
^
\hat \mu
μ^是
μ
\mu
μ的无偏估计
E [ μ ^ ] = E [ 1 N ∑ i = 1 N x i ] = 1 N ∑ i = 1 N E [ x i ] = 1 N N μ = μ E[\hat \mu]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}]=\frac{1}{N}\sum_{i=1}^{N}E[x_{i}]=\frac{1}{N}N\mu=\mu E[μ^]=E[N1∑i=1Nxi]=N1∑i=1NE[xi]=N1Nμ=μ
(2)已知 μ \mu μ,未知 σ 2 \sigma^{2} σ2,求 σ 2 \sigma^{2} σ2的最大似然估计量 σ ^ 2 \hat\sigma^{2} σ^2
似然函数: L ( X ∣ σ 2 ) = ∏ i = 1 N p ( x i ∣ σ 2 ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\sigma^{2})=\prod_{i=1}^{N}p(x_{i}|\sigma^{2})=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣σ2)=∏i=1Np(xi∣σ2)=∏i=1N2π σ1e−2σ2(xi−μ)2
两边分别取对数: l n L ( X ∣ σ 2 ) = l n ∏ i = 1 N p ( x i ∣ σ 2 ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\sigma^{2})=ln\prod_{i=1}^{N}p(x_{i}|\sigma^{2})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣σ2)=ln∏i=1Np(xi∣σ2)=−2Nln(2π)−Nlnσ−2σ21∑i=1N(xi−μ)2
两边对
σ
2
\sigma^{2}
σ2求导
d
l
n
L
(
X
∣
σ
2
)
d
σ
2
=
∑
i
=
1
N
1
σ
2
(
x
i
−
μ
)
=
0
−
N
2
σ
2
+
1
2
σ
4
∑
i
=
1
N
(
x
i
−
μ
)
2
=
0
σ
^
2
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
2
\frac{dlnL(X|\sigma^{2})}{d\sigma^{2}}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ -\frac{N}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}\sum_{i=1}^{N}(x_{i}-\mu)^{2}=0 \\ \hat \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^2
dσ2dlnL(X∣σ2)=i=1∑Nσ21(xi−μ)=0−2σ2N+2σ41i=1∑N(xi−μ)2=0σ^2=N1i=1∑N(xi−μ)2
可以发现,当
μ
\mu
μ已知时,
σ
^
2
\hat \sigma^{2}
σ^2的最大似然估计量受到样本以及样本均值
的影响,
σ
^
2
\hat \sigma^{2}
σ^2是
σ
2
\sigma^{2}
σ2的无偏估计
E [ σ ^ 2 ] = E [ 1 N ∑ i = 1 N ( x i − μ ) 2 ] = E [ 1 N ∑ i = 1 N x i 2 − 1 N ∑ i = 1 N 2 x i μ + 1 N ∑ i = 1 N μ 2 ] = E [ 1 N ∑ N i = 1 x i 2 − 2 μ 2 + μ 2 ] = E [ 1 N ∑ i = 1 N x i 2 − μ 2 ] = 1 N ∑ i = 1 N ( E ( x i 2 ) − E 2 ( x i ) ) = D ( x i ) = σ 2 E[\hat \sigma^{2}]=E[\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{2}]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\frac{1}{N}\sum_{i=1}^{N}2x_{i}\mu+\frac{1}{N}\sum_{i=1}^{N}\mu^{2}] = E[\frac{1}{N}\sum_{N}^{i=1}x_{i}^{2}-2\mu^{2}+\mu^{2}] \\ = E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\mu^{2}] = \frac{1}{N}\sum_{i=1}^{N}(E(x_{i}^2)-E^{2}(x_{i})) = D(x_{i}) = \sigma^{2} E[σ^2]=E[N1∑i=1N(xi−μ)2]=E[N1∑i=1Nxi2−N1∑i=1N2xiμ+N1∑i=1Nμ2]=E[N1∑Ni=1xi2−2μ2+μ2]=E[N1∑i=1Nxi2−μ2]=N1∑i=1N(E(xi2)−E2(xi))=D(xi)=σ2
(3) μ \mu μ和 σ 2 \sigma^{2} σ2均未知,求 μ \mu μ、 σ 2 \sigma^{2} σ2的最大似然估计量 μ ^ \hat\mu μ^和 σ ^ 2 \hat\sigma^{2} σ^2
似然函数: L ( X ∣ μ , σ 2 ) = ∏ i = 1 N p ( x i ∣ μ , σ 2 ) = ∏ i = 1 N 1 2 π σ e − ( x i − μ ) 2 2 σ 2 L(X|\mu, \sigma^{2})=\prod_{i=1}^{N}p(x_{i}|\mu, \sigma^{2})=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_{i}-\mu)^2}{2\sigma ^2}} L(X∣μ,σ2)=∏i=1Np(xi∣μ,σ2)=∏i=1N2π σ1e−2σ2(xi−μ)2
两边分别取对数: l n L ( X ∣ μ , σ 2 ) = l n ∏ i = 1 N p ( x i ∣ μ , σ 2 ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 lnL(X|\mu, \sigma^{2})=ln\prod_{i=1}^{N}p(x_{i}|\mu, \sigma^{2})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_{i}-\mu)^2 lnL(X∣μ,σ2)=ln∏i=1Np(xi∣μ,σ2)=−2Nln(2π)−Nlnσ−2σ21∑i=1N(xi−μ)2
- 两边对 μ \mu μ求导
d l n L ( X ∣ μ ) d μ = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 ∑ i = 1 N ( x i − μ ) = 0 → ∑ i = 1 N x i − N μ = 0 μ ^ = 1 N ∑ i = 1 N x i = X ‾ \frac{dlnL(X|\mu)}{d\mu}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ \sum_{i=1}^{N}(x_{i}-\mu)=0 \rightarrow \sum_{i=1}^{N}x_{i}-N\mu=0 \\ \hat \mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}= \overline{X} dμdlnL(X∣μ)=i=1∑Nσ21(xi−μ)=0i=1∑N(xi−μ)=0→i=1∑Nxi−Nμ=0μ^=N1i=1∑Nxi=X
- 两边对 σ 2 \sigma^{2} σ2求导
d l n L ( X ∣ σ 2 ) d σ 2 = ∑ i = 1 N 1 σ 2 ( x i − μ ) = 0 − N 2 σ 2 + 1 2 σ 4 ∑ i = 1 N ( x i − μ ) 2 = 0 σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ^ ) 2 = 1 N ∑ i = 1 N ( x i − X ‾ ) 2 \frac{dlnL(X|\sigma^{2})}{d\sigma^{2}}=\sum_{i=1}^{N}\frac{1}{\sigma^2}(x_{i}-\mu)=0 \\ -\frac{N}{2\sigma^{2}}+\frac{1}{2\sigma^{4}}\sum_{i=1}^{N}(x_{i}-\mu)^{2}=0 \\ \hat \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\hat \mu)^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline X)^2 dσ2dlnL(X∣σ2)=i=1∑Nσ21(xi−μ)=0−2σ2N+2σ41i=1∑N(xi−μ)2=0σ^2=N1i=1∑N(xi−μ^)2=N1i=1∑N(xi−X)2
可以发现,当 μ \mu μ的最大似然估计量 μ ^ \hat \mu μ^只受样本的影响(因为在计算时 σ 2 \sigma^{2} σ2被消去了), μ ^ \hat \mu μ^是 μ \mu μ的无偏估计
E [ μ ^ ] = E [ X ‾ ] = E [ 1 N ∑ i = 1 N x i ] = 1 N ∑ i = 1 N E [ x i ] = 1 N N μ = μ E[\hat \mu]=E[\overline X]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}]=\frac{1}{N}\sum_{i=1}^{N}E[x_{i}]=\frac{1}{N}N\mu=\mu E[μ^]=E[X]=E[N1∑i=1Nxi]=N1∑i=1NE[xi]=N1Nμ=μ
但是在计算 σ 2 \sigma^{2} σ2的最大似然估计量 σ ^ 2 \hat \sigma^{2} σ^2不仅受到样本的影响,还受到 μ \mu μ的影响,其中 μ \mu μ未知,只能用计算出的 μ ^ \hat \mu μ^来替代,通过下面计算可以发现** σ ^ 2 \hat \sigma^{2} σ^2是$ \sigma^{2}$的有偏估计**
E [ σ ^ 2 ] = E [ 1 N ∑ i = 1 N ( x i − X ‾ ) 2 ] = E [ 1 N ∑ i = 1 N x i 2 − 1 N ∑ i = 1 N 2 x i X ‾ + 1 N ∑ i = 1 N X ‾ 2 ] = E [ 1 N ∑ N i = 1 x i 2 − 2 X ‾ 2 + X ‾ 2 ] = E { ( 1 N ∑ i = 1 N x i 2 − X ‾ 2 ) − ( X ‾ 2 − X ‾ 2 ) } = E [ ( 1 N ∑ i = 1 N x i 2 − X ‾ 2 ) ] − E ( X ‾ 2 − X ‾ 2 ) = 1 N ∑ i = 1 N [ E ( x i 2 ) − E 2 ( x i ) ] − [ E ( X ‾ 2 ) − E 2 ( X ‾ ) ] = D ( x i ) − D ( X ‾ ) = σ 2 − σ 2 N = N − 1 N σ 2 \begin{aligned} E[\hat \sigma^{2}] &= E[\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline X)^{2}] = E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\frac{1}{N}\sum_{i=1}^{N}2x_{i}\overline X+\frac{1}{N}\sum_{i=1}^{N}\overline X^{2}] \\ & = E[\frac{1}{N}\sum_{N}^{i=1}x_{i}^{2}-2\overline X^{2}+\overline X^{2}] = E\{(\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\overline X^{2})-(\overline X^{2}-\overline X^{2})\} \\ & = E[(\frac{1}{N}\sum_{i=1}^{N}x_{i}^2-\overline X^{2})]-E(\overline X^{2}-\overline X^{2}) \\ & = \frac{1}{N}\sum_{i=1}^{N}[E(x_{i}^2)-E^{2}(x_{i})]-[E(\overline X^{2})-E^{2}(\overline X)] \\ & = D(x_{i})-D(\overline X) = \sigma^{2}-\frac{\sigma^{2}}{N} =\frac{N-1}{N}\sigma^{2} \end{aligned} E[σ^2]=E[N1i=1∑N(xi−X)2]=E[N1i=1∑Nxi2−N1i=1∑N2xiX+N1i=1∑NX2]=E[N1N∑i=1xi2−2X2+X2]=E{(N1i=1∑Nxi2−X2)−(X2−X2)}=E[(N1i=1∑Nxi2−X2)]−E(X2−X2)=N1i=1∑N[E(xi2)−E2(xi)]−[E(X2)−E2(X)]=D(xi)−D(X)=σ2−Nσ2=NN−1σ2
所以在计算样本的方差 S 2 S^{2} S2时,需要在在前面乘上一个系数,即 S 2 = N N − 1 E [ σ ^ 2 ] S^{2}=\frac{N}{N-1}E[\hat \sigma^{2}] S2=N−1NE[σ^2]
三、最大似然和最小二乘的关系
当数据为高斯分布时,最大似然和最小二乘相同
假设一个模型为线性回归模型,噪声为高斯噪声
已知 f θ ( x ) = f ( y ∣ x , w ) = ∑ i = 1 N x i w i T + ϵ = x w T + ϵ f_{\theta}(\mathbf{x}) = f(y|x,w) = \sum_{i=1}^{N}x_{i}w_{i}^{T}+\epsilon = \mathbf{x} \mathbf{w^{T}}+\mathbf{\epsilon} fθ(x)=f(y∣x,w)=∑i=1NxiwiT+ϵ=xwT+ϵ,设 ϵ i ~ N ( 0 , σ 2 ) \epsilon_{i}~N(0, \sigma^{2}) ϵi~N(0,σ2), f ( y i ∣ x i , w i ) = y i ~ N ( x i w i T , σ 2 ) f(y_{i}|x_{i},w_{i})=y_{i}~N(x_{i}w_{i}^{T}, \sigma^{2}) f(yi∣xi,wi)=yi~N(xiwiT,σ2)
由上面推导的最大似然函数求解: arg max w l n L ( w ) = l n ∏ i = 1 N p ( y i ∣ x i , w i ) = − N 2 l n ( 2 π ) − N l n σ − 1 2 σ 2 ∑ i = 1 N ( y i − x i w i T ) 2 \underset {w}{\operatorname {arg\,max}}~lnL(w)=ln\prod_{i=1}^{N}p(y_{i}|x_{i},w_{i})=-\frac{N}{2}ln(2\pi)-Nln\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 wargmax lnL(w)=ln∏i=1Np(yi∣xi,wi)=−2Nln(2π)−Nlnσ−2σ21∑i=1N(yi−xiwiT)2
由于前两项都与 w w w无关,因此可以将上式简化为: arg max w l n L ( w ) = − 1 2 σ 2 ∑ i = 1 N ( y i − x i w i T ) 2 ~ ∑ i = 1 N ( y i − x i w i T ) 2 \underset {w}{\operatorname {arg\,max}}~lnL(w)=-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2~\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 wargmax lnL(w)=−2σ21∑i=1N(yi−xiwiT)2~∑i=1N(yi−xiwiT)2
而最小二乘法的公式也是如此: arg min w f ( w ) = ∑ i = 1 N ( y i − x i w i T ) 2 = ∣ ∣ Y − X W T ∣ ∣ 2 2 \underset {w}{\operatorname {arg\,min}}~f(w)=\sum_{i=1}^{N}(y_{i}-x_{i}w_{i}^{T})^2 = \vert\vert Y-XW^{T}\vert\vert_{2}^{2} wargmin f(w)=∑i=1N(yi−xiwiT)2=∣∣Y−XWT∣∣22