统计推断(二) Estimation Problem

1. Bayesian parameter estimation

  • Formulation

    • Prior distribution px()p_{\mathsf{x}}(\cdot)px​(⋅)
    • Observation pyx()p_{\mathsf{y|x}}(\cdot|\cdot)py∣x​(⋅∣⋅)
    • Cost C(a,a^)C(a,\hat a)C(a,a^)
  • Solution

    • x^()=argminf()E[C(x,f(y))]\hat x(\cdot) = \arg\min_{f(\cdot)} \mathbb E[C(x,f(y))]x^(⋅)=argminf(⋅)​E[C(x,f(y))]
    • x^(y)=argminaXC(x,a)pxy(xy)dx\hat{\mathbf{x}}(\mathbf{y})=\underset{\mathbf{a}}{\arg \min } \int_{\mathcal{X}} C(\mathbf{x}, \mathbf{a}) p_{\mathbf{x} | \mathbf{y}}(\mathbf{x} | \mathbf{y}) \mathrm{d} \mathbf{x}x^(y)=aargmin​∫X​C(x,a)px∣y​(x∣y)dx
  • Specific case

    • MAE(Minimum absolute-error)

      • C(a,a^)=aa^C(a,\hat a)=|a-\hat a|C(a,a^)=∣a−a^∣
      • x^\hat xx^ is the median of the belief pxy(xy)p_{\mathsf{x|y}}(x|y)px∣y​(x∣y)
    • MAP(Maximum a posteriori)

      • C(a,a^)={1,aa^>ε0,otherwiseC(a,\hat a) = \left\{ \begin{array}{ll}{1,} & {|a-\hat a|>\varepsilon} \\ {0,} & {otherwise}\end{array}\right.C(a,a^)={1,0,​∣a−a^∣>εotherwise​
      • x^MAP(y)=argmaxapxy(ay)\hat x_{MAP}(y) = \arg \max_a p_{\mathsf{x|y}}(a|y)x^MAP​(y)=argmaxa​px∣y​(a∣y)
    • BLS(Bayes’ least-squares)

      • C(a,a^)=aa^2C(a,\hat a)=||a-\hat a||^2C(a,a^)=∣∣a−a^∣∣2

      • x^BLS(y)=E[xy]\hat x_{BLS}(y) = \mathbb E [\mathsf{x|y}]x^BLS​(y)=E[x∣y]

      • proposition

        • unbiased: b=E[e(x,y)]=E[x^(y)x]=0b = \mathbb E[\mathsf{e(x,y)}]=E[\mathsf{\hat x(y)-x}]=0b=E[e(x,y)]=E[x^(y)−x]=0

        • 误差的协方差矩阵就是 belief(后验分布?)的协方差阵的期望
          ΛBLS=E[Λxy(y)] \Lambda_{BLS}=\mathbb E[\mathsf{\Lambda_{x|y}(y)}] ΛBLS​=E[Λx∣y​(y)]

  • Orthogonality
    x^() is BLS    E[[x^(y)x]gT(y)]=0 \hat x(\cdot)\ is\ BLS \iff \mathbb E\left[ \mathsf{[\hat x(y)-x]g^T(y)}\right]=0 x^(⋅) is BLS⟺E[[x^(y)−x]gT(y)]=0

    Proof: omit

2. Linear least-square estimation

  • Drawback of BLS x^BLS(y)=E[xy]\hat x_{BLS}(y)=E[x|y]x^BLS​(y)=E[x∣y]

    • requires posterior p(xy)p(x|y)p(x∣y), which needs p(x)p(x)p(x) and p(yx)p(y|x)p(y∣x)
    • calculating posterior is complicated
    • estimator is nonlinear
  • Definition of LLS

    • x^LLS(y)=argminf()BE[xf(y)2]B={f():f(y)=Ay+d}\hat {\mathbf{x}}_{LLS}(y) = \arg \min\limits_{f(\cdot) \in \mathcal{B}} E\left[||\mathsf{x-f(y)}||^2\right] \\ \mathcal{B}=\{f(\cdot):f(y)=Ay+d\}x^LLS​(y)=argf(⋅)∈Bmin​E[∣∣x−f(y)∣∣2]B={f(⋅):f(y)=Ay+d}
    • 注意 x^(y)\hat {\mathbf{x}}(\mathsf{y})x^(y) 是一个随机变量,是关于 y\mathsf{y}y 的一个函数
    • LLS 与 BLS 都是假设 x 为一个随机变量,有先验分布,不同之处在于 LLS 要求估计函数为关于观测值 y 的线性函数,因此 LLS 只需要知道二阶矩,而 BLS 需要知道后验均值
  • Property

    • Orthogonality
      x^() is LLS    E[x^(y)x]=0  and  E[(x^(y)x)yT]=0 \hat {\mathbf{x}}(\cdot)\ is\ LLS \iff E[\hat {\mathbf{x}}(\mathsf{y})-\mathsf{x}]=0\ \ and\ \ E[(\hat {\mathbf{x}}(\mathsf{y})-\mathsf{x})\mathsf{y}^T]=0 x^(⋅) is LLS⟺E[x^(y)−x]=0  and  E[(x^(y)−x)yT]=0

    • 推论:由正交性可得到

      • x^LLS(y)=μX+ΛxyΛy1(yμy)\hat x_{LLS}(y)=\mu_X+\Lambda_{xy}\Lambda_y^{-1}(y-\mu_y)x^LLS​(y)=μX​+Λxy​Λy−1​(y−μy​)
      • ΛLLSE[(xx^LLS(y))(xx^LLS(y))T]=ΛxΛxyΛy1ΛxyT\Lambda_{\mathrm{LLS}} \triangleq \mathbb{E}\left[\left(\mathbf{x}-\hat{\mathbf{x}}_{\mathrm{LLS}}(\mathbf{y})\right)\left(\mathbf{x}-\hat{\mathbf{x}}_{\mathrm{LLS}}(\mathbf{y})\right)^{\mathrm{T}}\right]=\Lambda_{\mathrm{x}}-\Lambda_{\mathrm{xy}} \Lambda_{\mathrm{y}}^{-1} \Lambda_{\mathrm{xy}}^{\mathrm{T}}ΛLLS​≜E[(x−x^LLS​(y))(x−x^LLS​(y))T]=Λx​−Λxy​Λy−1​ΛxyT​

    Proof: x 可以是向量

    \Longrightarrow⟹:反证法

    1. suppose E[x^LLS(y)x]=b0E[\hat x_{LLS}(y)-x]=\mathbb{b} \ne 0E[x^LLS​(y)−x]=b​=0,take x^=x^LLSb\hat x'=\hat x_{LLS} - bx^′=x^LLS​−b
      then E[x^x2]=E[x^x2]b2<E[x^x2]E\left[||\hat x' - x||^2\right]=E\left[||\hat x - x||^2\right]-b^2 < E\left[||\hat x - x||^2\right]E[∣∣x^′−x∣∣2]=E[∣∣x^−x∣∣2]−b2<E[∣∣x^−x∣∣2]
      与 LLS 的定义矛盾;
    2. e=x^(y)xe=\hat x(y)-xe=x^(y)−x
      Take x^=x^LLSΛeyΛy1(yμy)\hat x' = \hat x_{LLS} - \Lambda_{ey}\Lambda_y^{-1}(y-\mu_y)x^′=x^LLS​−Λey​Λy−1​(y−μy​)

    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ M &= E\left[(\…

    由于 E[xf(y)2]=tr{M}E\left[||\mathsf{x-f(y)}||^2\right] = tr\{M\}E[∣∣x−f(y)∣∣2]=tr{M},LLS 的 MSE 应当最小
    由于 Λy\Lambda_yΛy​ 正定,因此应有 ΛeyΛy1ΛeyT=0\Lambda_{ey}\Lambda_y^{-1}\Lambda_{ey}^T=0Λey​Λy−1​ΛeyT​=0
    E[(x^μx)(yμy)T]=0E[(x^(y)x)yT]=0E\left[(\hat x-\mu_x)(y-\mu_y)^T \right]=0 \Longrightarrow E[(\hat {\mathbf{x}}(\mathsf{y})-\mathsf{x})\mathsf{y}^T]=0E[(x^−μx​)(y−μy​)T]=0⟹E[(x^(y)−x)yT]=0

    \Longleftarrow⟸:suppose another linear estimator x^\hat x'x^′
    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E\left[(\hat x…
    第三个等号是由于 x^x^=Ay+d\hat x'-\hat x = A'y+d'x^′−x^=A′y+d′

    同样的根据上面 MSE=tr{M}MSE=tr\{M\}MSE=tr{M} 可得到 x^\hat xx^ 有最小的 MSE

  • 联合高斯分布的情况

    • 定理:如果 x 和 y 是联合高斯分布的,那么
      x^BLS(y)=x^LLS(y) \hat x_{BLS}(y) = \hat x_{LLS}(y) x^BLS​(y)=x^LLS​(y)

    证明:eLLS=x^LLSxe_{LLS}=\hat x_{LLS}-xeLLS​=x^LLS​−x 也是高斯分布

    由于 E[eLLS yT]=0E[e_{LLS}\ y^T]=0E[eLLS​ yT]=0,故 eLLSe_{LLS}eLLS​ 与 y 相互独立

    E[eLLSy]=E[eLLS]=0E[x^LLSy]=x^LLS=E[xy]E[e_{LLS}|y]=E[e_{LLS}]=0 \to E[\hat x_{LLS}|y]=\hat x_{LLS} = E[x|y]E[eLLS​∣y]=E[eLLS​]=0→E[x^LLS​∣y]=x^LLS​=E[x∣y]

    • 通常如果只有联合二阶矩信息,那么 LLS 是 minmax

3. Non-Bayesian formulation

  • Formulation

    • observation: distribution of y parameterized by x, py(y;x)p_\mathsf{y}(\mathbf{y;x})py​(y;x)
      not conditioned on x, pyx(yx)p_\mathsf{y|x}(\mathbf{y|x})py∣x​(y∣x)
      此时 x 不再是一个随机变量,而是未知的一个参数
    • bias: b(x)=E[x^(y)x]b(x)=E[\hat x(y)-\mathbf{x}]b(x)=E[x^(y)−x]
    • 误差协方差矩阵 Λe(x)=E[(e(x,y)b(x))(e(x,y)b(x))T]\Lambda_{\mathrm{e}}(\mathrm{x})=\mathbb{E}\left[(\mathrm{e}(\mathrm{x}, \mathrm{y})-\mathrm{b}(\mathrm{x}))(\mathrm{e}(\mathrm{x}, \mathrm{y})-\mathrm{b}(\mathrm{x}))^{\mathrm{T}}\right]Λe​(x)=E[(e(x,y)−b(x))(e(x,y)−b(x))T]
  • **有效(valid)**估计器不应当显式地依赖于 x

  • MVU: Minimum-variance unbiased estimator

    • 在 MMSE 条件下最优估计就是 MVU 估计
      KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ MSE &= E[e^2]=…
  • MVU 可能不存在

    • 可能不存在无偏估计,即 A=\mathcal{A}=\varnothingA=∅
    • 存在无偏估计 A\mathcal{A} \ne \varnothingA​=∅,但是不存在某个估计量在所有情况(任意 x)下都是最小方差

4. CRB

定理:满足正规条件时
E[xlnpy(y;x)]=0    for all  x \mathbb{E}\left[\frac{\partial}{\partial x} \ln p_{y}(\mathbf{y} ; x) \right] = 0 \ \ \ \ for \ all \ \ x E[∂x∂​lnpy​(y;x)]=0    for all  x

λx^(X)1Jy(x) \lambda_{\hat x}(X) \ge \frac{1}{J_y(x)} λx^​(X)≥Jy​(x)1​
其中 Fisher 信息为
Jy(x)=E[(xlnpy(y;x))2]=E[2x2lnpy(y;x)] J_{y}(x)=\mathbb{E}\left[\left(\frac{\partial}{\partial x} \ln p_{y}(\mathbf{y} ; x)\right)^{2}\right]=-\mathbb{E}\left[\frac{\partial^{2}}{\partial x^{2}} \ln p_{y}(\mathbf{y} ; x)\right] Jy​(x)=E[(∂x∂​lnpy​(y;x))2]=−E[∂x2∂2​lnpy​(y;x)]
证明:取 f(y)=xlnpy(y;x)f(y)=\frac{\partial}{\partial x} \ln p_{y}(\mathbf{y} ; x)f(y)=∂x∂​lnpy​(y;x),有 E[f(y)]=0E[f(y)]=0E[f(y)]=0
cov(e(y),f(y))=(x^(y)x)xpy(y;x)dy=1 cov(e(y),f(y))=\int (\hat x(y)-x)\frac{\partial}{\partial x} p_{y}(\mathbf{y} ; x)dy=1 cov(e(y),f(y))=∫(x^(y)−x)∂x∂​py​(y;x)dy=1

1=cov(e,f)Var(e)Var(f) 1=cov(e,f)\le Var(e)Var(f) 1=cov(e,f)≤Var(e)Var(f)

备注

  • 正规条件不满足时,CRB 不存在
  • Fisher 信息可以看作 py(y;x)p_{y}(\mathbf{y} ; x)py​(y;x) 的曲率

4. 有效估计量

  • 定义:可以达到 CRB 的无偏估计量

  • 有效估计量一定是 MVU 估计量

  • MVU 估计量不一定是有效估计量,也即 CRB 不一定是紧致(tight)的,有时没有估计量可以对所有的 x 达到 CRB

  • 性质:(唯一的、无偏的,可以达到 CRB)
    x^  is  efficient    x^(y)=x+1Jy(x)xlnpy(y;x) \hat x \ \ is \ \ efficient \iff \hat x(y)=x+\frac{1}{J_y(x)}\frac{\partial}{\partial x} \ln p_{y}(\mathbf{y} ; x) x^  is  efficient⟺x^(y)=x+Jy​(x)1​∂x∂​lnpy​(y;x)

证明:有效估计量     \iff⟺ 可以达到 CRB     \iff⟺ 取等号 Var(e)Var(f)=1Var(e)Var(f)=1Var(e)Var(f)=1     \iff⟺ 取等号 e(y)=k(x)f(y)e(y)=k(x)f(y)e(y)=k(x)f(y)     \iffe(y)=x+k(X)f(y)e(y)=x+k(X)f(y)e(y)=x+k(X)f(y)
1Jy(x)=E[e2(y)]=k(x)E[e(y)f(y)]=k(x) \frac{1}{J_y(x)}=E[e^2(y)]=k(x)E[e(y)f(y)]=k(x) Jy​(x)1​=E[e2(y)]=k(x)E[e(y)f(y)]=k(x)

5. ML estimation

  • Definition
    x^ML()=argmaxap(ya) \hat x_{ML}(\cdot)=\arg\max_{a} p(y|a) x^ML​(⋅)=argamax​p(y∣a)

Proposition: if efficient estimator exists, it’s ML estimator
x^eff()=x^ML() \hat x_{eff}(\cdot)=\hat x_{ML}(\cdot) x^eff​(⋅)=x^ML​(⋅)
Proof:
x^eff(y)=x+1Jy(x)xlnp(y;x) \hat x_{eff}(y)=x+\frac{1}{J_y(x)}\frac{\partial}{\partial x}\ln p(y;x) x^eff​(y)=x+Jy​(x)1​∂x∂​lnp(y;x)
由于有效(valid)估计器不应当依赖于 x,因此上式中 x 取任意一个值都应当是相等的,可取 x^ML(y)\hat x_{ML}(y)x^ML​(y)
x^eff(y)=x^ML(y)+1Jy(x)lnp(y;x)xx=x^ML=x^ML(y) \hat x_{eff}(y)=\hat x_{ML}(y) + \frac{1}{J_y(x)}\frac{\partial \ln p(y;x)}{\partial x}\Big|_{x=\hat x_{ML}}=\hat x_{ML}(y) x^eff​(y)=x^ML​(y)+Jy​(x)1​∂x∂lnp(y;x)​∣∣∣​x=x^ML​​=x^ML​(y)
备注:反之不一定成立,即 ML 估计器不一定是有效的,比如有时候全局的有效估计器(efficient estimator)不存在,也即此时按公式计算得到的 x^eff(y)\hat x_{eff}(y)x^eff​(y) 实际上是依赖于 x 的,那么此时就不存在一个全局最优的估计器,此时的 ML 估计器也没有任何好的特性。

统计推断(二) Estimation Problem统计推断(二) Estimation Problem Bonennult 发布了37 篇原创文章 · 获赞 27 · 访问量 2万+ 私信 关注
上一篇:论文:Show, Attend and Tell: Neural Image Caption Generation with Visual Attention-阅读总结


下一篇:微信小程序项目总结-记账小程序(包括后端)