统计推断(七) Typical Sequence

1. 一些定理

Markov inequality: r.v.  x0r.v. \ \ \mathsf{x}\ge0r.v.  x≥0
P(xμ)E[x]μ \mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu} P(x≥μ)≤μE[x]​
Proof: omit…

Weak law of large numbers(WLLN): y=[y1,y2,...,yN]T,    yip   i.i.d\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.dy​=[y1​,y2​,...,yN​]T,    yi​∼p   i.i.d
limNP(Lp(y)+H(p)>ε)=0,  ε>0 \lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0 N→∞lim​P(∣Lp​(y​)+H(p)∣>ε)=0,  ∀ε>0
Proof: omit…

2. Typical set

  • Definition: Tε(p;N)={yYN:Lp(y)+H(p)<ε}\mathcal{T}_\varepsilon(p;N)=\{\vec{y}\in\mathcal{Y}^N:|L_p(\vec{y})+H(p)|<\varepsilon\}Tε​(p;N)={y​∈YN:∣Lp​(y​)+H(p)∣<ε}

  • Properties

    • WLLN P(yTε(p;N))1\Longrightarrow P\left(\vec{y}\in\mathcal{T}_\varepsilon(p;N)\right)\simeq1⟹P(y​∈Tε​(p;N))≃1, NNN large
    • Lp(y)H(p)py(y)2NH(p)L_p(\vec{y})\simeq H(p) \Longrightarrow p_y(\vec{y})\simeq 2^{-NH(p)}Lp​(y​)≃H(p)⟹py​(y​)≃2−NH(p)
    • Tε(p;N)2NH(p)\Longrightarrow |\mathcal{T}_\varepsilon(p;N)|\simeq 2^{NH(p)}⟹∣Tε​(p;N)∣≃2NH(p)
    • 当 p 不是均匀分布的时候,Tε(p;N)YN0\frac{|\mathcal{T}_\varepsilon(p;N)|}{|\mathcal{Y}^N|}\to0∣YN∣∣Tε​(p;N)∣​→0,也就是说典型集中元素(序列)个数在所有可能的元素(序列)中所占比例趋于 0,但是典型集中元素概率的和却趋近于 1
  • Theorem

    Asymptotic Equipartition Property(AEP)

    limNP(Tε(p;N))=1 \lim_{N\to\infty}P(\mathcal{T}_\varepsilon(p;N))=1 \\ N→∞lim​P(Tε​(p;N))=1

    2N(H(p)+ϵ)py(y)2N(H(p)ϵ),yTϵ(p;N) 2^{-N(H(p)+\epsilon)} \leq p_{\mathrm{y}}(\boldsymbol{y}) \leq 2^{-N(H(p)-\epsilon)}, \forall \boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N) 2−N(H(p)+ϵ)≤py​(y)≤2−N(H(p)−ϵ),∀y∈Tϵ​(p;N)

    • for a sufficient large NNN

    (1ϵ)2N(H(p)ϵ)Tϵ(p;N)2N(H(p)+ϵ) (1-\epsilon) 2^{N(H(p)-\epsilon)} \leq\left|\mathcal{T}_{\epsilon}(p ; N)\right| \leq 2^{N(H(p)+\epsilon)} (1−ϵ)2N(H(p)−ϵ)≤∣Tϵ​(p;N)∣≤2N(H(p)+ϵ)

    Proof:
    Tϵ(p;N)=yTϵ(p;N)1=2N(H(p)+ϵ)yTϵ(p;N)2N(H(p)+ϵ)2N(H(p)+ϵ)yTϵ(p;N)py(y)=2N(H(p)+ϵ)P{Tϵ(p;N)}2N(H(p)+ϵ) \begin{aligned}\left|\mathcal{T}_{\epsilon}(p ; N)\right| &=\sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 1 \\ &=2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 2^{-N(H(p)+\epsilon)} \\ & \leq 2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} p_{\mathbf{y}}(\boldsymbol{y}) \\ &=2^{N(H(p)+\epsilon)} P\left\{\mathcal{T}_{\epsilon}(p ; N)\right\} \\ & \leq 2^{N(H(p)+\epsilon)} \end{aligned} ∣Tϵ​(p;N)∣​=y∈Tϵ​(p;N)∑​1=2N(H(p)+ϵ)y∈Tϵ​(p;N)∑​2−N(H(p)+ϵ)≤2N(H(p)+ϵ)y∈Tϵ​(p;N)∑​py​(y)=2N(H(p)+ϵ)P{Tϵ​(p;N)}≤2N(H(p)+ϵ)​
    统计推断(七)  Typical Sequence

3. Divergence ε\varepsilonε-typical set

  • WLLN: y=[y1,y2,...,yN]T,    yip   i.i.d\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.dy​=[y1​,y2​,...,yN​]T,    yi​∼p   i.i.d
    $$
    L_{p | q}(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{\mathbf{y}}(\boldsymbol{y})}{q_{\mathbf{y}}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p\left(y_{n}\right)}{q\left(y_{n}\right)} \

    \lim {N \rightarrow \infty} \mathbb{P}\left(\left|L{p | q}(\boldsymbol{y})-D(p | q)\right|>\epsilon\right)=0
    $$
    Remarks: 前面只考虑的均值,这里还考虑了另一个分布

  • Definition: y=[y1,y2,...,yN]T,    yip   i.i.d\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.dy​=[y1​,y2​,...,yN​]T,    yi​∼p   i.i.d
    Tϵ(pq;N)={yYN:Lpq(y)D(pq)ϵ} \mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\} Tϵ​(p∣q;N)={y∈YN:∣∣​Lp∣q​(y)−D(p∥q)∣∣​≤ϵ}

  • Properties

    • WLLN qy(y)py(y)2ND(pq)\Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)}⟹qy​(y)≈py​(y)2−ND(p∥q)
    • Q{Tϵ(pq;N)}2ND(pq)0Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0Q{Tϵ​(p∣q;N)}≈2−ND(p∥q)→0
    • Remarks: p 的典型集可能是 q 的非典型集,在 NNN 很大的时候,不同分布的 typical set 是正交的
  • Theorem
    (1ϵ)2N(D(pq)+ϵ)Q{Tϵ(pq;N)}2N(D(pq)ϵ) (1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)} (1−ϵ)2−N(D(p∥q)+ϵ)≤Q{Tϵ​(p∥q;N)}≤2−N(D(p∥q)−ϵ)

4. Large deviation of sample averages

Theorem (Cram´er’s Theorem): y=[y1,y2,...,yN]T,   yiq   i.i.d\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.dy​=[y1​,y2​,...,yN​]T,   yi​∼q   i.i.d with mean μ<\mu<\inftyμ<∞ and γ>μ\gamma>\muγ>μ
limN1NlogP(1Nn=1Nynγ)=EC(γ) \lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma) N→∞lim​−N1​logP(N1​n=1∑N​yn​≥γ)=EC​(γ)
where EC(γ)E_C(\gamma)EC​(γ) is referred as Chernoff exponent
EC(γ)D(p(;x)q),   p(;x)=q(y)exyα(x) E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)} EC​(γ)≜D(p(⋅;x)∥q),   p(⋅;x)=q(y)exy−α(x)
and with x>0x>0x>0 chosen such that
Ep(;x)[y]=γ \mathbb{E}_{p(\cdot;x)}[y]=\gamma Ep(⋅;x)​[y]=γ
Proof:

  1. P(1Nn=1Nynγ)=P(exn=1NyneNxγ)eNxγE[exn=1Nyn]=eNxγ(E[exy])NeN(xγα(x))\begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned}P(N1​n=1∑N​yn​≥γ)​=P(ex∑n=1N​yn​≥eNxγ)≤e−NxγE[ex∑n=1N​yn​]=e−Nxγ(E[exy])N≤e−N(x∗​γ−α(x∗​))​
  2. φ(x)=xγα(x)\varphi(x)=x\gamma-\alpha(x)φ(x)=xγ−α(x) 是凸的,最大值取在 Ep(;x)[y]=α˙(x)=γ\mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gammaEp(⋅;x∗​)​[y]=α˙(x∗​)=γ
  3. 可以证明 xγα(x)=xα˙(x)α(x)=D(p(;x)q)x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right)x∗​γ−α(x∗​)=x∗​α˙(x∗​)−α(x∗​)=D(p(⋅;x∗​)∥q)
  4. 于是有 P(1Nn=1Nynγ)eNEC(γ)\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)}P(N1​∑n=1N​yn​≥γ)≤e−NEC​(γ)
  5. 下界的证明,暂时略…

用到的两个事实:p(y;x)=q(y)exp(xyα(x))p(y;x)=q(y)\exp(xy-\alpha(x))p(y;x)=q(y)exp(xy−α(x))

  1. D(p(y;x)q(y))D(p(y;x)||q(y))D(p(y;x)∣∣q(y)) 随着 x 单调增加
  2. Ep(;x)[y]\mathbb{E}_{p(;x)}[y]Ep(;x)​[y] 随着 x 单调增加

Remarks:

  1. 这个定理也相当于表达了 P(1Nn=1Nynγ)2NEC(γ)\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)}P(N1​∑n=1N​yn​≥γ)≅2−NEC​(γ)
  2. 相当于是分布 q 向由 E[y]=n=1Nynγ\mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gammaE[y]=∑n=1N​yn​≥γ 所定义的一个凸集中投影,恰好投影到边界(线性分布族) E[y]=γ\mathbb{E}[y]=\gammaE[y]=γ 上,而 qqq 向线性分布族的投影恰好就是 (10) 中的指数族表达式

统计推断(七)  Typical Sequence

5. Types and type classes

  • Definition: y=[y1,y2,...,yN]T\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^Ty​=[y1​,y2​,...,yN​]T (不关心真实服从的是哪个分布)

    • type(实质上就是一个经验分布)定义为

    p^(b;y)=1Nn=1N1b(yn)=Nb(y)N \hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N} p^​(b;y)=N1​n=1∑N​1b​(yn​)=NNb​(y)​

    • PNy\mathcal{P}_{N}^{y}PNy​ 表示长度为 NNN 的序列所有可能的 types
    • type class: TNy(p)={yyN:p^(;y)p()},   pPNy\mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y}TNy​(p)={y∈yN:p^​(⋅;y)≡p(⋅)},   p∈PNy​
  • Exponential Rate Notation: f(N)2Nαf(N) \doteq 2^{N \alpha}f(N)≐2Nα
    limNlogf(N)N=α \lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha N→∞lim​Nlogf(N)​=α
    Remarks: α\alphaα 表示了指数上面关于 NNN 的阶数(log、线性、二次 …)

  • Properties

    • PNy(N+1)y\left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|}∣PNy​∣≤(N+1)∣y∣
    • qN(y)=2N(D(p^(y)q)+H(p^(;y)))q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))}qN(y)=2−N(D(p^​(⋅y)∥q)+H(p^​(⋅;y)))
      pN(y)=2NH(p) for yTNy(p)p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p)pN(y)=2−NH(p) for y∈TNy​(p)
    • cNy2NH(p)TNy(p)2NH(p)c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)}cN−∣y∣2NH(p)≤∣TNy​(p)∣≤2NH(p)
  • Theorem
    cNy2ND(pq)Q{TNy(p)}2ND(pq)Q{TNy(p)}2ND(pq) c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\ Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)} cN−∣y∣2−ND(p∥q)≤Q{TNy​(p)}≤2−ND(p∥q)Q{TNy​(p)}≐2−ND(p∥q)

6. Large Deviation Analysis via Types

  • Definition: R={yyN:p^(;y)SPNy}\mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\}R={y∈yN:p^​(⋅;y)∈S∩PNy​}

Sanov’s Theorem:
Q{SPNy}(N+1)y2ND(pq)Q{SPNy}˙2ND(pq)p=argminpSD(pq) Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\ p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q) Q{S∩PNy​}≤(N+1)∣y∣2−ND(p∗​∥q)Q{S∩PNy​}≤˙​2−ND(p∗​∥q)p∗​=p∈Sargmin​D(p∥q)

7. Asymptotics of hypothesis testing

  • LRT: L(y)=1Nlogp1N(y)p0N(y)=1Nn=1Nlogp1(yn)p0(yn)><γL(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gammaL(y)=N1​logp0N​(y)p1N​(y)​=N1​∑n=1N​logp0​(yn​)p1​(yn​)​<>​γ
  • PF=P0{1Nn=1Ntnγ}2ND(pp0)P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)}PF​=P0​{N1​∑n=1N​tn​≥γ}≈2−ND(p∗∥p0′​)
  • PM=1PD2ND(pp1)P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)}PM​=1−PD​≈2−ND(p∗∥p1′​)
  • D(pp0)D(pp1)=p(t)logp1(t)p0(t)dt=p(t)tdt=Ep[t]=γD\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gammaD(p∗∥p0′​)−D(p∗∥p1′​)=∫p∗(t)logp0′​(t)p1′​(t)​dt=∫p∗(t)tdt=Ep∗​[t]=γ

统计推断(七)  Typical Sequence

8.Asymptotics of parameter estimation

Strong Law of Large Numbers(SLLN):
P(limN1Nn=1Nwn=μ)=1 \mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1 P(N→∞lim​N1​n=1∑N​wn​=μ)=1
Central Limit Theorem(CLT):
limNP(1Nn=1N(wnμσ)b)=Φ(b) \lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b) N→∞lim​P(N​1​n=1∑N​(σwn​−μ​)≤b)=Φ(b)
以下三个强度依次递减

  1. 依概率 1 收敛(SLLN):xNw.p.1a\mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} axN​⟶w.p.1​a
  2. 概率趋于 0(WLLN):
  3. 依分布收敛: xNdp\mathsf{x}_{N} \stackrel{d}{\longrightarrow} pxN​⟶d​p
  • Asymptotics of ML Estimation

    Theorem:
    x^N=argmaxxLN(x;y) \hat{x}_{N}=\arg \max _{x} L_{N}(x ; \mathbf{y}) x^N​=argxmax​LN​(x;y)
    在满足某些条件下(mild conditions),有
    x^Nwp1x0N(x^Nx0)dN(0,Jy(x0)1) \begin{array}{c}{\hat{x}_{N} \stackrel{w \cdot p \cdot 1}{\longrightarrow} x_{0}} \\ {\sqrt{N}\left(\hat{x}_{N}-x_{0}\right) \stackrel{d}{\longrightarrow} \mathcal{N}\left(0, J_{y}\left(x_{0}\right)^{-1}\right)}\end{array} x^N​⟶w⋅p⋅1​x0​N​(x^N​−x0​)⟶d​N(0,Jy​(x0​)−1)​

统计推断(七)  Typical Sequence统计推断(七)  Typical Sequence Bonennult 发布了37 篇原创文章 · 获赞 27 · 访问量 2万+ 私信 关注
上一篇:推荐一些国内的Jquery CDN免费服务


下一篇:统计推断(五) EM algorithm