1. 一些定理
Markov inequality : r . v . x ≥ 0 r.v. \ \ \mathsf{x}\ge0 r.v. x≥0P ( x ≥ μ ) ≤ E [ x ] μ
\mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu}
P(x≥μ)≤μE[x]Proof : omit…
Weak law of large numbers(WLLN) : y ⃗ = [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d \vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d y =[y1,y2,...,yN]T, yi∼p i.i.dlim N → ∞ P ( ∣ L p ( y ⃗ ) + H ( p ) ∣ > ε ) = 0 , ∀ ε > 0
\lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0
N→∞limP(∣Lp(y )+H(p)∣>ε)=0, ∀ε>0Proof : omit…
2. Typical set
3. Divergence ε \varepsilon ε-typical set
WLLN: y ⃗ = [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d \vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d y =[y1,y2,...,yN]T, yi∼p i.i.d
$$
L_{p | q}(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{\mathbf{y}}(\boldsymbol{y})}{q_{\mathbf{y}}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p\left(y_{n}\right)}{q\left(y_{n}\right)} \
\lim {N \rightarrow \infty} \mathbb{P}\left(\left|L {p | q}(\boldsymbol{y})-D(p | q)\right|>\epsilon\right)=0
$$Remarks : 前面只考虑的均值,这里还考虑了另一个分布
Definition: y ⃗ = [ y 1 , y 2 , . . . , y N ] T , y i ∼ p i . i . d \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d y =[y1,y2,...,yN]T, yi∼p i.i.dT ϵ ( p ∣ q ; N ) = { y ∈ Y N : ∣ L p ∣ q ( y ) − D ( p ∥ q ) ∣ ≤ ϵ }
\mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\}
Tϵ(p∣q;N)={y∈YN:∣∣Lp∣q(y)−D(p∥q)∣∣≤ϵ}
Properties
WLLN ⟹ q y ( y ) ≈ p y ( y ) 2 − N D ( p ∥ q ) \Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)} ⟹qy(y)≈py(y)2−ND(p∥q)
Q { T ϵ ( p ∣ q ; N ) } ≈ 2 − N D ( p ∥ q ) → 0 Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0 Q{Tϵ(p∣q;N)}≈2−ND(p∥q)→0
Remarks : p 的典型集可能是 q 的非典型集,在 N N N 很大的时候,不同分布的 typical set 是正交的
Theorem( 1 − ϵ ) 2 − N ( D ( p ∥ q ) + ϵ ) ≤ Q { T ϵ ( p ∥ q ; N ) } ≤ 2 − N ( D ( p ∥ q ) − ϵ )
(1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)}
(1−ϵ)2−N(D(p∥q)+ϵ)≤Q{Tϵ(p∥q;N)}≤2−N(D(p∥q)−ϵ)
4. Large deviation of sample averages
Theorem (Cram´er’s Theorem) : y ⃗ = [ y 1 , y 2 , . . . , y N ] T , y i ∼ q i . i . d \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.d y =[y1,y2,...,yN]T, yi∼q i.i.d with mean μ < ∞ \mu<\infty μ<∞ and γ > μ \gamma>\mu γ>μlim N → ∞ − 1 N log P ( 1 N ∑ n = 1 N y n ≥ γ ) = E C ( γ )
\lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma)
N→∞lim−N1logP(N1n=1∑Nyn≥γ)=EC(γ)
where E C ( γ ) E_C(\gamma) EC(γ) is referred as Chernoff exponent E C ( γ ) ≜ D ( p ( ⋅ ; x ) ∥ q ) , p ( ⋅ ; x ) = q ( y ) e x y − α ( x )
E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)}
EC(γ)≜D(p(⋅;x)∥q), p(⋅;x)=q(y)exy−α(x)
and with x > 0 x>0 x>0 chosen such thatE p ( ⋅ ; x ) [ y ] = γ
\mathbb{E}_{p(\cdot;x)}[y]=\gamma
Ep(⋅;x)[y]=γProof :
P ( 1 N ∑ n = 1 N y n ≥ γ ) = P ( e x ∑ n = 1 N y n ≥ e N x γ ) ≤ e − N x γ E [ e x ∑ n = 1 N y n ] = e − N x γ ( E [ e x y ] ) N ≤ e − N ( x ∗ γ − α ( x ∗ ) ) \begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned} P(N1n=1∑Nyn≥γ)=P(ex∑n=1Nyn≥eNxγ)≤e−NxγE[ex∑n=1Nyn]=e−Nxγ(E[exy])N≤e−N(x∗γ−α(x∗))
φ ( x ) = x γ − α ( x ) \varphi(x)=x\gamma-\alpha(x) φ(x)=xγ−α(x) 是凸的,最大值取在 E p ( ⋅ ; x ∗ ) [ y ] = α ˙ ( x ∗ ) = γ \mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gamma Ep(⋅;x∗)[y]=α˙(x∗)=γ
可以证明 x ∗ γ − α ( x ∗ ) = x ∗ α ˙ ( x ∗ ) − α ( x ∗ ) = D ( p ( ⋅ ; x ∗ ) ∥ q ) x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right) x∗γ−α(x∗)=x∗α˙(x∗)−α(x∗)=D(p(⋅;x∗)∥q)
于是有 P ( 1 N ∑ n = 1 N y n ≥ γ ) ≤ e − N E C ( γ ) \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)} P(N1∑n=1Nyn≥γ)≤e−NEC(γ)
下界的证明,暂时略…
用到的两个事实:p ( y ; x ) = q ( y ) exp ( x y − α ( x ) ) p(y;x)=q(y)\exp(xy-\alpha(x)) p(y;x)=q(y)exp(xy−α(x))
D ( p ( y ; x ) ∣ ∣ q ( y ) ) D(p(y;x)||q(y)) D(p(y;x)∣∣q(y)) 随着 x 单调增加
E p ( ; x ) [ y ] \mathbb{E}_{p(;x)}[y] Ep(;x)[y] 随着 x 单调增加
Remarks :
这个定理也相当于表达了 P ( 1 N ∑ n = 1 N y n ≥ γ ) ≅ 2 − N E C ( γ ) \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)} P(N1∑n=1Nyn≥γ)≅2−NEC(γ)
相当于是分布 q 向由 E [ y ] = ∑ n = 1 N y n ≥ γ \mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gamma E[y]=∑n=1Nyn≥γ 所定义的一个凸集中投影,恰好投影到边界(线性分布族) E [ y ] = γ \mathbb{E}[y]=\gamma E[y]=γ 上,而 q q q 向线性分布族的投影恰好就是 (10) 中的指数族表达式
5. Types and type classes
Definition: y ⃗ = [ y 1 , y 2 , . . . , y N ] T \vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T y =[y1,y2,...,yN]T (不关心真实服从的是哪个分布)
p ^ ( b ; y ) = 1 N ∑ n = 1 N 1 b ( y n ) = N b ( y ) N
\hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N}
p^(b;y)=N1n=1∑N1b(yn)=NNb(y)
P N y \mathcal{P}_{N}^{y} PNy 表示长度为 N N N 的序列所有可能的 types
type class : T N y ( p ) = { y ∈ y N : p ^ ( ⋅ ; y ) ≡ p ( ⋅ ) } , p ∈ P N y \mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y} TNy(p)={y∈yN:p^(⋅;y)≡p(⋅)}, p∈PNy
Exponential Rate Notation: f ( N ) ≐ 2 N α f(N) \doteq 2^{N \alpha} f(N)≐2Nαlim N → ∞ log f ( N ) N = α
\lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha
N→∞limNlogf(N)=αRemarks : α \alpha α 表示了指数上面关于 N N N 的阶数(log、线性、二次 …)
Properties
∣ P N y ∣ ≤ ( N + 1 ) ∣ y ∣ \left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|} ∣PNy∣≤(N+1)∣y∣
q N ( y ) = 2 − N ( D ( p ^ ( ⋅ y ) ∥ q ) + H ( p ^ ( ⋅ ; y ) ) ) q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))} qN(y)=2−N(D(p^(⋅y)∥q)+H(p^(⋅;y)))p N ( y ) = 2 − N H ( p ) for y ∈ T N y ( p ) p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p) pN(y)=2−NH(p) for y∈TNy(p)
c N − ∣ y ∣ 2 N H ( p ) ≤ ∣ T N y ( p ) ∣ ≤ 2 N H ( p ) c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)} cN−∣y∣2NH(p)≤∣TNy(p)∣≤2NH(p)
Theoremc N − ∣ y ∣ 2 − N D ( p ∥ q ) ≤ Q { T N y ( p ) } ≤ 2 − N D ( p ∥ q ) Q { T N y ( p ) } ≐ 2 − N D ( p ∥ q )
c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\
Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)}
cN−∣y∣2−ND(p∥q)≤Q{TNy(p)}≤2−ND(p∥q)Q{TNy(p)}≐2−ND(p∥q)
6. Large Deviation Analysis via Types
Definition: R = { y ∈ y N : p ^ ( ⋅ ; y ) ∈ S ∩ P N y } \mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\} R={y∈yN:p^(⋅;y)∈S∩PNy}
Sanov’s Theorem :Q { S ∩ P N y } ≤ ( N + 1 ) ∣ y ∣ 2 − N D ( p ∗ ∥ q ) Q { S ∩ P N y } ≤ ˙ 2 − N D ( p ∗ ∥ q ) p ∗ = arg min p ∈ S D ( p ∥ q )
Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\
Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\
p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q)
Q{S∩PNy}≤(N+1)∣y∣2−ND(p∗∥q)Q{S∩PNy}≤˙2−ND(p∗∥q)p∗=p∈SargminD(p∥q)
7. Asymptotics of hypothesis testing
LRT: L ( y ) = 1 N log p 1 N ( y ) p 0 N ( y ) = 1 N ∑ n = 1 N log p 1 ( y n ) p 0 ( y n ) > < γ L(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gamma L(y)=N1logp0N(y)p1N(y)=N1∑n=1Nlogp0(yn)p1(yn)<>γ
P F = P 0 { 1 N ∑ n = 1 N t n ≥ γ } ≈ 2 − N D ( p ∗ ∥ p 0 ′ ) P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)} PF=P0{N1∑n=1Ntn≥γ}≈2−ND(p∗∥p0′)
P M = 1 − P D ≈ 2 − N D ( p ∗ ∥ p 1 ′ ) P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)} PM=1−PD≈2−ND(p∗∥p1′)
D ( p ∗ ∥ p 0 ′ ) − D ( p ∗ ∥ p 1 ′ ) = ∫ p ∗ ( t ) log p 1 ′ ( t ) p 0 ′ ( t ) d t = ∫ p ∗ ( t ) t d t = E p ∗ [ t ] = γ D\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gamma D(p∗∥p0′)−D(p∗∥p1′)=∫p∗(t)logp0′(t)p1′(t)dt=∫p∗(t)tdt=Ep∗[t]=γ
8.Asymptotics of parameter estimation
Strong Law of Large Numbers(SLLN) :P ( lim N → ∞ 1 N ∑ n = 1 N w n = μ ) = 1
\mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1
P(N→∞limN1n=1∑Nwn=μ)=1Central Limit Theorem(CLT) :lim N → ∞ P ( 1 N ∑ n = 1 N ( w n − μ σ ) ≤ b ) = Φ ( b )
\lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b)
N→∞limP(N 1n=1∑N(σwn−μ)≤b)=Φ(b)
以下三个强度依次递减
依概率 1 收敛(SLLN):x N ⟶ w . p . 1 a \mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} a xN⟶w.p.1a
概率趋于 0(WLLN):
依分布收敛: x N ⟶ d p \mathsf{x}_{N} \stackrel{d}{\longrightarrow} p xN⟶dp
Bonennult
发布了37 篇原创文章 · 获赞 27 · 访问量 2万+
私信
关注