1.最大熵分布
函数:从数域到数域的映射
y
=
f
(
x
)
y=f(x)
y=f(x)
泛函:从函数(即向量空间)到数域的映射
z
=
h
(
f
)
z=h(f)
z=h(f)
算子:函数到函数的映射
g
(
x
)
=
∇
x
f
(
x
)
g(x)=\nabla_xf(x)
g(x)=∇xf(x)
凸集:对于集合内的每一对点,连接该对点的直线段上的每个点也在该集合内
凸函数:定义域为凸集
f
(
x
)
f(x)
f(x),
θ
=
z
−
x
y
−
x
,
θ
∈
[
0
,
1
]
\theta=\frac{z-x}{y-x},\theta\in [0,1]
θ=y−xz−x,θ∈[0,1],有
f
(
z
)
≤
f
(
x
)
+
θ
(
f
(
y
)
−
f
(
x
)
)
f(z)\le f(x)+\theta (f(y)-f(x))
f(z)≤f(x)+θ(f(y)−f(x))
对于概率密度函数
f
(
x
)
f(x)
f(x),支撑集
S
S
S(使得
f
(
x
)
>
0
f(x)>0
f(x)>0的所有的集合x)
熵
h
(
f
)
=
−
∫
f
ln
f
h(f)=-\int f\ln\ f
h(f)=−∫fln f是定义在凸集上的一个凹函数
熵最大时的概率密度表达式:
f
=
a
r
g
m
a
x
f
h
(
f
)
s
.
t
.
f
(
x
)
≥
0
∫
S
f
(
x
)
d
x
=
1
∫
S
f
(
x
)
r
i
(
x
)
d
x
=
α
i
f
o
r
a
l
l
1
≤
i
≤
m
(1)
\begin{aligned} &f=\underset{f}{argmax}\ h(f)\\ &s.t.\ f(x)\ge 0\\ &\quad\int_Sf(x)dx=1\\ &\quad\int_Sf(x)r_i(x)dx=\alpha_i\ for\ all\ 1\le i\le m \end{aligned}\tag{1}
f=fargmax h(f)s.t. f(x)≥0∫Sf(x)dx=1∫Sf(x)ri(x)dx=αi for all 1≤i≤m(1)
我们构造以下泛函
J
(
f
)
=
∫
−
f
ln
f
+
λ
0
f
+
∑
i
=
1
m
f
r
i
(2)
J(f)=\int -f\ln f+\lambda_0 f+\sum_{i=1}^m fr_i \tag{2}
J(f)=∫−flnf+λ0f+i=1∑mfri(2)
由Euler-Lagrange Equation:
泛函极值存在的必要条件为:证明过程
于是
J
(
f
)
J(f)
J(f)取极值的条件为
∂
(
−
f
ln
f
+
λ
0
f
+
∑
i
=
1
m
f
r
i
)
∂
(
f
)
=
−
ln
f
−
1
+
λ
0
+
∑
i
=
1
m
r
i
=
0
\frac{\partial{(-f\ln f+\lambda_0 f+\sum_{i=1}^m fr_i)}}{\partial(f)}=-\ln f-1+\lambda_0+\sum_{i=1}^m r_i=0
∂(f)∂(−flnf+λ0f+∑i=1mfri)=−lnf−1+λ0+i=1∑mri=0
f
∗
(
x
)
=
e
λ
0
−
1
+
∑
i
=
1
m
r
i
(3)
f^*(x)=e^{\lambda_0-1+\sum_{i=1}^m r_i}\tag{3}
f∗(x)=eλ0−1+∑i=1mri(3)
证明
f
(
x
)
f(x)
f(x)是极大值(即
∀
g
\forall g
∀g满足约束条件,对于
f
∗
(
x
)
=
e
λ
0
+
∑
i
=
1
m
r
i
f^*(x)=e^{\lambda_0+\sum_{i=1}^m r_i}
f∗(x)=eλ0+∑i=1mri,有
h
(
g
)
≤
h
(
f
∗
)
h(g)\le h(f^*)
h(g)≤h(f∗)):
1.泛函
F
(
f
)
=
∫
x
1
x
2
L
(
x
,
f
(
x
)
,
f
′
(
x
)
)
d
x
\mathcal{F}(f)=\int_{x_1}^{x_2}\mathcal{L}(x,f(x),f'(x))dx
F(f)=∫x1x2L(x,f(x),f′(x))dx的二阶变分
>
0
>0
>0则取极小值,反之取极大值
2.使用信息不等式
D
(
g
∣
∣
f
)
≥
0
D(g||f)\ge 0
D(g∣∣f)≥0
h
(
g
)
=
−
∫
S
g
ln
g
=
−
∫
S
g
ln
g
f
∗
f
∗
=
−
∫
S
g
ln
g
f
∗
−
∫
S
g
ln
f
∗
=
−
D
(
g
∣
∣
f
∗
)
−
∫
S
g
ln
f
∗
≤
−
∫
S
g
ln
f
∗
=
−
∫
S
g
(
λ
0
+
∑
λ
i
r
i
)
=
−
∫
S
f
∗
(
λ
0
+
∑
λ
i
r
i
)
=
−
∫
S
f
∗
ln
f
∗
(4)
\begin{aligned} &h(g)=-\int_Sg\ln g=-\int_Sg\ln \frac{g}{f^*}f^*=-\int_Sg\ln \frac{g}{f^*}-\int_Sg\ln f^*=-D(g||f^*)-\int_Sg\ln f^*\\ &\qquad \le -\int_Sg\ln f^*= -\int_Sg(\lambda_0+\sum\lambda_ir_i)= -\int_Sf^*(\lambda_0+\sum\lambda_ir_i)\\ &\qquad = -\int_Sf^*\ln f^*\end{aligned}\tag{4}
h(g)=−∫Sglng=−∫Sglnf∗gf∗=−∫Sglnf∗g−∫Sglnf∗=−D(g∣∣f∗)−∫Sglnf∗≤−∫Sglnf∗=−∫Sg(λ0+∑λiri)=−∫Sf∗(λ0+∑λiri)=−∫Sf∗lnf∗(4)
(4)中
∀
g
\forall g
∀g满足约束条件,因此
g
g
g换成
f
∗
f^*
f∗不影响结果。
2.应用
2.1.例1
对于半径为
r
r
r的二维球面上的一些点,已知这些点的均值为
[
μ
x
,
μ
y
]
T
[\mu_x,\mu_y]^T
[μx,μy]T,求这些点的最大熵分布。
已知点的约束为
L
:
x
2
+
y
2
=
r
2
L: x^2+y^2=r^2
L:x2+y2=r2
设最大熵概率分布概率密度函数为
f
(
x
,
y
)
f(x,y)
f(x,y)
我们有
∮
L
f
(
x
,
y
)
d
s
=
1
,
∮
L
x
f
(
x
,
y
)
d
s
=
μ
x
,
∮
L
y
f
(
x
,
y
)
d
s
=
μ
y
\oint_L f(x,y)ds=1,\oint_L xf(x,y)ds=\mu_x,\oint_L yf(x,y)ds=\mu_y
∮Lf(x,y)ds=1,∮Lxf(x,y)ds=μx,∮Lyf(x,y)ds=μy
令
x
=
cos
θ
,
y
=
sin
θ
x=\cos\theta,y=\sin\theta
x=cosθ,y=sinθ
则有
∫
f
(
θ
)
d
θ
=
1
,
∫
cos
θ
f
(
θ
)
d
θ
=
μ
x
,
∫
sin
θ
f
(
θ
)
d
θ
=
μ
y
\int f(\theta)d\theta=1,\int \cos\theta f(\theta)d\theta =\mu_x,\int \sin\theta f(\theta)d\theta=\mu_y
∫f(θ)dθ=1,∫cosθf(θ)dθ=μx,∫sinθf(θ)dθ=μy
因此,
r
1
=
cos
θ
,
r
2
=
sin
θ
r_1=\cos\theta,r_2=\sin\theta
r1=cosθ,r2=sinθ
因此,
f
(
θ
)
=
e
λ
0
−
1
+
λ
1
cos
θ
+
λ
2
sin
θ
=
C
e
λ
1
cos
θ
+
λ
2
sin
θ
f(\theta)=e^{\lambda_0-1+\lambda_1\cos\theta+\lambda_2\sin\theta}=Ce^{\lambda_1\cos\theta+\lambda_2\sin\theta}
f(θ)=eλ0−1+λ1cosθ+λ2sinθ=Ceλ1cosθ+λ2sinθ
因此,令
x
=
[
λ
1
λ
2
C
]
x=\begin{bmatrix}\lambda_1\\\lambda_2\\ C\end{bmatrix}
x=⎣⎡λ1λ2C⎦⎤
我们有
F
(
x
)
=
[
f
1
(
x
)
f
2
(
x
)
f
3
(
x
)
]
=
[
C
∫
e
λ
1
cos
θ
+
λ
2
sin
θ
d
θ
−
1
C
∫
cos
θ
e
λ
1
cos
θ
+
λ
2
sin
θ
d
θ
−
μ
x
C
∫
sin
θ
e
λ
1
cos
θ
+
λ
2
sin
θ
d
θ
−
μ
y
]
(5)
\begin{aligned} F(x)=\begin{bmatrix}f_1(x)\\f_2(x)\\f_3(x)\end{bmatrix}=\begin{bmatrix}C\int e^{\lambda_1\cos\theta+\lambda_2\sin\theta}d\theta-1 \\C\int \cos\theta e^{\lambda_1\cos\theta+\lambda_2\sin\theta}d\theta-\mu_x \\C\int \sin\theta e^{\lambda_1\cos\theta+\lambda_2\sin\theta}d\theta-\mu_y \end{bmatrix} \end{aligned}\tag{5}
F(x)=⎣⎡f1(x)f2(x)f3(x)⎦⎤=⎣⎡C∫eλ1cosθ+λ2sinθdθ−1C∫cosθeλ1cosθ+λ2sinθdθ−μxC∫sinθeλ1cosθ+λ2sinθdθ−μy⎦⎤(5)
要解
F
(
x
)
=
0
F(x)=0
F(x)=0采用最小二乘法,令
L
(
x
)
=
∣
∣
F
(
x
)
∣
∣
2
2
+
λ
∣
∣
x
∣
∣
2
2
L(x)=||F(x)||_2^2+\lambda ||x||_2^2
L(x)=∣∣F(x)∣∣22+λ∣∣x∣∣22
则
∇
x
L
=
∂
L
∂
x
=
J
T
(
F
(
x
)
)
F
(
x
)
+
λ
x
\nabla_x L=\frac{\partial L}{\partial x}=J^T(F(x))F(x)+\lambda x
∇xL=∂x∂L=JT(F(x))F(x)+λx
当
μ
x
=
0.17
,
μ
y
=
0.07
\mu_x=0.17,\mu_y=0.07
μx=0.17,μy=0.07时:
采用梯度下降法求得近似解为:
x
=
[
0.367692464472683
1.19936458109945
0.145459766239255
]
x=\begin{bmatrix}0.367692464472683\\1.19936458109945\\0.145459766239255\end{bmatrix}
x=⎣⎡0.3676924644726831.199364581099450.145459766239255⎦⎤
得到的概率密度函数用三维图像表示为(底部为x,y轴,竖轴表示概率密度)
2.2.例2:求约束条件为 E ( X ) = 0 , E ( X 2 ) = σ 2 E(X)=0,E(X^2)=\sigma^2 E(X)=0,E(X2)=σ2的最大熵分布
E
(
X
)
=
0
=
>
r
1
=
x
E
(
X
2
)
=
σ
2
=
>
r
2
=
x
2
(6)
\begin{aligned}E(X)=0=>r_1=x\\E(X^2)=\sigma^2=>r_2=x^2\end{aligned}\tag{6}
E(X)=0=>r1=xE(X2)=σ2=>r2=x2(6)
(3)(6)联立可得概率密度函数
f
(
x
)
=
c
e
λ
1
x
+
λ
2
x
2
(7)
\begin{aligned}&f(x)=ce^{\lambda_1x+\lambda_2x^2}\end{aligned}\tag{7}
f(x)=ceλ1x+λ2x2(7)
根据KKT条件
1.
∫
−
∞
+
∞
f
(
x
)
=
1
2.
∫
−
∞
+
∞
x
f
(
x
)
=
0
3.
∫
−
∞
+
∞
x
2
f
(
x
)
=
σ
2
(8)
\begin{aligned} &1.\int_{-\infty}^{+\infty}f(x)=1\\ &2.\int_{-\infty}^{+\infty}xf(x)=0\\ &3.\int_{-\infty}^{+\infty}x^2f(x)=\sigma^2\end{aligned}\tag{8}
1.∫−∞+∞f(x)=12.∫−∞+∞xf(x)=03.∫−∞+∞x2f(x)=σ2(8)
由条件2.得
f
(
x
)
f(x)
f(x)是偶函数因此
λ
1
=
0
\lambda_1=0
λ1=0
因此KKT条件变成
1.
∫
−
∞
+
∞
e
λ
2
x
2
d
x
=
1
c
2.
∫
−
∞
+
∞
x
2
e
λ
2
x
2
d
x
=
σ
2
c
(9)
\begin{aligned} &1.\int_{-\infty}^{+\infty}e^{\lambda_2x^2}dx=\frac{1}{c} \\ &2.\int_{-\infty}^{+\infty}x^2e^{\lambda_2x^2}dx=\frac{\sigma^2}{c}\end{aligned}\tag{9}
1.∫−∞+∞eλ2x2dx=c12.∫−∞+∞x2eλ2x2dx=cσ2(9)
已知
∫
−
∞
+
∞
e
−
x
2
d
x
=
π
\int_{-\infty}^{+\infty}e^{-x^2}dx=\sqrt{\pi}
∫−∞+∞e−x2dx=π
,则
∫
−
∞
+
∞
e
−
(
−
λ
2
x
)
2
d
x
=
∫
−
∞
+
∞
1
−
λ
2
e
−
t
2
d
t
=
1
c
=
>
∫
−
∞
+
∞
e
−
t
2
d
t
=
−
λ
2
c
=
π
(10)
\int_{-\infty}^{+\infty}e^{-(\sqrt{-\lambda_2}x)^2}dx=\int_{-\infty}^{+\infty}\frac{1}{\sqrt{-\lambda_2}}e^{-t^2}dt=\frac{1}{c}=>\int_{-\infty}^{+\infty}e^{-t^2}dt=\frac{\sqrt{-\lambda_2}}{c}=\sqrt{\pi}\tag{10}
∫−∞+∞e−(−λ2
x)2dx=∫−∞+∞−λ2
1e−t2dt=c1=>∫−∞+∞e−t2dt=c−λ2
=π
(10)
∫
−
∞
+
∞
x
2
e
λ
2
x
2
d
x
=
∫
−
∞
+
∞
x
2
e
−
(
−
λ
2
x
)
2
d
x
=
∫
−
∞
+
∞
−
t
2
λ
2
e
−
t
2
1
−
λ
2
d
t
=
σ
2
c
=
>
∫
−
∞
+
∞
t
2
e
−
t
2
d
t
=
−
σ
2
λ
2
−
λ
2
c
(11)
\int_{-\infty}^{+\infty}x^2e^{\lambda_2x^2}dx=\int_{-\infty}^{+\infty}x^2e^{-(\sqrt{-\lambda_2}x)^2}dx=\int_{-\infty}^{+\infty}-\frac{t^2}{\lambda_2}e^{-t^2}\frac{1}{\sqrt{-\lambda_2}}dt= \frac{\sigma^2}{c}=>\int_{-\infty}^{+\infty}t^2e^{-t^2}dt=- \frac{\sigma^2\lambda_2\sqrt{-\lambda_2}}{c}\tag{11}
∫−∞+∞x2eλ2x2dx=∫−∞+∞x2e−(−λ2
x)2dx=∫−∞+∞−λ2t2e−t2−λ2
1dt=cσ2=>∫−∞+∞t2e−t2dt=−cσ2λ2−λ2
(11)
对(11)分部积分得:
∫
−
∞
+
∞
t
2
e
−
t
2
d
t
=
−
t
2
e
−
t
2
∣
−
∞
+
∞
+
1
2
∫
−
∞
+
∞
e
−
t
2
d
t
=
π
2
(12)
\int_{-\infty}^{+\infty}t^2e^{-t^2}dt=-\frac{t}{2}e^{-t^2}|_{-\infty}^{+\infty}+\frac{1}{2}\int_{-\infty}^{+\infty}e^{-t^2}dt=\frac{\sqrt{\pi}}{2}\tag{12}
∫−∞+∞t2e−t2dt=−2te−t2∣−∞+∞+21∫−∞+∞e−t2dt=2π
(12)
∴
−
σ
2
λ
2
−
λ
2
c
=
π
2
,
−
λ
2
c
=
π
∴
λ
2
=
−
1
2
σ
2
,
c
=
1
2
π
σ
\therefore - \frac{\sigma^2\lambda_2\sqrt{-\lambda_2}}{c} =\frac{\sqrt{\pi}}{2},\frac{\sqrt{-\lambda_2}}{c}=\sqrt{\pi}\therefore \lambda_2=-\frac{1}{2\sigma^2},c=\frac{1}{\sqrt{2\pi}\sigma}
∴−cσ2λ2−λ2
=2π
,c−λ2
=π
∴λ2=−2σ21,c=2π
σ1
∴
f
(
x
)
=
1
2
π
σ
e
−
x
2
2
σ
2
(13)
\therefore f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{x^2}{2\sigma^2}}\tag{13}
∴f(x)=2π
σ1e−2σ2x2(13)
2.3.例3:求约束条件为 X > 0 , E ( X ) = 1 μ X>0,E(X)=\frac{1}{\mu} X>0,E(X)=μ1的最大熵分布
X
>
0
,
E
(
X
)
=
1
μ
=
>
r
1
=
x
=
>
f
(
x
)
=
c
e
λ
1
x
(14)
X>0,E(X)=\frac{1}{\mu}=>r_1=x=>f(x)=ce^{\lambda_1x}\tag{14}
X>0,E(X)=μ1=>r1=x=>f(x)=ceλ1x(14)
再根据KKT条件:
1.
∫
0
+
∞
e
λ
1
x
d
x
=
1
c
2.
∫
0
+
∞
x
e
λ
1
x
d
x
=
1
c
μ
(15)
1.\int_0^{+\infty}e^{\lambda_1x}dx=\frac{1}{c} \\2.\int_0^{+\infty}xe^{\lambda_1x}dx=\frac{1}{c\mu}\tag{15}
1.∫0+∞eλ1xdx=c12.∫0+∞xeλ1xdx=cμ1(15)要想KTT条件成立则
λ
1
≤
0
\lambda_1\le 0
λ1≤0
∫
0
+
∞
e
λ
1
x
d
x
=
1
λ
1
e
λ
1
x
∣
0
+
∞
=
−
1
λ
1
=
1
c
=
>
λ
1
=
−
c
(16)
\int_0^{+\infty}e^{\lambda_1x}dx=\frac{1}{\lambda_1}e^{\lambda_1x}|_0^{+\infty}=-\frac{1}{\lambda_1}=\frac{1}{c}=>\lambda_1=-c\tag{16}
∫0+∞eλ1xdx=λ11eλ1x∣0+∞=−λ11=c1=>λ1=−c(16)
∫
0
+
∞
x
e
λ
1
x
d
x
=
x
λ
1
e
λ
1
x
∣
0
+
∞
−
∫
0
+
∞
1
λ
1
e
λ
1
x
d
x
=
−
1
λ
1
2
e
λ
1
x
∣
0
+
∞
=
1
λ
1
2
=
1
c
μ
=
>
c
=
μ
,
λ
1
=
−
μ
(17)
\int_0^{+\infty}xe^{\lambda_1x}dx=\frac{x}{\lambda_1}e^{\lambda_1x}|_0^{+\infty}-\int_0^{+\infty}\frac{1}{\lambda_1}e^{\lambda_1x}dx\\=-\frac{1}{\lambda_1^2}e^{\lambda_1x}|_0^{+\infty} \\=\frac{1}{\lambda_1^2}=\frac{1}{c\mu}\\=>c=\mu,\lambda_1=-\mu\tag{17}
∫0+∞xeλ1xdx=λ1xeλ1x∣0+∞−∫0+∞λ11eλ1xdx=−λ121eλ1x∣0+∞=λ121=cμ1=>c=μ,λ1=−μ(17)
f
(
x
)
=
μ
e
−
μ
x
(18)
f(x)=\mu e^{-\mu x}\tag{18}
f(x)=μe−μx(18)
2.4.例4:求约束条件为 X ∈ [ a , b ] , a ≠ b X\in[a,b],a\neq b X∈[a,b],a=b的最大熵分布
f
(
x
)
=
c
e
−
0
x
=
c
(19)
f(x)=ce^{-0x}=c\tag{19}
f(x)=ce−0x=c(19)
再根据KKT条件
∫
a
b
f
(
x
)
d
x
=
∫
a
b
c
d
x
=
1
=
>
c
=
1
b
−
a
(20)
\int_{a}^{b}f(x)dx=\int_{a}^{b}c\ dx=1=>c=\frac{1}{b-a}\tag{20}
∫abf(x)dx=∫abc dx=1=>c=b−a1(20)
因此
f
(
x
)
=
1
b
−
a
(21)
f(x)=\frac{1}{b-a}\tag{21}
f(x)=b−a1(21)