【Inference】变分推断以及VIEM

在包含隐变量(latent variables)的推断问题中,针对连续性随机变量的情况,隐变量的高维以及被积函数(intergrand)的复杂度使积分(intergration)无法进行。而针对离散型随机变量,隐变量呈指数(exponentially)增长的隐状态使得精确计算的花费过高(prohibitively)。因此人们有了近似推断的想法。主要包括随机近似(stochastic)和确定性近似(deterministic)。变分推断就是确定性近似的一种。

文章目录

1 变分推断的基本思想

假设模型是联合概率分布 p ( x , z ) p(x,z) p(x,z),其中 x x x是观测变量, z z z是隐变量,包括参数。目标是学习模型的后验概率分布 p ( z ∣ x ) p(z|x) p(z∣x),用模型进行概率推理。但这是一个复杂的分布,直接估计分布的参数很困难。

所以考虑用概率分布 q ( z ) q(z) q(z)近似条件概率分布 p ( z ∣ x ) p(z|x) p(z∣x),用 K L KL KL散度 D ( q ( z ) ∣ ∣ p ( z ∣ x ) ) D(q(z)||p(z|x)) D(q(z)∣∣p(z∣x))计算两者的相似度, q ( z ) q(z) q(z)称为变分分布(variational distribution) 。如果能找到与 p ( z ∣ x ) p(z|x) p(z∣x)在 K L KL KL散度意义下最近的分布 q ∗ ( z ) q^{*}(z) q∗(z),则可以用这个分布近似 p ( z ∣ x ) p(z|x) p(z∣x)。

2 KL 散度以及问题转化

K L KL KL散度是用于比较两个分布之间的相似度的方法,分布越相似, K L KL KL散度越小。所以在变分推断中,我们目标就是让变分分布与目标分布的 K L KL KL散度最小。

K L KL KL d i v e r g e n c e divergence divergence可以写作:
D ( q ( z ) ∣ ∣ p ( z ∣ x ) ) = E q ( z ) [ l o g   q ( z ) ] − E q ( z ) [ l o g   p ( z ∣ x ) ] = E q ( z ) [ l o g   q ( z ) ] − E q ( z ) [ l o g   p ( x , z ) p ( x ) ] = E q ( z ) [ l o g   q ( z ) ] − E q ( z ) [ l o g   p ( x , z ) ] + l o g   p ( x ) = l o g   p ( x ) − ( E q ( z ) [ l o g   p ( x , z ) ] − E q ( z ) [ l o g   q ( z ) ] ) ( 1.1 ) \begin{aligned} D\left ( q\left ( z \right )||p\left ( z|x \right ) \right ) &= E_{q(z)}\left [ log\, q\left ( z\right ) \right ] - E_{q(z)}\left [ log\, p\left ( z|x \right ) \right ] \\ &= E_{q(z)}\left [ log\, q\left ( z\right ) \right ] - E_{q(z)}\left [ log\, \frac{p\left ( x,z \right )}{p\left ( x \right )} \right ]\\ &= E_{q(z)}\left [ log\, q\left ( z\right ) \right ] - E_{q(z)}\left [ log\, p\left ( x,z\right ) \right ] + log \, p(x) \\ &= log \, p(x) - \left ( E_{q(z)}\left [ log\, p\left ( x,z\right ) \right ]- E_{q(z)}\left [ log\, q\left ( z\right ) \right ] \right ) (1.1) \end{aligned} D(q(z)∣∣p(z∣x))​=Eq(z)​[logq(z)]−Eq(z)​[logp(z∣x)]=Eq(z)​[logq(z)]−Eq(z)​[logp(x)p(x,z)​]=Eq(z)​[logq(z)]−Eq(z)​[logp(x,z)]+logp(x)=logp(x)−(Eq(z)​[logp(x,z)]−Eq(z)​[logq(z)])(1.1)​
其中 E q ( z ) [ l o g   p ( x , z ) p ( x ) ] = E q ( z ) [ l o g   p ( x , z ) − l o g   p ( x ) ] E_{q(z)}\left [ log\, \frac{p\left ( x,z \right )}{p\left ( x \right )} \right ]=E_{q(z)}\left [ log\, p\left ( x,z \right )-log \,p\left ( x \right ) \right ] Eq(z)​[logp(x)p(x,z)​]=Eq(z)​[logp(x,z)−logp(x)],因为 l o g   p ( x ) log \,p\left ( x \right ) logp(x)与 q ( z ) q(z) q(z)无关,所以直接期望等于自身。

因为 K L KL KL散度大于等于零,当且仅当两个分布一-致时为零,所以任意情况下 l o g   p ( x ) ≥ ( E q ( z ) [ l o g   p ( x , z ) ] − E q ( z ) [ l o g   q ( z ) ] ) = L ( q ) ( 1.2 ) log \, p(x) \ge \left ( E_{q(z)}\left [ log\, p\left ( x,z\right ) \right ]- E_{q(z)}\left [ log\, q\left ( z\right ) \right ] \right )=L(q)(1.2) logp(x)≥(Eq(z)​[logp(x,z)]−Eq(z)​[logq(z)])=L(q)(1.2)

我们可以理解为不等式右端是左端的下界,左端是右端的上界。只要让右端无限增大接近于左端,那么 K L KL KL散度就越接近于0。所以我们的目标就从找到 K L KL KL散度的最小值转化为求 L ( q ) L(q) L(q)的最大值。

补充:公式1.2中,左端一般称为证据(evidence),右端称为证据下界(evidence lower bound),简写为ELBO。根据公式1.1和公式1.2,易得
l o g   p ( x ) = E L B O + K L ( q ( z ) ∣ ∣ l o g   p ( z ∣ x ) ) = ( E q ( z ) [ l o g   p ( x , z ) ] − E q ( z ) [ l o g   q ( z ) ] ) + E q ( z ) [ l o g   q ( z ) ] − E q ( z ) [ l o g   p ( z ∣ x ) ] = E q ( z ) [ l o g   p ( x , z ) l o g   q ( z ) ] − E q ( z ) [ l o g   p ( z ∣ x ) l o g   q ( z ) ] = ∫ z q ( z ) ∗ l o g   p ( x , z ) q ( z ) d z + ∫ z q ( z ) ∗ l o g   p ( z ∣ x ) q ( z ) d z ( 1.3 ) \begin{aligned} log\, p(x) &= ELBO+KL(q(z)||log\,p(z|x) )\\ &= \left ( E_{q(z)}\left [ log\, p\left ( x,z\right ) \right ]- E_{q(z)}\left [ log\, q\left ( z\right ) \right ] \right )+E_{q(z)}\left [ log\, q\left ( z\right ) \right ] - E_{q(z)}\left [ log\, p\left ( z|x \right ) \right ] \\ &=E_{q(z)}\left [ \frac{log\, p\left ( x,z\right ) }{log\, q\left ( z\right ) } \right ]-E_{q(z)}\left [ \frac{log\, p\left ( z|x\right ) }{log\, q\left ( z\right ) } \right ]\\ &=\int _{z}q(z)*log\, \frac{p(x,z)}{q(z)} dz+\int _{z}q(z)*log\, \frac{p(z|x)}{q(z)} dz(1.3) \end{aligned} logp(x)​=ELBO+KL(q(z)∣∣logp(z∣x))=(Eq(z)​[logp(x,z)]−Eq(z)​[logq(z)])+Eq(z)​[logq(z)]−Eq(z)​[logp(z∣x)]=Eq(z)​[logq(z)logp(x,z)​]−Eq(z)​[logq(z)logp(z∣x)​]=∫z​q(z)∗logq(z)p(x,z)​dz+∫z​q(z)∗logq(z)p(z∣x)​dz(1.3)​
第一项就是 E L B O ELBO ELBO。

3 平均场理论下变分推断的假设

在求解证据下界时,我们会引入平均场理论(mean field theroy)。

因为变分分布是我们自己找的一个简单的,易计算的分布。所以通常假设q(z)对z的所有分量都是互相独立的(实际是条件独立于参数),即满足
q ( z ) = q ( z 1 ) q ( z 2 ) . . . q ( z n ) = ∏ n = 1 M q ( z M ) ( 1.4 ) q(z) = q(z_{1})q(z_{2})...q(z_{n})=\prod_{n=1}^{M} q(z_{M})(1.4) q(z)=q(z1​)q(z2​)...q(zn​)=n=1∏M​q(zM​)(1.4)
这时的变分分布称为平均场。 K L KL KL散度的最小化或证据下界最大化实际是在平均场的集合进行的。

接下来推导平均场理论下公式1.3依旧成立。

平均场理论下 L ( q ) = E L B O = ∫ z q ( z ) ∗ l o g   p ( x , z ) q ( z ) d z = ∫ z q ( z ) ∗ l o g   p ( x , z ) d z − ∫ z q ( z ) ∗ l o g   q ( z ) d z = ∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i ) ∗ l o g   p ( x , z ) d z 1   d z 2 . . . d z M ( 第 一 项 ① ) − ∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i ) ∗ ∑ i = 1 M l o g   q i ( z i ) d z 1   d z 2 . . . d z M ( 第 二 项 ② ) \begin{aligned} L(q) &= ELBO =\int _{z}q(z)*log\, \frac{p(x,z)}{q(z)} dz\\ &=\int _{z}q(z)*log\, p(x,z) dz-\int _{z}q(z)*log\, q(z) dz\\ &=\int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})*log\, p(x,z) dz_{1}\,dz_{2}...dz_{M}(第一项①)\\ &- \int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})*\sum_{i=1}^{M} log\, q_{i}(z_{i}) dz_{1}\,dz_{2}...dz_{M}(第二项②)\\ \end{aligned} L(q)​=ELBO=∫z​q(z)∗logq(z)p(x,z)​dz=∫z​q(z)∗logp(x,z)dz−∫z​q(z)∗logq(z)dz=∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​)∗logp(x,z)dz1​dz2​...dzM​(第一项①)−∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​)∗i=1∑M​logqi​(zi​)dz1​dz2​...dzM​(第二项②)​
下面推导中,我们假设 q j q_{j} qj​以外的平均场分量( q 1 , q 2 , . . . , q M q_{1},q_{2},...,q_{M} q1​,q2​,...,qM​)均固定(其实 q j q_{j} qj​也确定了)。
第一项
∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i ) ∗ l o g   p ( x , z ) d z 1   d z 2 . . . d z M = ∫ z j q i ( z j ) d z j ∗ ∫ z i ≠ j ∏ i ≠ j M q ( z i ) l o g   p ( x , z ) d z i ≠ j = ∫ z j q i ( z j ) ∗ E ∏ i ≠ j M q ( z i ) [ l o g   p ( x , z ) d z i ≠ j ] d z j = ∫ z j q i ( z j ) ∗ l o g   p ^ ( x , z j ) ( 1.5 ) \begin{aligned} &\int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})*log\, p(x,z) dz_{1}\,dz_{2}...dz_{M}\\ &=\int _{z_{j}}q_{i}(z_{j})dz_{j}*\int _{z_{i\ne j}} \prod_{i\ne j}^{M}q(z_{i})log\, p(x,z)dz_{i\ne j}\\ &=\int _{z_{j}}q_{i}(z_{j})*E_{\prod_{i\ne j}^{M}q(z_{i})}[log\, p(x,z)dz_{i\ne j}]dz_{j}\\ &=\int _{z_{j}}q_{i}(z_{j})*log\, \widehat{p}(x,z_{j})(1.5) \end{aligned} ​∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​)∗logp(x,z)dz1​dz2​...dzM​=∫zj​​qi​(zj​)dzj​∗∫zi​=j​​i​=j∏M​q(zi​)logp(x,z)dzi​=j​=∫zj​​qi​(zj​)∗E∏i​=jM​q(zi​)​[logp(x,z)dzi​=j​]dzj​=∫zj​​qi​(zj​)∗logp ​(x,zj​)(1.5)​
第二项
∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i ) ∗ ∑ i = 1 M l o g   q i ( z i ) d z 1   d z 2 . . . d z M = ∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i ) [ l o g   q 1 ( z 1 ) + l o g   q 2 ( z 2 ) + . . . + l o g   q M ( z M ) ] d z M   d z 2 . . . d z M \begin{aligned} &\int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})*\sum_{i=1}^{M} log\, q_{i}(z_{i}) dz_{1}\,dz_{2}...dz_{M}\\ &=\int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})[log\, q_{1}(z_{1})+log\, q_{2}(z_{2})+...+log\, q_{M}(z_{M})]dz_{M}\,dz_{2}...dz_{M}\\ \end{aligned} ​∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​)∗i=1∑M​logqi​(zi​)dz1​dz2​...dzM​=∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​)[logq1​(z1​)+logq2​(z2​)+...+logqM​(zM​)]dzM​dz2​...dzM​​
求和项展开第一项:
∫ z 1 , z 2 , . . . , z M ∏ i = 1 M q i ( z i )   l o g   q 1 ( z 1 ) d z 1 , z 2 , . . . , z M = ∫ z 1 , z 2 , . . . , z M q 1 ( z 1 ) l o g   q 1 ( z 1 )   q 2 ( z 2 )   q 3 ( z 3 ) . . .   q M ( z M ) d z 1 , z 2 , . . . , z M = ∫ z 1 q 1 ( z 1 ) l o g   q 1 ( z 1 ) d z 1 ∗ ∫ z 2 q 2 ( z 2 ) d z 2 ∗ ∫ z 3 q 3 ( z 3 ) d z 3 ∗ . . . ∗ ∗ ∫ z M q M ( z M ) d z M = ∫ z 1 q 1 ( z 1 ) l o g   q 1 ( z 1 ) d z 1 \begin{aligned} & \int _{z_{1},z_{2},...,z_{M}}\prod_{i=1}^{M} q_{i}(z_{i})\ log\, q_{1}(z_{1})dz_{1},z_{2},...,z_{M}\\ & = \int _{z_{1},z_{2},...,z_{M}}q_{1}(z_{1})log\, q_{1}(z_{1})\ q_{2}(z_{2})\ q_{3}(z_{3})...\ q_{M}(z_{M})dz_{1},z_{2},...,z_{M}\\ & = \int _{z_{1}}q_{1}(z_{1})log\, q_{1}(z_{1})dz_{1}* \int _{z_{2}}q_{2}(z_{2})dz_{2}* \int _{z_{3}}q_{3}(z_{3})dz_{3}*...** \int _{z_{M}}q_{M}(z_{M})dz_{M}\\ & = \int _{z_{1}}q_{1}(z_{1})log\, q_{1}(z_{1})dz_{1} \end{aligned} ​∫z1​,z2​,...,zM​​i=1∏M​qi​(zi​) logq1​(z1​)dz1​,z2​,...,zM​=∫z1​,z2​,...,zM​​q1​(z1​)logq1​(z1​) q2​(z2​) q3​(z3​)... qM​(zM​)dz1​,z2​,...,zM​=∫z1​​q1​(z1​)logq1​(z1​)dz1​∗∫z2​​q2​(z2​)dz2​∗∫z3​​q3​(z3​)dz3​∗...∗∗∫zM​​qM​(zM​)dzM​=∫z1​​q1​(z1​)logq1​(z1​)dz1​​
所以求和项展开的第m项: ∫ z m q m ( z m ) l o g   q m ( z m ) d z m \int _{z_{m}}q_{m}(z_{m})log\, q_{m}(z_{m})dz_{m} ∫zm​​qm​(zm​)logqm​(zm​)dzm​
因为除了 q j q_{j} qj​都确定了,所以最终第二项等于 ② = ∑ i = 1 M ∫ z i q i ( z i ) l o g   q i ( z i ) d z i = ∫ z j q j ( z j ) l o g   q j ( z j ) d z j + C ( 常 数 ) ( 1.6 ) \begin{aligned} ②=\sum_{i=1}^{M}\int _{z_{i}}q_{i}(z_{i})log\, q_{i}(z_{i})dz_{i}=\int _{z_{j}}q_{j}(z_{j})log\, q_{j}(z_{j})dz_{j}+C(常数)(1.6) \end{aligned} ②=i=1∑M​∫zi​​qi​(zi​)logqi​(zi​)dzi​=∫zj​​qj​(zj​)logqj​(zj​)dzj​+C(常数)(1.6)​
结合等式1.5和等式1.6, L ( q ) = ∫ z j q i ( z j ) ∗ l o g   p ^ ( x , z j ) − ∫ z j q j ( z j ) l o g   q j ( z j ) d z j + C = ∫ z j q i ( z j ) ∗ l o g   p ^ ( x , z j ) q i ( z j ) d z j + C = − K L ( q i ( z j ) ∣ ∣ p ^ ( x , z j ) ) + C \begin{aligned} L(q)& = \int _{z_{j}}q_{i}(z_{j})*log\, \widehat{p}(x,z_{j})-\int _{z_{j}}q_{j}(z_{j})log\, q_{j}(z_{j})dz_{j}+C\\ & = \int _{z_{j}}q_{i}(z_{j})*log\, \frac{\widehat{p}(x,z_{j})}{q_{i}(z_{j})}dz_{j}+C \\ & = -KL(q_{i}(z_{j})||\widehat{p}(x,z_{j}))+C \end{aligned} L(q)​=∫zj​​qi​(zj​)∗logp ​(x,zj​)−∫zj​​qj​(zj​)logqj​(zj​)dzj​+C=∫zj​​qi​(zj​)∗logqi​(zj​)p ​(x,zj​)​dzj​+C=−KL(qi​(zj​)∣∣p ​(x,zj​))+C​
因为 K L KL KL散度恒大于0,所以 L ( q ) L(q) L(q)恒小于 C C C。且当 L ( q ) L(q) L(q)接近于 C C C时, K L KL KL散度无限小,q(z)与

4 变分推断的步骤

根据以上的假设和铺垫,变分推断共有以下几个步骤:

步骤1:定义变分分布 q ( z ) q(z) q(z);

步骤2:推导其证据下界表达式;

步骤3:用最优化方法对证据下界进行优化,如坐标上升,得到最优分布 q ∗ ( z ) q^{*}(z) q∗(z),作为后验分布 p ( z ∣ x ) p(z|x) p(z∣x)的近似。

这里最困难的就是证据下界最大化的问题。而EM算法(Expectation Maximization Algorithm),本身也是利用了下界最大化直至收敛的想法来进行优化的,所以我们可以应用EM算法处理第三步。

5 EM在变分推断中的应用(VIEM)

假设模型是联合概率分布 p ( x , z ∣ 0 ) p(x, z|0) p(x,z∣0),其中 x x x是观测变量, z z z是隐变量, θ \theta θ是参数。
目标是通过观测数据的概率(证据) l o g   p ( x ∣ θ ) log\, p(x|\theta) logp(x∣θ)的最大化,估计模型的参数 θ \theta θ。使用变分推理,导入平均场 q ( z ) = ∏ n = 1 M q ( z M ) q(z) =\prod_{n=1}^{M} q(z_{M}) q(z)=∏n=1M​q(zM​), 定义证据下界
L ( q , θ ) = E q [ l o g   p ( x , z ∣ θ ) ] − E q [ l o g   q ( z ) ] \begin{aligned} L(q,\theta)=E_{q}[log\,p(x,z|\theta)]-E_{q}[log\,q(z)] \end{aligned} L(q,θ)=Eq​[logp(x,z∣θ)]−Eq​[logq(z)]​
通过迭代,分别以 q q q和 θ \theta θ为变量对证据下界进行最大化,就得到变分 E M EM EM算法。
算法的目标函数 q ^ = a r g m i n q   K L ( q ( z ) ∣ ∣ p ( z ∣ x ) ) = a r g m a x q   L ( q , θ ) \widehat q= argmin_{q}\ KL(q(z)||p(z|x))=argmax_{q}\ L(q,\theta) q ​=argminq​ KL(q(z)∣∣p(z∣x))=argmaxq​ L(q,θ)

在第t次迭代中:

(1) E E E步:固定 θ ( t − 1 ) \theta^{(t-1)} θ(t−1),求 L ( q ( t ) , θ ( t − 1 ) ) L(q^{(t)},\theta^{(t-1)}) L(q(t),θ(t−1))对 q q q的最大化。

(2) M M M步:固定 q ( t ) q^{(t)} q(t),求 L ( q ( t ) , θ ( t ) ) L(q^{(t)},\theta^{(t)}) L(q(t),θ(t))对 θ θ θ的最大化。
变分EM推断中,以下关系成立:
l o g   p ( x ∣ θ t − 1 ) = L ( q ( t ) , θ ( t − 1 ) ) ≤ L ( q ( t ) , θ ( t ) ) = l o g   p ( x ∣ θ t ) \begin{aligned} log\, p(x|\theta^{t-1})= L(q^{(t)},\theta^{(t-1)})\le L(q^{(t)},\theta^{(t)})=log\, p(x|\theta^{t}) \end{aligned} logp(x∣θt−1)=L(q(t),θ(t−1))≤L(q(t),θ(t))=logp(x∣θt)​
根据公式1.3
L ( q , θ ) = E L B O = E q ( z ) [ l o g   p θ ( x , z ) l o g   q ( z ) ] = E q ( z ) [ l o g   p θ ( x , z ) ] + H [ q ( z ) ] ( 常 数 ) ) L(q,\theta)=ELBO=E_{q(z)}\left [ \frac{log\, p_{\theta}\left ( x,z\right ) }{log\, q\left ( z\right ) } \right ]=E_{q(z)}\left [ log\, p_{\theta}\left ( x,z\right ) \right ]+H[q(z)](常数)) L(q,θ)=ELBO=Eq(z)​[logq(z)logpθ​(x,z)​]=Eq(z)​[logpθ​(x,z)]+H[q(z)](常数))
根据之前的假设,对每一个 q j q_{j} qj​,都是 q i ≠ j q_{i\ne j} qi​=j​固定其余的 。所以
l o g   q j ( z j ) = E ∏ i ≠ j q i ( z i ) [ l o g   p θ ( x , z ) ] + C = ∫ q 1 ∫ q 2 . . . ∫ q j − 1 ∫ q j + 1 . . . ∫ q M q 1 q 2 . . . q j − 1 q j + 1 . . . q M [ l o g   p ( x , z ) ] d q 1 d q 2 . . . d q j − 1 d q j + 1 . . . d q M \begin{aligned} log\, q_{j}(z_{j}) &=E_{\prod_{i\ne j}q_{i}(z_{i})}\left [ log\, p_{\theta}\left ( x,z\right ) \right ]+C \\ &=\int_{q_{1}}\int_{q_{2}}...\int_{q_{j-1}}\int_{q_{j+1}}...\int_{q_{M}}q_{1}q_{2}...q_{j-1}q_{j+1}...q_{M}[log\, p(x,z)]dq_{1}dq_{2}...dq_{j-1}dq_{j+1}...dq_{M} \end{aligned} logqj​(zj​)​=E∏i​=j​qi​(zi​)​[logpθ​(x,z)]+C=∫q1​​∫q2​​...∫qj−1​​∫qj+1​​...∫qM​​q1​q2​...qj−1​qj+1​...qM​[logp(x,z)]dq1​dq2​...dqj−1​dqj+1​...dqM​​
可以使用坐标上升的方法进行迭代求解。即在每一轮迭代的M步中:
{ q 1 ^ ( z 1 ) = ∫ q 2 ∫ q 3 . . . ∫ q M   q 2   q 3 . . . q M [ l o g   p θ ( x , z ) ] d q 2   d q 3 . . . d q M q 2 ^ ( z 2 ) = ∫ q 1 ^ ∫ q 3 . . . ∫ q M   q 1 ^   q 3 . . . q M [ l o g   p θ ( x , z ) ] d q 1 ^   d q 3 . . . d q M q 3 ^ ( z 3 ) = ∫ q 1 ^ ∫ q 2 ^ . . . ∫ q M   q 1 ^   q 2 ^ . . . q M [ l o g   p θ ( x , z ) ] d q 1 ^   d q 2 ^ . . . d q M . . . . . . . . . . . . q M ^ ( z M ) = ∫ q 1 ^ ∫ q 2 ^ . . . ∫ q M − 1 ^   q 1 ^   q 2 ^ . . . q M − 1 ^ [ l o g   p θ ( x , z ) ] d q 1 ^   d q 2 ^ . . . d q M − 1 ^ \begin{aligned} \left\{\begin{matrix} \widehat{q_{1}}(z_{1})=\int _{q_{2}}\int _{q_{3}}...\int _{q_{M}}\, q_{2}\, q_{3}...q_{M}[log\, p_{\theta}(x,z)]dq_{2}\, dq_{3}...dq_{M} \\ \widehat{q_{2}}(z_{2})=\int _{\widehat{q_{1}}}\int _{q_{3}}...\int _{q_{M}}\, \widehat{q_{1}}\, q_{3}...q_{M}[log\, p_{\theta}(x,z)]d\widehat{q_{1}}\, dq_{3}...dq_{M} \\ \widehat{q_{3}}(z_{3})=\int _{\widehat{q_{1}}}\int _{\widehat{q_{2}}}...\int _{q_{M}}\, \widehat{q_{1}}\, \widehat{q_{2}}...q_{M}[log\, p_{\theta}(x,z)]d\widehat{q_{1}}\, d\widehat{q_{2}}...dq_{M}\\ ...... \\ ...... \\ \widehat{q_{M}}(z_{M})=\int _{\widehat{q_{1}}}\int _{\widehat{q_{2}}}...\int _{\widehat{q_{M-1}}}\, \widehat{q_{1}}\, \widehat{q_{2}}...\widehat{q_{M-1}}[log\, p_{\theta}(x,z)]d\widehat{q_{1}}\, d\widehat{q_{2}}...d\widehat{q_{M-1}} \end{matrix}\right. \end{aligned} ⎩⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎧​q1​ ​(z1​)=∫q2​​∫q3​​...∫qM​​q2​q3​...qM​[logpθ​(x,z)]dq2​dq3​...dqM​q2​ ​(z2​)=∫q1​ ​​∫q3​​...∫qM​​q1​ ​q3​...qM​[logpθ​(x,z)]dq1​ ​dq3​...dqM​q3​ ​(z3​)=∫q1​ ​​∫q2​ ​​...∫qM​​q1​ ​q2​ ​...qM​[logpθ​(x,z)]dq1​ ​dq2​ ​...dqM​............qM​ ​(zM​)=∫q1​ ​​∫q2​ ​​...∫qM−1​ ​​q1​ ​q2​ ​...qM−1​ ​[logpθ​(x,z)]dq1​ ​dq2​ ​...dqM−1​ ​​​

当然,平均场理论是一个非常强的假设,像神经网络就不适合平均场理论。因此SGVI(Stochastic Gradient Variational Inference)随机梯度变分推断就出现了。
(后续更新)

参考 reference

[1]David Bellot. Learning Probabilistic Graphical Models in R. Packt Publishing, 2016
[2]李航.《统计学习方法》(第二版).清华大学出版社, 2019
[3]参考视频: 【机器学习】【白板推导系列】【合集 1~23】_哔哩哔哩_bilibili.
PS:视频里的推导是反着推的,从 l o g   p ( x ) log\, p(x) logp(x)推出等于 E L B O ELBO ELBO和 K L   d i v e g e n c e KL\, divegence KLdivegence。再从 E L B O ELBO ELBO推出等于 − K L   d i v e r g e n c e + C o n s t a n t -KL\, divergence +Constant −KLdivergence+Constant(其实就是 l o g   p ( x ) log\, p(x) logp(x),和 q ( z ) q(z) q(z)无关所以是常数)。但是反推的过程较难,比较锻炼读者的推导能力,推荐观看。

上一篇:Levenshtein距离


下一篇:通过js的方法实现屏幕的适配