文章目录
概率图基础
概率图模型就是用图的结构来表示多个随机变量的联合概率分布(joint probability distribution
),
上图是一个有向图模型,箭头表示变量之间的相互依存关系。有向图模型的联合概率分布可以表示为父节点条件下的条件概率乘积 p ( x ) = ∏ k = 1 K p ( x k ∣ p a k ) p(\mathbf{x})=\prod_{k=1}^{K} p\left(x_{k} \mid \mathrm{pa}_{k}\right) p(x)=∏k=1Kp(xk∣pak):
p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ∣ x 1 , x 2 , x 3 ) p ( x 5 ∣ x 1 , x 3 ) p ( x 6 ∣ x 4 ) p ( x 7 ∣ x 4 , x 5 ) p\left(x_{1}\right) p\left(x_{2}\right) p\left(x_{3}\right) p\left(x_{4} \mid x_{1}, x_{2}, x_{3}\right) p\left(x_{5} \mid x_{1}, x_{3}\right) p\left(x_{6} \mid x_{4}\right) p\left(x_{7} \mid x_{4}, x_{5}\right) p(x1)p(x2)p(x3)p(x4∣x1,x2,x3)p(x5∣x1,x3)p(x6∣x4)p(x7∣x4,x5)
有了这样一个概率图模型之后,我们就能够很容易地去采样出一个样本出来。原始采样法(ancestral sampling
)从模型所表示的联合分布中产生样本,又称祖先采样法。该方法所得出的结果即视为原始采样。对于上述概率图,其采样可以表示为:
- Sample x 1 ∼ p ( x 1 ) , x 2 ∼ p ( x 1 ) , x 3 ∼ p ( x 3 ) x_{1} \sim p\left(x_{1}\right), x_{2} \sim p\left(x_{1}\right), x_{3} \sim p\left(x_{3}\right) x1∼p(x1),x2∼p(x1),x3∼p(x3);
- Sample x 4 ∼ p ( x 4 ∣ x 1 , x 2 , x 3 ) , x 5 ∼ p ( x 5 ∣ x 1 , x 3 ) x_{4} \sim p\left(x_{4} \mid x_{1}, x_{2}, x_{3}\right), x_{5} \sim p\left(x_{5} \mid x_{1}, x_{3}\right) x4∼p(x4∣x1,x2,x3),x5∼p(x5∣x1,x3);
- Sample x 6 ∼ p ( x 4 ∣ x 4 ) , x 7 ∼ p ( x 7 ∣ x 4 , x 5 ) x_{6} \sim p\left(x_{4} \mid x_{4}\right), x_{7} \sim p\left(x_{7} \mid x_{4}, x_{5}\right) x6∼p(x4∣x4),x7∼p(x7∣x4,x5).
D-separation
这里还有一个概念比较重要:条件独立:如果 p ( a ∣ b , c ) = p ( a ∣ c ) p(a \mid b, c)=p(a \mid c) p(a∣b,c)=p(a∣c),那么我们说在给定 c c c的情况下, a a a和 b b b是条件独立的,定义为 a ⊥ b ∣ c a \perp b \mid c a⊥b∣c。
给定一个图模型之后,如何测试哪些变量是条件独立的呢?我们举三个例子来说明:
-
Example 1:
tail-to-tail
上图的有向图联合概率分布可以表示为:
p ( a , b , c ) = p ( a ∣ c ) p ( b ∣ c ) p ( c ) p(a, b, c)=p(a \mid c) p(b \mid c) p(c) p(a,b,c)=p(a∣c)p(b∣c)p(c)
变量 c c c不给定的情况下, a a a和 b b b的联合概率表示为:
p ( a , b ) = ∑ c p ( a ∣ c ) p ( b ∣ c ) p ( c ) p(a, b)=\sum_{c} p(a \mid c) p(b \mid c) p(c) p(a,b)=c∑p(a∣c)p(b∣c)p(c)
可以知道,他们是不条件独立的,而一旦给定变量 c c c之后,概率图模型变为(给定变量用阴影填满):
此时 a a a和 b b b的联合概率可以表示为:
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ∣ c ) p ( b ∣ c ) \begin{aligned} p(a, b \mid c) &=\frac{p(a, b, c)}{p(c)} \\ &=p(a \mid c) p(b \mid c) \end{aligned} p(a,b∣c)=p(c)p(a,b,c)=p(a∣c)p(b∣c)
此时 a a a与 b b b条件独立。
-
Example 2:
head-to-tail
再考虑链状的一个情况:
此时概率图的联合概率可以表示为:
p ( a , b , c ) = p ( a ) p ( c ∣ a ) p ( b ∣ c ) p(a, b, c)=p(a) p(c \mid a) p(b \mid c) p(a,b,c)=p(a)p(c∣a)p(b∣c)
变量 a a a和变量 b b b的联合概率可以表示为:
p ( a , b ) = p ( a ) ∑ c p ( c ∣ a ) p ( b ∣ c ) = p ( a ) p ( b ∣ a ) p(a, b)=p(a) \sum_{c} p(c \mid a) p(b \mid c)=p(a) p(b \mid a) p(a,b)=p(a)c∑p(c∣a)p(b∣c)=p(a)p(b∣a)
可以发现,变量 a a a和变量 b b b并不条件独立。当给定变量 c c c之后,概率图变为:
变量 a a a和变量 b b b的联合概率可以表示为:
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ) p ( c ∣ a ) p ( b ∣ c ) p ( c ) = p ( a ∣ c ) p ( b ∣ c ) \begin{aligned} p(a, b \mid c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(c \mid a) p(b \mid c)}{p(c)} \\ &=p(a \mid c) p(b \mid c) \end{aligned} p(a,b∣c)=p(c)p(a,b,c)=p(c)p(a)p(c∣a)p(b∣c)=p(a∣c)p(b∣c)
此时变量 a a a和变量 b b b条件独立。
-
Example 3:
head-to-head
此时概率图的联合概率表示为:
p ( a , b , c ) = p ( a ) p ( b ) p ( c ∣ a , b ) p(a, b, c)=p(a) p(b) p(c \mid a, b) p(a,b,c)=p(a)p(b)p(c∣a,b)
a a a和 b b b之间的联合概率可以表示为:
p ( a , b ) = p ( a ) p ( b ) p(a, b)=p(a) p(b) p(a,b)=p(a)p(b)
可以发现他们是条件独立的,当给定变量 c c c之后,此时概率图模型变为如下形式:
此时 a a a和 b b b的联合概率可以表示为:
p ( a , b ∣ c ) = p ( a , b , c ) p ( c ) = p ( a ) p ( b ) p ( c ∣ a , b ) p ( c ) \begin{aligned} p(a, b \mid c) &=\frac{p(a, b, c)}{p(c)} \\ &=\frac{p(a) p(b) p(c \mid a, b)}{p(c)} \end{aligned} p(a,b∣c)=p(c)p(a,b,c)=p(c)p(a)p(b)p(c∣a,b)
此时 a a a和 b b b不是条件独立的。
对上述规律进行总结,变成D-separation
:
考虑两个结点的集合
A
A
A和
B
B
B,
A
A
A到
B
B
B的路径上,如果有一个集合
C
C
C在,以下两种情况我们称这条路径被blocked
:
(a) 路径上的箭头,满足head-to-tail
或者tail-to-tai
l的节点在集合
C
C
C中;
(b) 路径上的箭头满足head-to-head
的节点不在
C
C
C里面,或者它任何的后代都不在
C
C
C里面。
如果
A
A
A到
B
B
B的所有路径都是blocked
的话,我们称
A
A
A和
B
B
B被
C
C
C d-separated
的。
D-separation的应用
我们在做极大似然估计的时候,似然函数可以写成如下形式:
p ( D ∣ μ ) = ∏ n = 1 N p ( x n ∣ μ ) p(\mathcal{D} \mid \mu)=\prod_{n=1}^{N} p\left(x_{n} \mid \mu\right) p(D∣μ)=n=1∏Np(xn∣μ)
观测到的数据点是 x 1 , ⋯ , x n x_{1},\cdots,x_{n} x1,⋯,xn,给定的高斯均值为 μ \mu μ。上述等式成立的条件就用到了如下概率图模型:
贝叶斯推论
Bayesian inference
就是拿观测数据去更新我们的假设:
P ( hypothesis ∣ data ) = P ( data ∣ hypothesis ) P ( hypothesis ) P ( data ) P(\text { hypothesis } \mid \text { data })=\frac{P(\text { data } \mid \text { hypothesis }) P(\text { hypothesis })}{P(\text { data })} P( hypothesis ∣ data )=P( data )P( data ∣ hypothesis )P( hypothesis )
P
(
hypothesis
∣
data
)
P(\text { hypothesis } \mid \text { data })
P( hypothesis ∣ data )也被称作后验概率,说的是观测到某些数据之后所做的推断。
P
(
data
∣
hypothesis
)
P(\text { data } \mid \text { hypothesis })
P( data ∣ hypothesis )被称作为似然,likelihood
,
P
(
hypothesis
)
P(\text { hypothesis })
P( hypothesis )被称作为先验。
在做近似推断的时候,我们经常需要去评估后验概率 p ( Z ∣ X ) p(Z|X) p(Z∣X),或者是 E p ( Z ∣ X ) [ f ( Z ) ] \mathbb{E}_{p(Z \mid X)}[f(Z)] Ep(Z∣X)[f(Z)]。但往往这个 z z z变量是高维的,较难处理。近似推断(Approximate inference)常常会被用来解决这类问题。
-
确定行的技术:拉普拉斯近似(
Laplace approximation
)来用高斯分布找到 p ( Z ∣ X ) p(Z|X) p(Z∣X);另外一个技术就是变分推断(variational inference
)。经典机器学习系列(十)【变分推断】。 -
随机性的技术:马尔科夫链蒙特卡洛(
Markov Chain Monte Carlo
,MCMC),从 p ( Z ∣ X ) p(Z|X) p(Z∣X)中采样大量的样本之后做估计。
变分推断
变分推断(Variational inference
)的思想主要就是用一个参数化的分布近似后验分布:
q ( z ∣ ϕ ) ∼ p ( z ∣ x ) q(z \mid \phi) \sim p(z \mid x) q(z∣ϕ)∼p(z∣x)
这样就把一个推理(inference
)问题变成一个优化(optimization
)问题。详细的变分推断的知识可以在这里找到:经典机器学习系列(十)【变分推断】。这里直接给出log下的边缘概率表示:
ln p ( X ) = L ( q ) + KL ( q ∥ p ) \ln p(\mathbf{X})=\mathcal{L}(q)+\operatorname{KL}(q \| p) lnp(X)=L(q)+KL(q∥p)
其中
L
(
q
)
=
∫
q
(
Z
)
ln
{
p
(
X
,
Z
)
q
(
Z
)
}
d
Z
\mathcal{L}(q)=\int q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{X}, \mathbf{Z})}{q(\mathbf{Z})}\right\} \mathrm{d} \mathbf{Z}
L(q)=∫q(Z)ln{q(Z)p(X,Z)}dZ,
K
L
(
q
∥
p
)
=
−
∫
q
(
Z
)
ln
{
p
(
Z
∣
X
)
q
(
Z
)
}
d
Z
\mathrm{KL}(q \| p)=-\int q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{Z} \mid \mathbf{X})}{q(\mathbf{Z})}\right\} \mathrm{d} \mathbf{Z}
KL(q∥p)=−∫q(Z)ln{q(Z)p(Z∣X)}dZ。这里可以自己带入进去计算一下,是会相等的。因为KL
大于0
,所以
L
(
q
)
\mathcal{L}(q)
L(q)通常也被称为Evidence lower bound
(ELBO)。
概率图角度解强化学习问题
图概率下的策略搜索
最大熵的RL
就等于某种inference
,在最大熵的RL
里面,所有的东西都有一个soft
,都有一个概率,这样做的很自然的一个好处就在于能够增加探索(exploration
),概率图模型理论框架已近比较成熟,如果能够用于强化学习中能够解决很多强化学习的问题。
回顾一下强化学习,强化学习的优化目标可以表示为一个策略搜索问题,以最大化期望奖励对策略参数进行搜索:
θ ⋆ = arg max θ ∑ t = 1 T E ( s t , a t ) ∼ p ( s t , a t ∣ θ ) [ r ( s t , a t ) ] \theta^{\star}=\arg \max _{\theta} \sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim p\left(\mathbf{s}_{t}, \mathbf{a}_{t} \mid \theta\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] θ⋆=argθmaxt=1∑TE(st,at)∼p(st,at∣θ)[r(st,at)]
其轨迹(trajectory
)分布可以表示为:
p ( τ ) = p ( s 1 , a t , … , s T , a T ∣ θ ) = p ( s 1 ) ∏ t = 1 T p ( a t ∣ s t , θ ) p ( s t + 1 ∣ s t , a t ) p(\tau)=p\left(\mathbf{s}_{1}, \mathbf{a}_{t}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T} \mid \theta\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \theta\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p(τ)=p(s1,at,…,sT,aT∣θ)=p(s1)t=1∏Tp(at∣st,θ)p(st+1∣st,at)
从这个trajectory
的联合分布可以推出其概率图模型:
上述概率图的问题就是没有包含reward
,也就没有优化目标,引入二值随机优化变量
O
\mathcal{O}
O。
O
t
=
1
\mathcal{O}_{t}=1
Ot=1时,表示在时间步
t
t
t处是最优的,
O
t
=
0
\mathcal{O}_{t}=0
Ot=0时,表示不是最优的。选择如下分布去表示这个变量:
p ( O t = 1 ∣ s t , a t ) = exp ( r ( s t , a t ) ) p\left(\mathcal{O}_{t}=1 \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)=\exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p(Ot=1∣st,at)=exp(r(st,at))
此时的概率图模型表示为:
给定条件对于
t
∈
{
1
,
⋯
,
T
}
t\in\{1,\cdots,T\}
t∈{1,⋯,T}都有
O
t
=
1
\mathcal{O}_{t}=1
Ot=1,其trajectory
的后验分布可以表示为:
p ( τ ∣ o 1 : T ) ∝ p ( τ , o 1 : T ) = p ( s 1 ) ∏ t = 1 T p ( O t = 1 ∣ s t , a t ) p ( s t + 1 ∣ s t , a t ) = p ( s 1 ) ∏ t = 1 T exp ( r ( s t , a t ) ) p ( s t + 1 ∣ s t , a t ) = [ p ( s 1 ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t ) ] exp ( ∑ t = 1 T r ( s t , a t ) ) \begin{aligned} p\left(\tau \mid \mathbf{o}_{1: T}\right) \propto p\left(\tau, \mathbf{o}_{1: T}\right) &=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathcal{O}_{t}=1 \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ &=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ &=\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \end{aligned} p(τ∣o1:T)∝p(τ,o1:T)=p(s1)t=1∏Tp(Ot=1∣st,at)p(st+1∣st,at)=p(s1)t=1∏Texp(r(st,at))p(st+1∣st,at)=[p(s1)t=1∏Tp(st+1∣st,at)]exp(t=1∑Tr(st,at))
通过上述这种定义方式,在确定性环境(deterministic dynamics
)中很容易被理解,最高的奖励将有最大的出现概率。具体底奖励的轨迹出现的概率也会比较低。
当然我们关心的是如何得到最优策略,最优策略可以写作 p ( a t ∣ s t , O t : T = 1 ) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{t: T}=1\right) p(at∣st,Ot:T=1),参数化的表示为: p ( a t ∣ s t , θ ⋆ ) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \theta^{\star}\right) p(at∣st,θ⋆)。
定义两个消息(backward messages
):
β
t
(
s
t
,
a
t
)
\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)
βt(st,at)和
β
t
(
s
t
)
\beta_{t}\left(\mathbf{s}_{t}\right)
βt(st)。
β t ( s t , a t ) = p ( O t : T ∣ s t , a t ) \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) βt(st,at)=p(Ot:T∣st,at)
p ( O t : T ∣ s t , a t ) p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p(Ot:T∣st,at)表示的是从 t t t时刻开始,在给定条件 s t s_{t} st和 a t a_{t} at下的轨迹概率。这里要注意 β t ( s t , a t ) \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) βt(st,at)并不是 ( s t , a t ) \left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) (st,at)的概率密度,而是 O t : T = 1 \mathcal{O}_{t: T}=1 Ot:T=1的概率。
β t ( s t ) = p ( O t : T ∣ s t ) \beta_{t}\left(\mathbf{s}_{t}\right)=p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}\right) βt(st)=p(Ot:T∣st)
p ( O t : T ∣ s t ) p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}\right) p(Ot:T∣st)表示的是给定 s t s_{t} st条件下轨迹的发生概率。
我们可以从状态-动作(state-action
)的消息中得到仅有状态(state
)的消息:
β t ( s t ) = p ( O t : T ∣ s t ) = ∫ A p ( O t : T ∣ s t , a t ) p ( a t ∣ s t ) d a t = ∫ A β t ( s t , a t ) p ( a t ∣ s t ) d a t \beta_{t}\left(\mathbf{s}_{t}\right)=p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}\right)=\int_{\mathcal{A}} p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) d \mathbf{a}_{t}=\int_{\mathcal{A}} \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) d \mathbf{a}_{t} βt(st)=p(Ot:T∣st)=∫Ap(Ot:T∣st,at)p(at∣st)dat=∫Aβt(st,at)p(at∣st)dat
p ( a t ∣ s t ) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p(at∣st)是动作先验,可以假设 p ( a t ∣ s t ) = 1 A p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)=\frac{1}{\mathcal{A}} p(at∣st)=A1,简单一点可以假设其为一个随机分布。 β t ( s t , a t ) \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) βt(st,at)和 β t + 1 ( s t + 1 ) \beta_{t+1}(\mathbf{s}_{t+1}) βt+1(st+1)之间的关系如下所示:
β t ( s t , a t ) = p ( O t : T ∣ s t , a t ) = ∫ S β t + 1 ( s t + 1 ) p ( s t + 1 ∣ s t , a t ) p ( O t ∣ s t , a t ) d s t + 1 \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)=\int_{\mathcal{S}} \beta_{t+1}\left(\mathbf{s}_{t+1}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathcal{O}_{t} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) d \mathbf{s}_{t+1} βt(st,at)=p(Ot:T∣st,at)=∫Sβt+1(st+1)p(st+1∣st,at)p(Ot∣st,at)dst+1
上述 β t ( s t , a t ) \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) βt(st,at)的推导,用到了概率图模型的条件独立。给定 s t \mathbf{s}_{t} st和 a t \mathbf{a}_{t} at, s t + 1 \mathbf{s}_{t+1} st+1和 O t \mathcal{O}_{t} Ot是条件独立的。
通过backward messages
,我们可以得到最优策略
p
(
a
t
∣
s
t
,
O
1
:
T
)
p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{1: T}\right)
p(at∣st,O1:T),
O
1
:
(
t
−
1
)
\mathcal{O}_{1: (t-1)}
O1:(t−1)是条件独立于
a
t
\mathbf{a}_{t}
at和
s
t
\mathbf{s}_{t}
st的,所以
p
(
a
t
∣
s
t
,
O
1
:
T
)
=
p
(
a
t
∣
s
t
,
O
t
:
T
)
p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{1: T}\right)=p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{t: T}\right)
p(at∣st,O1:T)=p(at∣st,Ot:T),这样我们在考虑当前动作分布的时候就可以忽略之前的动作了,这一点与马尔可夫性也相对应了。最优动作可以表示为:
p ( a t ∣ s t , O t : T ) = p ( s t , a t ∣ O t : T ) p ( s t ∣ O t : T ) = p ( O t : T ∣ s t , a t ) p ( a t ∣ s t ) p ( s t ) p ( O t : T ∣ s t ) p ( s t ) ∝ p ( O t : T ∣ s t , a t ) p ( O t : T ∣ s t ) = β t ( s t , a t ) β t ( s t ) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{t: T}\right)=\frac{p\left(\mathbf{s}_{t}, \mathbf{a}_{t} \mid \mathcal{O}_{t: T}\right)}{p\left(\mathbf{s}_{t} \mid \mathcal{O}_{t: T}\right)}=\frac{p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t}\right)}{p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t}\right)} \propto \frac{p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}{p\left(\mathcal{O}_{t: T} \mid \mathbf{s}_{t}\right)}=\frac{\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}{\beta_{t}\left(\mathbf{s}_{t}\right)} p(at∣st,Ot:T)=p(st∣Ot:T)p(st,at∣Ot:T)=p(Ot:T∣st)p(st)p(Ot:T∣st,at)p(at∣st)p(st)∝p(Ot:T∣st)p(Ot:T∣st,at)=βt(st)βt(st,at)
p
(
a
t
∣
s
t
)
p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)
p(at∣st)这一样被忽略了,因为之前假设它是一个随机分布(uniform distribution
)。通过recursive message passing
,我们对最优策略做inference
。对
t
=
T
−
1
t=T-1
t=T−1到
t
=
1
t=1
t=1有:
β t ( s t , a t ) = p ( O t ∣ s t , a t ) E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ β t + 1 ( s t + 1 ) ] \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathcal{O}_{t} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\beta_{t+1}\left(\mathbf{s}_{t+1}\right)\right] βt(st,at)=p(Ot∣st,at)Est+1∼p(st+1∣st,at)[βt+1(st+1)]
β t ( s t ) = E a t ∼ p ( a t ∣ s t ) [ β t ( s t , a t ) ] \beta_{t}\left(\mathbf{s}_{t}\right)=E_{\mathbf{a}_{t} \sim p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)}\left[\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] βt(st)=Eat∼p(at∣st)[βt(st,at)]
这里
β
t
(
s
t
,
a
t
)
\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)
βt(st,at)我们可以看作一个类似
Q
Q
Q值函数的东西,
β
t
(
s
t
)
\beta_{t}\left(\mathbf{s}_{t}\right)
βt(st)可以看作
V
V
V值函数的东西,因此可以定义log
空间下的messages
:
Q ( s t , a t ) = log β t ( s t , a t ) V ( s t ) = log β t ( s t ) \begin{aligned} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) &=\log \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ V\left(\mathbf{s}_{t}\right) &=\log \beta_{t}\left(\mathbf{s}_{t}\right) \end{aligned} Q(st,at)V(st)=logβt(st,at)=logβt(st)
我们把这个东西称作soft value function
。此时策略:
p ( a t ∣ s t , O t : T ) = β t ( s t , a t ) β t ( s t ) = π ( a t ∣ s t ) = exp ( Q t ( s t , a t ) − V t ( s t ) ) = exp ( A t ( s t , a t ) ) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{t: T}\right)=\frac{\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}{\beta_{t}\left(\mathbf{s}_{t}\right)}=\pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)=\exp \left(Q_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-V_{t}\left(\mathbf{s}_{t}\right)\right)=\exp \left(A_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p(at∣st,Ot:T)=βt(st)βt(st,at)=π(at∣st)=exp(Qt(st,at)−Vt(st))=exp(At(st,at))
- Q Q Q和 V V V之间的关系:
到这我们已经得到了策略,更进一步我们考虑 Q Q Q和 V V V之间的关系。考虑对动作空间下的边缘概率分布积分有:
V ( s t ) = log ∫ A exp ( Q ( s t , a t ) ) d a t V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} V(st)=log∫Aexp(Q(st,at))dat
当 Q ( s t , a t ) Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) Q(st,at)很大时,有:
V ( s t ) = log ∫ A exp ( Q ( s t , a t ) ) d a t ≈ max a t Q ( s t , a t ) V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} \approx \max _{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) V(st)=log∫Aexp(Q(st,at))dat≈atmaxQ(st,at)
对于较小的 Q ( s t , a t ) Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) Q(st,at),给定 a t \mathbf{a}_{t} at和确定性的环境的话,我们有:
Q ( s t , a t ) = r ( s t , a t ) + V ( s t + 1 ) Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+V\left(\mathbf{s}_{t+1}\right) Q(st,at)=r(st,at)+V(st+1)
但是往往环境是随机的,所以我们有:
Q ( s t , a t ) = r ( s t , a t ) + log E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ exp ( V ( s t + 1 ) ) ] Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\exp \left(V\left(\mathbf{s}_{t+1}\right)\right)\right] Q(st,at)=r(st,at)+logEst+1∼p(st+1∣st,at)[exp(V(st+1))]
上述的这个反向传播值函数是很特别的,与以往的
Q
Q
Q值函数更新公式不同的地方在于,对于未来状态的状态值函数做了一个softmax
这样的东西,这就使得不会突出考虑最大的那个值函数,也能够增加更多的探索。对于随机环境来说会有比较好的效果。
隐目标函数下的最大熵强化学习
上述推导我们已经能够获得给定所有的最优变量 O 1 : T \mathcal{O}_{1: T} O1:T下的动作分布,但是优化目标是什么呢?回顾一下,给定最优变量 O 1 : T = 1 \mathcal{O}_{1: T}=1 O1:T=1时,整个概率图的轨迹分布可以表示为:
p ( τ ) = [ p ( s 1 ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t ) ] exp ( ∑ t = 1 T r ( s t , a t ) ) p(\tau)=\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p(τ)=[p(s1)t=1∏Tp(st+1∣st,at)]exp(t=1∑Tr(st,at))
给定策略 π ( a t ∣ s t ) \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) π(at∣st)的时候,整条轨迹的分布表示为:
p ^ ( τ ) = p ( s 1 ) ∏ t = T T p ( s t + 1 ∣ s t , a t ) π ( a t ∣ s t ) \hat{p}(\tau)=p\left(\mathbf{s}_{1}\right) \prod_{t=T}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p^(τ)=p(s1)t=T∏Tp(st+1∣st,at)π(at∣st)
优化目标为这两者的KL
散度:
D K L ( p ^ ( τ ) ∥ p ( τ ) ) = − E τ ∼ p ^ ( τ ) [ log p ( τ ) − log p ^ ( τ ) ] D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=-E_{\tau \sim \hat{p}(\tau)}[\log p(\tau)-\log \hat{p}(\tau)] DKL(p^(τ)∥p(τ))=−Eτ∼p^(τ)[logp(τ)−logp^(τ)]
两边取负号,并将其展开可得:
− D K L ( p ^ ( τ ) ∥ p ( τ ) ) = E τ ∼ p ^ ( τ ) [ log p ( s 1 ) + ∑ t = 1 T ( log p ( s t + 1 ∣ s t , a t ) + r ( s t , a t ) ) − log p ( s 1 ) − ∑ t = 1 T ( log p ( s t + 1 ∣ s t , a t ) + log π ( a t ∣ s t ) ) ] = E τ ∼ p ^ ( τ ) [ ∑ t = 1 T r ( s t , a t ) − log π ( a t ∣ s t ) ] = ∑ t = 1 T E ( s t , a t ) ∼ p ^ ( s t , a t ) ) [ r ( s t , a t ) − log π ( a t ∣ s t ) ] = ∑ t = 1 T E ( s t , a t ) ∼ p ^ ( s t , a t ) ) [ r ( s t , a t ) ] + E s t ∼ p ^ ( s t ) [ H ( π ( a t ∣ s t ) ) ] \begin{aligned} -D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=&E_{\tau \sim \hat{p}(\tau)}\left[\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T}\left(\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)+r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)-\right.\\ &\left.\log p\left(\mathbf{s}_{1}\right)-\sum_{t=1}^{T}\left(\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\right] \\ =& E_{\tau \sim \hat{p}(\tau)}\left[\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right] \\ =& \sum_{t=1}^{T} E_{\left.\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right] \\ =& \sum_{t=1}^{T} E_{\left.\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]+E_{\mathbf{s}_{t} \sim \hat{p}\left(\mathbf{s}_{t}\right)}\left[\mathcal{H}\left(\pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\right] \end{aligned} −DKL(p^(τ)∥p(τ))====Eτ∼p^(τ)[logp(s1)+t=1∑T(logp(st+1∣st,at)+r(st,at))−logp(s1)−t=1∑T(logp(st+1∣st,at)+logπ(at∣st))]Eτ∼p^(τ)[t=1∑Tr(st,at)−logπ(at∣st)]t=1∑TE(st,at)∼p^(st,at))[r(st,at)−logπ(at∣st)]t=1∑TE(st,at)∼p^(st,at))[r(st,at)]+Est∼p^(st)[H(π(at∣st))]
此时最小化KL
散度变成了最大化期望奖励和策略的熵。对于最后一步
T
T
T,把末尾项展开,把策略扔到计算期望的函数里面,可以得到如下形式:
E ( s T , a T ) ∼ p ^ ( s T , a T ) [ r ( s T , a T ) − log π ( a T ∣ s T ) ] = E s T ∼ p ^ ( s T ) [ − D K L ( π ( a T ∣ s T ) ∥ 1 exp ( V ( s T ) ) exp ( r ( s T , a T ) ) ) + V ( s T ) ] E_{\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right) \sim \hat{p}\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)}\left[r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-\log \pi\left(\mathbf{a}_{T} \mid \mathbf{s}_{T}\right)\right] \\ =E_{\mathbf{s}_{T} \sim \hat{p}\left(\mathbf{s}_{T}\right)}\left[-D_{\mathrm{KL}}\left(\pi\left(\mathbf{a}_{T} \mid \mathbf{s}_{T}\right) \| \frac{1}{\exp \left(V\left(\mathbf{s}_{T}\right)\right)} \exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right)\right)+V\left(\mathbf{s}_{T}\right)\right] E(sT,aT)∼p^(sT,aT)[r(sT,aT)−logπ(aT∣sT)]=EsT∼p^(sT)[−DKL(π(aT∣sT)∥exp(V(sT))1exp(r(sT,aT)))+V(sT)]
其中
V
(
s
T
)
=
log
∫
A
exp
(
r
(
s
T
,
a
T
)
)
d
a
T
V\left(\mathbf{s}_{T}\right)=\log \int_{A} \exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) d \mathbf{a}_{T}
V(sT)=log∫Aexp(r(sT,aT))daT。上述公式当KL
散度最小的时候,能够取到最优值,当两个分布相同的时候,KL
散度最小,此时的策略分布可以表示为:
π ( a T ∣ s T ) = exp ( r ( s T , a T ) − V ( s T ) ) \pi\left(\mathbf{a}_{T} \mid \mathbf{s}_{T}\right)=\exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-V\left(\mathbf{s}_{T}\right)\right) π(aT∣sT)=exp(r(sT,aT)−V(sT))
上述推导是针对最后一步得到的,对于中间某步的话,我们需要优化的目标可以拆分成两项,一项是关于当前步的,一项是关于对于未来 s t + 1 s_{t+1} st+1的:
E ( s t , a t ) ∼ p ^ ( s t , a t ) [ r ( s t , a t ) − log π ( a t ∣ s t ) ] + E ( s t , a t ) ∼ p ^ ( s t , a t ) [ E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ V ( s t + 1 ) ] ] E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right]+E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right] E(st,at)∼p^(st,at)[r(st,at)−logπ(at∣st)]+E(st,at)∼p^(st,at)[Est+1∼p(st+1∣st,at)[V(st+1)]]
整理可得:
E ( s t , a t ) ∼ p ^ ( s t , a t ) [ r ( s t , a t ) − log π ( a t ∣ s t ) ] + E ( s t , a t ) ∼ p ^ ( s t , a t ) [ E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ V ( s t + 1 ) ] ] = E s t ∼ p ^ ( s t ) [ − D K L ( π ( a t ∣ s t ) ∥ 1 exp ( V ( s t ) ) exp ( Q ( s t , a t ) ) ) + V ( s t ) ] E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right]+E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right]=\\E_{\mathbf{s}_{t} \sim \hat{p}\left(\mathbf{s}_{t}\right)}\left[-D_{\mathrm{KL}}\left(\pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) \| \frac{1}{\exp \left(V\left(\mathbf{s}_{t}\right)\right)} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right)+V\left(\mathbf{s}_{t}\right)\right] E(st,at)∼p^(st,at)[r(st,at)−logπ(at∣st)]+E(st,at)∼p^(st,at)[Est+1∼p(st+1∣st,at)[V(st+1)]]=Est∼p^(st)[−DKL(π(at∣st)∥exp(V(st))1exp(Q(st,at)))+V(st)]
其中 Q ( s t , a t ) = r ( s t , a t ) + E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ V ( s t + 1 ) ] Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right] Q(st,at)=r(st,at)+Est+1∼p(st+1∣st,at)[V(st+1)], V ( s t ) = log ∫ A exp ( Q ( s t , a t ) ) d a t V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} V(st)=log∫Aexp(Q(st,at))dat。
此时的策略也是当KL散度最小的时候能够取到最优值,有: π ( a t ∣ s t ) = exp ( Q ( s t , a t ) − V ( s t ) ) \pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)=\exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-V\left(\mathbf{s}_{t}\right)\right) π(at∣st)=exp(Q(st,at)−V(st))。
可以看出,基于图概率下的策略搜索和基于变分推断做KL散度得到的策略结果是一样的。这也充分说明之前做的police inference
就是在解决这个潜在目标函数下的优化问题。
随机环境下的优化
上述描述说的是在确定行系统条件下的事情,然而对于环境是随机系统的(stochastic dynamics
)情况,
p
(
s
t
+
1
∣
s
t
,
a
t
,
O
1
:
T
)
=
p
(
s
t
+
1
∣
s
t
,
a
t
)
p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}, \mathcal{O}_{1: T}\right)=p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)
p(st+1∣st,at,O1:T)=p(st+1∣st,at)这个等式无法成立,因此轨迹的发生概率应该表示为:
p ^ ( τ ) = p ( s 1 ∣ O 1 : T ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t , O 1 : T ) p ( a t ∣ s t , O 1 : T ) \hat{p}(\tau)=p\left(\mathbf{s}_{1} \mid \mathcal{O}_{1: T}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}, \mathcal{O}_{1: T}\right) p\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}, \mathcal{O}_{1: T}\right) p^(τ)=p(s1∣O1:T)t=1∏Tp(st+1∣st,at,O1:T)p(at∣st,O1:T)
此时的KL
散度优化目标可以写为:
− D K L ( p ^ ( τ ) ∥ p ( τ ) ) = E τ ∼ p ^ ( τ ) [ log p ( s 1 ) + ∑ t = 1 T r ( s t , a t ) + log p ( s t + 1 ∣ s t , a t ) ] + H ( p ^ ( τ ) ) -D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=E_{\tau \sim \hat{p}(\tau)}\left[\log p\left(\mathrm{~s}_{1}\right)+\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]+\mathcal{H}(\hat{p}(\tau)) −DKL(p^(τ)∥p(τ))=Eτ∼p^(τ)[logp( s1)+t=1∑Tr(st,at)+logp(st+1∣st,at)]+H(p^(τ))
由于存在 log p ( s t + 1 ∣ s t , a t ) \log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) logp(st+1∣st,at)这一项,在无模型的设定里,这一项是比较难优化的。
最大熵强化学习与变分推断
变分推断是用一个简单的变分分布去近似后验分布,在最大熵RL
里面的轨迹概率为:
p ( τ ) = [ p ( s 1 ) ∏ t = 1 T p ( s t + 1 ∣ s t , a t ) ] exp ( ∑ t = 1 T r ( s t , a t ) ) p(\tau)=\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p(τ)=[p(s1)t=1∏Tp(st+1∣st,at)]exp(t=1∑Tr(st,at))
这里我们将需要去近似这个分布的概率设置为:
q ( τ ) = q ( s 1 ) ∏ t = 1 T q ( s t + 1 ∣ s t , a t ) q ( a t ∣ s t ) q(\tau)=q\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) q\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) q(τ)=q(s1)t=1∏Tq(st+1∣st,at)q(at∣st)
当
q
(
s
1
)
=
p
(
s
1
)
q\left(\mathbf{s}_{1}\right)=p\left(\mathbf{s}_{1}\right)
q(s1)=p(s1),
q
(
s
t
+
1
∣
s
t
,
a
t
)
=
p
(
s
t
+
1
∣
s
t
,
a
t
)
q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)
q(st+1∣st,at)=p(st+1∣st,at)的时候,
q
(
τ
)
q(\tau)
q(τ)就是
p
^
(
τ
)
\hat{p}(\tau)
p^(τ)了,当然这里
π
(
a
t
∣
s
t
)
\pi\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)
π(at∣st)也被重命名为
q
(
a
t
∣
s
t
)
q\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)
q(at∣st)了。回顾我们的evidence
,就是对于
t
∈
{
1
,
…
,
T
}
t \in\{1, \ldots, T\}
t∈{1,…,T}我们有
O
t
=
1
\mathcal{O}_{t}=1
Ot=1。变分lower bound
可以给定为:
log p ( O 1 : T ) = log ∬ p ( O 1 : T , s 1 : T , a 1 : T ) d s 1 : T d a 1 : T = log ∬ p ( O 1 : T , s 1 : T , a 1 : T ) q ( s 1 : T , a 1 : T ) q ( s 1 : T , a 1 : T ) d s 1 : T d a 1 : T = log E ( s 1 : T , a 1 : T ) ∼ q ( s 1 : T , a 1 : T ) [ p ( O 1 : T , s 1 : T , a 1 : T ) q ( s 1 : T , a 1 : T ) ] ≥ E ( s 1 : T , a 1 : T ) ∼ q ( s 1 : T , a 1 : T ) [ log p ( O 1 : T , s 1 : T , a 1 : T ) − log q ( s 1 : T , a 1 : T ) ] \begin{aligned} \log p\left(\mathcal{O}_{1: T}\right) &=\log \iint p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) d \mathbf{s}_{1: T} d \mathbf{a}_{1: T} \\ &=\log \iint p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \frac{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)} d \mathbf{s}_{1: T} d \mathbf{a}_{1: T} \\ &=\log E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\frac{p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\right] \\ & \geq E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\log p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)-\log q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)\right] \end{aligned} logp(O1:T)=log∬p(O1:T,s1:T,a1:T)ds1:Tda1:T=log∬p(O1:T,s1:T,a1:T)q(s1:T,a1:T)q(s1:T,a1:T)ds1:Tda1:T=logE(s1:T,a1:T)∼q(s1:T,a1:T)[q(s1:T,a1:T)p(O1:T,s1:T,a1:T)]≥E(s1:T,a1:T)∼q(s1:T,a1:T)[logp(O1:T,s1:T,a1:T)−logq(s1:T,a1:T)]
最后一个不等式是通过Jensen’s inequality
推导得到的。再依据
p
(
τ
)
p(\tau)
p(τ)和
q
(
τ
)
q(\tau)
q(τ)的定义,带入上述方程中,我们可以得到如下不等式:
log p ( O 1 : T ) ≥ E ( s 1 : T , a 1 : T ) ∼ q ( s 1 : T , a 1 : T ) [ ∑ t = 1 T r ( s t , a t ) − log q ( a t ∣ s t ) ] \log p\left(\mathcal{O}_{1: T}\right) \geq E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right] logp(O1:T)≥E(s1:T,a1:T)∼q(s1:T,a1:T)[t=1∑Tr(st,at)−logq(at∣st)]
这里要注意上述条件成立是有约束条件的,就是当 q ( s 1 ) = p ( s 1 ) q\left(\mathbf{s}_{1}\right)=p\left(\mathbf{s}_{1}\right) q(s1)=p(s1), q ( s t + 1 ∣ s t , a t ) = p ( s t + 1 ∣ s t , a t ) q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right) q(st+1∣st,at)=p(st+1∣st,at)的时候才会成立。
上述用图模型思考强化学习问题的时候,目标函数会带一个entropy
项。
Soft Q-Learning
对于Q-Learning
我们只要参数化Q
函数:
Q
ϕ
(
s
t
,
a
t
)
Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)
Qϕ(st,at),目标函数为:
E ( ϕ ) = E ( s t , a t ) ∼ q ( s t , a t ) [ ( r ( s t , a t ) + E q ( s t + 1 ∣ s t , a t ) [ V ( s t + 1 ) ] − Q ϕ ( s t , a t ) ) 2 ] \mathcal{E}(\phi)=E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{q\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V \left(\mathbf{s}_{t+1}\right)\right]-{Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\right)^{2}\right] E(ϕ)=E(st,at)∼q(st,at)[(r(st,at)+Eq(st+1∣st,at)[V(st+1)]−Qϕ(st,at))2]
其中
V
(
s
t
)
=
log
∫
A
exp
(
Q
(
s
t
,
a
t
)
)
d
a
t
V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t}
V(st)=log∫Aexp(Q(st,at))dat。其Soft Q-learning
的更新公式为:
ϕ ← ϕ − α E [ d Q ϕ d ϕ ( s t , a t ) ( Q ϕ ( s t , a t ) − ( r ( s t , a t ) + log ∫ A exp ( Q ( s t + 1 , a t + 1 ) ) d a t + 1 ) ) ] \phi \leftarrow \phi-\alpha E\left[\frac{d Q_{\phi}}{d \phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\left(Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)\right) d \mathbf{a}_{t+1}\right)\right)\right] ϕ←ϕ−αE[dϕdQϕ(st,at)(Qϕ(st,at)−(r(st,at)+log∫Aexp(Q(st+1,at+1))dat+1))]
为了与标准的Q-Learning
算法对比,我们给出标准的Q-Learning
算法的更新公式:
ϕ ← ϕ − α E [ d Q ϕ d ϕ ( s t , a t ) ( Q ϕ ( s t , a t ) − ( r ( s t , a t ) + max a t + 1 Q ϕ ( s t + 1 , a t + 1 ) ) ) ) ] \left.\phi \leftarrow \phi-\alpha E\left[\frac{d Q_{\phi}}{d \phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\left(Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\max _{\mathbf{a}_{t+1}} Q_{\phi}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)\right)\right)\right)\right] ϕ←ϕ−αE[dϕdQϕ(st,at)(Qϕ(st,at)−(r(st,at)+at+1maxQϕ(st+1,at+1))))]
可以看到一项是在求Softmax
下的Q
,而标准的Q-Learning
算法拿的是max Q
值。对于连续动作的话,target value
V
(
s
′
)
=
soft
max
a
′
Q
ϕ
(
s
′
,
a
′
)
=
log
∫
exp
(
Q
ϕ
(
s
′
,
a
′
)
)
d
a
′
V\left(\mathbf{s}^{\prime}\right)=\operatorname{soft} \max _{\mathbf{a}^{\prime}} Q_{\phi}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)=\log \int \exp \left(Q_{\phi}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right) d \mathbf{a}^{\prime}
V(s′)=softmaxa′Qϕ(s′,a′)=log∫exp(Qϕ(s′,a′))da′需要采用一些方法来进行计算。
- Liu Q , Wang D . Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm[C]// 2016.
- Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy- based policies. In International Conference on Machine Learning (ICML).
最大熵策略梯度
对策略参数化为: q θ ( a t ∣ s t ) q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) qθ(at∣st)。
目标函数可以定义为:
J ( θ ) = ∑ t = 1 T E ( s t , a t ) ∼ q ( s t , a t ) [ r ( s t , a t ) + H ( q θ ( a t ∣ s t ) ) ] J(\theta)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) + \mathcal{H}\left(q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\right] J(θ)=t=1∑TE(st,at)∼q(st,at)[r(st,at)+H(qθ(at∣st))]
对其求梯度有:
∇ θ J ( θ ) = ∑ t = 1 T ∇ θ E ( s t , a t ) ∼ q ( s t , a t ) [ r ( s t , a t ) + H ( q θ ( a t ∣ s t ) ) ] = ∑ t = 1 T E ( s t , a t ) ∼ q ( s t , a t ) [ ∇ θ log q θ ( a t ∣ s t ) ( ∑ t ′ = t T r ( s t ′ , a t ′ ) − log q θ ( a t ′ ∣ s t ′ ) − 1 ) ] = ∑ t = 1 T E ( s t , a t ) ∼ q ( s t , a t ) [ ∇ θ log q θ ( a t ∣ s t ) ( ∑ t ′ = t T r ( s t ′ , a t ′ ) − log q θ ( a t ′ ∣ s t ′ ) − b ( s t ′ ) ) ] \begin{aligned} \nabla_{\theta} J(\theta) &=\sum_{t=1}^{T} \nabla_{\theta} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\mathcal{H}\left(q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q_{\theta}\left(\mathbf{a}_{t^{\prime}} \mid \mathbf{s}_{t^{\prime}}\right)-1\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q_{\theta}\left(\mathbf{a}_{t^{\prime}} \mid \mathbf{s}_{t^{\prime}}\right)-b\left(\mathbf{s}_{t^{\prime}}\right)\right)\right] \end{aligned} ∇θJ(θ)=t=1∑T∇θE(st,at)∼q(st,at)[r(st,at)+H(qθ(at∣st))]=t=1∑TE(st,at)∼q(st,at)[∇θlogqθ(at∣st)(t′=t∑Tr(st′,at′)−logqθ(at′∣st′)−1)]=t=1∑TE(st,at)∼q(st,at)[∇θlogqθ(at∣st)(t′=t∑Tr(st′,at′)−logqθ(at′∣st′)−b(st′))]
上述第二行的等式用到了likelihood ratio trick
,就是对期望的下标
(
s
t
,
a
t
)
∼
q
(
s
t
,
a
t
)
\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)
(st,at)∼q(st,at)做的处理。求导完之后多的是
−
log
q
θ
(
a
t
′
∣
s
t
′
)
−
1
-\log q_{\theta}\left(\mathbf{a}_{t^{\prime}} \mid \mathbf{s}_{t^{\prime}}\right)-1
−logqθ(at′∣st′)−1这一项,就是在鼓励找熵比较大的那一项。最后一个1
也可以设置为一个baseline
,与PG
算法里面处理地一样。或者直接将后面的表示为某种advantage
。
∇ θ J ( θ ) = ∑ t = 1 T E ( s t , a t ) ∼ q ( s t , a t ) [ ∇ θ log q θ ( a t ∣ s t ) A ^ ( s t , a t ) ] \nabla_{\theta} J(\theta)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) \hat{A}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] ∇θJ(θ)=t=1∑TE(st,at)∼q(st,at)[∇θlogqθ(at∣st)A^(st,at)]
Soft Actor-Critic
SAC
就是将优势函数也进行参数化,变成一个off-policy
的算法。优化目标为:
J V ( ψ ) = E s t ∼ D [ 1 2 ( V ψ ( s t ) − E a t ∼ π ϕ [ Q θ ( s t , a t ) − log π ϕ ( a t ∣ s t ) ] ) 2 ] J_{V}(\psi)=\mathbb{E}_{\mathbf{s}_{t} \sim \mathcal{D}}\left[\frac{1}{2}\left(V_{\psi}\left(\mathbf{s}_{t}\right)-\mathbb{E}_{\mathbf{a}_{t} \sim \pi_{\phi}}\left[Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right]\right)^{2}\right] JV(ψ)=Est∼D[21(Vψ(st)−Eat∼πϕ[Qθ(st,at)−logπϕ(at∣st)])2]
s
t
∼
D
\mathbf{s}_{t} \sim \mathcal{D}
st∼D表示从Replay buffer
里面抽取数据。对其求梯度有:
∇ ^ ψ J V ( ψ ) = ∇ ψ V ψ ( s t ) ( V ψ ( s t ) − Q θ ( s t , a t ) + log π ϕ ( a t ∣ s t ) ) \hat{\nabla}_{\psi} J_{V}(\psi)=\nabla_{\psi} V_{\psi}\left(\mathbf{s}_{t}\right)\left(V_{\psi}\left(\mathbf{s}_{t}\right)-Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right) ∇^ψJV(ψ)=∇ψVψ(st)(Vψ(st)−Qθ(st,at)+logπϕ(at∣st))
对 Q Q Q值函数一样,有:
J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) 2 ] J_{Q}(\theta)=\mathbb{E}_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\hat{Q}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right] JQ(θ)=E(st,at)∼D[21(Qθ(st,at)−Q^(st,at))2]
Q ^ ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ∼ p [ V ψ ˉ ( s t + 1 ) ] \hat{Q}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p}\left[V_{\bar{\psi}}\left(\mathbf{s}_{\mathfrak{t}+1}\right)\right] Q^(st,at)=r(st,at)+γEst+1∼p[Vψˉ(st+1)]
对其求导有:
∇ ^ θ J Q ( θ ) = ∇ θ Q θ ( a t , s t ) ( Q θ ( s t , a t ) − r ( s t , a t ) − γ V ψ ˉ ( s t + 1 ) ) \hat{\nabla}_{\theta} J_{Q}(\theta)=\nabla_{\theta} Q_{\theta}\left(\mathbf{a}_{t}, \mathbf{s}_{t}\right)\left(Q_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\gamma V_{\bar{\psi}}\left(\mathbf{s}_{t+1}\right)\right) ∇^θJQ(θ)=∇θQθ(at,st)(Qθ(st,at)−r(st,at)−γVψˉ(st+1))
策略优化目标为:
J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∥ exp ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] J_{\pi}(\phi)=\mathbb{E}_{\mathbf{s}_{t} \sim \mathcal{D}}\left[D_{K L}\left(\pi_{\phi}\left(\cdot \mid \mathbf{s}_{t}\right) \| \frac{\exp \left(Q_{\theta}\left(\mathbf{s}_{t}, \cdot\right)\right)}{Z_{\theta}\left(\mathbf{s}_{t}\right)}\right)\right] Jπ(ϕ)=Est∼D[DKL(πϕ(⋅∣st)∥Zθ(st)exp(Qθ(st,⋅)))]
J π ( ϕ ) = E s t ∼ D , ϵ t ∼ N [ log π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] J_{\pi}(\phi)=\mathbb{E}_{\mathbf{s}_{t} \sim \mathcal{D}, \epsilon_{t} \sim \mathcal{N}}\left[\log \pi_{\phi}\left(f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right) \mid \mathbf{s}_{t}\right)-Q_{\theta}\left(\mathbf{s}_{t}, f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right)\right)\right] Jπ(ϕ)=Est∼D,ϵt∼N[logπϕ(fϕ(ϵt;st)∣st)−Qθ(st,fϕ(ϵt;st))]
∇ ^ ϕ J π ( ϕ ) = ∇ ϕ log π ϕ ( a t ∣ s t ) + ( ∇ a t log π ϕ ( a t ∣ s t ) − ∇ a t Q ( s t , a t ) ) ∇ ϕ f ϕ ( ϵ t ; s t ) \hat{\nabla}_{\phi} J_{\pi}(\phi)=\nabla_{\phi} \log \pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\\ \quad+\left(\nabla_{\mathbf{a}_{t}} \log \pi_{\phi}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)-\nabla_{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \nabla_{\phi} f_{\phi}\left(\epsilon_{t} ; \mathbf{s}_{t}\right) ∇^ϕJπ(ϕ)=∇ϕlogπϕ(at∣st)+(∇atlogπϕ(at∣st)−∇atQ(st,at))∇ϕfϕ(ϵt;st)
伪代码:
参考文献
- Pattern recognition and machine learning by Bishop 2006
- Levine, S., 2018. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909
- Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy- based policies. In International Conference on Machine Learning (ICML).
- Tuomas Haarnoja*, Aurick Zhou*, Kristian Hartikainen*, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. Soft Actor-Critic Algorithms and Applications. arXiv preprint, 2018.
- Kappen. (2009). Optimal control as a graphical model inference problem: frames control as an inference problem in a graphical model
- Ziebart. (2010). Modeling interaction via the principle of maximal causal entropy: connection between soft optimality and maximum entropy modeling.