发表时间:2020(ICLR 2020)
文章要点:这篇文章提出了一个叫Simulated Policy Learning (SimPLe)的算法,用model based的方式来提高sample efficiency,在和环境交互100K次的限制下,比所有model free算法的效果好。
具体的,就是去学一个world model,这个world model包括environment所有的组成部分,dynamics,reward function。作者设计了一个非常复杂的网络结构,如下图
这个world model可以是Deterministic Model,也可以是Stochastic Models,区别就是多了一个variational autoencoder,这里具体就不说了。
有了这个model之后,就在这上面训一个PPO就好了。整个逻辑就结束了,确实是挺simple的。整个算法伪代码如下
总结:之前还没有model based算法取得比model free更好的效果,作者看来这是第一个(no prior work has successfully demonstrated model-based control via predictive models that achieve competitive results with model-free RL)。不过作者后面也说了,如果不把交互次数限制在100K,那么随着训练的进行,model free还是比文章这个算法效果好,就挺奇怪的其实(This demonstrates that SimPLe excels in a low data regime, but its advantage disappears with a bigger amount of data.)。
疑问:里面这个world model的结构设计的很复杂,估计要真正做了才知道为啥最后设计成这个样子的。
相关文章
- 10-03文献阅读笔记——《ON REWARD SHAPING FOR MOBILE ROBOT NAVIGATION: A REINFORCEMENT LEARNING AND SLAM BASED APP》
- 10-03Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning 论文翻译
- 10-03Playing Atari with Deep Reinforcement Learning:打响DRL的第一枪
- 10-03【RL-CC】Reinforcement learning-based neural network congestion controller for ATM network
- 10-03Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning
- 10-03Learning to Combat Compounding-Error in Model-Based Reinforcement Learning
- 10-03Online and Offline Reinforcement Learning by Planning with a Learned Model
- 10-03MOReL: Model-Based Offline Reinforcement Learning
- 10-03Model-based Reinforcement Learning: A Survey
- 10-03ALGORITHMIC FRAMEWORK FOR MODEL-BASED DEEP REINFORCEMENT LEARNING WITH THEORETICAL GUARANTEES