MIT 6.S191 Lecture 6: Deep Reinforcement Learning
CS 294: Deep Reinforcement Learning
Jan 18: Introduction and course overview (Levine, Finn, Schulman)
Why deep reinforcement learning?
• Deep = can process complex sensory input
…and also compute really complex functions
• Reinforcement learning = can choose complex actions
OpenAI 2016年6月21日宣布了其主要目标,包括制造“通用”机器人和使用自然语言的聊天机器人。
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013).
policy gradients
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. (2015);
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016).
X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. NIPS. 2014.
guided policy search
S. Levine, C. Finn, T. Darrell, and P. Abbeel. “End-to-end training of deep visuomotor policies”. (2015).
policy gradients
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-dimensional continuous control using generalized advantage estimation”. (2015).
Finally, AlphaGo 的 四大技术
supervised learning + policy gradients + value functions + Monte-Carlo tree search
Deep Q Network
回答:Google’s DeepMind published its famous paper Playing Atari with Deep Reinforcement Learning, in which they introduced a new algorithm called Deep Q Network (DQN for short) in 2013. It demonstrated how an AI agent can learn to play games by just observing the screen without any prior information about those games(无信息先验?). The result turned out to be pretty impressive.
This paper opened the era of what is called ‘deep reinforcement learning’, a mix of deep learing and reinforcement learning.
Then, 通过实践了解这个牛牛的网络:Deep Q Learning with Keras and Gym
外加一个有良心的国内博客:用Tensorflow基于Deep Q Learning DQN 玩Flappy Bird (课外阅读)
Cartpole Game 简介
CartPole is one of the simplest environments in OpenAI gym (a game simulator).
As you can see in the animation from the top, the goal of CartPole is to balance a pole connected with one joint on top of a moving cart.
Instead of pixel information, there are 4 kinds of information given by the state, such as angle of the pole and position of the cart.
An agent can move the cart by performing a series of actions of 0 or 1 to the cart, pushing it left or right.
Gym makes interacting with the game environment really simple.
next_state, reward, done, info = env.step(action)
As we discussed above, action can be either 0 or 1.
If we pass those numbers, env
, which represents the game environment, will emit the results. done
is a boolean value telling whether the game ended or not.
The old state
information paired with action
and next_state
and reward
is the information we need for training the agent.
Implementing Simple Nerual Network using Keras
This post is not about deep learning or neural net. So we will consider neural net as just a black box algorithm.
An algorithm that learns on the pairs of example input and output data, detects some kind of patterns, and predicts the output based on an unseen input data.
But we should understand which part is the neural net in the DQN algorithm.
DQN 算法中哪里涉及神经网络
Note that the neural net we are going to use is similar to the diagram above.
We will have one input layer that receives 4 information and 3 hidden layers. 输入层
But we are going to have 2 nodes in the output layer since there are two buttons (0 and 1) for the game.
Keras makes it really simple to implement basic neural network.
The code below creates an empty neural net model.
, loss
and optimizer
are the parameters that define the characteristics of the neural network, but we are not going to discuss it here.
马里奥AI实现 简介
Ref: http://www.cnblogs.com/Leo_wl/p/5852010.html
所谓NEAT算法即通过增强拓扑的进化神经网络(Evolving Neural Networks through Augmenting Topologies),算法不同于我们之前讨论的传统神经网络,
- 它不仅会训练和修改网络的权值,
- 同时会修改网络的拓扑结构,包括新增节点和删除节点等操作。
- 基因:网络中的连接
- 基因组:基因的集合
- 物种:一批具有相似性基因组的集合
- Fitness:有点类似于增强学习中的reward函数
- generation:进行一组训练的基因组集合,每一代训练结束后,会根据fitness淘汰基因组,并且通过无性繁殖和有性繁殖来新增新的基因组
- 基因变异:发生在新生成基因组的过程中,可能会出现改变网络的权重,增加突出连接或者神经元,也有可能禁用突触或者启用突触
Ref: NeuroEvolution with MarI/O。
基于Deep Q-learning的马里奥AI实现
do {
- 利用CNN来识别游戏总马里奥的状态,
- 并利用增强学习算法做出动作选择,
- 然后根据新的返回状态和历史状态来计算reward函数从而反馈给Q函数进行迭代,
} while(不断的训练直到游戏能够通关)
Q-leanring 基础(应该涉及到Bayes loss function)
Link: 另起一篇
From: https://arxiv.org/pdf/1701.07274.pdf
先了解传统的Q-learning,再结合NN深入Deep Q-learning。
Concise outline I prefer.
An RL agent may include one or more of these components:
Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
Policy: 显示了最佳路径。
Value function: 显示了各个位置的value。
Model: 建模时需考虑的一些问题。
Most Markov reward and decision processes are discounted. Why? 为何设计为discount?
- Mathematically convenient to discount rewards
- Avoids infinite returns in cyclic Markov processes
- Uncertainty about the future may not be fully represented
- If the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behaviour shows preference for immediate reward
- It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.
利用Markov迭代的性质,如下。注意:这里像极了HMM的forward-backword algorithm。
利用上述结论,更新一个value function,如下:【这里默认了无衰减】