文章目录
Introduction to Reinforcement Learning
The RL Problem
state
-
Environment state S t e S_t^e Ste
-
Agent state S t a S_t^a Sta
-
Information state (a.k.a. Markov state)
Definition: a state St is Markov if and only if
P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] \mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right] P[St+1∣St]=P[St+1∣S1,…,St]
Fully Observable Environments: O t = S t a = S t e O_t = S_t^a = S_t^e Ot=Sta=Ste
Partially Observable Environments: S t a ≠ S t e S_t^a \neq S_t^e Sta=Ste
Inside An RL Agent
- Policy: 行为函数,一般用 π \pi π 表示
- Value Function: 评价状态或动作的好坏
- Model: 智能体对环境的理解
Policy
a map from state to action
- Deterministic policy: a = π ( s ) a = \pi(s) a=π(s)
- Stochastic policy: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a \mid s) = \mathbb{P}[A_t = a \mid S_t = s] π(a∣s)=P[At=a∣St=s]
Value Function
a prediction of future reward, used to evalute the goodness/badness of states
v
π
(
s
)
=
E
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
⋯
∣
S
t
=
s
]
v_\pi(s) = \mathbb{E}\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t =s \right]
vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+⋯∣St=s]
Rt+1表示的是在状态St下采取动作后得到的奖励,和一般记作Rt不同
Model
- A model predicts what the environment will do next
- P \mathcal{P} P predicts the next state
- R \mathcal{R} R predicts the next (immediate) reward, e.g.
P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] R s a = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{P}_{ss'}^a = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s, A_t = a \right] \\ \mathcal{R}_s^a = \mathbb{E}\left[R_{t+1} \mid S_t = s, A_t = a \right] Pss′a=P[St+1=s′∣St=s,At=a]Rsa=E[Rt+1∣St=s,At=a]
Problems within RL
Learning and Planning
Exploration and Exploitation
当存在一个较好的解决方案时,到底是选择 Exploration 更多关于环境的信息,还是 Exploitation 已知的信息去最大化奖励?二者之间的权衡问题
Prediction and Control
- Prediction: 给定一个policy,计算agent能够得到多少reward,预估未来
- Control: 确定众多的policy之中,哪一个决策能够得到最多的奖励,即最优策略
实际上二者是一个递进的关系,在RL中,通过解决预测问题来解决控制问题
用如下例子说明:
Prediction:
- 除了从A~A’和B~B’,其他步骤的反馈都是-1
- Policy 是上下左右采取的概率都是25%
基于上述规则计算出Value Function 如上图 (b) 所示
Control:
背景和上面一样,这时行为模式policy是未知的,我们需要解决控制问题,求出最优的Value Function,这样最优的policy顺其自然就出来了