Lecture1 Intro to RL

文章目录

Introduction to Reinforcement Learning

The RL Problem

state

  1. Environment state S t e S_t^e Ste​

  2. Agent state S t a S_t^a Sta​

  3. Information state (a.k.a. Markov state)
    Definition: a state St is Markov if and only if
    P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , … , S t ] \mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right] P[St+1​∣St​]=P[St+1​∣S1​,…,St​]

Fully Observable Environments: O t = S t a = S t e O_t = S_t^a = S_t^e Ot​=Sta​=Ste​

Partially Observable Environments: S t a ≠ S t e S_t^a \neq S_t^e Sta​​=Ste​​​

Inside An RL Agent

  1. Policy: 行为函数,一般用 π \pi π 表示
  2. Value Function: 评价状态或动作的好坏
  3. Model: 智能体对环境的理解

Policy

a map from state to action

  • Deterministic policy: a = π ( s ) a = \pi(s) a=π(s)
  • Stochastic policy: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a \mid s) = \mathbb{P}[A_t = a \mid S_t = s] π(a∣s)=P[At​=a∣St​=s]

Value Function

a prediction of future reward, used to evalute the goodness/badness of states
v π ( s ) = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ S t = s ] v_\pi(s) = \mathbb{E}\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \mid S_t =s \right] vπ​(s)=E[Rt+1​+γRt+2​+γ2Rt+3​+⋯∣St​=s]
Rt+1表示的是在状态St下采取动作后得到的奖励,和一般记作Rt不同

Model

  • A model predicts what the environment will do next
  • P \mathcal{P} P predicts the next state
  • R \mathcal{R} R predicts the next (immediate) reward, e.g.

P s s ′ a = P [ S t + 1 = s ′ ∣ S t = s , A t = a ] R s a = E [ R t + 1 ∣ S t = s , A t = a ] \mathcal{P}_{ss'}^a = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s, A_t = a \right] \\ \mathcal{R}_s^a = \mathbb{E}\left[R_{t+1} \mid S_t = s, A_t = a \right] Pss′a​=P[St+1​=s′∣St​=s,At​=a]Rsa​=E[Rt+1​∣St​=s,At​=a]

Problems within RL

Learning and Planning

Exploration and Exploitation

当存在一个较好的解决方案时,到底是选择 Exploration 更多关于环境的信息,还是 Exploitation 已知的信息去最大化奖励?二者之间的权衡问题

Prediction and Control

  • Prediction: 给定一个policy,计算agent能够得到多少reward,预估未来
  • Control: 确定众多的policy之中,哪一个决策能够得到最多的奖励,即最优策略

实际上二者是一个递进的关系,在RL中,通过解决预测问题来解决控制问题

用如下例子说明:

Prediction:

Lecture1 Intro to RL

  • 除了从A~A’和B~B’,其他步骤的反馈都是-1
  • Policy 是上下左右采取的概率都是25%

基于上述规则计算出Value Function 如上图 (b) 所示

Control:

Lecture1 Intro to RL
背景和上面一样,这时行为模式policy是未知的,我们需要解决控制问题,求出最优的Value Function,这样最优的policy顺其自然就出来了

上一篇:题解 P4284「SHOI 2014」概率充电器


下一篇:凸函数定义与判定条件