【书籍阅读 0】Reinforcement Learning An Introduction, 2nd Edition

前言:张聪明的强化学习书籍阅读系列开启 发现博客挖了好多坑没填… 就开新的了(比如上次的文献综述),算是练习笔记了?
每一个目录对应的是在pdf的页数(如果LPage就是书左上角的页码 - 因为我发现后面我要在两页之间加空白页 做练习lol);书籍下载链接:CSDN资源下载 未上传BDYP下载链接 提取码:3a2p

更新时间:12/14

[Elements] Page:27/548 Date:12/3

一个强化学习系统应该具备四个元素:
1. policy (mapping from perceived states of the environment to actions )
也就是environment -> states -> action
policy 可以使随机的 仅指出每个动作的概率【怎么去计算概率呢?】
2. reward signal (确定了强化学习的目标 短期=每一步)
类似于这个动作 (每一个动作后的)reward:a single number 【建一个reward函数】
最大化 the total rewards
3. a value function(长期的 what is good in the long run)

4. model of the environment (optional)

[Multi-armed Bandits] Page:47&48/548 Date:12/14

首先这里的multi-armed是多个*(单臂->多臂:单臂的多个),所以每一次action对应的就是你动了哪几个*
本章节首先提出的是:RL与其他学习方法不同之处是:仅评价action 不直接给出正确的action
在k-armed中 k个动作对应rewards;也就是动作的评价值(value of action)
在时间t选中的动作为: A t A_t At​,对应的rewards就是 R t R_t Rt​ 对于任意的动作 a a a的评价值都为 q ∗ ( a ) q_*(a) q∗​(a)也就是动作a被选中的期待reward
q ∗ ( a ) = E [ R t ∣ A t = a ] q_*(a) = E [R_t|A_t=a] q∗​(a)=E[Rt​∣At​=a]

Chapter 2 Exercise

Exercise 2.1

In ε ε ε-greedy action selection, for the case of two actions and ε ε ε = 0.5, what is the probability that the greedy action is selected?
50%
理由:ε的选择就是 prob>ε 选择exploration;prob<ε的时候选择当前最好的,五五开所以 被选择的概率也五五开

Exercise 2.2 & 2.3

Exercise 2.2: Bandit example Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using ε-greedy action selection, sample-average action-value estimates, and initial estimates of Q 1 ( a ) = 0 Q_1(a) = 0 Q1​(a)=0, for all a. Suppose the initial sequence of actions and rewards is A 1 = 1 , R 1 = − 1 , A 2 = 2 , R 2 = 1 , A 3 = 2 , R 3 = − 2 , A 4 = 2 , R 4 = 2 , A 5 = 3 , R 5 = 0 A_1 = 1, R_1 = -1, A_2 = 2, R_2 = 1, A_3 = 2, R_3 = -2, A_4 = 2, R_4 = 2, A_5 = 3, R_5 = 0 A1​=1,R1​=−1,A2​=2,R2​=1,A3​=2,R3​=−2,A4​=2,R4​=2,A5​=3,R5​=0. On some of these time steps the ε \varepsilon ε case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?
Definite: t2, t5; Possible: t1, t3
Reason:
2.1.1
Here in time step 2, the exploratory action occurred since it has observed that action 1 yielded reward 1 which is greater than 0 for the other actions. It definitely occurred at time step 5 since the reward is 0 and it wasn’t the greedy choice of action 2.

2.1.2
At timestep 2 2 2 this definitely occurred, as we know that the average reward associated with A 1 A_1 A1​ is 1 1 1 and so Q 2 ( a 1 ) > Q 2 ( a 2 ) = 0 Q_2(a_1) >Q_2(a_2) = 0 Q2​(a1​)>Q2​(a2​)=0. Therefore, choosing A 2 = 2 A_2 = 2 A2​=2 must have been the result of a exploration step. Similarly, this must have also occurred at timestep 5 5 5.
It might have occurred at timestep 1 1 1, depending on how the algorithm picks among actions with equal Q Q Q values. The same is true for timestep 3 3 3, at the beginning of which a 2 a_2 a2​ and a 1 a_1 a1​ have the same Q Q Q value.
If when one is picking a random action, one chooses among all actions rather than just all the ones currently considered suboptimal, then it is possible that a random action was selected at any of the timesteps.
【问题:如果是这样的话,那action有关吗?t=5的时候因为比t=4最大的小,所以一定探索?那为啥t=4直接选了?不存在探索的可能性?】

Exercise 2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
e of 0.01% will perform better in the long run as it will end up choosing the correct actions 99.1% of the time,versus 91% of the time for e of 0.1%, a difference of 8.1%
【问题:91%在书左上角P30页,99.1%是怎么得出来的?】

Ref:
2.1.1Multi-Armed Bandits
2.1.2Sutton & Barto - Reinforcement Learning: Some Notes and Exercises
2.2 rlai-exercises

上一篇:Android 桌面上特定的图标不能被移动和删除


下一篇:解决yum安装mysql时Requires: libc.so.6(GLIBC_2.17)(64bit)