Reinforcement Learning: An Introduction (second edition) - Chapter 9,10
Contents
Chapter 9
9.1
Show that tabular methods such as presented in Part I of this book are a special case of linear function approximation. What would the feature vectors be?
- tabular的方法就和state aggregation有点像,只不过一个状态就是一个group。即,对当前更新状态的梯度是1,其他group是0。
9.2
Why does (9.17) define \((n + 1)^k\) distinct features for dimension \(k\)?
- \(s_i\)一共有\(k\)个 (\(s_1,s_2,...,s_k\)),然后每个有\(n+1\)个选择 (\(0,1,...,n\)),所以是 \((n + 1)^k\)个。
9.3
What \(n\) and \(c_{i,j}\) produce the feature vectors \(\textbf{x}(s) = (1, s_1, s_2, s_1s_2, s^2_1,s^2_2,s_1s^2_2,s^2_1s_2,s^2_1s^2_2)^\top\)?
- \(n=2, c_{i,j} \in \{0,1,2\}\)。
9.4
Suppose we believe that one of two state dimensions is more likely to have an effect on the value function than is the other, that generalization should be primarily across this dimension rather than along it. What kind of tilings could be used to take advantage of this prior knowledge?
- 题中已经说了只有一个维度重要,那就在一个维度上有泛化性就好了。就如Figure 9.12 (middle)就行。
9.5
Suppose you are using tile coding to transform a seven-dimensional continuous state space into binary feature vectors to estimate a state value function \(\hat v(s,\textbf w) \approx v(s)\). You believe that the dimensions do not interact strongly, so you decide to use eight tilings of each dimension separately (stripe tilings), for \(7\times8 = 56\) tilings. In addition, in case there are some pairwise interactions between the dimensions, you also take all \(\binom{7}{2}=21\) pairs of dimensions and tile each pair conjunctively with rectangular tiles. You make two tilings for each pair of dimensions, making a grand total of \(21\times2 + 56 = 98\) tilings. Given these feature vectors, you suspect that you still have to average out some noise, so you decide that you want learning to be gradual, taking about 10 presentations with the same feature vector before learning nears its asymptote. What step-size parameter \(\alpha\) should you use? Why?
- 这个题有点看不懂啥意思,只能猜一下了。之前说tabular case的时候,说一个状态对应一个数,所以收集到一个该状态的experience,要只用一次就\(\alpha=1\),用十次就\(\alpha=\frac{1}{10}\)。现在就说有一个experience来了,但是这个其实会影响很多个状态,就想说如果10个experience更新一个状态的值的话,\(\alpha\)应该是多少。首先应该有个\(\frac{1}{10}\),其次每个维度有8个tilings,并且这8个tilings在7个维度上相互独立,所以一个experience对于这8个tilings就被泛化了8次。同时还有两两之间有影响的状态,一共是42个,每一个影响两个维度,所以是\(42\times 2 \div 7=12\)。所以应该是\(\alpha=\frac{1}{10}\times\frac{1}{8+12}=\frac{1}{200}\)。
Chapter 10
10.1
We have not explicitly considered or given pseudocode for any Monte Carlo methods in this chapter. What would they be like? Why is it reasonable not to give pseudocode for them? How would they perform on the Mountain Car task?
- 也可以写成蒙特卡洛方法吧,就是在更新的时候把\(G_{t:t+1}\)换成\(G_t\)。不给的原因是这个方法不好用?如果游戏一直不结束,蒙特卡洛方法一直不会更新,那策略永远不会变,游戏就永远不会结束,就死循环了?可能有这个问题吧。除非增加探索,然后限制每个episode的step数。
10.2
Give pseudocode for semi-gradient one-step Expected Sarsa for control.
- 把semi-gradient one-step Sarsa里的更新步换成Expected Sarsa的就行。即
\(\textbf{w}\leftarrow \textbf{w}+\alpha[R+\gamma\hat q(S^\prime,A^\prime,\textbf{w})-\hat q(S, A,\textbf{w})]\nabla \hat q(S^\prime,A^\prime,\textbf{w})\)
换成
\(\textbf{w}\leftarrow \textbf{w}+\alpha[R+\gamma \sum_a\pi(a|S^\prime) \hat q(S^\prime,A^\prime,\textbf{w})-\hat q(S, A,\textbf{w})]\nabla \hat q(S^\prime,A^\prime,\textbf{w})\)
10.3
Why do the results shown in Figure 10.4 have higher standard errors at large n than at small n?
- TD方法方差小,蒙特卡洛方法方差大,所以n越大,方差越大,即标准差越大。
10.4
Give pseudocode for a differential version of semi-gradient Q-learning.
- 参考dferential semi-gradient Sarsa,把
改为
\[\delta\leftarrow R-\bar R+\max_a\hat q(S^\prime,a,\textbf{w})-\hat q(S,A,\textbf{w}) \]10.5
What equations are needed (beyond 10.10) to specify the differential version of TD(0)?
- 参考dferential semi-gradient Sarsa,只需式子(10.12)
10.6
Suppose there is an MDP that under any policy produces the deterministic sequence of rewards +1, 0, +1, 0, +1, 0, . . . going on forever. Technically, this is not allowed because it violates ergodicity; there is no stationary limiting distribution \(\mu_\pi\) and the limit (10.7) does not exist. Nevertheless, the average reward (10.6) is well defined; What is it? Now consider two states in this MDP. From A, the reward sequence is exactly as described above, starting with a +1, whereas, from B, the reward sequence starts with a 0 and then continues with +1, 0, +1, 0, . . .. The di↵erential return (10.9) is not well defined for this case as the limit does not exist. To repair this, one could alternately define the value of a state as
Under this definition, what are the values of states A and B?
- 先看式子(10.7),\(\lim_{t\rightarrow\infty}E[R_t|S_0,A_{0:t-1}\sim\pi]\),因为\(R_t\)要么确定是1要么确定是0,所以\(R_t\)期望的极限不存在,这没问题。再看式子(10.6),\(\lim_{t\rightarrow\infty}\frac{1}{h}\sum_{t=1}^h E[R_t|S_0,A_{0:t-1}\sim\pi]\),因为多了一个求和再平均的操作,这个极限就有意义了,显然为\(\frac{1}{2}\)。接下来解\(v_\pi(A),v_\pi(B)\),推导简写符号。当\(h=2n+1\)为奇数,由定义有
当\(h=2n\)为偶数,由定义有
\[\begin{array}{l} v_\pi(A) =\lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}\sum_{t=0}^{2n}\gamma^t(E_\pi[R_{t+1}|S_0=s]-r(\pi)) \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^0(E[R_{1}]-r)+\gamma^1(E[R_{2}]-r)+\cdots + \gamma^{2n}(E[R_{2n+1}]-r)] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^0(1-r)+\gamma^1(0-r)+\cdots + \gamma^{2n}(1-r)] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^0+\gamma^2+\cdots \ \gamma^{2n} - (\gamma^0+\gamma^1+\cdots +\gamma^{2n})r] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\frac{1-\gamma^{2(n+1)}}{1-\gamma^2}-\frac{1-\gamma^{2n+1}}{1-\gamma}\frac{1}{2}] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\frac{1-\gamma-\gamma^{2n+2}+\gamma^{2n+1}}{2(1+\gamma)(1-\gamma)}] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}[\frac{1}{2(1+\gamma)}] \\ \\ \quad \quad \ \ \ = \frac{1}{4} \end{array} \]即有\(v_\pi(A) = \frac{1}{4}\)。
同理有
当\(h=2n+1\)为奇数,由定义有
当\(h=2n\)为偶数,由定义有
\[\begin{array}{l} v_\pi(B) =\lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}\sum_{t=0}^{2n}\gamma^t(E_\pi[R_{t+1}|S_0=s]-r(\pi)) \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^0(E[R_{1}]-r)+\gamma^1(E[R_{2}]-r)+\cdots + \gamma^{2n}(E[R_{2n+1}]-r)] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^0(0-r)+\gamma^1(1-r)+\cdots + \gamma^{2n}(0-r)] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\gamma^1+\gamma^3+\cdots \ \gamma^{2n} - (\gamma^0+\gamma^1+\cdots +\gamma^{2n})r] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\frac{\gamma(1-\gamma^{2n+1})}{1-\gamma^2}-\frac{1-\gamma^{2n+1}}{1-\gamma}\frac{1}{2}] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}\lim_{n \rightarrow \infty}[\frac{\gamma-1+\gamma^{2n+1}-\gamma^{2n+2}}{2(1+\gamma)(1-\gamma)}] \\ \\ \quad \quad \ \ \ = \lim_{\gamma \rightarrow 1}[-\frac{1}{2(1+\gamma)}] \\ \\ \quad \quad \ \ \ = -\frac{1}{4} \end{array} \]即有\(v_\pi(B) = -\frac{1}{4}\)。
10.7
Consider a Markov reward process consisting of a ring of three states A, B, and C, with state transitions going deterministically around the ring. A reward of +1 is received upon arrival in A and otherwise the reward is 0. What are the differential values of the three states using (10.13)?
- 这里就不纠结奇偶之类的了,只写一个,h写到期望回报刚好为1的时候。并且显然有\(r=\frac{1}{3}\)。仿照Excercise 10.6,有
10.8
The pseudocode in the box on page 251 updates \(\bar R_t\) using \(\delta_t\) as an error rather than simply \(R_{t+1}-\bar R_t\). Both errors work, but using \(\delta_t\) is better. To see why, consider the ring MRP of three states from Exercise 10.7. The estimate of the average reward should tend towards its true value of \(\frac{1}{3}\). Suppose it was already there and was held stuck there. What would the sequence of \(R_{t+1}-\bar R_t\) errors be? What would the sequence of \(\delta_t\) errors be (using (10.10))? Which error sequence would produce a more stable estimate of the average reward if the estimate were allowed to change in response to the errors? Why?
- 由Excercise 10.7 有\(r=\frac{1}{3}\),则\(R_{t+1}-\bar R_t\)为\(-\frac{1}{3},-\frac{1}{3},\frac{2}{3},-\frac{1}{3},-\frac{1}{3},\frac{2}{3},\cdots\)。三个\(\delta\)分别是\(0-\frac{1}{3}+v_\pi(B)-v_\pi(A),0-\frac{1}{3}+v_\pi(C)-v_\pi(B),1-\frac{1}{3}+v_\pi(A)-v_\pi(C)=0,0,0\)。这两个序列相比较,显然\(\delta\)的序列更稳定,只要收敛了,就不会再震荡了。而\(R_{t+1}-\bar R_t\)会永远震荡,只能通过step-size逐渐变小来收敛。
10.9
In the differential semi-gradient n-step Sarsa algorithm, the step-size parameter on the average reward, \(\beta\), needs to be quite small so that \(\bar R\) becomes a good long-term estimate of the average reward. Unfortunately, \(\bar R\) will then be biased by its initial value for many steps, which may make learning inefficient. Alternatively, one could use a sample average of the observed rewards for \(\bar R\). That would initially adapt rapidly but in the long run would also adapt slowly. As the policy slowly changed, \(\bar R\) would also change; the potential for such long-term nonstationarity makes sample-average methods ill-suited. In fact, the step-size parameter on the average reward is a perfect place to use the unbiased constant-step-size trick from Exercise 2.7. Describe the specific changes needed to the boxed algorithm for differential semi-gradient n-step Sarsa to use this trick.
- Exercise 2.7 的式子为\(\beta_n = \alpha/\bar o_n, \bar o_n = \bar o_{n-1}+\alpha (1-\bar o_{n-1}), \bar o_0 =0\)。所以在differential semi-gradient n-step Sarsa algorithm中先初始化\(o =0\),在最里面的循环 \(\text{If} \ \tau \geq 0:\)里面,先\(o\leftarrow o+\beta(1-o)\)。再把\(\bar R \leftarrow R+\beta \delta\)改成\(\bar R \leftarrow R+(\beta/o) \delta\)