目录
- Chapter1
- Chapter2
- Chapter3 有限MDPs
- Chapter 4 DP
- Chapter 5 Monte Carlo method
- Chapter 6 TD learning
- Chapter 7 n-step bootstrapping
- Chapter 8 Planning and learning
-
Chapter 9 approximate solution method
- SGD 和semi-gradient
- linear model
- Feature Construction for Linear Methods
- Selecting Step-Size Parameters Manually
- Nonlinear Function Approximation: Artificial Neural Networks
- Least-Squares TD
- Memory-based Function Approximation
- Kernel-based Function Approximation
- Looking Deeper at On-policy Learning: Interest and Emphasis
- Chapter 10 on-policy control with approximation
- Chapter 11 off policy methods with approximation
- Chapter 12 Eligibility Traces
- Chapter 13 Policy Gradient Methods
Chapter1
RL is learning what to do.
RL两大特征: trial-and-error, delayed reward
MDP 三个方面:sensation, action, goal
RL 认为自己不属于supervised L 和unsupervised L
RL独特的问题:exploration and exploitation. (其他领域也有类似的比如生成图像的多样性和准确性)
One Key feature:强化学习考虑全局
除了agent, environment:
policy: lookup table, function, or search process
reward signal:
一个state的value 就是往后余生所有Reward的和(可能会有decay,毕竟人生得意须尽欢,莫使金樽空对月。)
Chapter2
Learning- Evaluative feedback vs Instructive feedback
监督学习是给出建设性意见,不管你的action是什么。
RL是根据你的action,告诉你的action是好是坏
多臂赌博机 multi-armed bandits
多臂赌博机只有一个situation, 一个action就是一个episode。
对于每个lever, Reward是一个概率分布,\(A_t\)意味着在时间t采取的行动,对应action a,value就是reward的期望值
\(q_*(a) = E(R_t|A_t=a)\)
遗憾的是我们不知道这个,我们可以有一个估计值\(Q(a)\)
\(Q(a)\)是根据我们过往经验的估计值,所以可能不准,只是存在 exploration,exploitation 的conflict。
影响explore 和exploit的因素有很多,比如剩下的steps,现有的估计值,不确定性等。有很多方法去衡量,但是大多都有假设,这些假设通常在RL领域不work
action-value method
estimating
\(Q_t(a) = \frac{\sum_{i=1}^{t-1}R_i \cdot I_{A_i=a}}{\sum_{i=1}^{t-1}\cdot I_{A_i=a}}\)
\(I_{A_i=a}\)是指示函数
greedy action:
\(A_t= \underset{a}{argmax}Q_t(a)\)
\(\epsilon\) -greedy 有\(\epsilon\)的概率随机选择,剩下\(1-\epsilon\)greedy。
Incremental implementation
求平均值的实现
\(Q_t(a) = Q_{t-1}(a) + \frac{1}{t-1}(R_{t-1}-Q_{t-1})\)
\(NewEstimate=OldEstimate+StepSize[Target-OldEstimate]\)
Nonstationary Problem
不稳定状态,recent reward更重要
\(Q_{n+1}(a) = Q_{n}(a) + \alpha(R_{n}-Q_{n}), \alpha\in(0,1]\)
展开\(Q_{n+1} = (1-n)^nQ_1+\sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i\)
vary step size \(\alpha_n(a)\).
收敛条件\(\sum_{n=1}^{\infty}\alpha_n(a)=\,\sum_{n=1}^{\infty}\alpha_n(a)^2<\infty\)
\(\alpha_n(a)=\alpha\)不满足条件,但是因为不稳定环境无需收敛,跟随变化就行。满足条件的收敛慢,理论可行。
optimistic initial values
初始值\(Q_1\)设置一个很高的值,选一个action,就会对这个action失望,然后选择其他没有选择过的action。
状态空间变化无法使用,仅仅是初始探索而已
UCB(Upper confidence bound)
雨露均沾,宠幸的少了,就要补偿一点
\(A_t = \underset{a}{argmax}[Q_t(a)+c\sqrt{\frac{lnt}{N_t(a)}}]\)
无法处理大的状态空间,不稳定系统
Gradient bandit algorithms
\(H_t(a)\) preference for each action a
\(Pr\{A_t=a\} = \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}}= \pi_t{(a)}\)
采取\(A_t\),获得\(R_t\)后
\(H_{t+1}(A_t) = H_t(A_t) +\alpha(R_t-\bar{R}_t)(1-\pi_t(A_t))\)
\(H_{t+1}(a) = H_t(a) -\alpha(R_t-\bar{R}_t)\pi_t(A_t)\)for all \(a \neq A_t\)
\(\bar{R}_t\)是不包含\(R_t\)的平均值,也就是baseline
上面的两个递推式,可以通过梯度上升推导出来
preference的改变量就是对期望reward的影响力
\(H_{t+1}(a) = H_{t}(a) + \alpha\frac{\partial E[R_t]}{\partial H_t(a)}\)
对于赌博机,期望Reward就是所有action的value之和
\(E[R_t] = \sum_x \pi_t(x)q_*(x)\)
\(\frac{\partial E[R_t]}{\partial H_t(a)}=\frac{\partial }{\partial H_t(a)}{[\sum_x \pi_t(x)q_*(x)]}=\sum_xq_*(x)\frac{\partial \pi_t(x)}{\partial H_t(a)}=\sum_x(q_*(x)-B_t)\frac{\partial \pi_t(x)}{\partial H_t(a)}\)
\(B_t\)是baseline,和x无关的值,\(\sum_x \partial\pi_t(x)=0\) ,baseline不改变什么
想要把求和变为求期望,用\(A_t\)替换x
\(\frac{\partial E[R_t]}{\partial H_t(a)}=\sum_x \pi_t(x)(q_*(x)-B_t)\frac{\partial \pi_t(x)}{\partial H_t(a)}/\pi_t(x)=E[(q_*(A_t)-B_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\)
\(=E[(R_t-\bar{R}_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\)
因为\(E[R_t|A_t]=q_*(A_t)\)
\(=E[(R_t-\bar{R}_t)\pi_t(A_t)(I_{a=A_t}-\pi_t(a))/\pi_t(A_t)]=E[(R_t-\bar{R}_t)(I_{a=A_t}-\pi_t(a))]\)
期望变成每一步
\(H_{t+1}(a) = H_t(a) +\alpha (R_t-\bar{R}_t)(I_{a=A_t}-\pi_t(a))\) for all a
contextual bandit
在多臂赌博机中,没有状态的概念,每一时刻,我们都是选择一个行动而已。
上下文赌博机,每次给定一个赌博机,同时给你一个赌博机的相关信息,这就是状态了,根据不同状态,判断需要采取的相应action。介于多臂赌博机和真正的强化学习之间。
Chapter3 有限MDPs
The Agent–Environment Interface
MDPs 是learning from interaction to achieve a goal的一个框架,适用于大多数强化学习,但并不是全部
agent :learner, decision maker
environment: everything outside the agent
dynamics:
\(p(s',r|s,a)=Pr\{S_t=s',R_t=r|S_{t-1}=s,A_{t-1}=a\}\)
P就是MDP的dynamics,是一个条件概率分布\(p:S*R*S*A\to[0,1]\)
四元组也可以变成其他的
\(p(s'|s,a),r(s,a),r(s,a,s')\)
In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them.
Goals and rewards
goals, purposes是,最大累计Reward的期望。
Reward可以是惩罚也可以是奖励
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.
设置Reward不应该包含怎样去做的先验知识。
Returns and Episodes
return :\(G_t= R_{t+1} + R_{t+2} + R_{t+3} + ··· + R_T\),T是final time step
与环境交互的一个subsequence : episodes
terminal state ,结束之后,下一个state不依赖与terminal state.
这种是episodic task
nonterminal state S,terminal state \(S^+\)
continuing tasks: on-going process-control task
需要使用discounting
\(G_t= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ··· =\sum_{k=0}^{\infty }\gamma^k R_{t+k+1},\gamma\in [0,1]\)
\(\gamma\) 决定agent是myopic还是farsighted
\(G_t = R_{t+1}+\gamma G_{t+1}\)
\(G_t\)是收敛的,\(\sum_{k=0}^{\infty}\gamma^k=\frac{1}{1-\gamma}\)
unified notation for episode and continuing tasks
两种类型的可以用一种形式表示
对于episode,没必要区分不同的episode
而且episode可以看做是continuing tasks, 最后在absorbing state无限循环,r=0
\(G_t =\sum_{k=t+1}^{T}\gamma^{k-t-1}R_k\)
\(T\)可以是\(\infty\),\(\gamma\)可以是1,不能同时满足
Polices and value functions
value function: estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state)
"how good" :feature rewards that can be expected
policy:\(\pi(a|s)\)
value function under a policy \(\pi\), \(v_\pi(s)=E_\pi[G_t|S_t=s]\)
action-value function:\(q_\pi(s,a)=E_\pi[G_t|S_t=s,A_t=a]\)
\(V_\pi(s)= E[R_{t+1}+\gamma G_{t+1}|S_t=s]=\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma E_\pi[G_{t+1}|S_{t+1}=s']]\)
\(=\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_\pi(s')]\)
\(q_\pi(s,a) = \sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s')q_\pi(s',a')]\)
optimal policies and optimal value functions
\(v_*(s) = \underset{\pi}{max} \,v_\pi(s)\)
\(q_*(s,a)=\underset{\pi}{max}q_\pi(s,a)\)
\(q_*(s,a) = E_\pi[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]\)
\(v_*(s)=\underset{a\in A(s)}{max}q_{\pi_*}(s,a)=\underset{a}{max} E_\pi[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]\)
\(=\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))\)
\(q_*(s,a)=E[R_{t+1}+\gamma \underset{a'}{max}q_*(S_{t+1},a')|S_t=s,A_t=a]\)
\(=\sum_{s',r}p(s',r|s,a)(r+\gamma \underset{a'}{max}q_*(s',a'))\)
optimality and approximation
bellman optimality equation 理论上在知道模型的情况下可解。但是收到状态空间和memory的限制,只能求近似值
Chapter 4 DP
finite MPDs 是指action ,state, reward finite
\(v_*(s)=\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))\)
\(q_*(s,a)=\sum_{s',r}p(s',r|s,a)(r+\gamma \underset{a'}{max}q_*(s',a'))\)
policy evaluation (prediction)
给定policy \(\pi\) 计算\(v_\pi\)
\(v_\pi(s)= E[R_{t+1}+\gamma G_{t+1}|S_t=s]=\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma E_\pi[G_{t+1}|S_{t+1}=s']]\)
\(=\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_\pi(s')]\)
如果系统的dynamics 知道,就是解\(|S|\)个线性方程,*才这么做
采用迭代的方式,\(v_0,v_1,\cdots,v_k\)
初始值,随便设,但是final state\(V_\pi(S^+)=0\)
\(v_{k+1}(s)=\sum_{a}\pi_(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')]\)
\(v_\pi\)是不动点,根据压缩映射原理,保证收敛\(v_{k}=v_\pi,k\to \infty\)
这种方式叫做iterative policy evaluation
一次更新叫做\(expected \,update\)
policy improvement
policy improvement 就是对于两个确定性Policy
\(q_\pi(s,\pi'(s))\geq v_\pi(s)\)
则for all state \(s\in S\)
\(v_{\pi'}(s)\geq v_\pi(s)\)(证明,就是不断迭代)
那么\(\pi'\)就比\(\pi\) 好
所以怎么找到\(\pi'\)
\(\pi'(s)=\underset{a}{argmax}\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]\)
除非\(\pi=\pi_*\)否则一定有policy improvement(strict inequality)
确定性策略成立,随机策略也成立
policy iteration
evaluation 和improvement交替进行,实际上evaluation不用等到收敛就可以进行improvement
value iteration
只进行一次policy evaluation 的迭代,然后进行improvement
\(v_{k+1}=\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))\)
也可以看做是bellman optimality equation
GPI(general Policy iteration)
evaluation 和improvement可以乱序执行,前者希望value function跟当前的policy一致 ,后者改变当前的策略,让value function偏离的Policy
The evaluation and improvement processes in GPI can be viewed as both competing
and cooperating. They compete in the sense that they pull in opposing directions. Making
the policy greedy with respect to the value function typically makes the value function
incorrect for the changed policy, and making the value function consistent with the policy
typically causes that policy no longer to be greedy. In the long run, however, these
two processes interact to find a single joint solution: the optimal value function and an
optimal policy.
bootstrapping:通过后记状态的estimate,来估计现在的state
Chapter 5 Monte Carlo method
不需要model的P,只需要experience,
actual experience: striking
simulated experience: powerful
based on sample returns, only for episodic task
MC prediction
each occurrence of state s is called a visit to s
first-visit-MC
every-visit-MC
理论不同
都收敛到\(v_\pi(s)\)
MC 不用bootstrap
MC 可以只计算感兴趣的states
MC estimation of action values
因为没有model,只知道\(v_\pi\),不知道如何选择action,所以需要学习\(q_\pi(s,a)\)
MC需要足够的探索,确定性策略不行。
可以选择,start in a state-action pair. exploring starts (不太靠谱,随机策略)
MC control
MC的方法就是GPI里解决evaluation的一个方式
\(\pi_(s) =\underset{a}{argmax}\,q(s,a)\)
\(q_{\pi_k}(s, \pi_{k+1}(s)) = q_{\pi_k}(s,\underset{a}{argmax}\,q_{\pi_k}(s,a))= \underset{a}{max}q_{\pi_k}(s,a)\geq q_{\pi_k}(s,\pi_k{(s)}) \geq v_{\pi_k}(s)\)
没有理论上的证明一定收敛到最优策略
Monte Carlo Control without Exploring Starts
exploring starts 不切实际
on policy: attempt to evaluate or improve the policy that is used to generate the data.
off policy: attempt to evaluate or improve a policy different from that used to generate the data
on policy : generally soft: $\pi(a|s)>0 ,\forall s\in S,a\in A $
理论policy improvement 仍然可行。
off-policy via IS
evaluation 需要足够的data去学习,但是improvement让policy越来越greedy. 如果采用当前策略生成data用来学习,意味着是on-policy,这时候policy为了保持探索只能是近似optimality.
另一个方法就是采用两个策略,behavior policy只是用来生成data. 然后生成的data进行evaluation,然后improvement target policy. 显然evaluation 在DP里是bootstrap 求期望,在MC里面是根据大数定理算均值。
在MC里面数据是假设在当前策略中产生的。
所以既然生成数据分布和真实分布不是一个分布
那么我们就用其他方法来采样
所以on-policy可以看做是off-policy的一个特例。
这一节只考虑prediction的问题
behavior policy \(b\)
target policy \(\pi\)
coverage : \(\pi(a|s)>0,imples ,b(a|s)>0\)
\(Pr\{A_t,S_{t+1},\cdots ,S_T |S_t,A_{t:T-1}\sim \pi\}\)
\(=\pi(A_t|S_t)p(S_{t+1|S_t,A_t})\pi(A_{t+1}|S_{t+1})\cdots p(S_T|S_{T-1},A_{T-1})\)
\(=\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)\)
\(\rho_{t:T-1}= \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}\)
\(E[\rho_{t:T -1}G_t | S_t=s] = v_\pi(s)\)
\(J(s)\)包含state s的time step, 所有episode 排成一列。
ordinary importance sampling: O
\(V(s)=\frac{\sum_{t\in J(s)}\rho_{t:T-1}G_t}{|J(s)|}\)
weighted importance sampling: W
\(V(s)=\frac{\sum_{t\in J(s)}\rho_{t:T-1}G_t}{\sum_{t\in J(s)}\rho_{t:T-1}}\)
first visit:
- O : unbiased, high variance(unbounded)
- W: biased(converge to zero), low variance
every visit:
both biased
incremental implementation
ordinary IS 采用chapter 2 的方法。
weighted IS
\(V_n=\frac{\sum_{k=1}^{n-1}W_{k}G_k}{\sum_{k=1}^{n-1}W_K},n\geq2\)
\(V_{n+1} = V_{v} + \frac{W_n}{C_n}(G_n-V_n),n\geq 1,C_{n+1} = C_n+W_{n+1},C_0=0\)
每次都从episode的tail开始学,slow
Per-decision Importance Sampling
\(\rho_{t:T-1}G_t=\rho_{t:T-1}(R_{t+1}+\gamma R_{t+2}+\cdots )\)
$=\rho_{t:T-1}R_{t+1}+\gamma \rho_{t:T-1} R_{t+2}+\cdots $
\(\rho_{t:T-1}R_{t+1}=\frac{\pi(A_t|S_t)\pi(A_{t+1}|S_{t+1})\cdots}{b(A_t|S_t)b(A_{t+1}|S_{t+1})\cdots}R_{t+1}\)
\(R_{t+1}之和第一项有关\)
\(E[\rho_{t:T-1}R_{t+1}]=E[\rho_{t:t}R_{t+1}]\)
\(E[\rho_{t:T-1}G_t]=E[\overset{\sim}G_t]\)
$\overset{\sim}{G}t = \rho{t:t}R_{t+1} + \gamma \rho_{t:t+1}R_{t+2} + \gamma^2 \rho_{t:t+2}R_{t+3} + ··· $
per-decision IS,能够减小variance
\(V(s) =\frac{\sum_{t\in J(s)\overset{\sim}{G}_t}}{|J(s)|}\)
Is there a per-decision version of weighted importance sampling? no clear
Chapter 6 TD learning
TD 是介于DP,和MC之间的方法,不需要model,需要bootstrap.
MC :\(V(s_t) = V(s_t)+ \alpha (G_t-V(s_t))\)
TD:\(V(s_t) = V(s_t)+ \alpha (R_{t+1} +\gamma V(s_{t+1})-V(s_t))\)
TD(0), one-step TD
\(v_\pi(s) = E_\pi[G_t|S_t=s] = E_\pi[R_{t+1} +\gamma G_{t+1}|S_t=s] = E_\pi[R_{t+1}+ \gamma v_\pi(S_{t+1})|S_t=s]\)
TD 结合了MC的sampling 和 DP的bootstrapping
TD-error: \(\delta_t = R_{t+1}+\gamma V(S_{t+1})-V(S_t)\)
TD(0) 的优势
batch 就是对于一堆数据,TD(0)统一计算TD-error,或者MC统一算均值
TD(0)在batch的情况下比MC更快。
MC是想最小化training set 的训练误差
而TD(0)是想求MDP的极大似然估计。certainty-equivalence estimate
sarsa on-policy TD control and Q-learning off-policy TD control
\(Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]\)
\(Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, \underset{a}{max} \,Q(S_{t+1},a)-Q(S_t,A_t)]\)
expected sarsa(方差小,最后策略如果是greedy的话,就是Q-learning)
\(Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, \underset{a}{\sum}\pi(a|S_{t+1}) \,Q(S_{t+1},a)-Q(S_t,A_t)]\)
最大化估计值,通常有bias
问题大概是,一个state-action的return是一个分布,有好有坏,期望并不好。但是由于过于乐观所以,总想要好的,高估了这个的价值。
问题是我们用同一个东西及决定了谁是最大的,又去评估它的价值。应该用两个,一个选择,一个评估
double q-learning
\(Q_1(S_t,A_t) = Q_1(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, Q_2(S_{t+1},\underset{a}{argmax} \,Q_1(S_{t+1},a))-Q_1(S_t,A_t)]\)
通常的task都是action-value function。
但是也有特殊情况,比如棋类,我们明确的知道我们的action之后,s会变成什么,有时候不同的action state 最后还变成同一state
arterstate function 实在做出动作后评估价值
Chapter 7 n-step bootstrapping
n-step sasrsa
\(G_t= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ··· + \gamma^{T-t-1}R_T\)
\(G_{t:t+1} = R_{t+1} +\gamma V_t{(S_{t+1)}}\)
\(G_{t:t+2} = R_{t+1} +\gamma R_{t+2}+\gamma^2 V_{t+1}{(S_{t+2)}}\)
\(G_{t:t+n} = R_{t+1} +\gamma R_{t+2}+\cdots +\gamma^{n-1}R_{t+n} +\gamma^{n} V_{t+n-1}{(S_{t+n)}}\)
\(V_{t+n} = V_{t+n-1}(S_t) + \alpha [G_{t:t+n}-V_{t+n-1}(S_t)]\)
off policy 采用重要性采样
\(V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \rho_{t:t+n-1}[G_{t:t+n} -V_{t+n-1}(S_t)]\)
action value 对应第一个state的action被选过了,关注后面的
$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n}[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)] $
Note that the importance sampling ratio here starts and ends one step
later than for n-step TD (7.9). This is because here we are updating a state–action
pair. We do not have to care how likely we were to select the action; now that we have
selected it we want to learn fully from what happens, with importance sampling only for
subsequent actions.
\(\rho_{t:h} =\prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}\)
\(\rho_{t:h}\)为0的时候target就为0,连乘中一个为0,结果就是0,带来了较大的方差
对于value-function
\(G_{t:h} = R_{t+1}+\gamma G_{t+1:h},G_{h:h} = V_{h-1}(S_h)\)
变为
\(G_{t:h} = \rho_t(R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t),G_{h:h} = V_{h-1}(S_h)\)
\(\rho_t\)的期望是1不影响结果,当\(\rho_t=0\)的时候就用\(V_{h-1}(S_t)\)代替,td-error为0, \((1-\rho_t)V_{h-1}(S_t)\)这就是control variable
对于action-value function
\(G_{t:h} =R_{t+1}+ \gamma (\rho_{t+1} G_{t+1:h}+\bar{V}_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1}))\)
\(=R_{t+1}+ \gamma \rho_{t+1} (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})\)
\(\bar{V}_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})\)是control variable
off policy without IS
tree backup algorithm
类似于expected sarsa
策略b是random的
\(G_{t:h}= R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a)\)
\(= R_{t+1} + \gamma \sum_{a\neq A_{t=1}} \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\)
总结
n-step sarsa
n-step expected sarsa
tree-backup algorithm
\(\rho\) 表示off policy需要IS的地方
tree backup的公式
\(G_{t:h}= R_{t+1} + \gamma \sum_{a\neq A_{t=1}} \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\)
\(=R_{t+1} + \gamma \bar{V}_{h-1}(S_{t+1})-\gamma\pi(A_{t+1}|S_{t+1})Q_{h-1}(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\)
\(=R_{t+1} + \gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:h}-Q_{h-1}(S_{t+1}, A_{t+1}))+ \gamma \bar{V}_{h-1}(S_{t+1})\)
n-step sarsa 公式:
\(G_{t:h}=R_{t+1}+ \gamma \rho_{t+1} (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})\)
\(Q(\sigma)\)公式
\(=R_{t+1}+ \gamma (\sigma_{t+1}\rho_{t+1}+(1-\sigma_{t+1}\pi(A_{t+1}|S_{t+1}))) (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})\)
用\(\sigma\)去调节\(\rho\)和\(\pi\)的比例
for \(t<h \leq T\). The recursion ends with \(G_{h:h}=Q_{h-1}(S_h, A_h)\) if h<T, or with
\(G_{T-1:T}=R_T\) if h = T. Then we use the general (o↵-policy) update for n-step Sarsa
(7.11). A complete algorithm is given in the box.
Chapter 8 Planning and learning
mode: Given a state and an action, a model produces a
prediction of the resultant next state and next reward
sample model:依概率返回一个state 和Reward
distribution model:返回next state 和Reward的概率\(p(s',r|s,a)\)
model is used to simulate the environment and produce simulated experience.
State-space planning:从状态空间出发,找到一个最优policy
plan-space planning: 不care
model\(\to\) simulated experience \(\to\) values \(\to\) policy
basic idea:
- 通过计算value function 提升 policy
- 从experience 中通过update 或者 backup 计算 value
when model is wrong
模型正确的时候,model based会比model free 更快。
但是当model错误的时候,就需要重新修正。错误有两种
一种是,在当前状态,model和environment 不一致,等到agent 在真实environment 中经历过后再修正过来
另一种是model和environment在局部一致,且在model认为这就是最优的state,但是environment改变了另外其他的地方,有更好的Policy,但是model不知道,陷入局部最优
prioritized sweeping
sample updates and expected
三元组产生8种value function update(1种无效)
expected update
\(Q(s,a) =\sum_{s',r}p(s',r|s,a)[r + \gamma \underset{a'}{max} Q(s',a')]\)
sample updates
\(Q(s,a) = Q(s,a) + \alpha [r + \gamma \underset{a'}{max} Q(s',a') - Q(s,a)]\)
sample updates 比expected updates更高效,在计算受限的情况下更出色
并且,sample updates 的 Q(s',a')一直在更新,所以 sample updates的successor state的estimates更加准确
trajectory sampling
DP里更新时候有用的是讲所有的state 或者state value 更新
但是有其他方法根据分布采样来更新
比如uniform distribution 这和exhaustive sweeps 一样
on-policy distribution, that is, according to the distribution observed when following the current policy.
trajectory sampling
real time DP
\(v_{k+1} = \underset{a}{max}\sum_{s',r}p(s',r|s,a)[r+ v_{k}(s')]\)
trajectory sampling asynchronous value iteration DP
不用每个state 都更新inftyite次,关注和策略相关的一个state子集
只关注当前的,当value function稳定的时候就到了最优策略, 普通DP 不是这样的
planning at decision time
planning 分为两种
一种是利用data from model , improve value function or policy, 然后给定当前的\(S_t\) 选择最好的action,improve是对所有state improve的所以叫background update
decision-time planning 给定\(S_t\) 给出\(A_t\)
Heuristic Search
关注当前状态 searching deeper
rollout algorithm
在当前状态MC control
MCTS
MCTS 四个步骤
第一步:selection:用tree policy(通常会权衡explore 和exploit, 比如UCB)到达一个叶子节点
第二步:expansion:通常随机扩展出一个叶子节点
第三步:simulation: 使用rollout policy simulate 一个trajectory 来
第四步:backup: 用simulate 的trajectory 来backup叶子节点和他的父节点,simulated 的trajectory不保留
到下一状态了能够利用当前产生的树的子树
第三个dimension 是 on-policy 或者off-Policy
Chapter 9 approximate solution method
approximate value 是参数值改一下,所有state value都会受到不同程度的影响
我们需要一个衡量标准,对于所有的state,
\(\overline{VE}(w) = \sum_{s}\mu(s)(\hat{v}(s)-v_\pi(s))^2\)
\(\mu(s)\)是on-policy distribution 体现了对不同状态的重视程度
\(\eta(s) = h(s) + \sum_{\overline{s}} \eta(\overline{s}) \sum_{a}\pi(a|\overline{s})p(s|a,\overline{s})\)
\(\mu(s) = \frac{\eta{(s)}}{\sum_{s'}\eta{(s')}},\forall s'\in S\)
\(\eta (s)\)表示花在s上的时间,就是start state为s,的概率,加上前一状态能够来到s状态的概率
目标就是要找\(\overline{VE}(w^*)\leq \overline{VE}(w),\forall w\)
或者找一个local optimum
SGD 和semi-gradient
\(w_{t+1} = w_{t} -0.5*\alpha \nabla [v_\pi(S_t)-\hat{v}(S_t,w_t)]^2\)
\(=w_t+\alpha [v_\pi(S_t)-\hat{v}(S_t,w_t)]\nabla \hat v{(S_t,w_t)}\)
符号变了,书上这么写的
gradient MC unbiased
bootstrapping 是semi-gradient
因为target value 里面有\(w_t\)但是不对它求导
\(w_{t+1}=w_t+\alpha [R_{t+1}+\gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)]\nabla \hat v{(S_t,w_t)}\)
linear model
\(\hat{v}(s,w)=w^Tx(s)= \sum_{i=1}^{d}w_ix_i(s)\)
x(s)是feature vector
\(\nabla v(s,w) = x(s)\)
MC 可以收敛到\(\overline{VE}\)的global optimum
但是TD(0)不行
\(w_{t+1}=w_t + \alpha [R_{t+1}+\gamma w_{t}^Tx_{t+1}-w_t^Tx_{t} ] x_t\)
\(=w_t + \alpha [R_{t+1}x_t -x_t(x_t-\gamma x_{t+1})^Tw_t]\)
\(E(w_{t+1}|w_t) = w_t+\alpha(b-Aw_t)\)
\(b = E[R_{t+1}x_t]\)
\(A=E[x_t(x_t-\gamma x_{t+1})^T]\)
如果收敛的话
\(b-Aw_{TD}=0\)
$w_{TD}=A^{-1}b $
TD fixed point A逆存在可证
\(\overline{VE}(w_{TD})\leq \frac{1}{1-\gamma} min_{w} \overline{VE}(w)\)
\(\gamma\)接近1,所以误差会很大
Feature Construction for Linear Methods
polynomials
state 可以表示为\((s_1,s_2,\cdots,s_k)^T\)
x(s)可以是state的多项式表示,能够让不同dimension的state属性相互影响
如果使用n阶多项式
\(x_i(s)= \prod_{j=1}^{k}s_j^{c_{i,j}}\)
\(c_{i,j}\in [0,n]\)
总共有\((n+1)^k\)
state维度高,阶数高 不可行
Fourier series
\(s =(s_1,s_2,\cdots,s_k)^T\)
\(x_i(s) = cos(\pi s^T c_i)\)
\(c_i=(c_1^i,c_2^i,\cdots,c_k^i)\)
\(c_j^i\in\{0,1,...,n\}\)
\(i=1,\cdots,(n+1)^k\)
离散的state不好表示,会出现ringing现象。需要很高的采样率
Coarse Coding
Tile Coding
适用于多维连续空间具有泛化能力,计算简单
不同的平移方式能编出不同形式的编码,非对称的好一点
Radial Basis Functions
\(x_i(s)=exp(-\frac{|s-c_i|^2}{2\sigma_i^2})\)
相当于粗编码的扩展,粗编码只能是0,1这里是0,1类似加了距离度量,跟选定的n个中心的距离组成了feature,没什么用。
Selecting Step-Size Parameters Manually
\(\alpha = (\tau E[x^Tx)^{-1}]\)
Suppose you wanted to learn in about \(\tau\) experiences with substantially the same feature vector
Nonlinear Function Approximation: Artificial Neural Networks
Least-Squares TD
\(E[w_{t+1}|w_t]=w_t+\alpha (b-Aw_t)\)
\(w_{TD}=A^{-1}b\)
\(A = E[x_t(x_t-\gamma x_{t+1})^T]\)
\(b= E[R_{t+1}x_t]\)
\(\hat{A}_t = \sum_{k=0}^{t-1}x_k(x_k-\gamma x_{k+1})^T + \epsilon I\)
\(\hat{b}_t = \sum_{k=0}^{t-1}R_{k+1}x_k\)
不用除以t
\(w_t = \hat{A}_t^{-1}\hat{b}_t\)消掉了
\(\hat{A}_t^{-1} = (\hat{A}_{t-1} + x_{t-1}(x_{t-1}-\gamma x_t)^{T})^{-1}\)
\(=\hat{A}_{t-1}^{-1}- \frac{\hat{A}_{t-1}^{-1} x_{t-1} (x_{t-1}-\gamma x_t )^T \hat{A}_{t-1}^{-1}}{1+(x_{t-1}-\gamma x_t)^T \hat{A}_{t-1}^{-1}x_{t-1}}\)
Sherman-Morrison formula
Memory-based Function Approximation
lazy learning
记下来,遇到了,再找 类似 KNN
local learning
state的距离有多种定义方法
nearest neighbor: 找最近的
weighted average: 根据距离加权
Locally weighted regression : 计算一个locally fitted surface
避免维度灾难
Kernel-based Function Approximation
用kernel function 代替距离,作为一种相关性的衡量,或者泛化能力的体现
Kernel regression \(\hat{v}(s,D)=\sum_{s'\in D} k(s,s')g(s')\)
D是存储的数据,\(g\)就是s对应的值
\(k(s,s')=x(s)^Tx(s')\)会产生和线性回归x(s)一样的结果
Looking Deeper at On-policy Learning: Interest and Emphasis
\(w_{t+n} = w_{t+n-1} +\alpha M_t[G_{t:t+n}-\hat{v}(S_t,w_{t+n-1})] \nabla \hat{v}(S_t,w_{t+n-1})\)
\(M_t\)表示 emphasis 表示对t时刻更新的重视程度
\(M_t=I_t+\gamma^nM_{t-n}\)
\(I_t\)表示interest 表示在t时刻对不同状态的重视程度
Chapter 10 on-policy control with approximation
先从TD 变为sarsa
\(w_{t+1}=w_t + \alpha [U_t -\hat{q}(S_t,A_t,w_t)]\nabla \hat{q}(S_t,A_t,w_t)\)
\(w_{t+1}=w_t + \alpha [R_{t+1} +\gamma \hat{q}(S_{t+1},A_{t+1},w_t) -\hat{q}(S_t,A_t,w_t)]\nabla \hat{q}(S_t,A_t,w_t)\)
episodic semi-gradient one step sarsa
和TD(0)收敛到一样的结果
Policy improvement is then done (in the on-policy case treated in this chapter) by changing the estimation policy to a soft approximation of the greedy policy such as the "-greedy policy.
\(G_{t:t+n} = R_{t+1} + \gamma R_{t+2}+ \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})\)
\(w_{t+n}=w_{t+n-1} + \alpha [G_{t:t+n} -\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})\)
continuing tasks :average reward
discounted 对于永不终止的任务,不太适合
\(r(\pi)=\underset{h\to \infty}{lim}\frac{1}{h}\sum_{t=1}^{h}E[R_t|S_0,A_{0:t-1} \sim \pi]\)
\(=\underset{t\to \infty}{lim}E[R_t|S_0,A_{0:t-1} \sim \pi]\)
\(=\sum_s \mu_\pi(s)\sum_a \pi{(a|s)} \sum_{s',r}p(s',r|s,a)r\)
\(\mu_\pi(s)=lim_{t\to\infty}Pr(S_t=s|A_{0:t-1} \sim \pi)\)
最终MDP进入平稳分布,与初始状态和前面的burning无关
平稳分布时:
\(\sum_s\mu_\pi(s)\sum_a \pi{(a|s)} p(s'|s,a)r = \mu_\pi(s')\)
In the average-reward setting, returns are defined in terms of di↵erences between rewards and the average reward:
differential return \(G_t = R_{t+1}-r(\pi) + R_{t+2}-r(\pi)+\cdots\)
对于这种\(v_\pi,p_\pi,v_*,p_*\)仍然有bellman function
\(v_\pi(s) = \sum_{a} \pi(a|s)\sum_{s',r}p(s',r|s,a)[r-r(\pi)+v_\pi(s')]\)
\(q_\pi(s,a) = \sum_{s',r}p(s',r|s,a)[r-r(\pi)+\sum_{a'}\pi(a'|s')q_\pi(s',a')]\)
\(v_* = \max_{a} \sum_{s',r}p(s',r|s,a)[r-max_\pi r(\pi)+v_*(s')]\)
\(q_*(s,a) = \sum_{s',r}p(s',r|s,a)[r-max_\pi r(\pi)+max_{a'}q_*(s',a')]\)
td-error
\(\delta_t = R_{t+1}-\overline{R}_t + \hat{v}_\pi(S_{t+1},w_t)-\hat{v}(S_t,w_t)\)
\(\delta_t = R_{t+1}-\overline{R}_t + \hat{q}_\pi(S_{t+1},A_{t+1},w_t)-\hat{q}(S_t,A_t,w_t)\)
\(w_{t+1} = w_{t} + \alpha \delta_t \nabla \hat{q}(S_t,A_t,w_t)\)
tabular 中加\(\gamma\) 之后 在稳定之后的为\(\frac{1}{1-\gamma}r(\pi)\),\(1+\gamma +\gamma^2+\cdots = \frac{1}{1-\gamma}\)看上去加不加不影响。但是在函数近似中不能这样用。
函数近似不再是局部优化进而策略提升,函数近似一旦改变就是全局改变,没办法证明全局提升
但是在函数近似中
The root cause of the diffculties with the discounted control setting is that with
function approximation we have lost the policy improvement theorem (Section 4.2). It is
no longer true that if we change the policy to improve the discounted value of one state
then we are guaranteed to have improved the overall policy in any useful sense.
缺少policy improvement 的理论支撑
Chapter 11 off policy methods with approximation
off-policy learning的困难
1,target of the update
target policy使用的数据是有 behavior policy 产生的
2, distribution of the updates
数据分布是behavior 的数据分布
Something more is needed for the second part of the challenge of off-policy learning
with function approximation because the distribution of updates in the off-policy case is
not according to the on-policy distribution. The on-policy distribution is important to
the stability of semi-gradient methods
Semi-gradient Methods
\(\rho_t = p_{t:t}=\frac{\pi(A_t|S_t)}{b(A_t|S_t)}\)
\(w_{t+1} = w_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t,w_t)\)
\(\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)\) episodic, discounted
\(\delta_t = R_{t+1} -\overline{R}+ \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)\),continuing, undiscounted
action-value
semi-gradient Expected Sarsa:
只用到了(S_t,A_t,R_t)应该没有设计到分布的问题
\(w_{t+1} = w_t + \alpha \delta_t \nabla \hat{q}(S_t,A_t,w_t)\)
\(\delta_t = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})\hat{q}(S_{t+1},a,w_t)-\hat{q}(S_t,A_t,w_t)\)
\(\delta_t = R_{t+1} -\overline{R} + \sum_a \pi(a|S_{t+1})\hat{q}(S_{t+1},a,w_t)-\hat{q}(S_t,A_t,w_t)\)
multi-step 设计IS
\(w_{t+n} =w_{t+n-1}+ \alpha \rho_{t+1} \cdots \rho_{t+n-1}[G_{t+n-1}-\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})\)
\(G_{t:t+n} = R_{t+1}+\cdots +\gamma^{n-1}R_{t+n}+ \gamma^n \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})\)
\(G_{t:t+n} = R_{t+1}-\overline{R}_t+\cdots +R_{t+n}-\overline{R}_{t+n-1}+ \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})\)
n-step tree-backup
\(w_{t+n} =w_{t+n-1}+ \alpha [G_{t:t+n}-\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})\)
\(G_{t:t+n}=\hat{q}(S_t,A_t,w_{t-1})+\sum_{k=t}^{t+n-1}\delta_{k}\prod_{i=t+1}^{k}\gamma \pi(A_i|S_i)\)
TODO
off policy function approximation 不稳定的第二个困难是on-policy 会在后续修正预测的错误,但是off-policy时由于target-Policy 和on-policy的分布不一致,\(\rho=0\),导致target 估计错的可能永远不会得到修正
the deadly triad 致命三合
Function approximation
Bootstrapping
Off-policy training
三个合起来就不稳定
线性 value-function geometory
假设状态空间是\({s_0,s_1,s_2}\)总共有三个状态,三维空间的一点对应一个value function
我们使用一个线性模包含两个参数\(w_1,w_2\),很显然对于线性模型只能表示三维空间的一个平面。对于一个策略\(\pi\)它的\(v_\pi\)很可能在平面外,我们找到的最优解就是\(v_\pi\)在平面的投影\(\Pi v_\pi\)
定义距离 \(|v|^2_\mu =\sum_{s\in S}\mu(s)v(s)^2\) ,\(\mu(s)\)体现了对不同s的重视程度
\(\overline{VE}(w) = |{v_w}-v_\pi|^2\)
\(\Pi v =v_w,w=\underset{w\in R^n}{argmin}|v-v_w|^2\)
\(\Pi = X(X^TDX)^{-1}X^TD\) ,\(D\)维度是\(|S|*|S|\)的对角矩阵,对角线是\(\mu(s)\), \(X\)是\(|S|*d\)的feature matrix
\(|v|_\mu^2=v^TDv\)
\(v_w=Xw\)
\(v_\pi(s) = \sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))\)
Bellman error at state s
\(\overline{\delta}_w(s) = [\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_w(s'))]-v_w(s)\)
\(=E_\pi[R_{t+1}+\gamma v_w(S_{t+1})-v_w(S_t)|S_t=s,A_t=a]\)
Mean Squared Bellman Error,衡量所有状态下的error
\(\overline{BE}(w) = |\overline{\delta}_w|^2_\mu\)
Bellman operator\(B_\pi\)
\((B_\pi v)(s) = \sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v(s'))\)
\(\overline{\delta}_w = B_\pi v_w-v_w\)
在动态规划的情况下(without function approximation)不断迭代有不动点\(v_\pi = B_\pi v_\pi\)
过程就是图中的gray line 不断迭代到达了\(v_\pi\)
但是在使用\(function \,approximation\) 后,平面外的东西表示不了,也就是\(B_\pi v_w\)表示不了只能表示它的投影
\(PBE = ||\Pi \overline{\delta}_w||^2_\mu\)
PBE在线性的情况下可以为0,也就是\(w_{TD}\)
Gradient Descent in the bellman error
The TD error,也不一定要这样定义
\(δ_θ(s,a,s′)=R(s,a,s′)+γv_θ(s′)−vθ(s)\)
The Advantage
\(A_θ(s,a)=E_{s′∼P}[δ_θ(s,a,s′)]=E_{s′∼P}[R(s,a,s′)+γv_θ(s′)]−v_θ(s)\)
The bellman error
\(ϵ_θ(s)=E_{a∼π}[Aθ(s,a)]=E{a∼π,s′∼P}[R(s,a,s′)+γvθ(s′)]−vθ(s),\)
SGD 就是找个目标最小化
首先来看TD error
\(\delta_t=R_{t+1} + \gamma \hat{v}(S_{t+1},w_t)- \hat{v}(S_t,w_t)\)
\(\overline{TDE}(w) =\sum_{s}\mu(s)E[\delta^2_t|S_t=s,A_t\sim \pi]\)
\(\overline{TDE}(w) =\sum_{s}\mu(s)E[\rho_t \delta^2_t|S_t=s,A_t\sim b]\)
\(=E_b[\rho_t\delta_t^2]\) 如果\(\mu\)是在b下的分布,就不用了
\(w_{t+1} = w_w -\frac{1}{2}\alpha \nabla(\rho_t\delta_t^2)\)
\(=w_t-\alpha\rho_t \delta_t \nabla \delta_t\)
\(=w_t +\alpha \rho_t \delta_t (\nabla\hat{v}(S_t,w_t)-\gamma \hat{v}(S_{t+1},w_t))\)
naive residual gradient,收敛,但是收敛不到我们想要的地方
真实的value function 的TDE 会比其他来的大
The Bellman error for a state is the expected TD error in that state
\(w_{t+1} = w_w -\frac{1}{2}\alpha \nabla(E_\pi[\delta_t]^2)\)
\(w_{t+1} = w_w -\frac{1}{2}\alpha \nabla(E_b[\rho_t\delta_t]^2)\)
\(w_{t+1} = w_w -E_b[\rho_t\delta_t]\alpha \nabla(E_b[\rho_t\delta_t])\)
\(w_{t+1} = w_w -\alpha E_b[\rho_t(R_{t+1}+\gamma \hat{v}(S_{t+1},w) -\hat{v}(S_t,w))]E_b[\rho_t\nabla \delta_t])\)
\(w_{t+1} = w_w +\alpha [E_b[\rho_t(R_{t+1}+\gamma \hat{v}(S_{t+1},w))]-\hat{v}(S_t,w)][\nabla \hat{v}(S_t,w)- \gamma E_b[\rho_t\nabla \hat{v}(S_{t+1},w)]]\)
找个算法叫做residual-gradient algorithm
\(S_{t+1}\)出现在前后两个地方,如果要unbiased,前后两个需要无关。确定性环境或者模拟环境
这个算法也收敛,
三个方面不满意:
1,太慢了
2.仍然会收敛到错误的地方,BE就是TDE的期望,确定性环境结果一样,
learnable: 传统上是多项式时间可解,在强化学习下值得是从experience data 里面无法学习到某些性质,这些性质给定环境内部结构,可以计算出来.
Chapter 12 Eligibility Traces
The lamda-return
\(G_{t:t+n}=R_{t+1}+ \gamma R_{t+1} + \cdots \gamma^{n-1}R_{t+n} +\gamma ^n \hat{v}(S_{t+n},w_{t+n-1}),0\leq t\leq T-n\)
\(G_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t:t+n}\)
\(G_t^\lambda = (1-\lambda)\sum_{n=1}^{T-t-1}\lambda^{n-1}G_{t:t+n} +\lambda^{T-t-1}G_t\)
the off line \(\lambda\)-return algorithm
\(w_{t+1} = w_{t} + \alpha [G_t^\lambda -\hat{v}(S_t,w_t)]\nabla\hat{v}(S_t,w_t)\)
TD(\(\lambda\))
forward view 只能在episode结束后才可以计算
backward view 可以online 可以使用continuing
z 就是eligibility trace 相当于记录了前面参数对状态的影响,累加。然后每一步更新的时候都把前面的影响考虑进来
TD(\(\lambda\)) 是另一种统一TD 和MC 的方法
\(\lambda\) = 0 的时候就是TD(0)
\(\lambda\) = 1的时候就是带decay 的MC,TD(1)
TD(1)比MC 更加general
n-step Truncated \(\lambda\)-return Methods
\(G_{t:h}^\lambda = (1-\lambda)\sum_{n=1}^{h-t-1}\lambda^{n-1}G_{t:t+n} +\lambda^{h-t-1}G_{t:h},0\leq t\leq h\leq T\)
\(w_{t+n} = w_{t+n-1} + \alpha [G^\lambda_{t:t+n} -\hat{v}(S_t,w_{t+n-1})]\nabla\hat{v}(S_t,w_{t+n-1})\)
迭代计算k-step \(\lambda\)-return
\(G_{t:t+k}^\lambda =\hat{v}(S_t,w_{t-1})+ \sum_{i=t}^{t+k-1}(\gamma\lambda )^{i-t}\delta_i'\)
\(\delta'_t=R_{t+1}\gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_{t-1})\)
剩下还有内容都不写了
Chapter 13 Policy Gradient Methods
\(\pi(a|s,\theta)\)
REINFORCE with baseline 不是AC 算法,因为它的value function 没有bootstrap.