reinforcement learning

1. Model Free

1.1 Monte Carlo

1.1.1 Value Iteration

SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update Q ( s , a ) = 1 N ( s , a ) ∑ i G i t ( s , a ) Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a) Q(s,a)=N(s,a)1​i∑​Git​(s,a)
4. Inprove the policy based uppdated Q value

1.1.2 Policy interation

1.1.2 Policy Gradient

1.2 TD

1.2.1 Value Iteration

can be done in non-episode environment)

SARSA

  1. None episode setting, need tuple ( s t , a t , r t , a t + 1 ) (s_t,a_t,r_t,a_{t+1}) (st​,at​,rt​,at+1​)
    2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t)) Q(st​,at​)=Q(s,a)+α(rt​+rQ(st+1​,at+1​)−Q(st​,at​))
    3. Inprove the policy based updated Q value
    reinforcement learning

Q learning off policy learning
1. None episode setting, need tuple ( s t , a t , r t ) (s_t,a_t,r_t) (st​,at​,rt​)
2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r max ⁡ a ′ Q ( s t + 1 , a ′ ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t)) Q(st​,at​)=Q(s,a)+α(rt​+ra′max​Q(st+1​,a′)−Q(st​,at​))
reinforcement learning

DQN
reinforcement learning

1.2.2 Policy Gradient

link

2. Model Based

上一篇:VS2010中打开项目中的Winform界面报“This method explicitly users CAS policy,which has been obsoleted by the ...”


下一篇:小程序直传文件到oss