1. Model Free
1.1 Monte Carlo
1.1.1 Value Iteration
SARSA 1. current Q -> e-greedy policy
2. sample trajectorys (s1,a1,r1,s2,a2,r2 …), first visit MC
3. update
Q
(
s
,
a
)
=
1
N
(
s
,
a
)
∑
i
G
i
t
(
s
,
a
)
Q(s,a) = \frac{1}{N(s,a)}\sum_{i} G_i^t(s,a)
Q(s,a)=N(s,a)1i∑Git(s,a)
4. Inprove the policy based uppdated Q value
1.1.2 Policy interation
1.1.2 Policy Gradient
1.2 TD
1.2.1 Value Iteration
can be done in non-episode environment)
SARSA
- None episode setting, need tuple
(
s
t
,
a
t
,
r
t
,
a
t
+
1
)
(s_t,a_t,r_t,a_{t+1})
(st,at,rt,at+1)
2. update Q ( s t , a t ) = Q ( s , a ) + α ( r t + r Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+rQ(s_{t+1},a_{t+1})-Q(s_t,a_t)) Q(st,at)=Q(s,a)+α(rt+rQ(st+1,at+1)−Q(st,at))
3. Inprove the policy based updated Q value
Q learning off policy learning
1. None episode setting, need tuple
(
s
t
,
a
t
,
r
t
)
(s_t,a_t,r_t)
(st,at,rt)
2. update
Q
(
s
t
,
a
t
)
=
Q
(
s
,
a
)
+
α
(
r
t
+
r
max
a
′
Q
(
s
t
+
1
,
a
′
)
−
Q
(
s
t
,
a
t
)
)
Q(s_t,a_t) = Q(s,a) + \alpha(r_{t}+r \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t))
Q(st,at)=Q(s,a)+α(rt+ra′maxQ(st+1,a′)−Q(st,at))
DQN