【强化学习】港中大强化学习课程Assignment解析 01_2
课程相关
- 课程首页:https://cuhkrlcourse.github.io/
- 视频链接:https://space.bilibili.com/511221970/channel/seriesdetail?sid=764099【B站】
- 相关资料:https://datawhalechina.github.io/easy-rl/#/【EasyRL】
- Reinforcement Learning: An Introduction:https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
- Github首页(作业获取):https://github.com/cuhkrlcourse/ierg5350-assignment-2021
- Gitee(我的解析):https://gitee.com/cstern-liao/cuhk_rl_assignment
2 有模型的表格型方法
Model-based vs. Model-free
智能体按照是否对真实世界建模分为有模型和无模型两类,有模型(Model-based)的智能体对真实世界建模成一个虚拟世界,智能体可以通过状态转移函数 P ( s t + 1 ∣ s t , a t ) P(s_{t+1}\ |\ s_t, a_t) P(st+1 ∣ st,at) 和奖励函数 R ( s t , a t ) R(s_t, a_t) R(st,at) 来预测在某个状态采取某个动作之后会转移到哪个状态,获得怎样的奖励,这样智能体可以直接通过学习策略或者价值函数来最大化奖励。但在真实世界中大部分问题我们没有办法得到环境中的全部元素,他的状态转移函数和奖励函数对我们来说是无法感知的,**这时就需要采用免模型学习。**免模型学习没有对真实环境进行建模,智能体只能在真实环境中通过一定的策略来执行动作,等待奖励和状态迁移,然后根据这些反馈信息来更新行为策略,这样反复迭代直到学习到最优策略。
这里Assignment作业使用的是有模型的表格型方法。
在Section2中,题目给出了父类TabularRLTrainerAbstract的定义。
# Run this cell without modification
class TabularRLTrainerAbstract:
"""This is the abstract class for tabular RL trainer. We will inherent the specify
algorithm's trainer from this abstract class, so that we can reuse the codes like
getting the dynamic of the environment (self._get_transitions()) or rendering the
learned policy (self.render())."""
def __init__(self, env_name='FrozenLake8x8-v1', model_based=True):
self.env_name = env_name
self.env = gym.make(self.env_name)
self.action_dim = self.env.action_space.n
self.obs_dim = self.env.observation_space.n
self.model_based = model_based
def _get_transitions(self, state, act):
"""Query the environment to get the transition probability,
reward, the next state, and done given a pair of state and action.
We implement this function for you. But you need to know the
return format of this function.
"""
self._check_env_name()
assert self.model_based, "You should not use _get_transitions in " \
"model-free algorithm!"
# call the internal attribute of the environments.
# `transitions` is a list contain all possible next states and the
# probability, reward, and termination indicater corresponding to it
transitions = self.env.env.P[state][act]
# Given a certain state and action pair, it is possible
# to find there exist multiple transitions, since the
# environment is not deterministic.
# You need to know the return format of this function: a list of dicts
ret = []
for prob, next_state, reward, done in transitions:
ret.append({
"prob": prob,
"next_state": next_state,
"reward": reward,
"done": done
})
return ret
def _check_env_name(self):
assert self.env_name.startswith('FrozenLake')
def print_table(self):
"""print beautiful table, only work for FrozenLake8X8-v0 env. We
write this function for you."""
self._check_env_name()
print_table(self.table)
def train(self):
"""Conduct one iteration of learning."""
raise NotImplementedError("You need to override the "
"Trainer.train() function.")
def evaluate(self):
"""Use the function you write to evaluate current policy.
Return the mean episode reward of 1000 episodes when seed=0."""
result = evaluate(self.policy, 1000, env_name=self.env_name)
return result
def render(self):
"""Reuse your evaluate function, render current policy
for one episode when seed=0"""
evaluate(self.policy, 1, render=True, env_name=self.env_name)
它是一个抽象类,这意味着里面有一个函数train()需要在接下来进行重写,我们将在2.1节和2.2节中继承这个抽象类并重写train方法,分别实现策略迭代和价值迭代的过程。重点看一下其中 _get_transitions(self, state, act) 函数
2.1 策略迭代
首先回顾一下策略迭代算法:
-
在给定环境转移的情况下,更新价值函数直到收敛。第一步是一个小循环,价值函数与上一轮小循环的值相差很小(即收敛)时退出。
v k + 1 = E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] v_{k+1}=E_{s'}[R(s,a)+\gamma v_k(s')] vk+1=Es′[R(s,a)+γvk(s′)]
其中 a a a 由当前迭代的策略函数给出, s ′ s' s′ 是下一个状态, R R R 是奖励函数, v k ( s ′ ) v_k(s') vk(s′)是上一个小循环中下一个状态的价值
-
找到该轮迭代中能使价值函数最大化的最优策略
a = a r g m a x a E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] a=argmax_aE_{s'}[R(s,a)+\gamma v_k(s')] a=argmaxaEs′[R(s,a)+γvk(s′)]
-
如果找到的最优策略跟前一轮一致,则停止迭代,否则回到第一步继续迭代
综上,策略迭代算法有一个外循环和一个内循环(在第一步中得到收敛的价值函数)
接下来我们创建一个策略迭代的子类***(PolicyIterationTrainer)***继承上面所说的抽象父类。
class PolicyItertaionTrainer(TabularRLTrainerAbstract):
def __init__(self, gamma=1.0, eps=1e-10, env_name='FrozenLake8x8-v1'):
# ...
def train(self):
# ...
def update_value_function(self):
# ...
def update_policy(self):
# ...
这个抽象类定义里面有四个**[TODO]**,我们一个个来看:
-
初始化函数中要求在开始时生成一个随机策略,这里我们随机写一个就好,一开始可以根据我们的判断尽量让智能体趋向于去完成这个episode,比如这里我们让它有向右下走的趋势,而不是直接random.choice(),这样可以减少迭代的次数,稍微提高效率,这个策略我们会在接下来通过迭代去寻找最优策略。
def __init__(self, gamma=1.0, eps=1e-10, env_name='FrozenLake8x8-v1'): super(PolicyItertaionTrainer, self).__init__(env_name) # discount factor self.gamma = gamma # value function convergence criterion self.eps = eps # build the value table for each possible observation self.table = np.zeros((self.obs_dim,)) # [TODO] you need to implement a random policy at the beginning. # It is a function that take an integer (state or say observation) # as input and return an interger (action). # remember, you can use self.action_dim to get the dimension (range) # of the action, which is an integer in range # [0, ..., self.action_dim - 1] # hint: generating random action at each call of policy may lead to # failure of convergence, try generate random actions at initializtion # and fix it during the training. self.policy = lambda _: DOWN if (obs + 1) % 8 == 0 or (obs + 8) < 64 else RIGHT # test your random policy test_random_policy(self.policy, self.env)
-
接下来是***train()***方法中,要求如果有必要的话,每次大循环把价值函数重置。经过测试,是否重置在现在这个小作业中对最终每个状态的价值和平均奖励没有太大影响,只是影响每次迭代的次数,但是我理解的价值迭代的原理是每次大循环通过新的策略函数在原始的基础上获得价值函数,所以我认为这里应该重置清0。
def train(self): """Conduct one iteration of learning.""" # [TODO] value function may be need to be reset to zeros. # if you think it should, than do it. If not, then move on. # hint: the value function is equivalent to self.table, # a numpy array with length 64. self.table = np.zeros((self.obs_dim,)) self.update_value_function() self.update_policy()
继续往下,train()方法依次执行了***update_value_function()*** 和 update_policy() 函数,接下来我们介绍。
-
update_value_function() 我们逐句来看。
def update_value_function(self): count = 0 # count the steps of value updates while True: old_table = self.table.copy() # 复制一份旧的价值表 for state in range(self.obs_dim): act = self.policy(state) # 用当前策略函数生成当前状态决策的动作 transition_list = self._get_transitions(state, act) # 调用函数获得在当前状态采取当前动作后的状态转移情况 state_value = 0 for transition in transition_list: prob = transition['prob'] # 转移到某状态的概率 reward = transition['reward'] # 转移到该状态后的即时奖励 next_state = transition['next_state'] # 转移到的下一个状态 done = transition['done'] # 是否结束 # [TODO] what is the right state value? # hint: you should use reward, self.gamma, old_table, prob, # and next_state to compute the state value state_value += prob * (reward + self.gamma * old_table[next_state]) # 关键步骤 # update the state value self.table[state] = state_value # [TODO] Compare the old_table and current table to # decide whether to break the value update process. # hint: you should use self.eps, old_table and self.table should_break = np.sum(np.abs(old_table - self.table)) < self.eps # 判断是否收敛 if should_break: break
其中遍历每种可能的转移情况内的关键步骤 s t a t e _ v a l u e + = p r o b ∗ ( r e w a r d + s e l f . g a m m a ∗ o l d t a b l e [ n e x t s t a t e ] ) state\_value += prob * (reward + self.gamma * old_table[next_state]) state_value+=prob∗(reward+self.gamma∗oldtable[nextstate])对应的就是公式 v k + 1 = E s ′ [ R ( s , a ) + γ v k ( s ′ ) ] v_{k+1}=E_{s'}[R(s,a)+\gamma v_k(s')] vk+1=Es′[R(s,a)+γvk(s′)]
在每次循环后判断是否收敛来决定是否停止循环
- update_policy() 逐句来看
def update_policy(self): """You need to define a new policy function, given current value function. The best action for a given state is the one that has greatest expected return. To optimize computing efficiency, we introduce a policy table, which take state as index and return the action given a state. """ policy_table = np.zeros([self.obs_dim, ], dtype=np.int64) for state in range(self.obs_dim): state_action_values = [0] * self.action_dim # [TODO] assign the action with greatest "value" # to policy_table[state] # hint: what is the proper "value" here? # you should use table, gamma, reward, prob, # next_state and self._get_transitions() function # as what we done at self.update_value_function() # Bellman equation may help. best_action = None # 对可能采取的动作遍历 for action in range(self.action_dim): transition_list = self._get_transitions(state, action) for transition in transition_list: prob = transition['prob'] reward = transition['reward'] next_state = transition['next_state'] done = transition['done'] # 计算在采取当前动作的情况下,当前状态的价值 state_action_values[action] += prob * (reward + self.gamma * self.table[next_state]) best_action = np.argmax(np.array(state_action_values)) # 取价值最大时对应的动作 policy_table[state] = best_action # 更新在某个状态下应该采取的动作 self.policy = lambda obs: policy_table[obs]
接下来我们看一下运行函数:
# Managing configurations of your experiments is important for your research.
default_pi_config = dict(
max_iteration=1000,
evaluate_interval=1,
gamma=1.0,
eps=1e-10
)
def policy_iteration(train_config=None):
config = default_pi_config.copy()
if train_config is not None:
config.update(train_config)
trainer = PolicyItertaionTrainer(gamma=config['gamma'], eps=config['eps'])
old_policy_result = {
obs: -1 for obs in range(trainer.obs_dim)
}
for i in range(config['max_iteration']):
# train the agent
trainer.train() # [TODO] please uncomment this line
# [TODO] compare the new policy with old policy to check whether
# should we stop. If new and old policy have same output given any
# observation, them we consider the algorithm is converged and
# should be stopped.
new_policy_result = {
state: trainer.policy(state) for state in range(trainer.obs_dim)
}
should_stop = (new_policy_result == old_policy_result)
if should_stop:
print("We found policy is not changed anymore at "
"itertaion {}. Current mean episode reward "
"is {}. Stop training.".format(i, trainer.evaluate()))
break
old_policy_result = new_policy_result
return trainer
pi_agent = policy_iteration()
pi_agent.print_table()
________________________________________
[INFO] In 0 iteration, current mean episode reward is 0.822.
[DEBUG] Updated values for 200 steps. Difference between new and old table is: 0.041664161897299004
[DEBUG] Updated values for 400 steps. Difference between new and old table is: 0.0022292041480653085
[DEBUG] Updated values for 600 steps. Difference between new and old table is: 0.0001184338151329345
[DEBUG] Updated values for 800 steps. Difference between new and old table is: 6.291939822350434e-06
[DEBUG] Updated values for 1000 steps. Difference between new and old table is: 3.3426684917237104e-07
[DEBUG] Updated values for 1200 steps. Difference between new and old table is: 1.7758327711114852e-08
[DEBUG] Updated values for 1400 steps. Difference between new and old table is: 9.434331232904825e-10
[INFO] In 1 iteration, current mean episode reward is 0.804.
[DEBUG] Updated values for 200 steps. Difference between new and old table is: 0.0005348034918033623
[DEBUG] Updated values for 400 steps. Difference between new and old table is: 4.20104129661425e-06
[DEBUG] Updated values for 600 steps. Difference between new and old table is: 2.8070459692774996e-08
[DEBUG] Updated values for 800 steps. Difference between new and old table is: 1.7462087331665543e-10
[INFO] In 2 iteration, current mean episode reward is 0.77.
[DEBUG] Updated values for 200 steps. Difference between new and old table is: 0.0004257477615745714
[DEBUG] Updated values for 400 steps. Difference between new and old table is: 1.4125290733302265e-05
[DEBUG] Updated values for 600 steps. Difference between new and old table is: 3.971402302571647e-07
[DEBUG] Updated values for 800 steps. Difference between new and old table is: 1.0301546324309463e-08
[DEBUG] Updated values for 1000 steps. Difference between new and old table is: 2.548876110175513e-10
[INFO] In 3 iteration, current mean episode reward is 0.688.
[DEBUG] Updated values for 200 steps. Difference between new and old table is: 0.00019132880435357436
[DEBUG] Updated values for 400 steps. Difference between new and old table is: 1.8012092146968417e-05
[DEBUG] Updated values for 600 steps. Difference between new and old table is: 1.3738229058951612e-06
[DEBUG] Updated values for 800 steps. Difference between new and old table is: 9.513477411404736e-08
[DEBUG] Updated values for 1000 steps. Difference between new and old table is: 6.230829824316331e-09
[DEBUG] Updated values for 1200 steps. Difference between new and old table is: 3.9353281744425317e-10
[INFO] In 4 iteration, current mean episode reward is 0.829.
[INFO] In 5 iteration, current mean episode reward is 0.867.
We found policy is not changed anymore at itertaion 6. Current mean episode reward is 0.867. Stop training.
______________________________________________________________
+-----+-----+-----State Value Mapping-----+-----+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 1 |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 2 |1.000|0.978|0.926|0.000|0.857|0.946|0.982|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 3 |1.000|0.935|0.801|0.475|0.624|0.000|0.945|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 4 |1.000|0.826|0.542|0.000|0.539|0.611|0.852|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 5 |1.000|0.000|0.000|0.168|0.383|0.442|0.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 6 |1.000|0.000|0.195|0.121|0.000|0.332|0.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 7 |1.000|0.732|0.463|0.000|0.277|0.555|0.777|0.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
可以看到,我们在外层循环中循环调用trainer的train方法,并在每次更新策略后,通过比较更新前后策略是否一致(或达到最大迭代数)来决定是否停止迭代。
最终策略达到最优时我们可以得到的平均奖励为0.867,对应的价值函数如上表。
2.2 价值迭代
价值迭代与策略迭代的不同就是:策略迭代是在当前策略下将价值函数迭代到完全收敛后更新策略;而价值函数是每次循环只迭代一次价值函数,然后紧接着更新策略,直到价值收敛。前者的迭代停止条件是策略不变,后者的停止条件是价值收敛。
def train(self):
"""Conduct one iteration of learning."""
# [TODO] value function may be need to be reset to zeros.
# if you think it should, than do it. If not, then move on.
# self.table = np.zeros((self.obs_dim,))
# In value iteration, we do not explicit require a
# policy instance to run. We update value function
# directly based on the transitions. Therefore, we
# don't need to run self.update_policy() in each step.
self.update_value_function()
def update_value_function(self):
old_table = self.table.copy()
for state in range(self.obs_dim):
state_value = 0
# [TODO] what should be de right state value?
# hint: try to compute the state_action_values first
action_state_value = np.zeros(self.action_dim)
for action in range(self.action_dim):
transition_list = self._get_transitions(state, action)
for transition in transition_list:
prob = transition['prob']
reward = transition['reward']
next_state = transition['next_state']
done = transition['done']
action_state_value[action] += prob * (reward + self.gamma * old_table[next_state])
self.table[state] = max(action_state_value)
def evaluate(self):
"""Since in value itertaion we do not maintain a policy function,
so we need to retrieve it when we need it."""
self.update_policy()
return super().evaluate()
所以只有这两处是不同的,可以注意到这里的train方法内,不应该把table重新置0,因为我们的目的是让价值函数收敛,并且每次循环只执行一步,如果重置很可能会陷入死循环中。
for i in range(config['max_iteration']):
old_state_value_table = trainer.table.copy()
# train the agent
trainer.train() # [TODO] please uncomment this line
# evaluate the result
if i % config['evaluate_interval'] == 0:
print("[INFO]\tIn {} iteration, current "
"mean episode reward is {}.".format(
i, trainer.evaluate()
))
# [TODO] compare the new policy with old policy to check should
# we stop.
# [HINT] If new and old policy have same output given any
# observation, them we consider the algorithm is converged and
# should be stopped.
should_stop = (np.sum(np.abs(old_state_value_table - trainer.table)) < default_vi_config['eps'])
if should_stop:
print("We found policy is not changed anymore at "
"itertaion {}. Current mean episode reward "
"is {}. Stop training.".format(i, trainer.evaluate()))
break
vi_agent = value_iteration()
vi_agent.render()
vi_agent.print_table()
_______________________________________
[INFO] In 0 iteration, current mean episode reward is 0.0.
[INFO] In 100 iteration, current mean episode reward is 0.892.
[INFO] In 200 iteration, current mean episode reward is 0.867.
[INFO] In 300 iteration, current mean episode reward is 0.867.
[INFO] In 400 iteration, current mean episode reward is 0.867.
[INFO] In 500 iteration, current mean episode reward is 0.867.
[INFO] In 600 iteration, current mean episode reward is 0.867.
[INFO] In 700 iteration, current mean episode reward is 0.867.
[INFO] In 800 iteration, current mean episode reward is 0.867.
[INFO] In 900 iteration, current mean episode reward is 0.867.
[INFO] In 1000 iteration, current mean episode reward is 0.867.
[INFO] In 1100 iteration, current mean episode reward is 0.867.
[INFO] In 1200 iteration, current mean episode reward is 0.867.
[INFO] In 1300 iteration, current mean episode reward is 0.867.
[INFO] In 1400 iteration, current mean episode reward is 0.867.
[INFO] In 1500 iteration, current mean episode reward is 0.867.
[INFO] In 1600 iteration, current mean episode reward is 0.867.
We found policy is not changed anymore at itertaion 1600. Current mean episode reward is 0.867. Stop training.
_____________________________________
+-----+-----+-----State Value Mapping-----+-----+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 1 |1.000|1.000|1.000|1.000|1.000|1.000|1.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 2 |1.000|0.978|0.926|0.000|0.857|0.946|0.982|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 3 |1.000|0.935|0.801|0.475|0.624|0.000|0.945|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 4 |1.000|0.826|0.542|0.000|0.539|0.611|0.852|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 5 |1.000|0.000|0.000|0.168|0.383|0.442|0.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 6 |1.000|0.000|0.195|0.121|0.000|0.332|0.000|1.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 7 |1.000|0.732|0.463|0.000|0.277|0.555|0.777|0.000|
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
调用是这样的,这里我们通过trainer.evaluate()方法复用父类的update_policy()方法。