从零开始学习PPO算法编程(pytorch版本)(三)

从零开始学习PPO算法编程(pytorch版本)(三)

我们接着上一篇文章继续写
从零开始学习PPO算法编程(pytorch版本)(三)
从伪代码中可以看到,在进行第6步和第7步的时候每次迭代需要执行多个epoch,所以我们首先要把epoch的个数放在之前定义的初始化函数中。因为从公式看 θ \theta θ和 ϕ \phi ϕ都有下标,说明第k次迭代的参数与当前的参数之间存在区别,这意味着每个迭代也都有自己的一组要执行的epoch。
Step 5: 优势估计函数计算。在强化学习中,策略对应的优势函数描述了在状态下采取特定行动比随机选择行动要好多少。根据OpenAI官方文档定义,优势函数的计算公式如下:
A π ( s , a ) = Q π ( s , a ) − V π ( s , a ) A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s,a) Aπ(s,a)=Qπ(s,a)−Vπ(s,a)
在这里,将优势函数修改为:
A π ( s , a ) = Q π ( s , a ) − V ϕ k ( s ) A^{\pi}(s,a)=Q^{\pi}(s,a)-V_{\phi k}(s) Aπ(s,a)=Qπ(s,a)−Vϕk​(s)
其中, Q π Q^{\pi} Qπ是状态-动作对(s,a)的Q-value值, V ϕ k V_{\phi k} Vϕk​是由我们的critic网络根据第k次迭代中的参数Φ确定的状态s的值。我们对这里的公式做了一个修改,指定预测的value值跟随第k次迭代的参数φ,这很重要,因为在后面的步骤7中,我们需要跟随第i次迭代的参数φ重新计算V(s)。但是,由于Q值是在每次rollout后确定的,并且 V ϕ k V_{\phi k} Vϕk​必须在我们执行多次更新前决定。 V ϕ k V_{\phi k} Vϕk​和advantage必须在epoch循环之前计算出来。Q-value值在compute_rtgs中已经计算,所以我们着重关心 V ϕ k V_{\phi k} Vϕk​。

我们定义一个函数evaluate 来计算 V ϕ k V_{\phi k} Vϕk​

    def evaluate(self, batch_obs, batch_acts):
        # Query critic network for a value V for each batch_obs. Shape of V should be same as batch_rtgs
        V = self.critic(batch_obs).squeeze()
		
		return V

然后计算优势函数

			# Calculate advantage at k-th iteration
			V, _ = self.evaluate(batch_obs, batch_acts)
			A_k = batch_rtgs - V.detach()                                  # ALG STEP 5

根据Coding PPO from Scratch with PyTorch,可以使用一个小trick,对优势函数进行规范化

A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10)

注意,我们在优势的标准差上加上1e-10,只是为了避免除以0的可能性。

Step 6: 通过最大化PPO-Clip进行参数更新,参数更新函数如下:
θ k + 1 = a r g m a x θ 1 D k T ∑ τ ∈ D k ∑ t = 0 T m i n ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) \theta_{k+1}=argmax_{\theta}\frac{1}{\mathscr{D}_{k}T}\sum_{\tau \in \mathscr{D}_{k}}\sum_{t=0}^{T}min(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{k}}(a_{t}|s_{t})}A^{\pi_{\theta_{k}}}(s_{t},a_{t}),g(\epsilon ,A^{\pi_{\theta_{k}}}(s_{t},a_{t}))) θk+1​=argmaxθ​Dk​T1​τ∈Dk​∑​t=0∑T​min(πθk​​(at​∣st​)πθ​(at​∣st​)​Aπθk​​(st​,at​),g(ϵ,Aπθk​​(st​,at​)))
首先计算 π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) \frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{k}}(a_{t}|s_{t})} πθk​​(at​∣st​)πθ​(at​∣st​)​。
修改evaluate函数返回动作的log probability。

    def evaluate(self, batch_obs, batch_acts):
        """
            Estimate the values of each observation, and the log probs of
            each action in the most recent batch with the most recent
            iteration of the actor network. Should be called from learn.
            Parameters:
                batch_obs - the observations from the most recently collected batch as a tensor.
                            Shape: (number of timesteps in batch, dimension of observation)
                batch_acts - the actions from the most recently collected batch as a tensor.
                            Shape: (number of timesteps in batch, dimension of action)
            Return:
                V - the predicted values of batch_obs
                log_probs - the log probabilities of the actions taken in batch_acts given batch_obs
        """
        # Query critic network for a value V for each batch_obs. Shape of V should be same as batch_rtgs
        V = self.critic(batch_obs).squeeze()

        # Calculate the log probabilities of batch actions using most recent actor network.
        # This segment of code is similar to that in get_action()
        mean = self.actor(batch_obs)
        dist = MultivariateNormal(mean, self.cov_mat)
        log_probs = dist.log_prob(batch_acts)

        # Return the value vector V of each observation in the batch
        # and log probabilities log_probs of each action in the batch
        return V, log_probs

开始epoch循环来执行多次actor-critic网络参数更新,首先在参数初始化时加入更新次数n_updates_per_iteration

    def _init_hyperparameters(self, hyperparameters):
        self.n_updates_per_iteration = 5  # Number of times to update actor/critic per iteration

因为batch_log_probs和curr_log_probs都是log probabity,我们可以减去它们,然后用e取幂。

# Calculate ratios
ratios = torch.exp(curr_log_probs - batch_log_probs)

计算代替损失

# Calculate surrogate losses
surr1 = ratios * A_k
surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A_k
...
def _init_hyperparameters(self):
  ...
  self.clip = 0.2 # As recommended by the paper

最后计算actor_loss:

actor_loss = (-torch.min(surr1, surr2)).mean()

actor网络进行反向传播。
首先定义Adam优化器

from torch.optim import Adam
class PPO:
  def __init__(self, env):
    ...
    self.actor_optim = Adam(self.actor.parameters(), lr=self.lr)
  def _init_hyperparameters(self):
    ...
    self.lr = 0.005

反向传播

# Calculate gradients and perform backward propagation for actor 
# network
self.actor_optim.zero_grad()
actor_loss.backward()
self.actor_optim.step()

Step 7: 对critic网络进行参数更新,参数更新函数定义为:
ϕ k + 1 = a r g m i n ϕ 1 D k T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R t ^ ) 2 \phi_{k+1}=argmin_{\phi}\frac{1}{\mathscr{D}_{k}T}\sum_{\tau \in \mathscr{D}_{k}}\sum_{t=0}^{T}(V_{\phi}(s_{t})-\hat{R_{t}})^{2} ϕk+1​=argminϕ​Dk​T1​τ∈Dk​∑​t=0∑T​(Vϕ​(st​)−Rt​^​)2
我们使用当前epoch的 V ϕ ( s t ) V_{\phi}(s_{t}) Vϕ​(st​)和 R t ^ \hat{R_{t}} Rt​^​均方误差更新critic网络参数,可以直接用pytorch给出的torch.nn.MSELoss计算MSE。
首先定义批判者网络的Adam优化器

self.critic_optim = Adam(self.critic.parameters(), lr=self.lr)

V ϕ ( s t ) V_{\phi}(s_{t}) Vϕ​(st​)和 R t ^ \hat{R_{t}} Rt​^​也已经有了
计算loss:

critic_loss = nn.MSELoss()(V, batch_rtgs)

反向传播:

# Calculate gradients and perform backward propagation for critic network    
self.critic_optim.zero_grad()    
critic_loss.backward()    
self.critic_optim.step()

Step 8: End for

参考:
Coding PPO from Scratch with PyTorch (Part 3/4)

上一篇:C#编写windows服务,多服务为什么只启动一个(ServiceBase.Run)


下一篇:实验2 多个逻辑段的汇编源程序编写与调试