(元)强化学习相关开源代码

2024-04-08 20:54:16

本地代码：https://github.com/lucifer2859/meta-RL

元强化学习简介：https://www.cnblogs.com/lucifer1997/p/13603979.html

一、Meta-RL

1、Learning to Reinforcement Learn：CogSci 2017

https://github.com/awjuliani/Meta-RL
环境：TensorFlow，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit，Contextual Bandit，GridWorld
A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned “on the fly.”
模型：one-layer LSTM A3C [Figure 1(a)，无Enc层]；
实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；本地代码对其略有修改，参见https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL；

https://github.com/achao2013/Learning-To-Reinforcement-Learn
环境：MXNet，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
模型：multi-layer LSTM A3C[无Enc层]；
实验：未运行；
https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
环境：PyTorch，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
模型：one-layer LSTM A3C [Figure 1(a)，with GAE，无Enc层]；
实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL2)：ICLR 2017

https://github.com/mwufi/meta-rl-bandits
环境：PyTorch，CPU；
任务：Independent Bandit；
模型：two-layer LSTM REINFORCE；
实验：成功运行，无bug；模型与论文不符，原文RNN模型为GRU；训练不收敛(当前超参数)；
https://github.com/VashishtMadhavan/rl2
环境：TensorFlow，CPU；
任务：Dependent Bandit；
模型：one-layer LSTM A3C [无Enc层]；
实验：运行失败，gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0；

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)：ICML 2017

https://github.com/tristandeleu/pytorch-maml-rl
环境：PyTorch，GPU；
任务：Multi-armed Bandit，Tabular MDP，Continuous Control with MuJoCo，2D Navigation Task；
模型：MAML TRPO；
实验：初始运行失败，terminate called after throwing an instance of ‘c10::Error’；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解决；但是出现新问题(AttributeError: Can’t pickle local object ‘make_env.._make_env’)；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解决；最终成功运行train.py，但test.py运行失败；bandit-k5-n10不收敛(当前超参数)；
https://github.com/cbfinn/maml_rl
环境：the TensorFlow rllab version，CPU；
任务：MuJoCo；
模型：MAML TRPO；
实验：未运行；

4、Evolved Policy Gradients (EPG)：NeurIPS, 2018

https://github.com/openai/EPG
环境：Chainer，CPU；
任务：MuJoCo；
模型：EPG PPO；
实验：未运行；

5、A Simple Neural Attentive Meta-Learner：ICLR 2018

https://github.com/chanb/metalearning_RL
环境：PyTorch，GPU；
任务：Multi-armed Bandit，Tabular MDP；
模型：SNAIL，RL2（GRU）+ PPO；
实验：成功运行，无bug；

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL)：arXiv: Learning, 2019

https://github.com/katerakelly/oyster
环境：PyTorch，GPU；
任务：MuJoCo；
模型：PEARL (SAC-based)；
实验：Docker配置过程中运行docker build . -t pearl失败；放弃Docker配置在本地对相关包进行安装后，可以成功运行；使用本地包需要提前加一句：conda config --set restore_free_channel true，不然找不到大部分特定版本的包，就会导致创建环境失败；相关问题可以咨询Chains朱朱的主页 - 博客园 (cnblogs.com)；

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL)： ICLR 2020

http://louiskirsch.com/code/metagenrl
环境：TensorFlow，GPU；
任务：MuJoCo；
模型：MetaGenRL；
实验：在tensorflow-gpu1.14.0与tensorflow1.13.2环境上运行python ray_experiments.py train时都会出现bug；

二、RL-Adventure

1、Deep Q-Learning：

参见先前的Blog
https://www.cnblogs.com/lucifer1997/p/13458563.html；
https://github.com/lucifer2859/DQN；
https://github.com/Kaixhin/Rainbow
环境：PyTorch，GPU;
任务：Atari;
模型：Rainbow；
实验：成功运行；
https://github.com/TianhongDai/hindsight-experience-replay
环境：PyTorch，GPU(Not Recommended, Better Use CPU)；
任务：MuJoCo；
模型：HER；
实验：未运行；

2、Policy Gradients：

https://github.com/higgsfield/RL-Adventure-2
环境：PyTorch，GPU；
任务：Gym；
模型：A2C，GAE，PPO，ACER，DDPG，TD3，SAC，GAIL，HER；
实验：成功运行；本地代码基于bug、issue以及性能对其进行修改，参见https://github.com/lucifer2859/Policy-Gradients；在本地代码中，所有模型(HER除外)均可以收敛且获得较好性能；HER的问题参见https://github.com/higgsfield/RL-Adventure-2/issues/14；SAC实现似乎与原文不符(参见https://github.com/higgsfield/RL-Adventure-2/issues/11)；A2C实验仅在CartPole-v0上能够收敛；

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail；
环境：PyTorch/TensorFlow GPU;
任务：Atari，MuJoCo，PyBullet (including Racecar, Minitaur and Kuka)，DeepMind Control Suite；
模型：A2C，PPO，ACKTR，GAIL
实验：未运行；

https://github.com/ikostrikov/pytorch-a3c
环境：PyTorch，CPU；
任务：Atari；
模型：A3C；
实验：初始运行失败，NotImplementedError；参考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解决；最终成功运行；

https://github.com/haarnoja/sac
环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

https://github.com/denisyarats/pytorch_sac
环境：PyTorch，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

http://github.com/rail-berkeley/softlearning/
环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第二版，模型去掉了状态价值函数V）；
实验：未运行；

https://github.com/ku2482/sac-discrete.pytorch
环境：PyTorch，GPU；
任务：Atari；
模型：SAC-Discrete(基于新版连续控制任务下的SAC改进的离散版本)；
实验：成功运行；本地代码对其略有修改，参见https://github.com/lucifer2859/sac-discrete-pytorch；训练收敛，但性能与论文描述存在差异；

3、两者兼有：

https://github.com/ShangtongZhang/DeepRL
环境：PyTorch，GPU；
任务：Atari，MuJoCo；
模型：(Double/Dueling/Prioritized) DQN，C51，QR-DQN，(Continuous/Discrete) Synchronous Advantage A2C，N-Step DQN，DDPG，PPO，OC，TD3，COF-PAC，GradientDICE，Bi-Res-DDPG，DAC，Geoff-PAC，QUOTA，ACE；
实验：未运行；

https://github.com/astooke/rlpyt
环境：PyTorch，GPU；
任务：Atari；
模型：Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
Policy Gradient：A2C, PPO.
Replay Buffers：(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
Q-Function Policy Gradient DDPG, TD3, SAC.
实验：
成功运行，无bug；

https://github.com/vitchyr/rlkit
环境：PyTorch，GPU；
任务：gym[all]
模型：Skew-Fit，RIG，TDM，HER，DQN，SAC（新版），TD3，AWAC；
实验：未运行；
p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)
环境：PyTorch；
任务：CartPole，MountainCar，Bit Flipping，Four Rooms，Long Corridor，Ant-[Maze, Push, Fall]；
模型：DQN，DQN with Fixed Q Target，DDQN，DDQN with Prioritised Experience Replay，Dueling DDQN，REINFORCE，DDPG，TD3，SAC，SAC-Discrete，A3C，A2C，PPO，DQN-HER，DDPG-HER，h-DQN，Stochastic NN-HRL，DIAYN；
实验：部分模型在部分任务上成功运行(例如SAC-Discrete无法在Atari上成功运行)；

https://github.com/hill-a/stable-baselines
环境：TensorFlow；

https://github.com/openai/baselines
环境：TensorFlow；

https://github.com/openai/spinningup
环境：TensorFlow/PyTorch
介绍：This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:
a short introduction to RL terminology, kinds of algorithms, and basic theory,
an essay about how to grow into an RL research role,
a curated list of important papers organized by topic,
a well-documented code repo of short, standalone implementations of key algorithms,
and a few exercises to serve as warm-ups.

三、Meta Learning (Learn to Learn)

1、Platform：

https://github.com/learnables/learn2learn

码农公寓

相关文章