目录
2.贪心策略(The epsilon-greedy algorithm)
3.玻尔兹曼勘探(The softmax exploration algorithm)
4.置信上限算法(The upper confidence bound algorithm)
5.汤普森采样算法(The Thompson sampling algorithm)
问题描述:
多臂*问题(Multi-Armed Bandit Problem)是强化学习的经典问题。MAB实际上是一个台机器,在赌场玩的一种赌博游戏,你拉动手臂(杠杆)并得到一个支付(奖励)基于随机生成的概率分布。
我们的目标是,随着时间序列,找出哪台机器可以得出最大的累计奖励,即最大化累计奖励
实现步骤:
1.环境的部署与实现
pip3 install gym_bandits
import gym
import gym_bandits
import numpy as np
env = gym.make("BanditTenArmedGaussian-v0")
print(env.action_space.n)
2.贪心策略(The epsilon-greedy algorithm)
在贪心策略中,我们要么选择表现最好的臂,要么是随机选择臂
'''initialize all variables'''
#number of rounds
num_rounds = 20000
#count of number of times an arm was pulled
count =np.zeros(10)
#sum of rewards of each arm
sum_rewards = np.zeros(10)
#q value is the average reward
Q = np.zeros(10)
#define epsilon_greedy function
def epsilon_greedy(epsilon):
rand = np.random.random()
if rand < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q)
return action
#start pulling arm
for i in range(num_rounds):
#select the arm using epsilon greedy
arm = epsilon_greedy(0.5)
#get the reward
observation,reward,done,info = env.step(arm)
#update the count of that arm
count[arm] += 1
#sum the reward
sum_rewards[arm] += reward
#calculate Q value which is the average rewards of the arm
Q[arm] = sum_rewards[arm]/count[arm]
print('the optimal arm is {}'.format(np.argmax(Q)))
3.玻尔兹曼勘探(The softmax exploration algorithm)
在softmax探索中,我们根据玻尔兹曼概率选择臂
import math
import random
''' in softmax exploration, we select an arm based on a probability from
the Boltzmann distribution'''
#define the softmax function
def softmax(tau):
total = sum(math.exp(val/tau) for val in Q)
probs = [math.exp(val/tau) /total for val in Q]
threshold = random.random()
cumulative_prob = 0.0
for i in range(len(probs)):
cumulative_prob += probs[i]
if (cumulative_prob > threshold):
return i
return np.argmax(probs)
#begining
for i in range(num_rounds):
#selct the using arm
arm = softmax(0.5)
#get the reward
observation,reward,done,info = env.step(arm)
#update the count of arm
count[arm] += 1
#sum the rewards
sum_rewards[arm] += reward
#calculate Q value
Q[arm] = sum_rewards[arm]/count[arm]
print("the optimal arm is {}".format(np.argmax(Q)))
4.置信上限算法(The upper confidence bound algorithm)
在此算法中,我们注重于在初期表现很差,但是在后面的回合中,表现不错的臂。置信上限算法也称为乐观面对不确定性
'''
1. Select the action (arm) that has a high sum of average reward and upper
confidence bound
2. Pull the arm and receive a reward
3. Update the arm's reward and confidence bound
'''
#define the upper confidence bound function
def UCB(iters):
ucb = np.zeros(10)
#explore all the arm
if iters <10:
return i
else:
for arm in range(10):
#calculate upper bound
upper_bound = math.sqrt((2*math.log(sum(count))) / count[arm])
#add upper bound to the Q value
ucb[arm] = Q[arm] + upper_bound
#return the arm which has maximum value
return (np.argmax(ucb))
#begining
for i in range(num_rounds):
#select the arm using UCB
arm = UCB(i)
#get the reward
observation,reward,done,info = env.step(arm)
#update the count
count[arm] += 1
#sum the rewards
sum_rewards[arm] += reward
#calculate Q value
Q[arm] = sum_rewards[arm] /count[arm]
print("the optimal arm is {}".format(np.argmax(Q)))
5.汤普森采样算法(The Thompson sampling algorithm)
是一种基于先验的概率算法分布。
'''
1. Sample a value from each of the k distributions and use this value as a prior
mean.
2. Select the arm that has the highest prior mean and observes the reward.
3. Use the observed reward to modify the prior distribution.
'''
#initialize alpha and beta value
alpha = np.ones(10)
beta = np.ones(10)
#define the thompson_sampling function
def thompson_sampling(alpha,beta):
samples = [np.random.beta(alpha[i] +1,beta[i] +1) for i in range(10)]
return np.argmax(samples)
#begining
for i in range(num_rounds):
arm = thompson_sampling(alpha,beta)
observation,reward,done,info = env.step(arm)
count[arm] += 1
sum_rewards[arm] += reward
Q[arm] = sum_rewards[arm] /count[arm]
if reward>0:
alpha[arm] += 1
else:
beta[arm] += 1
print('the optimal arm is {}'.format(np.argmax(Q)))
参考:
《Hands-on Reinforcement Learning with Python. Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow》