强化学习的特征
区别于其他的机器学习方法,强化学习:
- 没有监督者,但是有奖励信号。不同于监督学习和非监督学习,强化学习无所谓“正确的行为”,类似于一个小孩不断试错的过程。
- 反馈不是即使的,而是延迟的。eg:过了若干步之后才可以知道最初的选择是正确的还是错误的。
- 对于强化学习系统来说,时间非常重要
- 智能体的行为会影响它接收到的数据(所处环境)。
概念
- reward: a reward R t R_t Rtis a scalar feedback signal. 区别于博弈中的正负目标函数,反馈信号表明智能体在第t步的选择有多正确。智能体的目标就是maximise cumulative reward.
- Reward Hypothesis: All goals can be described by the maximisation of expected cumulative reward.
在每一步智能体都得到两个输入,一个输出。
-
History: the history is the sequence of obeservations, actions, rewards.
H t = A 1 , O 1 , R 1 , ⋯ , A t , O t , R t H_t = A_1,O_1,R_1,\cdots,A_t,O_t,R_t Ht=A1,O1,R1,⋯,At,Ot,Rt算法的目标是建立从history到action的映射。 -
state: state is the information used to determine what happens next. state包含了所有我们下一步做出决策所需要的信息。
S t = f ( H t ) S_t=f(H_t) St=f(Ht) -
environment state: The environment state S t e S^e_t Ste is the environment’s private representation. 环境状态包括了环境用来采取下一步输入或输入的全部信息。环境状态对智能体来说通常有一部分不可知,且有时候环境状态的信息对于智能体做出选择可能无用。
-
agent state: The agent state S t a S^a_t Sta is the agent’s internal representation.这才是强化学习算法中智能体用到的信息。
S t a = f ( H t ) S^a_t=f(H_t) Sta=f(Ht) -
information state: A information state (a.k.a. Markov state contains all useful information from the history.
-
A state S t S_t St is Markov if and only if: P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , ⋯ , S t ] P[S_{t+1}|S_t]=P[S_{t+1}|S_1,\cdots,S_t] P[St+1∣St]=P[St+1∣S1,⋯,St]
The state is independent of the past given the present.
H 1 : t → S t → H t + 1 : ∞ H_{1:t}\rightarrow S_t\rightarrow H_{t+1:\infty} H1:t→St→Ht+1:∞
将得到的状态S存入 S t S_t St后,根据Markov性质,state包含了所有的游泳信息,故可以舍弃history而只观察state.
eg: environment state is Markov. entire history is also Markov state. -
Full observability: agent directly observes environment state
O t = S t a = S t e O_t=S^a_t=S^e_t Ot=Sta=Ste
最好的情况:智能体可以获得环境的全部信息。
agent state = environment state = information state.
Formally, this is a Markov decision process (MDP). -
Partial observability: agent indirectly observes environment.
eg: a robot with camera vision isn’t told its absolute location.
agent state ≠ environment state.
Formally, this is a partially observable Markov decision process (POMDP).
Agent must construct its own state representation S t a S^a_t Sta.
eg: S t a = H t S^a_t=H_t Sta=Ht
or Bayes: S t a = ( P [ S t e = s 1 ] , ⋯ , P [ S t e = s n ] ) S^a_t=(P[S^e_t=s_1],\cdots,P[S^e_t=s^n]) Sta=(P[Ste=s1],⋯,P[Ste=sn]) -
exploration与exploitation:
Exploration 是要在全局范围内进行不确定策略的搜索,有选择地放弃某些奖励来获取更多关于环境的信息,避免陷入 local;
Exploration finds more information about the environment.
Exploitation 是要在当前最优的附近搜索,开发利用已知的信息,来最大化奖励,找到更好的解。
Exploitation exploits known information to maximise reward. -
prediction与control:
Prediction: evaluate the future. Given a policy,
Control: optimise the future. Find the best policy.
RL智能体的主要组成(非全部必需)
- Policy: agent’s behaviour function. It’s a map from state to action.
a
=
π
(
s
)
a=\pi(s)
a=π(s)s:situation
Stochastic policy(随机决策): π ( a ∣ s ) = P [ A = a ∣ A = s ] \pi(a|s)=P[A=a|A=s] π(a∣s)=P[A=a∣A=s] - Value function: how good is each state and/or actor. A prediction of future reward
v π ( s ) = E [ R t + γ R t + 1 + γ 2 R t + 2 + ⋯ ∣ S t = s ] v_\pi(s)=\mathbb{E}[R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\cdots|S_t=s] vπ(s)=E[Rt+γRt+1+γ2Rt+2+⋯∣St=s] γ \gamma γ是衰减因子,表示我们虽然考虑未来的收益,但是更重视当前的收益。 - Model: agent’s representation of the environment (maybe not real environment, but is the view of agent)
A model predicts what the environment will do next. - Transitions: P predicts the next state. 动态特性
P s s ′ a = P [ S ′ = s ′ ∣ S = s , A = a ] R s a = E [ R ∣ S = s , A = a ] P^a_{ss^{\prime}}=P[S^{\prime}=s^{\prime}|S=s,A=a]\\R^a_s=\mathbb{E}[R|S=s,A=a] Pss′a=P[S′=s′∣S=s,A=a]Rsa=E[R∣S=s,A=a] - Rewards: R predicts the next reward.
注意区分状态价值函数和奖励:状态价值函数包含预测,看得很长远;奖励是即时性的,是对当下环境的反馈。状态价值函数由多个即时奖励组成。
RL中的基本问题
连续决策问题中的两类基本问题:
- 强化学习:
环境最初是未知的;智能体与环境有交互;智能体优化它的决策(The agent improves its policy.) - 规划问题:
环境是已知的;智能体用模型进行模拟,而不需要实际与环境产生交互;智能体优化它的决策。