【RL-Notes】Constraint Decoupling by Constraint Relaxiation

Navigator

Example

Constraint Relaxiation, whereby the constraint set is replaced by another constraint set that does not involve coupling.

Multiarmed bandit problem, involves n projects of which only one can be worked on at any time period. Each project i is characterized at time k by its state x k i x_k^i xki​. If project i is worked on at time k, one receives an expected reward R i ( x k i ) R^i(x_k^i) Ri(xki​), and the state x k i x_k^i xki​ evolves according to the equation:
x k + 1 i = f i ( x k i , w k i ) x_{k+1}^i=f^i(x_k^i, w_k^i) xk+1i​=fi(xki​,wki​)
where w k i w_k^i wki​ is a random disturbance with probability distribution depending on x k i x_k^i xki​ but not on prior disturbances. If project i is not worked on, its state changes according to
x k + 1 i = f ˉ i ( x k i , w ˉ k i ) x_{k+1}^i=\bar{f}^i(x_k^i, \bar{w}_k^i) xk+1i​=fˉ​i(xki​,wˉki​)

In particular, suppose that the optimal reward function J k ∗ ( x 1 , … , x n ) J_k^*(x^1, \dots, x^n) Jk∗​(x1,…,xn) is approximated by a separable function of the form ∑ i = 1 n J ~ k i ( x i ) \sum_{i=1}^n \tilde{J}_k^i(x^i) ∑i=1n​J~ki​(xi), where each J ~ k i \tilde{J}_k^i J~ki​ is a function that quantifies the contribution of the i i ith project to the total reward. The corresponding one-step lookhead policy selects at time k k k the project i i i that maximizes:
R i ( x i ) = ∑ j ≠ i R ˉ j ( x j ) + E { J ~ k + 1 i ( f i ( x i , w i ) ) } + ∑ j ≠ i E { J k + 1 j ( x j , w ˉ j ) } R^i(x^i)=\sum_{j\neq i}\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))\}+\sum_{j\neq i}\mathbb{E}\{J_{k+1}^j(x^j, \bar{w}^j)\} Ri(xi)=j​=i∑​Rˉj(xj)+E{J~k+1i​(fi(xi,wi))}+j​=i∑​E{Jk+1j​(xj,wˉj)}
which can also be written as:
R i ( x i ) − R ˉ i ( x i ) + E { J ~ k + 1 i ( f i ( x i , w i ) ) − J ~ k + 1 i ( f ˉ i ( x j , w ˉ j ) ) } + ∑ j = 1 n { R ˉ j ( x j ) + E { J ~ k + 1 j ( f ˉ j ( x j , w ˉ j ) ) } } R^i(x^i)-\bar{R}^i(x^i)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))-\tilde{J}_{k+1}^i(\bar{f}^i(x^j, \bar{w}^j))\}+\sum_{j=1}^n\{\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}^j_{k+1}(\bar{f}^j(x^j, \bar{w}^j))\}\} Ri(xi)−Rˉi(xi)+E{J~k+1i​(fi(xi,wi))−J~k+1i​(fˉ​i(xj,wˉj))}+j=1∑n​{Rˉj(xj)+E{J~k+1j​(fˉ​j(xj,wˉj))}}
Noting that the last term in the above expression does not depend on i.

An alternative possibility is to use a separable parametric approximation of the form
∑ i = 1 n J ~ k + 1 i ( x k + 1 i , r k + 1 i ) \sum_{i=1}^n\tilde{J}_{k+1}^i(x_{k+1}^i, r_{k+1}^i) i=1∑n​J~k+1i​(xk+1i​,rk+1i​)
where r k + 1 i r_{k+1}^i rk+1i​ are vectors of ‘tunnable’ parameters. The values of r k + 1 i r_{k+1}^i rk+1i​ can be obtained by some training algorithm.

Reference

Reinforcement Learning and Optimal Control

上一篇:实现strStr()—— KMP ——记录(C++)


下一篇:AcWing 1049. 大盗阿福