Navigator
Example
Constraint Relaxiation
, whereby the constraint set is replaced by another constraint set that does not involve coupling.
Multiarmed bandit problem
, involves n
projects of which only one can be worked on at any time period. Each project i
is characterized at time k
by its state
x
k
i
x_k^i
xki. If project i
is worked on at time k
, one receives an expected reward
R
i
(
x
k
i
)
R^i(x_k^i)
Ri(xki), and the state
x
k
i
x_k^i
xki evolves according to the equation:
x
k
+
1
i
=
f
i
(
x
k
i
,
w
k
i
)
x_{k+1}^i=f^i(x_k^i, w_k^i)
xk+1i=fi(xki,wki)
where
w
k
i
w_k^i
wki is a random disturbance with probability distribution depending on
x
k
i
x_k^i
xki but not on prior disturbances. If project i
is not worked on, its state changes according to
x
k
+
1
i
=
f
ˉ
i
(
x
k
i
,
w
ˉ
k
i
)
x_{k+1}^i=\bar{f}^i(x_k^i, \bar{w}_k^i)
xk+1i=fˉi(xki,wˉki)
In particular, suppose that the optimal reward function
J
k
∗
(
x
1
,
…
,
x
n
)
J_k^*(x^1, \dots, x^n)
Jk∗(x1,…,xn) is approximated by a separable function of the form
∑
i
=
1
n
J
~
k
i
(
x
i
)
\sum_{i=1}^n \tilde{J}_k^i(x^i)
∑i=1nJ~ki(xi), where each
J
~
k
i
\tilde{J}_k^i
J~ki is a function that quantifies the contribution of the
i
i
ith project to the total reward. The corresponding one-step lookhead policy selects at time
k
k
k the project
i
i
i that maximizes:
R
i
(
x
i
)
=
∑
j
≠
i
R
ˉ
j
(
x
j
)
+
E
{
J
~
k
+
1
i
(
f
i
(
x
i
,
w
i
)
)
}
+
∑
j
≠
i
E
{
J
k
+
1
j
(
x
j
,
w
ˉ
j
)
}
R^i(x^i)=\sum_{j\neq i}\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))\}+\sum_{j\neq i}\mathbb{E}\{J_{k+1}^j(x^j, \bar{w}^j)\}
Ri(xi)=j=i∑Rˉj(xj)+E{J~k+1i(fi(xi,wi))}+j=i∑E{Jk+1j(xj,wˉj)}
which can also be written as:
R
i
(
x
i
)
−
R
ˉ
i
(
x
i
)
+
E
{
J
~
k
+
1
i
(
f
i
(
x
i
,
w
i
)
)
−
J
~
k
+
1
i
(
f
ˉ
i
(
x
j
,
w
ˉ
j
)
)
}
+
∑
j
=
1
n
{
R
ˉ
j
(
x
j
)
+
E
{
J
~
k
+
1
j
(
f
ˉ
j
(
x
j
,
w
ˉ
j
)
)
}
}
R^i(x^i)-\bar{R}^i(x^i)+\mathbb{E}\{\tilde{J}_{k+1}^i(f^i(x^i, w^i))-\tilde{J}_{k+1}^i(\bar{f}^i(x^j, \bar{w}^j))\}+\sum_{j=1}^n\{\bar{R}^j(x^j)+\mathbb{E}\{\tilde{J}^j_{k+1}(\bar{f}^j(x^j, \bar{w}^j))\}\}
Ri(xi)−Rˉi(xi)+E{J~k+1i(fi(xi,wi))−J~k+1i(fˉi(xj,wˉj))}+j=1∑n{Rˉj(xj)+E{J~k+1j(fˉj(xj,wˉj))}}
Noting that the last term in the above expression does not depend on i
.
An alternative possibility is to use a separable parametric approximation of the form
∑
i
=
1
n
J
~
k
+
1
i
(
x
k
+
1
i
,
r
k
+
1
i
)
\sum_{i=1}^n\tilde{J}_{k+1}^i(x_{k+1}^i, r_{k+1}^i)
i=1∑nJ~k+1i(xk+1i,rk+1i)
where
r
k
+
1
i
r_{k+1}^i
rk+1i are vectors of ‘tunnable’ parameters. The values of
r
k
+
1
i
r_{k+1}^i
rk+1i can be obtained by some training algorithm.
Reference
Reinforcement Learning and Optimal Control