Introduction
In the reinforcement learning paradigm, an agent receives from its envrionment a scalar reward value called \(reinforcement\). This feedback is rather poor: it can be loolean (true, false) or fuzzy (bad, fair, very good, ...), and, moreover, ti may be delayed. A sequence of control actions is often executed before receiving any information on the quality of the whole sequence. Therefore, it is difficult to evaluate the contribution of on individual action.
Q-learning
Q-learning is a form of competitve learning which provides agents with the capability of learning to act optimally by evaluatiing the consequences of actons. Q-learning keeps a Q-function which attempts to estimate the discounted future reinforcement fo taking actions from given states. A Q-function is a mapping from state-action pairs to predicted reinforcement. In order to explain the method, we adopt the implementation proposed by Bersini.
- The state space, \(U\subset R^{n}\), is partitioned into hypercubes or cells. Among these cells we can distinguish: (a) one particular cell, called the target cell, to which the quality value +1 is assigned, (b) a subset of cells, called viability zone, that the process must not leave. The quality value for viability zone is 0. This notion of viability zone comes from Aubin and eliminates strong constraints on a reference trajectory for the process. (c) the remaining cells, called failure zone, with the quality value -1.
- In each ceel, a set of \(J\) agents compete to control a process. With \(M\) cells, the agent \(j\), $j \in {1,\ldots, J} $