Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Feb. 23, 2021

Aim ‾ \underline{\text{Aim}} Aim​

In this paper, an efficient Bandit Online Linear Optimization algorithm is proposed, which achieves an optimal O ∗ ( T 1 2 ) O^*(T^{\frac{1}{2}}) O∗(T21​) regret. Actually the existence of an efficient algorithm has already been posed in a few papers. This paper exploit a self-concordant potential function to the difficulties encountered in the previous studys.

Background ‾ \underline{\text{Background}} Background​

A sequential decision making problem, termed “the multiarmed bandit problem”, inherits from a model that, on each round in a sequence, a gambler must pull the arm on one of several slot machines (“one-armed bandits”) that each returns a reward chosen stochastically from a fixed distribution, The gambler does not know the best arm a priori, his goal is to maximize the reward of his strategy relative to reward he would receive had he known the optimal arm.

Several authors have proposed a very natural generalization of the multi-armed bandit problem to the field of convex optimization, and this is called “bandit linear optimization”. In this setting we imagine that, on each round t, an adversary chooses some linear function f t ( ⋅ ) f_t(\cdot) ft​(⋅) which is not revealed to the player. The player then chooses a point x t \mathbf{x}_t xt​ within some given convex set K ∈ R n \mathcal{K} \in \mathbb{R}^n K∈Rn. The player then suffers f t ( x t ) f_t(\mathbf{x}_t) ft​(xt​) and this quantity is reveled to him. This process continues for T rounds, and at the end the learner’s payoff is his regret:
R T = ∑ t = 1 T f t ( x t ) − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ( x ∗ ) R_{T}=\sum_{t=1}^{T} f_{t}\left(\mathbf{x}_{t}\right)-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} f_{t}\left(\mathbf{x}^{*}\right) RT​=t=1∑T​ft​(xt​)−x∗∈Kmin​t=1∑T​ft​(x∗)

In the full-information model, it has been known for some time that the optimal regret bound is O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21​). It had been conjectured that this O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21​) bound also holds for the bandit version. However, several algorithms proposed only achieve O ( T 3 4 ) O(T^{\frac{3}{4}}) O(T43​) or O ( T 2 3 ) O(T^{\frac{2}{3}}) O(T32​). The one achieves O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21​) is, unfortunately, not efficient.

This paper propose an algorithm which achieves high efficiency and an O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21​) regret bound. Moreover, the paper discovers a link between the Bregman divergences and self-concordant barriers: divergence functions provide the right perspective for the problem of managing uncertainty given limited feedback.

Brief Project Description ‾ \underline{\text{Brief Project Description}} Brief Project Description​

The terms “full-information version” and “bandit version” were mentioned above. Here they will be explained after the definition of an online linear optimization problem. This problem is is defined as the following repeated game between the learner (player) and the environment (adversary).

At each time step t = 1 t=1 t=1 to T T T,

∙ \bullet ∙ Player chooses x t ∈ K \mathbf{x}_t\in\mathcal{K} xt​∈K
∙ \bullet ∙ Adversary independently chooses f t ∈ R n \mathbf{f}_t\in\mathbb{R}^n ft​∈Rn
∙ \bullet ∙ Player suffers loss f t ⊤ x t \mathbf{f}_t^\top\mathbf{x}_t ft⊤​xt​ and observes feedback ℑ \Im ℑ.

In this game, the Player’s goal is to minimize his regret R T R_T RT​ defined as

R T : = ∑ t = 1 T f t ⊤ x t − min ⁡ x ∗ ∈ K ∑ t = 1 T f t ⊤ x ∗ R_{T}:=\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}_{t}-\min _{\mathbf{x}^{*} \in \mathcal{K}} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{x}^{*} RT​:=t=1∑T​ft⊤​xt​−x∗∈Kmin​t=1∑T​ft⊤​x∗

Now, the The full-information version, the Player may observe the entire function f t \mathbf{f}_t ft​ as his feedback ℑ \Im ℑ and can exploit this in making his decisions. In comparison, the player can only observe a scalar value feedback f t x t \mathbf{f}_t\mathbf{x}_t ft​xt​ after he has made the decision x t \mathbf{x}_t xt​ at that round.

Though the algorithm proposed in this paper can deal with the bandit version problem, it is still reasonable to utilize a reduction to the full-information setting, as any algorithm that aimed for low-regret in the bandit setting would necessarily have to achieve low regret given full information. For example, the well know Follow The Leader (FTL) stragety using the “select the best choice so far”:
x t + 1 : = arg ⁡ min ⁡ x ∈ K ∑ s = 1 t f s ⊤ x .                               ( 1 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}} \sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}.\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) xt+1​:=argx∈Kmin​s=1∑t​fs⊤​x.                             (1)
And the Follow The Regularized Leader (FTRL):
x t + 1 : = arg ⁡ min ⁡ x ∈ K [ ∑ s = 1 t f s ⊤ x + λ R ( x ) ] .          ( 2 ) \mathbf{x}_{t+1}:=\arg \min _{\mathbf{x} \in \mathcal{K}}\left[\sum_{s=1}^{t} \mathbf{f}_{s}^{\top} \mathbf{x}+\lambda \mathcal{R}(\mathbf{x})\right]. \ \ \ \ \ \ \ \ (2) xt+1​:=argx∈Kmin​[s=1∑t​fs⊤​x+λR(x)].        (2)
Given that R \mathcal{R} R is convex and differentiable, the general form of the update of FTRL is as follow:
x ‾ t + 1 = ∇ R ∗ ( ∇ R ( x ‾ t ) − η f t ) ,                       ( 3 ) \overline{\mathbf{x}}_{t+1}=\nabla \mathcal{R}^{*}\left(\nabla \mathcal{R}\left(\overline{\mathbf{x}}_{t}\right)-\eta \mathbf{f}_{t}\right),\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) xt+1​=∇R∗(∇R(xt​)−ηft​),                     (3)
followed by a projection onto K \mathcal{K} K with respect to the divergence D R D_\mathcal{R} DR​:
x t + 1 = arg ⁡ min ⁡ u ∈ K D R ( u , x ‾ t + 1 ) . \mathbf{x}_{t+1}=\arg \min _{\mathbf{u} \in \mathcal{K}} D_{\mathcal{R}}\left(\mathbf{u}, \overline{\mathbf{x}}_{t+1}\right). xt+1​=argu∈Kmin​DR​(u,xt+1​).
Here R ∗ \mathcal{R}^* R∗ is the Fenchel dual function and η \eta η is a parameter. This procedure is known as the mirror descent.

For an online learning algorithm A \mathcal{A} A, “explore or exploit” is a serious problem. A player first choose some fullinformation online learning algorithm A \mathcal{A} A. A \mathcal{A} A will receive input vectors f 1 , ⋯   , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1​,⋯,ft​ corresponding to previously observed functions, and will return some point x t + 1 ∈ K \mathbf{x}_{t+1}\in\mathcal{K} xt+1​∈K to predict. It is assumed that f 1 , ⋯   , f t \mathbf{f}_1,\cdots, \mathbf{f}_t f1​,⋯,ft​ are just realizations of the random variable (vector) f ~ t \tilde{\mathbf{f}}_{t} f~t​. So, the prediction will be more accurate if there are more “new” f \mathbf{f} f input vectors. Here comes the dilemma of “explore or exploit”: whether to follow the advice of A \mathcal{A} A of predicting x t \mathbf{x}_t xt​, or to try to estimate f t \mathbf{f}_t ft​ by sampling in a wide region around K \mathcal{K} K, possibly hurting its performance on the given round. This exploration exploitation trade-off is the primary source of difficulty in obtaining O ( T 1 2 ) O(T^{\frac{1}{2}}) O(T21​) guarantees on the regret.

Roughly two categories of approaches, namely Alternating Explore/Exploit and Simultaneous Explore/Exploit, perform both exploration and exploitation. The first category fail to obtain the desired O ( p o l y ( n ) T 1 2 ) O(poly(n)T^{\frac{1}{2}}) O(poly(n)T21​)., so the second one will be the focus. The two Simultaneous-Explore/Exploit-type algorithms, proposed by Auer et at [1] and Flaxman et al [2] respectively, are reviewed. Both of their schedules are: Query A \mathcal{A} A for x t \mathbf{x}_t xt​ and construct a random vector X t \bm{X}_t Xt​ such that E ( X t ) = x t \mathbb{E}(\bm{X}_t) = \mathbf{x}_t E(Xt​)=xt​. Construct f ~ t \tilde{\mathbf{f}}_t f~t​ randomly based on the outcome of X t \bm{X}_t Xt​ and the learned value f t ⊤ X t \mathbf{f}_t^\top\bm{X}_t ft⊤​Xt​.

It is pointed out in the paper that the estimates of f ~ t \tilde{\mathbf{f}}_t f~t​ in both methods are reversely proportional to the distance of x t \mathbf{x}_t xt​ to the boundary, which implies high variance of the estimated functions. Indeed, most full-information algorithms scale linearly with the magnitude of the functions played by the environment. Fortunately, if If we restrict our search to a regularization algorithm of type (2), the expected regret can be proved to be equal to an expression involving E D R ( x t , x t + 1 ) \mathbb{E} D_{\mathcal{R}}\left(\mathbf{x}_{t}, \mathbf{x}_{t+1}\right) EDR​(xt​,xt+1​) terms. For R ( x ) ∝ ∥ x ∥ 2 \mathcal{R}(\mathbf{x}) \propto\|\mathbf{x}\|^{2} R(x)∝∥x∥2, the paper recovers the method of Flaxman et al with its insurmountable hurdle of E ∥ f ~ t ∥ 2 \mathbb{E}\left\|\tilde{\mathbf{f}}_{t}\right\|^{2} E∥∥∥​f~t​∥∥∥​2.

The main result of this paper is an algorithm for online linear optimization in the bandit setting for an arbitrary compact convex set K \mathcal{K} K, which is as follows:
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization
In Section 4 the regularization framework is discussed in detail and it will be shown that how the regret can be computed in terms of Bregman divergences. The theory and main properties of self-concordant functions will be presented in Section 5. In Section 6, several key elements of the proof of the regret bound of the proposed algorithm in this paper will be given. In Section 7 the paper shows how this algorithm can be used for one interesting case, namely the bandit version of the Online Shortest Path problem. The precise analysis of our algorithm is given in Section 8. Finally, in Section 9 is the implementation of the algorithm.

The main result of the paper is as follows:

Theorem 1 Let K \mathcal{K} K be a convex set and R \mathcal{R} R be a ℑ \Im ℑ-self-concordant barrier on K \mathcal{K} K. Let u \mathbf{u} u be any vector in K ′ = K T − 1 / 2 \mathcal{K}' = \mathcal{K}_{T^{-1/2}} K′=KT−1/2​. Suppose we have the property that ∣ f t ⊤ x ∣ ≤ 1 \left|\mathbf{f}_{t}^{\top} \mathbf{x}\right| \leq 1 ∣∣​ft⊤​x∣∣​≤1 for any x ∈ K \mathbf{x}\in\mathcal{K} x∈K. Setting η = ϑ log ⁡ T 4 n T \eta=\frac{\sqrt{\vartheta \log T}}{4 n \sqrt{T}} η=4nT ​ϑlogT ​​, the regret of Algorithm 1 is bounded as
E ∑ t = 1 T f t ⊤ y t ≤ min ⁡ u ∈ K ′ E ( ∑ t = 1 T f t ⊤ u ) + 16 n ϑ T log ⁡ T \mathbb{E} \sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{y}_{t} \leq \min _{\mathbf{u} \in \mathcal{K}^{\prime}} \mathbb{E}\left(\sum_{t=1}^{T} \mathbf{f}_{t}^{\top} \mathbf{u}\right)+16 n \sqrt{\vartheta T \log T} Et=1∑T​ft⊤​yt​≤u∈K′min​E(t=1∑T​ft⊤​u)+16nϑTlogT
whenever T > 8 ϑ log ⁡ T T>8 \vartheta \log T T>8ϑlogT.

Here the definition of the scaled version of K \mathcal{K} K and the ℑ \Im ℑ-self-concordant function are used. Thr scaled version of K \mathcal{K} Kis define as:
K δ = { u : π x 1 ( u ) ≤ ( 1 + δ ) − 1 } \mathcal{K}_{\delta}=\left\{\mathbf{u}: \pi_{\mathbf{x}_{1}}(\mathbf{u}) \leq(1+\delta)^{-1}\right\} Kδ​={u:πx1​​(u)≤(1+δ)−1}
To define a ℑ \Im ℑ-self-concordant function, first we give the definition of a self-concordant function as follows:

Definition (self-concordant function) A self-concordant function R \mathcal{R} R: i n t   K → R int \ \mathcal{K} \rightarrow\mathbb{R} int K→R is a C 3 C^3 C3 convex function such that
∣ D 3 R ( x ) [ h , h , h ] ∣ ≤ 2 ( D 2 R ( x ) [ h , h ] ) 3 / 2 \left|D^{3} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}, \mathbf{h}]\right| \leq 2\left(D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right)^{3 / 2} ∣∣​D3R(x)[h,h,h]∣∣​≤2(D2R(x)[h,h])3/2
Here, the third-order differential is defined as

D 3 R ( x ) [ h 1 , h 2 , h 3 ] : = ∂ 3 ∂ t 1 ∂ t 2 ∂ t 3 ∣ t 1 = t 2 = t 3 = 0 R ( x + t 1 h 1 + t 2 h 2 + t 3 h 3 ) D^{3} \mathcal{R}(\mathbf{x})\left[\mathbf{h}_{1}, \mathbf{h}_{2}, \mathbf{h}_{3}\right] := \left.\frac{\partial^{3}}{\partial t_{1} \partial t_{2} \partial t_{3}}\right|_{t_{1}=t_{2}=t_{3}=0} \mathcal{R}\left(\mathbf{x}+t_{1} \mathbf{h}_{1}+t_{2} \mathbf{h}_{2}+t_{3} \mathbf{h}_{3}\right) D3R(x)[h1​,h2​,h3​]:=∂t1​∂t2​∂t3​∂3​∣∣∣∣​t1​=t2​=t3​=0​R(x+t1​h1​+t2​h2​+t3​h3​)

Now we can define the ℑ \Im ℑ-self-concordant function as follows:

Definition ( ℑ \Im ℑ-self-concordant function) A ℑ \Im ℑ-self-concordant barrier R \mathcal{R} R is a self-concordant function with
∣ D R ( x ) [ h ] ∣ ≤ ϑ 1 / 2 [ D 2 R ( x ) [ h , h ] ] 1 / 2 . |D \mathcal{R}(\mathbf{x})[\mathbf{h}]| \leq \vartheta^{1 / 2}\left[D^{2} \mathcal{R}(\mathbf{x})[\mathbf{h}, \mathbf{h}]\right]^{1 / 2}. ∣DR(x)[h]∣≤ϑ1/2[D2R(x)[h,h]]1/2.

Significance of Paper ‾ \underline{\text{Significance of Paper}} Significance of Paper​

This is the first paper to achieve both high efficiency and a O ( p o l y ( n ) T ) O(poly(n) \sqrt{T}) O(poly(n)T ​) regret bound. The bound O ( T ) O(\sqrt{T}) O(T ​) is a regret bound for the full-information model and now it becomes the one for bandit setting as well. This is surely a breakthrough since what a player can observe at the end of each round int the bandit setting is far less than that in a full-information setting. Also, as the paper reviewed, only bounds like O 3 / 4 O^{3/4} O3/4, O 2 / 3 O^{2/3} O2/3 are obtained in quite a few previous papers. Now this “goal” bound is achieved efficiently, finally.

Reference {\text{\Large Reference}} Reference

[1] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, 2003.

[2] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA ’05: Proceedings ofthe sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.

[3] Abernethy J D, Hazan E, Rakhlin A. Competing in the dark: An efficient algorithm for bandit linear optimization[J]. 2009.

上一篇:css3实现背景颜色渐变,文字颜色渐变,边框颜色渐变


下一篇:线性代数