What Doubling Tricks Can and Can't Do for Multi-Armed Bandits

(1)

HAL Id: hal-01736357

https://hal.inria.fr/hal-01736357

Preprint submitted on 19 Mar 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - ShareAlike| 4.0

Lilian Besson, Emilie Kaufmann

To cite this version:

Lilian Besson, Emilie Kaufmann. What Doubling Tricks Can and Can’t Do for Multi-Armed Bandits. 2018. �hal-01736357�

(2)

Lilian Besson† [email protected] CentraleSup´elec (campus of Rennes), IETR, SCEE Team,

Avenue de la Boulaie – CS47601, F-35576 Cesson-S´evign´e, France

Emilie Kaufmann [email protected]

CNRS & Universit´e de Lille, Inria SequeL team

UMR9189 – CRIStAL, F-59000 Lille, France

Abstract

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the “Doubling Trick”. In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we

prove that a geometric doubling trick can be used to conserve (minimax) bounds in RT = O(

√ T )

but cannot conserve (distribution-dependent) bounds in RT = O(log T ). We give insights as to

why exponential doubling tricks may be better, as they conserve bounds in RT = O(log T ), and are

close to conserving bounds in RT = O(

√ T ).

Keywords: Multi-Armed Bandits; Anytime Algorithms; Sequential Learning; Doubling Trick.

1. Introduction

Multi-Armed Bandit (MAB) problems are well-studied sequential decision making problems in which an agent repeatedly chooses an action (the “arm” of a one-armed bandit) in order to maximize some total reward (Robbins,1952;Lai and Robbins,1985). Initial motivation for their study came from the modeling of clinical trials, as early as 1933 with the seminal work ofThompson(1933). In this example, arms correspond to different treatments with unknown, random effect. Since then, MAB models have been proved useful for many more applications, that range from cognitive radio (Jouini et al.,2009) to online content optimization (e.g., news article recommendation (Li et al., 2010), online advertising (Chapelle and Li,2011), A/B Testing (Kaufmann et al.,2014;Yang et al., 2017)), or portfolio optimization (Sani et al.,2012).

While the number of patients involved in a clinical study (and thus the number of treatments to select) is often decided in advance, in other contexts the total number of decisions to make (the horizon T ) is unknown. It may correspond to the total number of visitors of a website optimizing its displays for a certain period of time, or to the number of attempted communications in a smart radio device. In such cases, it is thus crucial to devise anytime algorithms, that is algorithms that do no rely on the knowledge of this horizon T to sequentially select arms. A general way to turn any base algorithm into an anytime algorithm is the use of the so-called Doubling Trick, first proposed byAuer et al.(1995), that consists in repeatedly running the base algorithm with increasing horizons. Motivated by the frequent use of this technique and the absence of a generic study of its effect

(3)

on the algorithm’s efficiency, this paper investigates in details two families of doubling sequences (geometric and exponential), and shows that the former should be avoided for stochastic problems.

More formally, a MAB model is a set of K arms, each arm k being associated to a (unknown) reward stream(Yk,t)t∈N. Fix T a finite (possibly unknown) horizon. At each time step t ∈ {1, . . . , T } an agent selects an arm A(t) ∈ {1, . . . , K} and receives as a reward the current value of the associated reward stream, r(t) := YA(t),t. The agent’s decision strategy (or bandit algorithm) AT := (A(t), t ∈ {1, . . . , T }) is such that A(t) can only rely on the past observations A(1), r(1), . . . , A(t−1), r(t−1), on external randomness and (possibly) on the knowledge of the horizon T . The objective of the agent is to find an algorithm A that maximizes the expected cumulated rewards, where the expectation is taken over the possible randomness used by the algorithm and the possible randomness in the generation of the rewards stream. In the oblivious case, in which the reward streams are independent of the algorithm’s choice, this is equivalent to minimizing the regret, defined as

RT(AT) := max k∈{1,...,K}E " _T X t=1 Yk,t− YA(t),t # . (1)

This quantity, referred to as pseudo-regret inBubeck et al. (2012), quantifies the difference between the expected cumulated reward of the best fixed action, and that of the strategy AT. For the general adversarial bandit problem (Auer et al.,2002b), in which the rewards streams are arbitrary (picked by an adversary), a worst-case lower bound has been given. It says that for every algorithm, there exists (stochastic) reward streams such that the regret is larger than (1/20)√KT (Auer et al., 2002b). Besides, the EXP3 algorithm has been shown to have a regret of orderpKT log(K).

Much smaller regret may be obtained in stochastic MAB models, in which the reward stream from each arm k is assumed to be i.i.d., from some (unknown) distribution νk, with mean µk. In that case, various algorithms have been proposed with problem-dependent regret upper bounds of the form C(ν) log(T ), where C(ν) is a constant that only depend on the arms distributions. Different assumptions on the arms distributions lead to different problem-dependent constants. In particular, under some parametric assumptions (e.g., Gaussian distributions, exponential families), asymptotically optimalalgorithms have been proposed and analyzed (e.g., kl-UCB (Capp´e et al., 2013) or Thompson sampling (Agrawal and Goyal,2012;Kaufmann et al.,2012)), for which the constant C(ν) obtained in the regret upper bound matches exactly that of the lower bound given by Lai and Robbins (1985). Under the non-parametric assumption that the νk are bounded in [0, 1], the regret of the UCB1 algorithm (Auer et al., 2002a) is of the above form with C(ν) = 8 ×P

k:µk>µ∗(µ

∗_{− µ}

k)−1, where µ∗ = maxkµk is the mean of the best arm. Like in this last example, all the available constants C(ν) become very large on “hard” instances, in which some arms are very close to the best arm. On such instances, C(ν) log(T ) may be much larger than the worst-case (1/20)√KT , and distribution-independent guarantees may actually be preferred.

The MOSS algorithm, proposed byAudibert and Bubeck(2009), is the first stochastic bandit algorithm to enjoy a problem-dependent logarithmic regret and to be optimal in a minimax sense, as its regret is proved to be upper bounded by√KT , for bandit models with rewards in [0, 1]. However the corresponding constant C(ν) is proportional to K/∆min, where ∆min= mink(µ∗− µk) is the minimal gap, which worsen the constant of UCB1. Another drawback of MOSS is that it is not anytime. These two shortcoming have been overcame recently in two different works. On the one hand, the MOSS-anytime algorithm (Degenne and Perchet,2016) is minimax optimal and anytime, but its problem-dependent regret does not improve that of MOSS. On the other hand, the kl-UCB++

(4)

algorithm (M´enard and Garivier, 2017) is simultaneously minimax optimal and asymptotically optimal (i.e., it has the best problem-dependent constant C(ν)), but it is not anytime. A natural question is thus to know whether a Doubling Trick could overcome this limitation.

This question is the starting point of our comprehensive study of the Doubling Trick: can a single Doubling Trick be used to preserve both problem-dependent (logarithmic) regret and minimax (square-root) regret? We answer this question in the negative, by showing that two different types of Doubling Trick may actually be needed. In this paper, we investigate how algorithms enjoying regret guarantees of the generic form

∀T ≥ 1, RT(AT) ≤ c Tγ(log(T ))δ+ o(Tγ

log(T ))δ

(2) may be turned into an anytime algorithm enjoying similar regret guarantees with an appropriate Doubling Trick. This does not come for free, and we exhibit a “price of Doubling Trick”, that is a constant factor larger than 1, referred to as a constant manipulative loss.

The rest of the paper is organized as follows. The Doubling Trick is formally defined in Section2, along with a generic tool for its analysis. In Section3, we present upper and lower bounds on the regret of algorithms to which a geometric Doubling Trick is applied. Section4investigates regret guarantees that can be obtained for a “faster” exponential Doubling Trick. Experimental results are then reported in Section5. Complementary elements of proofs are deferred to the appendix.

2. Doubling Tricks

The Doubling Trick, denoted by DT , is a general procedure to convert a (possibly non-anytime) algorithm into an anytime algorithm. It is formally stated below as Algorithm1and depends on a non-decreasing diverging doubling sequence (Ti)i∈N(i.e., Ti→ ∞ for i → ∞). DT fully restarts the underlying algorithm A at the beginning of each new sequence (at t = Ti+ 1), and run this algorithm on a sequence of length (Ti− Ti−1).

Input: Bandit algorithm A, and doubling sequence (Ti)i∈N. 1 Let i = 0, and initialize algorithm A(0) = AT0.

2 for t = 1, . . . , T − 1 do

3 if t > T_ithen // Next horizon Ti+1 from the sequence 4 Let i = i + 1.

5 Initialize algorithm A(i) = ATi−Ti−1. // Full restart

6 end

7 Play with A(i): play arm A0(t) := A(i)(t − T_i), observe reward r(t) = Y_A0_(t),t.

8 end

Algorithm 1: The Generic Doubling Trick Algorithm, A0 = DT (A, (Ti)i∈N).

Related work. The Doubling Trick is a well known idea in online learning, that can be traced back toAuer et al.(1995). In the literature, the term Doubling Trick usually refers to the geometric sequence Ti= 2i, in which the horizon is actually doubling, that was popularized byCesa-Bianchi

and Lugosi(2006) in the context of adversarial bandits. Specific doubling tricks have also been used for stochastic bandits, for example in the work ofAuer and Ortner(2010), which uses the doubling sequence Ti= 22

i

(5)

Elements of regret analysis. For a sequence (Ti)i∈N, with Ti ∈ N for all i, we denote T−1 = 0, and T0 is always taken non-zero, T0 > 0 (i.e., T0 ∈ N∗). We only consider non-decreasing and divergingsequences (that is, ∀i, Ti+1≥ Ti, and Ti → ∞ for i → ∞).

Definition 1 (Last Term LT)

For a non-decreasing diverging sequence(Ti)i∈NandT ∈ N, we can define LT (Ti)i∈N by ∀T ≥ 1, LT (Ti)i∈N := min {i ∈ N : Ti > T } . (3) It is simply denotedLT when there is no ambiguity (e.g., if the doubling sequence is chosen). DT (A) reinitializes its underlying algorithm A at each time Ti, and in generality the total regret is upper bounded by the regret on each sequence {Ti, . . . , Ti+1− 1}. By considering the last partial sequence {TLT−1, . . . , T − 1}, this splitting can be used to get a generic upper bound (UB) by

taking into account a larger last sequence (up to TLT − 1). And for stochastic bandit models, the

i.i.d. hypothesis on the rewards streams makes the splitting on each sequence an equality, so we can also get the lower bound (LB) by excluding the last partial sequence. Lemma2 is proved in AppendixA.1.

Lemma 2 (Regret Lower and Upper Bounds for DT )

Forany bandit model and algorithm A and horizon T , one has the generic upper bound RT(DT (A, (Ti)i∈N)) ≤

LT

X

i=0

RTi−Ti−1(ATi−Ti−1). (LB)

Under astochastic bandit model, one has furthermore the lower bound RT(DT (A, (Ti)i∈N)) ≥

LT −1

X

i=0

RTi−Ti−1(ATi−Ti−1). (UB)

As expected, the key to obtain regret guarantees for a Doubling Trick algorithm is to carefully choose the doubling sequence (Ti)i∈N. Empirically, one can verify that sequences with slow growth gives terrible results, and for example using an arithmetic progression typically gives a linear regret. Building on this result, we will prove that if A satisfies a certain regret bound (RT = O(Tγ), O((log T )δ_{), or O(T}γ_{(log T )}δ_{)) then an appropriate anytime version of A with a certain doubling} trick can conserve the regret bound, with an explicit constant multiplicative loss ` > 1. In this paper, we study in depth two families of sequences: first geometric and then exponential growths.

3. What the Geometric Doubling Trick Can and Can’t Do

We define geometric doubling sequences, and prove that they can be used to conserve bounds in O Tγ_{(log T )}δ_{with γ > 0 but cannot be used to conserve bounds in O (log T )}δ_.

Definition 3 (Geometric Growth) For b ∈ R, b > 1 and T0 ∈ N∗, the sequence defined by Ti= bT0bic is non-decreasing and diverging, and satisfies

∀T < T₀, LT = 0, and ∀T ≥ T0, LT = log_b T T0 , (4)

∀i > 0, T₀(b − 1)bi−1− 1 ≤ T_i− T_i−1≤ 1 + T₀(b − 1)bi−1. (5) Asymptotically fori and T → ∞, Ti = O bi and LT ∼ logb(T ) = O(log T ).

(6)

3.1. Conserving a Regret Upper Bound with Geometric Horizons

A geometric doubling sequence allows to conserve a minimax bound (i.e., RT = O( √

T )). It was suggested, for instance, in (Cesa-Bianchi and Lugosi,2006, Ex.2.9). We generalize this result in the following theorem, proved in AppendixA.2, by extending it from√T bounds to bounds of the form Tγ(log T )δfor any 0 < γ < 1 and δ ≥ 0. Note that no distinction is done on the case δ = 0 neither in the expression of the constant loss, nor in the proof.

Theorem 4 If an algorithm A satisfies RT(AT) ≤ c Tγ(log T )δ+ f (T ), for 0 < γ < 1, δ ≥ 0 and forc > 0, and an increasing function f (t) = o tγ(log t)δ (at t → ∞), then the anytime version A0 := DT (A, (Ti)i∈N) with the geometric sequence (Ti)i∈N of parametersT0 ∈ N∗, b > 1 (i.e., Ti= bT0bic) with the condition T0(b − 1) > 1 if δ > 0, satisfies,

RT(A0) ≤ `(γ, δ, T0, b) c Tγ(log T )δ+ g(T ), (6) with a increasing functiong(t) = o tγ_{(log t)}δ_{, and a constant loss `(γ, δ, T}

0, b) > 1, `(γ, δ, T0, b) := log(T0(b − 1) + 1) log(T0(b − 1)) δ × b γ_{(b − 1)}γ bγ_{− 1} . (7)

For a fixed γ and δ, minimizing `(γ, δ, T0, b) does not always give a unique solution. On the one hand, if γ & 0.384, there is a unique solution b∗(γ) > 1 minimizing the bγ_b(b−1)γ₋₁γ term, solution of

bγ+1− 2b + 1 = 0, but without a closed form if γ is unknown. On the other hand, for any γ, the term depending on δ tends quickly to 1 when T0increases.

Practical considerations. Empirically, when γ and δ are fixed and known, there is no need to minimize ` jointly. It can be minimized separately by first minimizing bγ_b(b−1)γ₋₁γ, that is by solving

bγ+1− 2b + 1 = 0 numerically (e.g., with Newton’s method), and then by taking T0large enough so that the other term is close enough to 1.

For the usual case of γ = 1/2 and δ = 0 (i.e., bounds in√T ), the optimal choice of b is 3+ √

5 2 leading to ` ' 3.33, and the usual choice of b = 2 gives ` ' 3.41 (see Corollary10in appendix). Any large enough T0 gives similar performance, and empirically T0 K is preferred, as most algorithms explore each arm once in their first steps (e.g., T0 = 200 for K = 9 for our experiments). 3.2. A Regret Lower Bound with Geometric Horizons

We observe that the constant loss in Eq. (7) from the previous Theorem4blows up when γ goes to zero, giving the intuition that no geometric doubling trick could be used to preserve a logarithmic bound (i.e., with γ = 0, δ > 0). This is confirmed by the lower bound given below.

Theorem 5 For stochastic models, if A satisfies RT(AT) ≥ c (log T )δ, for c > 0 and δ > 0, then the anytime versionA0 := DT (A, (Ti)i∈N) with the geometric sequence (Ti)i∈N of parameters T0∈ N∗, b > 1 (i.e., Ti = bT0bic) satisfies this lower bound for a certain constant c0 > 0,

∀T ≥ 1, LT ≥ 2 =⇒ RT(A0) ≥ c0 (log T )δ+1. (8) This implies thatRT(A0) = Ω((log T )δ+1), which proves that a geometric sequence cannot be used to conserve a logarithmic regret bound.

(7)

Theorem 5 implies that a geometric sequence cannot be used to conserve a finite-horizon lower bound like RT(AT) ≥ c log(T ). A complementary lower bound, stated as Theorem11in AppendixB, shows that if the regret is lower bounded at finite horizon by RT(AT) ≥ c

√

T , then a comparable lower bound for the Doubling Trick algorithm DT (A), possibly with a larger constant. This special case (δ = 1) is indeed the most interesting, as in the stochastic case the regret of any uniformly efficient algorithm is at least logarithmic (Lai and Robbins(1985)), and efficient algorithm with logarithmic regret have been exhibited. If RT(AT)/log T is bounded, then using a geometric sequence in the doubling trick algorithm is a bad idea as it guarantees a blow up in the regret, that is RT(DT (A, (Ti)i∈N)) = Ω((log T )2). This result is the reason we need to consider successive horizons growing faster than a geometric sequence (i.e., such that log(Ti) i), like the exponential sequence, which is studied in Section4.

3.3. Proof of Theorem5

Let A0 := DT (A, (Ti)i∈N) and consider a fixed stochastic bandit problem. The lower bound (LB) from Lemma2gives

RT(A0) ≥

LT−1

X

i=0

RTi−Ti−1(AT =Ti−Ti−1)

We bound Ti− Ti−1≥ T0(b − 1)bi−1− 1 for any i > 0, thanks to Definition3, and we can use the hypothesis on A for each regret term.

≥ LT−1 X i=0 c(log(Ti− Ti−1))δ≥ c LT−1 X i=1 log T0(b − 1)bi−1− 1 δ = c LT−2 X i=0 log T0(b − 1)bi− 1 δ (with i := i − 1)

Let xi := T0(b − 1)bi > 0. If we have T0(b − 1) > 1 (see below (♣) in Page7for a discussion on the other case), then Lemma15(Eq. (26)) gives log(xi− 1) ≥ log(T_log(T0(b−1)−1)₀_(b−1)) log(xi) as xi > 1. For lower bounds, there is no need to handle the constants tightly, and we have xi≥ biby hypothesis, so let call this constant c0= clog(T0(b−1)−1)

log(T0(b−1))

δ

> 0, and thus it simplifies to

≥ c0

LT−2

X

i=0

log(bi)δ

A sum-integral minoration for the increasing function t 7→ tδ(as δ > 0) givesPLT−2

i=0 log(bi) δ = (log b)δPLT−2 i=1 iδ ≥ (log b)δ RL_T−2 0 t δ_{dt =} (log b)δ δ+1 (LT − 2) δ+1 (if LT ≥ 2), and so RT(A0) ≥ c0 (log b)δ δ + 1 (LT − 2) δ+1

(8)

For the geometric sequence, we know that LT ≥ logb

T T0

≥ log_b(T ), and log_b(T ) − 2 ∼ log_b(T ) at T → ∞ so there exists a constant 0 < c00< 1 such that LT − 2 ≥ c00logb(T ) for T large enough (≥ _b−12 ), and such that LT ≥ 2. And thus we just proved that there is a constant c000> 0 such that

RT(A0) ≥ c000(log T )δ+1 =: g(T ).

So this proves that for T large enough, RT(A0) ≥ g(T ) with g(T ) = O (log T )δ+1, and so RT(A0) = Ω((log T )δ+1), which also implies that RT(A0) cannot be a O (log T )δ.

(♣) If we do not have the hypothesis T0(b − 1) > 1, the same proof could be done, by observing that from i ≥ i0 large enough, we have xi ≥ bi−i0 (as soon as bi0 ≥ _T₀_(b−1)1 > 0, i.e., i0 ≥ d− log_b(T0(b − 1))e ≥ 1), and so the same arguments can be used, to obtain a sum from i = i0+ 1 instead of from i = 1. For a fixed i0, we also have LT − 2 − i0 ≥ c00log(T ) for a (small enough)

constant c00, and thus we obtain the same result.

4. What Can the Exponential Doubling Trick Do?

We define exponential doubling sequences, and prove that they can be used to conserve bounds in O (log T )δ_{, unlike the previously studied geometric sequences. Furthermore, we provide elements} showing that they may also conserve bounds in O(Tγ) or O Tγ(log T )δ.

Definition 6 (Exponential Growth) For a, b ∈ R, a, b > 1 and T0 ∈ N∗, ifτ := T_a0 ∈ R, ≥ 0, then the sequence defined byTi:= bτ ab

i

c is non-decreasing and diverging, and satisfies ∀T < T0, LT = 0, and ∀T ≥ T0, LT = log_b log_a T τ . (9)

Asymptotically fori and T → ∞, Ti = O

abi

andLT ∼ logb(loga(Tτ)) = O(log log T ). 4.1. Conserving a Regret Upper Bound with Exponential Horizons

An exponential doubling sequence allows to conserve a problem-dependent bound on regret (i.e., RT = O(log T )). This was already used in particular cases by Auer and Ortner(2010) and more recently byLiau et al.(2018). We generalize this result in the following theorem.

Theorem 7 If an algorithm A satisfies RT(AT) ≤ c Tγ(log T )δ+ f (T ), for 0 ≤ γ < 1, δ ≥ 0, and forc > 0, and an increasing function f (t) = o tγ_{(log t)}δ_{(at t → ∞), then the anytime version} A0 := DT (A, (Ti)i∈N) with the exponential sequence (Ti)i∈Nof parametersT0 ∈ N∗, a, b > 1 (i.e., Ti= bT_a0ab

i

c), satisfies the following inequality, RT(A0) ≤ `(γ, δ, T0, a, b) c

Tb

γ

(log T )δ+ g(T ). (10)

with an increasing functiong(t) = o (tb)γ(log t)δ, and a constant loss `(γ, δ, T0, a, b) > 0,

`(γ, δ, T0, a, b) :=    a T0 (b−1)γ b2δ bδ₋₁ > 0 ifδ > 0 1 +_{(log(a))(log(b}1 γ₎₎ > 1 ifδ = 0 (11)

(9)

This result first shows that an exponential doubling trick can preserve a logarithmic regret bound (O(log(T )), which corresponds to γ = 0 and δ = 1), with a multiplicative constant loss ` ≥ 4. It can further be applied to bounds of the generic form RT = O Tγ(log T )δ, but with a significant loss as Tγbecomes Tbγ, additionally to the constant multiplicative loss ` > 0. However, it is important to notice that for γ > 0, the constant ` can be made arbitrarily small (with a large enough first step T0). This observation is encouraging, and let the authors think that a tighter upper bound could be proved. Remark 8 An interesting particular case of Theorem7is the following (γ = 0, δ = 1 and f (t) = 0).

RT(AT) ≤ c log(T ) =⇒ RT(DT (A, (bTb i

0 c)i∈N)) ≤

b2

b − 1 c log(T ). (12) In this upper bound, the optimal choice ofb is b = 2, which yields a constant multiplicative loss of `(γ = 0, b) = 4. It can be observed that this loss is twice smaller as the loss of 8.0625 obtained by (Auer and Ortner,2010, Sec.4)1.

4.2. A Regret Lower Bound with Exponential Horizons

Assuming the upper bound of Theorem 7 obtained for γ > 0 are tight would lead to think that exponential doubling tricks cannot preserve minimax regret bounds of the form O(√T )). If true, such a conjecture would need to be supported by a lower bound (a counterpart of Theorem 5). Theorem9provides such a lower bound, but as discussed below, combined with Theorem7, this result rather advocates the use of exponential doubling tricks. Theorem9is proved in AppendixA.3. Theorem 9 For stochastic models, if A satisfies RT(AT) ≥ c Tγ, forc > 0 and 0 < γ ≤ 1, then the anytime versionA0 _{:= DT (A, (T}

i)i∈N) with the exponential sequence (Ti)i∈N of parameters T0∈ N∗,a > 1, b > 1 (i.e., Ti = bT_a0ab

i

c), satisfies this lower bound for a certain constant c0 > 0, ∀T ≥ 1, RT(A0) ≥ c0

T1b

γ

. (13)

We already saw that any exponential doubling trick can conserve logarithmic problem-dependent regret bounds. If we could take b → 1 in the two previous Theorems7and9, both results would match and prove that there exists an exponential doubling trick that can also be used to conserve minimax regret bounds. This argument is not so easy to formulate, as b cannot depend on T , but it supports our belief that exponential doubling tricks are good candidates for (asymptotically) preserving both problem-dependent and minimax regret bounds.

4.3. Proof of Theorem7

Let A0 := DT (A, (Ti)i∈N), and consider a fixed bandit problem. We first consider the harder case of δ > 0, see below in Page10in (♠) for the other case. The lower bound (LB) from Lemma2gives

RT(A0) ≤

LT

X

i=0

1. InAuer and Ortner(2010), the authors obtained a loss of 258/32 = 8.0625 ≥ 8, as the ratio between the constants for the log(T ) terms, respectively 258 in Th.4.1 and 32 in Th.3.1.

(10)

We bound naively2Ti− Ti−1≤ Ti≤ T_a0ab i

, and we can use the hypothesis on A for each regret term, as f and t 7→ ctγ(log t)δare non-decreasing for t ≥ 1 (by hypothesis for f and by Lemma16).

≤ LT X i=0 f (Ti) + c LT X i=0 (Ti)γ(log (Ti))δ≤ g1(T ) + c LT X i=0 T0 aa biγ log T0 aa biδ

The first part is denoted g1(T ) := PLi=0T f (Ti) and is dealt with Lemma17: the sum of f (Ti) is a o PLT i=0T γ i (log(Ti))δ

, as f (t) = o tγ(log t)δ by hypothesis, and this sum of T_iγ(log(Ti))δis proved below to be bounded by Tbγ(log(T ))δ. So g1(T ) = o Tbγ(log T )δ. The second part is c T0 a γPLT i=0 abiγ_logT0 aab iδ

. Define log+(x) := max(log(x), 0) ≥ 0, so whether T0

a ≤ 1

or > 1, we always have log T0 aab i ≤ log+ T0 a+log

abi_{. Then we can use Lemma}₁₄_{(Eq. (}₂₃₎₎ to distribute the power on δ (as it is < 1). SologT0

aa biδ ≤ log+ T0 a δ + (log(a))δ biδ with the convention that 0δ = 0 (even if δ = 0), and so this gives

≤ g₁(T ) + c T0 a γ" log+ T0 a δ LT X i=0 (abi)γ+ (log(a))δ LT X i=0 (abi)γ biδ #

If γ = 0 then the first sum is just LT + 1 = O(log(log(T ))) which can be included in g1(T ) = o (log T )δ (still increasing), and so only the second sum has to be bounded, and a geometric sum gives PLT

i=0 bi

δ

≤ bδ

bδ₋₁(bLT)δ. But if γ > 0, we can naively bound the first sum by

PLT

i=0(ab

i

)γ ≤ (LT + 1)(abLT)γ Observe that abLT = (abLT−1)b ≤ (a_TT₀)b. So abLT = O Tb

and LT + 1 = O(log(log(T ))), thus the first sum is a O Tbγlog(log(T )) = o Tbγ(log T )δ (as δ > 0). In both cases, the first sum can be included in g2(T ) which is still a o Tbγ(log T )δ Another geometric sum bounds the second sum byPLT

i=0(ab i )γ biδ ≤ (abLT )γPLT i=0 bi δ ≤ bδ bδ₋₁(bLT)δ. ≤ g₁(T ) + c1(ab LT )γ(bLT₎δ

We identify a constant multiplicative loss c1 := c T_a0 γ _bδ

bδ₋₁(log a) δ

> 0. The only term left which depends on LT is (abLT)γ(bLT)δ, and it can be bounded by using bLT = bbLT−1≤ b loga(aTT0) =

b + b log+_a(_TT

0) ≤ b + b loga(T ) (as T ≥ 1), and again with a

bLT _{≤ (a}T

T0)

b_{. The constant part of} bLT _{also gives a O T}bγ term, that can be included in g(T ) := g

2(T ) + (a_TT₀)bγ which is still a o Tbγ(log T )δ, and is still increasing as sum of increasing functions. So we can focus on the last term, and we obtain

≤ g(T ) + c₁ b log(a) δ" a T0 b Tb #γ (log T )δ

=⇒ RT(A0) ≤ g(T ) + `(γ, δ, T0, a, b) cTbγ(log T )δ with an increasing g(t) = o

tbγ(log t)δ

.

2. Here, using the more subtle bound Ti− Ti≤T_a0ab

i−1

(αbi−1) + 1, with α = ab−1, from Definition6, does not seem to help as it becomes too complicated to handle clearly in the log terms.

(11)

So the constant multiplicative loss ` depends on γ and δ as well as on T0, a and b and is `(γ, δ, T0, a, b) := a T0 (b−1)γ × b 2δ bδ_{− 1}> 0, if δ > 0. (14) If T0 = a, the loss `(γ, δ, T0, a, b) is minimal at b∗(δ) = 21/δ > 1 and for a minimal value of

minb>1`(γ, δ, T0, a, b) = 4 (for any δ and γ). Finally, the a/T0part tends to 0 if T0 → ∞ so the

loss can be made as small as we want, simply by choosing a T0large enough (but constant w.r.t. T ). (♠) Now for the other case of δ = 0, we can start similarly, but instead of bounding naively PLT

i=0(ab

i

)γ by (LT + 1)(abLT)γ we use Lemma13to get a more subtle bound: PLi=0T (ab i

)γ ≤ aγ+ (1 +_{(log(a))(log(b}1 γ₎₎)(ab

LT

)γ. The constant term gets included in g(T ), and for the non-constant part, (abLT)γis handled similarly. Finally we obtain the loss

`(γ, 0, T0, a, b) := 1 +

1

(log(a))(log(bγ₎₎ > 1. (15)

5. Numerical Experiments

We illustrate here the practical cost of using Doubling Trick, for two interesting non-anytime algorithms that have recently been proposed in the literature: Approximated Finite-Horizon Gittins indexes, that we refer to as AFHG, byLattimore(2016) (for Gaussian bandits with known variance) and kl-UCB++byM´enard and Garivier(2017) (for Bernoulli bandits).

We first provide some details on these two algorithms, and then illustrate the behavior of Doubling Tricks applied to these algorithms with different doubling sequences.

5.1. Two Index-Based Algorithms

We denote by Xk(t) := Ps<tYA(s),s1(A(s) = k) the accumulated rewards from arm k, and Nk(t) :=P_s<t1(A(s) = k) the number of times arm k was sampled. Both algorithms A assume to know the horizon T . They compute an index I_kA(t) ∈ R for each arm k ∈ {1, . . . , K} at each time step t ∈ {1, . . . , T }, and use the indexes to choose the arm with highest index, i.e., A(t) := arg maxk∈{1,...,K}Ik(t) (ties are broken uniformly at random).

• The algorithm AFHG can be applied for Gaussian bandits with variance V (= 1 for our experiments). Let m(T, t) = T − t + 1 ≥ 1, and let

I_kAFHG(t) := Xk(t) Nk(t) + v u u u t V Nk(t) log   m(T, t) Nk(t) log1/2 _m(T,t) Nk(t)  . (16)

• The algorithm kl-UCB++_{can be applied for bounded rewards in [0, 1], and in particular for} Bernoulli bandits. The binary Kullback-Leibler divergence is kl(x, y) := x log(x/y) + (1 − x) log((1 − x)/(1 − y)) (for 0 < x, y < 1), and let log+(x) := max(0, log(x)) ≥ 0. Let the function g(n, T ) := log+ T Kn(1 + log +₍ T Kn) 2

)for n ≤ T , and finally let

I_kkl-UCB++(t) := sup q∈[0,1] q : kl Xk(t) Nk(t) , q ≤ g(Nk(t), T ) Nk(t) . (17)

(12)

5.2. Experiments

We present some results from numerical experiments on Bernoulli and Gaussian bandits. More results are presented in AppendixE. We present in pages12and13results for K = 9 arms and horizon T = 45678 (to ensure that no choice of sequence were lucky and had TLT−1 = T or too close to it).

We ran n = 1000 repetitions of the random experiment, either on the same “easy” bandit problem µ (with evenly spaced means), or on n different random instances µ sampled uniformly in [0, 1]K, and we plot the average regret on n simulations. The black line without markers is the (asymptotic) lower bound inP

k6=k∗(kl(µ_k, µ∗))−1log T , fromLai and Robbins(1985). We consider kl-UCB++

for Bernoulli bandits (Figures2,3) or AFHG for Gaussian bandits (Figures4,5),

Each doubling trick algorithm uses the same T0= 200 as a first guess for the horizon. We include both the non-anytime version that knows the horizon T , and different anytime versions to compare the choice of doubling trick. To compare against an algorithm that does not need the horizon, we also include kl-UCB (Capp´e et al.,2013) as a baseline for Bernoulli bandits and for Gaussian bandits (in the Gaussian version, the divergence used is kl(x, y) = (x − y)2/2, and the algorithm is referred to as UCB). We consider are a geometric doubling sequence with b = 2, and two different exponential doubling sequences: the “classical” b = 2 and a “slower” one with b = 1.1. Both use a = T0= 200, and the last one is using a = 2, b = 2. Despite what was proved theoretically in Theorem7, using a = T0and a large enough T0improves regarding to using a = 2 and a leading (T0/a) factor.

Another version of the Doubling Trick with “no restart”, denoted DTno-restart, is presented in AppendixC, but it is only an heuristic and cannot be applied to any algorithm A. Algorithm2can be applied to kl-UCB++or AFHG for instance, as they use T just as a numerical parameter (see Eqs.16 and17), but its first limitation is that it cannot be applied to DMED+ (Honda and Takemura,2010) or EXP3++(Seldin and Lugosi,2017), or any algorithms based on arm eliminations, for example. A second limitation is the difficulty to analyze this “no restart” variant, due to the unpredictable effect on regret of giving non-uniform prior information to the underlying algorithm A on each successive sequence. An interesting future work would be to analyze it, either in general or for a specific algorithm like kl-UCB++. Despite its limitations, this heuristic exhibits as expected better empirical performance than DT , as can be observed in AppendixE.

6. Conclusion

We formalized and studied the well-known “Doubling Trick” for generic multi-armed bandit prob-lems, that is used to automatically obtain an anytime algorithm from any non-anytime algorithm. Our results are summarized in Table1. We show that a geometric doubling can be used to conserve minimax regret bounds (in√T ), with a constant loss (typically ≥ 3.33), but cannot be used to conserve problem-dependent bounds (in log T ), for which a faster doubling sequence is needed. An exponential doubling sequence can conserve logarithmic regret bounds also with a constant loss, but it is still an open question to know if minimax bounds can be obtained for this faster growing sequence. Partial results of both a lower and an upper bound, for bounds of the generic form Tγ(log T )δ, let use believe in a positive answer.

It is still an open problem to know if an anytime algorithm can be both asymptotically optimal for the problem-dependent regret (i.e., with the exact constant) and optimal in a minimax regret (i.e., have a√KT regret), but we believe that using a doubling trick on non-anytime algorithms like kl-UCB++cannot be the solution. We showed that it cannot work with a geometric doubling sequence, and conjecture that exponential doubling trick would never bring the right constant either.

(13)

Bound \ Doubling Geometric, Ti= bT0bic Exponential, Ti = bT_a0ab i c (log T )δ × Known to fail RT(DT ) ≥ c0(log T )1+δif RT(AT) ≥ c(log T )δ. (Theorem5)

X Known to work, with loss `(δ, b) = _bbδ2δ₋₁ > 1.

(Theorem7)

Tγ

X Known to work, with loss `(γ, b) = bγ_b(b−1)γ₋₁γ > 1.

(Theorem4)

? Partial, best known bound is c0₀(T1b)γ ≤ R_T(DT ≤ `c(Tb)γwith a loss ` > 1, if c0Tγ≤ RT(AT) ≤ cTγ. (Theorems7,9) Tγ(log T )δ for both γ > 0, δ > 0

X Known to work, with loss `(γ, δ, T0, b) = _log(T 0(b−1)+1) log(T0(b−1)) δ_bγ_(b−1)γ bγ₋₁ > 1. (Theorem4)

? Partial, best known bound is RT(DT ) ≤ `c(Tb)γif

RT(AT) ≤ cTγ, with a loss ` → 0 for T0 → ∞.

(Theorem7) Figure 1: Summary of knownpositive_Xandnegative×andpartial results ?.

0 10000 20000 30000 40000

Time steps t = 0. . . T−1, horizon T = 45678

0 100 200 300 400 500 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9 X k= 1 µk 10 00 [Tk (s )]

Cumulated regrets for different bandit algorithms, averaged 1000 times 9 arms: Bayesian MAB, Bernoulli with means on [0, 1]

KLUCB KLUCB+ +(T_{= 45678}) DT(Ti= 200×2i)[KLUCB+ +] DT(Ti= 2002i)[KLUCB+ +] DT(Ti= 2001.1i_)[KLUCB+ +_] DT(Ti= (200/2)22i)[KLUCB+ +]

Lai & Robbins lower bound = 3.74 log(t)

Figure 2: Regret for DT , for K = 9 Bernoulli arms, horizon T = 45678, n = 1000 repetitions and µ taken uniformly in [0, 1]K. Geometric doubling (b = 2) and slow exponential doubling (b = 1.1) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At t = 40000 we see clearly the effect of a new sequence for the best doubling trick (Ti= 200 × 2i). As expected, kl-UCB++outperforms kl-UCB, and if the doubling sequence is growing fast enough then DT (kl-UCB++) can perform as well as kl-UCB++(see for t < 40000).

(14)

0 10000 20000 30000 40000 Time steps t = 0. . . T−1, horizon T = 45678

0 50 100 150 200 250 300 350 400 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9 X k= 1 µk 10 00 [Tk (s )]

Cumulated regrets for different bandit algorithms, averaged 1000 times

9 arms: [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)∗] KLUCB KLUCB+ +₍_{T = 45678}₎ DT(Ti= 200×2i)[KLUCB+ +] DT(Ti= 2002i)[KLUCB+ +] DT(Ti= 2001.1i)[KLUCB+ +] DT(Ti= (200/2)22i)[KLUCB+ +]

Figure 3: Similarly but for µ evenly spaced in [0, 1]K({0.1, . . . , 0.9}). Both kl-UCB and kl-UCB++ are very efficient on “easy” problems like this one, and we can check visually that they match the lower bound fromLai and Robbins(1985). As before we check that slow doubling are too slow to give reasonable performance.

0 10000 20000 30000 40000

0 500 1000 1500 2000 2500 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9 X k= 1 µk 10 00 [Tk (s )]

9 arms: Bayesian MAB, Gaussian with uniform means on [−5, 5]

UCB ApprFHG(T = 45678) DT(Ti= 200×2i_)[ApprFHG] DT(Ti= 2002i_)[ApprFHG] DT(Ti= 2001.1i)[ApprFHG] DT(Ti= (200/2)22i)[ApprFHG]

Figure 4: Regret for K = 9 Gaussian arms N (µ, 1), horizon T = 45678, n = 1000 repetitions and µ taken uniformly in [−5, 5]Kand variance V = 1. On “hard” problems like this one, both UCB and AFHG perform similarly and poorly w.r.t. to the lower bound fromLai and Robbins(1985). As before we check that geometric doubling (b = 2) and slow exponential doubling (b = 1.1) are too slow, but a fast enough doubling sequence does give reasonable performance for the anytime AFHG obtained by Doubling Trick.

(15)

Acknowledgments

This work is supported by the French National Research Agency (ANR), under the project BADASS (grant coded: N ANR-16-CE40-0002), by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay.

References

S. Agrawal and N. Goyal. Analysis of Thompson sampling for the Multi-Armed Bandit problem. In Conference On Learning Theory. PMLR, 2012.

J-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, pages 217–226. PMLR, 2009.

P. Auer and R. Ortner. UCB Revisited: Improved Regret Bounds For The Stochastic Multi-Armed Bandit Problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a Rigged Casino: The Adversarial Multi-Armed Bandit Problem. In Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE, 1995.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time Analysis of the Multi-armed Bandit Problem. Machine Learning, 47(2):235–256, 2002a.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM journal on computing, 32(1):48–77, 2002b.

S. Bubeck, N. Cesa-Bianchi, et al. Regret Analysis of Stochastic and Non-Stochastic Multi-Armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1), 2012.R

O. Capp´e, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516–1541, 2013.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems, pages 2249–2257. Curran Associates, Inc., 2011.

R. Degenne and V. Perchet. Anytime Optimal Algorithms In Stochastic Multi Armed Bandits. In International Conference on Machine Learning, pages 1587–1595, 2016.

A. Garivier, E. Kaufmann, and T. Lattimore. On Explore-Then-Commit Strategies. volume 29 of Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016.

J. Honda and A. Takemura. An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. In Conference on Learning Theory, pages 67–79. PMLR, 2010.

W. Jouini, D. Ernst, C. Moy, and J. Palicot. Multi-Armed Bandit Based Policies for Cognitive Radio’s Decision Making Issues. In International Conference Signals, Circuits and Systems. IEEE, 2009.

(16)

E. Kaufmann, N. Korda, and R. Munos. Thompson Sampling: an Asymptotically Optimal Finite-Time Analysis, pages 199–213. PMLR, 2012.

E. Kaufmann, O. Capp´e, and A. Garivier. On the Complexity of A/B Testing. In Conference on Learning Theory, pages 461–481. PMLR, 2014.

T. L. Lai and H. Robbins. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6(1):4–22, 1985.

T. Lattimore. Regret Analysis Of The Finite Horizon Gittins Index Strategy For Multi Armed Bandits. In Conference on Learning Theory, pages 1214–1245. PMLR, 2016.

L. Li, W. Chu, J. Langford, and R. E. Schapire. A Contextual-Bandit Approach to Personalized News Article Recommendation. In International Conference on World Wide Web, pages 661–670. ACM, 2010.

D. Liau, E. Price, Z. Song, and G. Yang. Stochastic Multi-Armed Bandits in Constant Space. In International Conference on Artificial Intelligence and Statistics, 2018.

P. M´enard and A. Garivier. A Minimax and Asymptotically Optimal Algorithm for Stochastic Bandits. In Algorithmic Learning Theory, volume 76, pages 223–237. PMLR, 2017.

H. Robbins. Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

A. Sani, A. Lazaric, and R. Munos. Risk-Aversion In Multi-Armed Bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.

Y. Seldin and G. Lugosi. An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits. In Conference on Learning Theory, volume 65, pages 1–17. PMLR, 2017.

W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25, 1933.

F. Yang, A. Ramdas, K. Jamieson, and M. Wainwright. A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control. In Advances in Neural Information Processing Systems, pages 5957–5966. Curran Associates, Inc., 2017.

Note: the simulation code used for the experiments is using Python 3. It is open-sourced at https://GitHub.com/SMPyBandits/SMPyBanditsand fully documented at

(17)

Appendix A. Omitted Proofs

We include here the proofs omitted in the main document.

A.1. Proof of Lemma2, “Regret Lower and Upper Bounds for DT ” Let A0denote DT (A, (Ti)i∈N). For every k ∈ {1, . . . , K},

E " _T X t=1 (Xk,t− XA(t),t) # = LT−1 X i=0 E   Ti X t=Ti−1 (Xk,t− XA(t),t)  + E   T X t=T_{LT −1} (Xk,t− XA(t),t)   ≤ LT−1 X i=0 max k∈{1,...,K}E   Ti X t=Ti−1 (Xk,t− XA(t),t)  + max k∈{1,...,K}E   T X t=T_{LT −1} (Xk,t− XA(t),t)   ≤ LT−1 X i=0 RTi−Ti−1 A 0_{+ R} T −T_{LT −1} A0 .

Thus, by definition of the regret

RT(A0) ≤ LT−1 X i=0 RTi−Ti−1(A 0_{) + R} T −T_{LT −1}(A0) = LT−1 X i=0 RTi−Ti−1 ATi−Ti−1 + RT − T_L T−1 | {z } ≤TLT−TLT −1 A_T LT−TLT −1 ≤ LT X i=0 RTi−Ti−1 ATi−Ti−1 .

In the stochastic case, it is well known that the regret can be rewritten in the following way, introducing µkthe mean of arm k and µ∗the mean of the best arm:

RT(A0) = E " _T X t=1 (µ∗− µ_A(t)) # = LT −1 X i=0 E   Ti X t=Ti−1 (µ∗− µ_A(t))  + E   T X t=T_{LT −1} (µ∗− µ_A(t))   = LT−1 X i=0 RTi−Ti−1 ATi−Ti−1) + RT −T_{LT −1}(A 0 ) | {z } ≥0 .

(18)

A.2. Proof of Theorem4, “Conserving a Regret Upper Bound with Geometric Horizons” It is interesting to note that the proof is valid for both the easiest case when δ = 0 (as it was known in Cesa-Bianchi and Lugosi(2006) for γ = 1/2) and the generic case when δ ≥ 0, with no distinction.

As far as the authors know, this result in its generality with δ ≥ 0 is new.

Proof Let A0 := DT (A, (Ti)i∈N), and consider a fixed bandit problem. The upper bound (UB) from Lemma2gives RT(A0) ≤ LT X i=0 RTi−Ti−1(AT =Ti−Ti−1)

We can use the hypothesis on A for each regret term, as f and t 7→ ctγ(log t)δare non-decreasing for t ≥ 1 (by hypothesis for f and by Lemma16).

≤ LT X i=0 f (Ti− Ti−1) + cT0γ(log T0)δ+ c LT X i=1

(Ti− Ti−1)γ(log (Ti− Ti−1))δ

The first part is denoted g1(T ) :=PLi=0T f (Ti− Ti−1) + cT0γ(log T0)δ, it is an increasing function as a sum of increasing functions, and it is dealt with by using Lemma17: the sum of f (Ti− Ti−1) is a o

PLT

i=0(Ti− Ti−1)γ(log(Ti− Ti−1))δ

, as f (t) = o tγ(log t)δ by hypothesis, and this sum of (Ti− Ti−1)γ(log(Ti − Ti−1))δ is proved below to be bounded by c0Tγ(log(T ))δ for a certain constant c0> 0, which gives g1(T ) = o Tγ(log T )δ. For the second part, we bound Ti− Ti−1≤ T0(b − 1)bi−1+ 1 thanks to Definition3. Moreover, as γ < 1 we can use Lemma14(Eq. (23)) to distribute the power on γ, so (T0(b − 1)bi−1+ 1)γ ≤ (T0(b − 1)bi−1)γ+1(γ 6= 0) (indeed if γ = 0 both sides are equal to 1). This gives

≤ g1(T ) + c(T0(b − 1))γ

LT

X

i=1

(bi−1)γ(log(Ti− Ti−1))δ+ c1(γ 6= 0)

LT

X

i=1

(log(Ti− Ti−1))δ

If γ 6= 0, the last sum is bounded byPLT

i (log Ti)δ≤ (log T0)δ(LT + 1) + (log b)δ PLT

i iδwhich is a OLδ+1_T = O (log T )δ+1 = o Tγ(log T )δ (as γ > 0, thanks to a geometric sum), and so it can be included in g2(T ) = o Tγ(log T )δ. If γ = 0, there is only the first sum. We bound again Ti − Ti−1 ≤ T0(b − 1)bi−1+ 1 and use Lemma 15to bound log(T0(b − 1)bi−1+ 1) by

log(T0(b−1)+1)

log(T0(b−1)) log(T0(b − 1)b

i−1_{) term (as T}

0(b − 1) > 1 by hypothesis). ≤ g2(T ) + c(T0(b − 1))γ LT X i=1 (bi−1)γ log(T0(b − 1) + 1) log(T0(b − 1)) log(T0(b − 1)bi−1) δ

We split the log(T0(b − 1)bi−1) term in two, and once again, the term with log(T0(b − 1)) gives a O bLT−1 (by a geometric sum), which gets included in g

3(T ) = o Tγ(log T )δ. We focus on the fastest term, and we can now rewrite the sum from i = 0 to LT − 1,

≤ g2(T ) + c(T0(b − 1))γ log(b)log(T0(b − 1) + 1) log(T0(b − 1)) δ LT−1 X i=0 (bi)γiδ

(19)

We naively bound iδby (LT − 1)δ, and use a geometric sum to have ≤ g2(T ) + c(T0(b − 1))γ log(b)log(T0(b − 1) + 1) log(T0(b − 1)) δ (LT − 1)δ bγ bγ_{− 1}(b LT−1₎γ

Finally, observe that LT − 1 ≤ logb(TT0) ≤ logb(T ), so the (log b)

δ_{term simplifies, and observe} that bLT−1 _≤ T

T0 so the T γ

0 term also simplifies. Thus we get

≤ g2(T ) + c log(T0(b − 1) + 1) log(T0(b − 1)) δ bγ(b − 1)γ bγ_{− 1} T γ_{(log T )}δ_.

The constant multiplicative loss ` depends on γ and δ as well as on T0and b, and is `(γ, δ, T0, b) :=

_log(T

0(b−1)+1)

log(T0(b−1))

δ_bγ_(b−1)γ

bγ₋₁ > 1.

Minimizing the constant loss? This constant loss has two distinct part, `(γ, δ, T0, b) = `1(δ, T0, b) `2(γ, b), with `1depending on δ, T0and b (equal to 1 if δ = 0), and `2depending on γ and b.

• Minimizing this constant loss `1(δ, T0, b) :=

log(T0(b−1)+1)

log(T0(b−1))

δ

≥ 1 is independent of δ (even if it is 0). If we assume b to be fixed, `1(δ, T0, b) → 1 when T0 → ∞. Moreover, for any δ and b > 1, `1(δ, T0, b) goes to 1 very quickly when T0 is large enough. For instance, for γ = δ = 1₂ and b = 3+

√ 5

2 (see Corollary10), then `1(δ, T0, b) ' 1.109 for T0 = 2, ' 1.01 for T0= 10 and ' 1.0004 for T0 = 100.

• To minimize this constant loss `2(γ, b) := b

γ_(b−1)γ

bγ₋₁ > 1, we fix γ and study h : b 7→

`2(γ, b). The function h is of class C1 on (1, ∞) and h(b) → +∞ for b → 1+ and b → ∞, so h has a (possibly non-unique) global minimum and attains it. Moreover h0(b) =

γbγ−1(b−1)γ−1

(bγ₋₁₎2 bγ+1− 2b + 1 has the sign of bγ+1− 2b + 1, which does not have a constant

sign and does not have explicit root(s) for a generic γ. However, it is easy to minimizing `2(γ, b) for b numerically when γ is known and fixed (with, e.g., Newton’s method).

The result from Theorem4of course implies the result from (Cesa-Bianchi and Lugosi,2006, Ex.2.9), in the special case of δ = 0 and γ = 1₂ (for minimax bounds), as stated numerically in the following Corollary10.

Corollary 10 If γ = 1₂ andδ = 0, the multiplicative loss `(1₂, 0, T0, b) does not depend on T0. It is then minimal forb∗(1₂) = 3+

√ 5

2 ' 2.62 and its minimum is q

11+5√5

2 ' 3.33. Usually b = 2 is used, which gives a loss of

√ 2 √

2−1 ' 3.41, close to the optimal value.

In particular, order-optimal and optimal algorithms for the minimax bound haveγ = 1₂ and f (t) = 0, for which Theorem7gives a simpler bound

RT(AT) ≤ c √ T =⇒ RT(DT (A, (T02i)i∈N)) ≤ √ 2 √ 2 − 1c √ T . (18)

(20)

A.3. Proof of Theorem9, “Minimax Regret Lower Bound with Exponential Horizons” Proof Let A0:= DT (A, (Ti)i∈N), and consider a fixed stochastic bandit problem. The lower bound (LB) from Lemma2gives

RT(A0) ≥

LT−1

X

i=0

We can use the hypothesis on A for each regret term, and as 0 < γ ≤ 1, we can use Lemma14 (Eq. (24)) to distribute the power on γ to ease the proof and obtain

≥ cT₀γ+ c LT−1 X i=1 (Ti− Ti−1)γ ≥ cT₀γ+ c LT−1 X i=1 T_iγ− T_i−1γ

(it is a telescopic sum and simplifies) ≥ cT_Lγ

T−1

Observe that TLT−1 ≥ (TLT) 1

b by definition of the exponential sequence (Def.6), and T_L

T ≥ T

(Def.1). For the log(TLT−1) term, we simply have log(TLT−1) ≥ 1

blog(T ) so if c

0 _{= c/b, then we} obtain what we want

RT(A0) ≥ c0 T γ b.

This lower bound goes from RT(AT) = ΩTγto RT(DT (A)) = ΩT γ

b, and it looks very similar to

the upper bound from Theorem7where RT(DT (A)) = O Tbγ was obtained from RT(AT) = O(Tγ_).

Remark It does seem sub-optimal to lower bound TLT−1 like this (TLT−1 ≥ (TLT) 1

b), but we

remind that T can be located anywhere in the discrete interval {TLT−1, . . . , TLT − 1}, so in the

worst case when T is very close to TLT (and for large enough T ), we indeed have T b

LT−1 ∼ TLT and

TLT ∼ T , so with this approach, the lower bound TLT−1 ≥ T 1

b cannot be improved.

Appendix B. Minimax Regret Lower Bound with Geometric Horizons

We include here a last result that partly replies to Theorem4. It is more subtle that the lower bound in Theorem5but still provides an interesting insight: if b is not chosen carefully (i.e., if `0(γ, b) > 1), then the anytime version of AT using a geometric Doubling Trick suffers a non-improvable constant multiplicative loss compared to AT.

Theorem 11 For stochastic models, if A satisfies RT(AT) ≥ c Tγ, for0 < γ < 1 and c > 0, then the anytime versionA0 _{:= DT (A, (T}

i)i∈N) with the geometric sequence (Ti)i∈N of parameters T0∈ N∗, b > 1 (i.e., Ti = bT0bic) satisfies

(21)

withg0(t) = O(log t) = o(tγ), and a constant loss `0(γ, b) depending only on γ and b, `0(γ, b) =

(b − 1)γ

bγ_(bγ_{− 1)} > 0. (20)

`0(γ, b) is always > 0 and tends to 0 for b → ∞, and some choice of b gives `0(γ, b) > 1.

Proof Let A0:= DT (A, (Ti)i∈N), and consider a fixed stochastic bandit problem. Assume LT ≥ 2. The lower bound (LB) from Lemma2gives

RT(A0) ≥

LT−1

X

i=0

We bound Ti− Ti−1 ≥ T0(b − 1)bi−1− 1 for i > 0, thanks to Definition3, and we can use the hypothesis on A for each regret term. Additionally, we have (Ti− Ti−1)γ≥ (T0(b − 1)bi−1− 1)γ≥ (T0(b − 1))γ(bi−1)γ− 1 by Lemma14(Eq. (24), as b > 1 and 0 < γ < 1), thus

≥ LT−1 X i=0 c(Ti− Ti−1)γ ≥ cT₀γ+ cT₀γ(b − 1)γ LT−1 X i=1 (bi−1)γ− c LT−1 X i=1 1 ≥ cT₀γ+ cT₀γ(b − 1)γ LT−2 X i=0 (bi)γ− c(LT − 1) We havePLT−1 i=0 (bi)γ = (bLT −1)γ−1

bγ₋₁ thanks to a geometric sum (with γ > 0) and thus

≥ cT₀γ+ cT₀γ(b − 1)γ(b

LT−1₎γ_{− 1}

bγ_{− 1} + c(1 − LT) Thanks to Definition3, bLT−1_{satisfies b}LT−1_≥ 1

b T T0. Let g0(T ) := cT γ 0 (b−1)γ bγ₋₁ − 1 + c(LT − 1) = O(1) + O log_b(_TT 0)

= O(log T ) = o(Tγ) and g0(T ) > 0, then we have

≥ c (b − 1) γ bγ_(bγ_{− 1)}T γ₋ cT₀γ (b − 1) γ bγ_{− 1} − 1 + c(LT − 1) .

We obtain as announced, RT(A0) ≥ `0(b) cTγ+ g0(T ) with g0(T ) = O(log T ) = o(Tγ).

Maximizing the constant loss? To maximize3 `0(γ, b) := (b−1) γ

bγ_(bγ₋₁₎ > 0, we fix γ and study

the function h : b 7→ `0(γ, b). The function h is of class C1 on (1, ∞) and h(b) → +∞ for b → 1+and h(b) → 0 for b → ∞. Moreover h0(b) = −γ (b−1)γ−1

bγ+1_(bγ₋₁₎2 −2bγ+ bγ+1+ 1 has the

same sign as f (b) := − −2bγ+ bγ+1+ 1. The function f is of class C1, with f (1) = 0 and f0(b) = −(γ + 1)(b − _γ+12γ )bγ−1, and as 0 < γ < 1, _γ+12γ < 1 so f0(b) < 0 for all b > 1. Thus f is decreasing, and ∀b > 1, f (b) < f (1) = 0. So h0has a negative sign, and this allows to conclude that h is decreasing, and so b 7→ `0(γ, b) has no global maximum at fixed γ, and `0 → ∞ if b → 1+.

(22)

Relationship with the upper bound. For any b > 1, we compare `0(γ, b) with `(γ, b) and we see that, interestingly, `0(γ, b) = `2(γ, b)/b2γ, with `2(γ, b) from Theorem4. For the particular case of γ = 1₂, this lower bound also leads to another interesting remark: if b is chosen to minimize the loss in the upper bound (Theorem4, b∗(1₂) = 3+

√ 5

2 ), then this lower bound gives `0( 1 2, b ∗₍1 2)) = 1 + √ 2

2 ' 1.71 > 1, which proves that this choice of geometric doubling trick cannot be used to conserve an optimal algorithm, i.e., the constant loss cannot be made as close to 1 as we want.

Appendix C. An Efficient Heuristic, the Doubling Trick with “No Restart”

Let A0 := DTno-restart(A, (Ti)i∈N) denotes the following Algorithm2. The only difference with DT (Algorithm1) is that the history from all the steps from t = 1 to t = Tiis used to reinitialize the new algorithm A(i). To be more precise, this means that a fresh algorithm A(i)is created, and then fed with successive observations (A0(s), YA0_(s),s) for all 1 ≤ s < t, like if it was playing

from the beginning. Note that A(i)could have chosen a different path of actions, but we give it the observations from the previous plays of A0.

This obviously cannot be applied to any kind of algorithm A, and for instance any algorithm based on arm elimination (e.g., Explore-Then-Commit approaches like inGarivier et al.(2016)) will most surely fail with this approach. This second doubling trick algorithm DTno-restartcan be applied in practice if A is index-based and uses the horizon T as a simple numerical parameter in its indexes, like it is the case for instance for AFHG (cf. Eq. (16)). or kl-UCB++(cf. Eq. (17)).

Input: Bandit algorithm A, and sequence (Ti)i∈N 1 Let i = 0, and initialize algorithm A(0) = AT0. 2 for t = 1, . . . , T do

3 if t > T_ithen // Next horizon Ti+1 from the sequence 4 Let i = i + 1.

5 Initialize algorithm A(i) = ATi−Ti−1.

6 Update internal memory of A(i)with the history of plays and rewards from Ai−1.

7 end

8 Play with A(i): play arm A0(t) := A(i)(t − Ti), observe reward r(t) = YA0_(t),t.

9 end

Algorithm 2: The Non-Restarting Doubling Trick Algorithm, A0= DTno-restart(A, (Ti)i∈N). However, it is much harder to state any theoretical result on this heuristic DTno-restart. We conjecture that a regret upper bound similar to (UB) from Lemma2 could still be obtained, but it is still an open problem that the authors do not know how to tackle for a generic algorithm. The intuition is that starting A(i)with some previous observations from the (same) i.i.d. process

(Yk,s)k∈{1,...,K},s∈Ncan only improve the performance of A(i), and lead to a smaller regret on the

(23)

Appendix D. Basic but Useful Results

All the logarithm log are taken in base e = exp(1) (i.e., natural logarithm ln, but log is preferred for readability). Logarithms in a basis b > 1 are denoted log_b(x) := log x_{log b}, for any x ∈ R, x > 0.

We remind that bxc denotes the integer part of x ∈ R, and for x > 0, that is the unique integer i such that i ≤ x < i + 1. The only property we use is its definition and the fact that bxc ≤ x. We also define dxe := 1 + bxc for x ≥ 0, which is the unique integer j such that j − 1 ≤ x < j. D.1. Weighted Geometric Inequality

Lemma 12 (Weighted Geometric Inequality) For any n ∈ N∗, b > 1 and δ > 0, and if f is a function of classC1_{, non-decreasing and non-negative on}_{[0, ∞), we have}

n X i=0 f (i)(bi)δ≤ b δ bδ_{− 1}f (n)(b n₎δ_. ₍₂₁₎

Proof By hypothesis, f is non-decreasing, so ∀i ∈ {0, . . . , n}, f (i) ≤ f (n), and so by using the sum of a geometric sequence, we have

n X i=0 f (i)(bi)δ≤ f (n) n X i=0 (bi)δ ! ≤ f (n) 1 bδ_{− 1}(b n+1₎δ₌ bδ bδ_{− 1} f (n)(bn)δ.

f (i) = 1 gives the geometric inequality. Note that if we make δ → 0, the left sum converges to Pn−1

i=0 f (i) and the right term diverges to +∞, making this inequality completely uninformative.

D.2. Another Sum Inequality

This second result is similar to the previous one but for a “doubly exponential” sequence, i.e., abi, as it also bounds a sum of increasing terms by a constant times its last term.

Lemma 13 For any n ∈ N∗,a > 1, b > 1 and γ > 0, we have n X i=0 (abi)γ ≤ aγ+ 1 + 1 (log(a))(log(bγ₎₎ (abn)γ = O(abn)γ. (22) Proof We first isolate both the first and last term in the sum and focus on the from i = 1 sum up to i = n − 1. As the function t 7→ (abt)γis increasing for t ≥ 1, we use a sum-integral inequality, and then the change of variable u := γbt, of Jacobian dt = _{log b}1 du_u , gives

n−1 X i=1 (abi)γ≤ Z n 1 aγbt dt ≤ 1 log(bγ₎ Z γbn γb au u du Now for u ≥ 1, observe that a_uu ≤ au_{, and as γb > 1, we have}

≤ 1 log(bγ₎ Z γbn γb au du ≤ 1 log(bγ₎ 1 log(a)a γbn = 1 (log(a))(log(bγ₎₎(a bn )γ.

Finally, we obtain as desired, n P i=0 (abi)γ≤ aγ_{+ (a}bn )γ+_{(log(a))(log(b}1 γ₎₎(ab n )γ.

(24)

D.3. Basic Functional Inequalities

These functional inequalities are used in the proof of the main theorems.

Lemma 14 (Generalized Square Root Inequalities) For any x, y ≥ 0 and 0 < δ < 1,

(x + y)δ≤ xδ_{+ y}δ_. ₍₂₃₎

And conversely for any0 < δ < 1, and x, y ≥ 0, if x ≥ y then

(x − y)δ≥ xδ− yδ. (24)

Proof Fix y ≥ 0. Let f (x) := (x+y)δ−(xδ_+yδ_{) for x ≥ 0. First, f (0) = y}δ_−yδ_{= 0, and as δ > 0,} f is differentiable on [0, ∞), with f0(x) = (log δ)(x + y)δ_{− (log δ)x}δ _{= (log δ)((x + y)}δ_{− x}δ_{), and} as δ < 1, log δ < 0, and (x + y)δ ≥ xδ_{, so f}0_{(x) ≤ 0 for any x ≥ 0. Therefore, f is non-increasing,} and so ∀x ≥ 0, f (x) ≤ f (0) = 0, so f is non-positive, giving the desired inequality (for any y ≥ 0 and any x ≥ 0).

The second inequality is a direct application of the first one. Assume x ≥ y, and let x0 = x − y ≥ 0, then (x0+ y)δ≤ (x0₎δ_{+ y}δ_{. This gives (x − y)}δ_{= (x}0₎δ_{≤ (x}0_{+ y)}δ_{− y}δ_{= x}δ_{− y}δ_.

Lemma 15 (Bounding log(x − ∆)) Let x0 > 1 and 0 < ∆ < x0 (e.g., ∆ ≤ 1), then ∀x ≥ x0,

log(x0− ∆) log(x0)

log(x) ≤ log(x − ∆) ≤ log(x). (25) With∆ = 1, it implies that if T0 > 1, b > 1 satisfy T0(b − 1) > 1, then for any i ∈ N, we have

log(T0(b − 1)bi− 1) ≥

log(T0(b − 1) − 1) log(T0(b − 1))

log(T0(b − 1)bi). (26)

Proof Let f (x) := log(x−∆)_log(x) , defined for x ≥ x0. It is of class C1, and by differentiating, we have f0(x) = log(x)−log(x−∆)_{x(log x)}2 > 0 as log is increasing. So f is increasing, and its minimum is attained at

x = x0, i.e., ∀x ≥ x0, f (x) ≥ f (x0) = log(x_log(x0−∆)₀₎ > 0, which gives Eq. (25). The corollary is immediate but stated explicitly for clarity when used in page6.

Lemma 16 For any γ > 0 and δ > 0, the function f : x 7→ xγ(log x)δis increasing on[1, ∞). Proof f is of class C1on [1, ∞). First, if γ > 0, we have f0(x) = xγ−1(log x)δ−1(δ + γ log(x)). So f0(x) > 0 if and only if δ + γ log(x) ≥ 0, that it x ≥ exp(−δ_γ). But x ≥ 1 and < 0 so f0(x) is always positive, and thus f is increasing. Then, if γ = 0, we have f0(x) = δ1_x(log x)δ−1 > 0 as x > 1 gives log x > 0 and so (log x)δ−1 > 0.

(25)

D.4. Controlling an Unbounded Sum of Dominated Terms

This Lemma is used in the proofs of our upper bounds (Theorems4and7), to handle the sum of f (Ti) terms. In particular, it can be applied to (Ti) the geometric sequence and g(t) = h(t) = tγor g(t) = h(t) = tγ(log t)δ(for Theorem4) for γ > 0 and δ ≥ 0; or (Ti) the exponential sequence, g(t) = tγ(log t)δand h(t) = tbγ(log t)δ(for Theorem7) for γ ≥ 0 and δ ≥ 0. Note that it would be obvious if LT was bounded for T → ∞, but a more careful analysis has to be given as LT → ∞. Lemma 17 Let f , g and h be three positive, diverging and non-decreasing functions on [1, ∞), such thatf (t) = o(g(t)) for t → ∞. Let a non-decreasing diverging sequence (Ti)i∈N, and a diverging sequence(LT)T ∈N(i.e., Ti → ∞ for i → ∞ and LT → ∞ if T → ∞), such that there exists a constantc ≥ 0 satisfying ∀T ≥ 1,PLT

i=0g(Ti) ≤ c × h(T ). Then the (unbounded) sum of dominated termsf (Ti) is still dominated by h(T ), i.e.,

f (t) = T →∞o(g(t)) and ∃c ≥ 0, LT X i=0 g(Ti) ≤ ∀T ≥1 c × h(T ) =⇒ LT X i=0 f (Ti) = T →∞o(h(T )) . (27) Proof By hypothesis, if f is dominated by g, formally f (t) = o(g(t)), then there exists a positive function ε(t) such that f (t) = g(t)ε(t) and ε(t) → 0 for t → 0. Fix η > 0, as small as we want, then there exists Tη ≥ 1 such that ∀t ≥ Tη, ε(t) ≤ η. Let iη such that Tiη−1< Tη ≤ Tiη. Now for

any T ≥ Tηand large enough so that LT ≥ Tη, we can start to split the sum

LT X i=0 f (Ti) = iη−1 X i=0 f (Ti) + LT X i=iη f (Ti)

The first sum is naively bounded by iη× f (Tiη−1) as f is increasing, and for the second sum, for

any i ≥ iη, Ti≥ Tη and so f (Ti) = ε(Ti)g(Ti) ≤ ηg(Ti), thus

≤ iηf (Tiη−1) + η ×   LT X i=iη g(Ti)  

The sum is smaller than a sum on a larger interval, as g(Ti) ≥ 0 for any i, and f is increasing so ≤ iηf (Tη) + η LT X i=0 g(Ti) !

But now, f (Tη) ≤ ηg(Tη) by hypothesis, and this sum is smaller than c × h(T ) also by hypothesis ≤ i_ηη × g(Tη) + ηc × h(T ) = η (iηg(Tη) + c × h(T ))

Finally, we use the hypothesis that h(T ) is diverging and as η and Tηare both fixed, there exists a f

Tη ≥ Tη large enough so that iηg(Tη) ≤ h(T ) for all T ≥ fTη. And so we have finally proved that ∀η > 0, ∃fTη ≥ 1, ∀T ≥ fTη,

LT

X

i=0

f (Ti) ≤ η(c + 1) × h(T ).

This concludes the proof and shows that

LT

P

i=0

(26)

Appendix E. Additional Experiments

We presents additional experiments, for Gaussian bandits and for the heuristic DTno-restart. E.1. Experiments with Gaussian Bandits (with Known Variance)

We include here another figure for experiments on Gaussian bandits, see Fig.5.

0 10000 20000 30000 40000

Time steps t= 0. . . T−1, horizon T= 45678

0 50 100 150 200 250 300 350 400 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9 X k= 1 µk 10 00 [ Tk ( s )]

Cumulated regrets for different bandit algorithms, averaged 1000 times 9 arms: [N(−4,1), N(−3,1), N(−2,1), N(−1,1), N(0,1), N(1,1), N(2,1), N(3,1), N(4,1)∗_] UCB ApprFHG(T = 45678) DT(Ti= 200×2i)[ApprFHG] DT(Ti= 2002i)[ApprFHG] DT(Ti= 2001.1i)[ApprFHG] DT(Ti= (200/2)22i)[ApprFHG]

Figure 5: Regret for DT , for K = 9 Gaussian arms N (µ, 1), horizon T = 45678, n = 1000 repetitions and µ uniformly spaced in [−5, 5]K. On “easy” problems like this one, both UCB and AFHG perform similarly and attain near constant regret (identifying the best Gaussian arm is very easy here as they are sufficiently distinct). Each doubling trick also appear to attain near constant regret, but geometric doubling (b = 2) and slow exponential doubling (b = 1.1) are slower to converge and thus less efficient.

E.2. Experiments with DTno-restart

As mentioned previously, the DTno-restartalgorithm (Algorithm2) is only an heuristic so far, as no theoretical guarantee was proved for it. For the sake of completeness, we also include results from numerical experiments with it, to compare its performance against the “with restart” version DT .

As expected, DTno-restartenjoys much better empirical performance, and in Figs.6and7we see that a geometric or a slow exponential doubling trick with no restart with kl-UCB++can outperform kl-UCB and perform similarly to the non-anytime kl-UCB++. But as observed before, the regret blows up after the beginning of each new sequence if the doubling sequence increase too fast (e.g., exponential doubling). Despite what is proved theoretically in Theorem5, here we observe that the geometric doubling is the only one to be slow enough to be efficient.

(27)

0 10000 20000 30000 40000 Time steps t = 0. . . T−1, horizon T = 45678

0 100 200 300 400 500 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9_X µ _k=1 k 10 00 [Tk (s )]

9 arms: Bayesian MAB, Bernoulli with means on [0,1]

KLUCB KLUCB+ +(_{T = 45678}) DTnr(Ti= 200×2i_)[KLUCB+ +_] DTnr(Ti= 2002i)[KLUCB+ +] DTnr(Ti= 2001.1i)[KLUCB+ +] DTnr(Ti= (200/2)22i)[KLUCB+ +]

Figure 6: Regret for K = 9 Bernoulli arms, horizon T = 45678, n = 1000 repetitions and µ taken uniformly in [0, 1]K, for DTno-restart. Geometric doubling (e.g., b = 2) and slow exponential doubling (e.g., b = 1.1) are too slow, and short first sequences make the regret blow up in the beginning of the experiment. At t = 40000 we see clearly the effect of a new sequence for the best doubling trick (Ti = 200 × 2i). As expected, kl-UCB++outperforms kl-UCB, and if the doubling sequence is growing fast enough then DTno-restart(kl-UCB++) can perform as well as kl-UCB++.

0 10000 20000 30000 40000

0 20 40 60 80 100 120 140 Cu m ula te d re gr et Rt = tµ ∗− t − 1 X s= 0 9_X µ _k=1 k 10 00 [Tk (s )]

9 arms: [B(0.1), B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)∗] KLUCB KLUCB+ +(_{T = 45678}) DTnr(Ti= 200×2i_)[KLUCB+ +_] DTnr(Ti= 2002i)[KLUCB+ +] DTnr(Ti= 2001.1i)[KLUCB+ +] DTnr(Ti= (200/2)22i)[KLUCB+ +]

Figure 7: K = 9 Bernoulli arms with µ evenly spaced in [0, 1]K. On easy problems like this one, both kl-UCB and kl-UCB++are very efficient, and here the geometric allows the DTno-restart anytime version of kl-UCB++to outperform both kl-UCB and kl-UCB++.