Regret Bounds and Minimax Policies under Partial Monitoring

(1)

HAL Id: hal-00654356

https://hal-enpc.archives-ouvertes.fr/hal-00654356

Submitted on 21 Dec 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Jean-Yves Audibert, Sébastien Bubeck

To cite this version:

Jean-Yves Audibert, Sébastien Bubeck. Regret Bounds and Minimax Policies under Partial Moni- toring. Journal of Machine Learning Research, Microtome Publishing, 2010, 11, pp.2785-2836. �hal- 00654356�

(2)

Regret Bounds and Minimax Policies under Partial Monitoring

Jean-Yves Audibert^∗ audibert@imagine.enpc.fr

Imagine, Universit´e Paris Est 6 avenue Blaise Pascal

77455 Champs-sur-Marne, France

S´ebastien Bubeck sebastien.bubeck@inria.fr

SequeL Project, INRIA Lille 40 avenue Halley

59650 Villeneuve d’Ascq, France

Editor:Nicol`o Cesa-Bianchi

Abstract

This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo- regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary functionψfor which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, forψ(x) = exp(ηx) +_K^γ, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x) =( _η

−x

)q

+_K^γ, which deﬁnes a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus ﬁll in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action.

Finally, we consider the stochastic bandit game, and prove that an appropriate mod- iﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.

Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label eﬃcient, upper conﬁdence bound (UCB) policy, online learning, prediction with limited feedback.

1. Introduction

This section starts by defining the prediction tasks, the different regret notions that we will consider, and the different adversaries of the forecaster. We will then recap existing lower and upper regret bounds for the different settings, and give an overview of our contributions.

∗. Also at Willow, CNRS/ENS/INRIA—UMR 8548.

(3)

Parameters: the number of arms (or actions) K and the number of rounds n withn≥K≥2.

For each roundt= 1,2, . . . , n

(1) The forecaster chooses an armI_t∈ {1, . . . , K}, possibly with the help of an external randomization.

(2) Simultaneously the adversary chooses a gain vectorgt= (g1,t, . . . , gK,t)∈ [0,1]^K (see Section 8 for loss games or signed games).

(3) The forecaster receives the gain gI_t,t (without systematically observing it). He observes

– the reward vector (g1,t, . . . , gK,t) in thefull informationgame, – the reward vector (g1,t, . . . , gK,t) if he asks for it with the global

constraint that he is not allowed to ask it more than m times for some ﬁxed integer number 1≤m≤n. This prediction game is the label eﬃcientgame,

– onlygIt,t in thebandit game,

– only his obtained reward g_I_t_,t if he asks for it with the global constraint that he is not allowed to ask it more thanmtimes for some ﬁxed integer number 1≤m≤n. This prediction game is thelabel eﬃcient bandit game.

Goal : The forecaster tries to maximize his cumulative gain ∑n t=1g_I_t_,t. Figure 1: The four prediction tasks considered in this work.

1.1 The Four Prediction Tasks

We consider a general prediction game where at each stage, a forecaster (or decision maker) chooses one action (or arm), and receives a reward from it. Then the forecaster receives a feedback about the rewards which he can use to make his choice at the next stage. His goal is to maximize his cumulative gain. In the simplest version, after choosing an arm the forecaster observes the rewards for all arms, this is the so called full information game. In the label efficient game, originally proposed by Helmbold and Panizza (1997), after choosing its action at a stage, the forecaster decides whether to ask for the rewards of the different actions at this stage, knowing that he is allowed to do it a limited number of times. Another classical setting is the bandit game where the forecaster only observes the reward of the arm he has chosen. In its original version (Robbins, 1952), this game was considered in a stochastic setting, that is, the nature draws the rewards from a fixed product-distribution.

Later it was considered in an adversarial framework (Auer et al., 1995), where there is an adversary choosing the rewards on the arms. A combination of the two previous settings is the label efficient bandit game (György and Ottucsák, 2006), in which the only observed rewards are the ones obtained and asked by the forecaster, with again a limitation on the number of possible queries. These four games are described more precisely in Figure 1.

Their Hannan consistency has been considered in Allenberg et al. (2006) in the case of

(4)

unbounded losses. Here we will focus on regret upper bounds and minimax policies for bounded losses.

1.2 Regret and Pseudo-regret

A natural way to assess the performance of a forecaster is to compute hisregret with respect to the best action in hindsight (see Section 7 for a more general regret in which we compare to the best switching strategy having a ﬁxed number of action-switches):

Rn= max

1≤i≤K

∑n t=1

(gi,t−gIt,t

).

A lot of attention has been drawn by the characterization of the minimax expected regret in the diﬀerent games we have described. More precisely for a given game, let us write sup for the supremum over all allowed adversaries and inf for the inﬁmum over all forecaster strategies for this game. We are interested in the quantity:

inf supERn,

where the expectation is with respect to the possible randomization of the forecaster and the adversary. Another related quantity which can be easier to handle is thepseudo-regret:

Rn= max

1≤i≤KE

∑n t=1

(gi,t−gIt,t

).

Note that, by Jensen’s inequality, the pseudo-regret is always smaller than the expected regret. In Appendix D we discuss cases where the converse inequality holds (up to an additional term).

1.3 The Diﬀerent Adversaries

The simplest adversary is the deterministic one. It is characterized by a ﬁxed matrix ofnK rewards corresponding to (g_i,t)₁_≤_i_≤_K,1_≤_t_≤_n. Another adversary is the “stochastic” one, in which the reward vectors are independent and have the same distribution.¹ This adversary is characterized by a distribution on [0,1]^K, corresponding to the common distribution of g_t, t = 1, . . . , n. A more general adversary is the fully oblivious one, in which the reward vectors are independent. Here the adversary is characterized by n distributions on [0,1]^K corresponding to the distributions of g1, . . . , gn. Deterministic and stochastic adversaries are fully oblivious adversaries.

An even more general adversary is the oblivious one, in which the only constraint on the adversary is that the reward vectors are independent of the past decisions of the forecaster.

The most general adversary is the one who may choose the reward vector gt as a function of the past decisions I1, . . . , It−1 (non-oblivious adversary).

1. The term “stochastic” can be a bit misleading since the assumption is not just stochasticity but rather an i.i.d. assumption.

(5)

inf supR_n inf supER_n

Lower bound Upper bound Lower bound Upper bound Full information game √

nlogK √

nlogK Label eﬃcient game n

√logK

m n

√logK

m n

√logK

m n

√logn m

Bandit game √

nK √

nKlogK √

nK √

nKlogn Bandit label eﬃcient game n

√K

m n

√KlogK

m n

√K

m n

√Klogn m

Table 1: Existing bounds (apart from the lower bounds in the last line which are proved in this paper) on the pseudo-regret and expected regret. Except for the full information game, there are logarithmic gaps between lower and upper bounds.

1.4 Known Regret Bounds

Table 1 recaps existing lower and upper bounds on the minimax pseudo-regret and the minimax expected regret for general adversaries (i.e., possibly non-oblivious ones). For the ﬁrst three lines, we refer the reader to the book (Cesa-Bianchi and Lugosi, 2006) and references within, particularly Cesa-Bianchi et al. (1997) and Cesa-Bianchi (1999) for the full information game, Cesa-Bianchi et al. (2005) for the label eﬃcient game, Auer et al.

(2002b) for the bandit game and György and Ottucsák (2006) for the label efficient bandit game. The lower bounds in the last line do not appear in the existing litterature, but we prove them in this paper. Apart from the full information game, the upper bounds are usually proved on the pseudo-regret. The upper bounds on the expected regret are obtained by using high probability bounds on the regret. The parameters of the algorithm in the latter bounds usually depend on the confidence level δ that we want to obtain. Thus to derive bounds on the expected regret we can not integrate the deviations but rather we have to take δ of order 1/n, which leads to the gaps involving log(n). Table 1 exhibits several logarithmic gaps between upper and lower bounds on the minimax rate, namely:

• √

log(K) gap for the minimax pseudo-regret in the bandit game as well as the label eﬃcient bandit game.

• √

log(n) gap for the minimax expected regret in the bandit game as well as the label eﬃcient bandit game.

• √

log(n)/log(K) gap for the minimax expected regret in the label eﬃcient game, 1.5 Contributions of This Work

We reduce the above gaps by improving the upper bounds as shown by Table 2. Different proof techniques are used and new forecasting strategies are proposed. The most original contribution is the introduction of a new forecaster, INF (Implicitly Normalized Forecaster), for which we propose a unified analysis of its regret in the four games we consider. The analysis is original (it avoids the traditional but scope-limiting argument based on the simplification of a sum of logarithms of ratios), and allows to fill in the long open gap in the bandit problems with oblivious adversaries (and with general adversaries for the pseudo- regret notion). The analysis also applies to exponentially weighted average forecasters. It

(6)

allows to prove a regret bound of order √

nKSlog(nK/S) when the forecaster’s strategy is compared to a strategy allowed to switch S times between arms, while the best known bound was √

nKSlog(nK) (Auer, 2002), and achieved for a diﬀerent policy.

An “orthogonal” contribution is to propose a tuning of the parameters of the forecasting policies such that the high probability regret bounds holds for any conﬁdence level (instead of holding just for a single conﬁdence level as in previous works). Bounds on the expected regret that are deduced from these PAC (“probably approximately correct”) regret bounds are better than previous bounds by a logarithmic factor in the games with limited information (see columns on inf supERnin Tables 1 and 2). The arguments to obtain these bounds are not fundamentally new and rely essentially on a careful use of deviation inequalities for supermartingales. They can be used either in the standard analysis of exponentially weighted average forecasters or in the more general context of INF.

Another “orthogonal” contribution is the proposal of a new biased estimate of the rewards in bandit games, which allows to achieve high probability regret bounds depending on the performance of the optimal arm: in this new bound, the factor n is replaced by Gmax = maxi=1,...,n

∑_n

t=1gi,t. If the forecaster draws It according to the distribution pt= (p1,t, . . . , pK,t), then the new biased estimate ofgi,t is vi,t =−^1I^It_β⁼ⁱlog(

1−^βg_p_i,t^i,t) .This estimate should be compared to v_i,t = g_i,t^1I_p^It⁼ⁱ

i,t , for which bounds in terms of G_max exists in expectations as shown in (Auer et al., 2002b, Section 3), and to v_i,t = g_i,t^1I_p^It⁼ⁱ

i,t + _p^β

i,t

for some β >0 for which high probability bounds exist but they are expressed with then factor, and notGmax (see Section 6 of Auer et al., 2002b, and Section 6.8 of Cesa-Bianchi and Lugosi, 2006).

We also propose a unified proof to obtain the lower bounds in Table 1. The contribution of this proof is two-fold. First it gives the first lower bound for the label efficient bandit game. Secondly in the case of the label efficient (full information) game it is a simpler proof than the one proposed in Cesa-Bianchi et al. (2005). Indeed in the latter proof, the authors use Birgé’s version of Fano’s lemma to prove the lower bound for deterministic forecasters.

Then the extension to non-deterministic forecasters is done by a generalization of this information lemma and a decomposition of general forecasters into a convex combination of deterministic forecasters. The beneﬁt from this proof technique is to be able to deal with the caseK = 2 andK = 3 while the basic version of Fano’s lemma does not give any information in this case. Here we propose to use Pinsker’s inequality for the case K = 2 and K = 3. This allows us to use the basic version of Fano’s lemma and to extend the result to non-deterministic forecasters with a simple application of Fubini’s Theorem.

The last contribution of this work is also independent of the previous ones and concerns the stochastic bandit game (that is the bandit game with “stochastic” adversary). We prove that a modiﬁcation of UCB1, Auer et al. (2002a), attains the optimal distribution-free rate

√nK as well as the logarithmic distribution-dependent rate. The key idea, compared to previous works, is to reduce exploration of suﬃciently drawn arms.

1.6 Outline

In Section 2, we describe a new class of forecasters, called INF, for prediction games. Then we present a new forecaster inside this class, called Poly INF, for which we propose a general

(7)

inf supR_n inf supER_n High probability bound onR_n

Label eﬃcient game n

√logK

m n

√log(Kδ⁻¹) m

Bandit game with fully oblivious adversary √

nK √

nKlog(δ⁻¹) Bandit game with oblivious adversary √

nK √

nK

√ nK

logKlog(Kδ⁻¹) Bandit game with general adversary √

nK √

nKlogK

√ nK

logKlog(Kδ⁻¹) L.E. bandit with deterministic adversary n

√K

m n

√K

m n

√K

mlog(δ⁻¹) L.E. bandit with oblivious adversary n

√K

m n

√K

m n

√ K

mlogKlog(Kδ⁻¹) L.E. bandit with general adversary n

√K

m n

√KlogK

m n

√ K

mlogKlog(Kδ⁻¹) Table 2: New regret upper bounds proposed in this work. The high probability bounds are

for a policy of the forecaster that does not depend on the conﬁdence levelδ (unlike previously known high probability bounds).

theorem bounding its regret. A more general statement on the regret of any INF can be found in Appendix A. Exponentially weighted average forecasters are a special case of INF as shown in Section 3. In Section 4, we prove that our forecasters and analysis recover the known results for the full information game.

Section 5 contains the core contributions of the paper, namely all the regret bounds for the limited feedback games. The interest of Poly INF appears in the bandit games where it satisﬁes a regret bound without a logarithmic factor, unlike exponentially weighted average forecasters. Section 6 provides high probability bounds in the bandit games that depends on the cumulative reward of the optimal arm: the factorn is replaced by max1≤i≤K

∑_n

t=1gi,t. In Section 7, we consider a stronger notion of regret, when we compare ourselves to a strategy allowed to switch between arms a ﬁxed number of times. Section 8 shows how to generalize our results when one considers losses rather than gains, or signed games.

Section 9 considers a framework fundamentally diﬀerent from the previous sections, namely the stochastic multi-armed bandit problem. There we propose a new forecaster, MOSS, for which we prove an optimal distribution-free rate as well as a logarithmic distribution-dependent rate.

Appendix A contains a general regret upper bound for INF and two useful technical lemmas. Appendix B contains the uniﬁed proof of the lower bounds. Appendix C contains the proofs that have not been detailed in the main body of the paper. Finally, Appendix D gathers the diﬀerent results we have obtained regarding the relation between the expected regret and the pseudo-regret.

2. The Implicitly Normalized Forecaster

In this section, we deﬁne a new class of randomized policies for the general prediction game.

Let us consider a continuously diﬀerentiable functionψ:R^∗₋→R^∗+ satisfying ψ^′>0, lim

x→−∞ψ(x)<1/K, lim

x→0ψ(x)≥1. (1)

(8)

Lemma 1 There exists a continuously diﬀerentiable function C :R^K+ → R satisfying for anyx= (x1, . . . , xK)∈R^K+,

1max≤i≤Kx_i < C(x)≤ max

1≤i≤Kx_i−ψ⁻¹(1/K), (2) and

∑K i=1

ψ(xi−C(x)) = 1. (3)

Proof Consider a ﬁxed x= (x1, . . . , xK). The decreasing functionϕ:c7→∑_K

i=1ψ(xi−c) satisﬁes

c→limmax

1≤i≤Kxi

ϕ(c)>1 and lim

c→+∞ϕ(c)<1.

From the intermediate value theorem, there is a uniqueC(x) satisfyingϕ(C(x)) = 1.From the implicit function theorem, the mappingx7→C(x) is continuously diﬀerentiable.

INF (Implicitly Normalized Forecaster):

Parameters:

• the continuously diﬀerentiable functionψ:R^∗₋→R^∗+ satisfying (1)

• the estimatesvi,tofgi,t based on the (drawn arms and) observed rewards at time t(and before timet)

Letp1be the uniform distribution over{1, . . . , K}. For each roundt= 1,2, . . . ,

(1) Draw an armI_t from the probability distributionp_t.

(2) Use the observed reward(s) to build the estimatev_t= (v_1,t, . . . , v_K,t) of (g1,t, . . . , gK,t) and let: Vt=∑t

s=1vs= (V1,t, . . . , VK,t).

(3) Compute the normalization constantCt=C(Vt).

(4) Compute the new probability distribution pt+1 = (p1,t+1, . . . , pK,t+1) where

pi,t+1=ψ(Vi,t−Ct).

Figure 2: The proposed policy for the general prediction game.

The implicitly normalized forecaster (INF) is deﬁned in Figure 2. Equality (3) makes the fourth step in Figure 2 legitimate. From (2), C(V_t) is roughly equal to max₁_≤_i_≤_KV_i,t. Recall that Vi,t is an estimate of the cumulative gain at time tfor arm i. This means that INF chooses the probability assigned to armias a function of the (estimated) regret. Note that, in spirit, it is similar to the traditional weighted average forecaster, see for example Section 2.1 of Cesa-Bianchi and Lugosi (2006), where the probabilities are proportional to a function of the diﬀerence between the (estimated) cumulative reward of arm i and the

(9)

cumulative reward of the policy, which should be, for a well-performing policy, of order C(Vt).

The interesting feature of the implicit normalization is the following argument, which allows to recover the results concerning the exponentially weighted average forecasters, and more interestingly to propose a policy having a regret of order √

nK in the bandit game with oblivious adversary. First note that∑_n

t=1

∑_K

i=1p_i,tv_i,troughly evaluates the cumulative reward∑_n

t=1g_I_t_,t of the policy. In fact, it is exactly the cumulative gain in the bandit game when v_i,t =g_i,t^1I_p^It⁼ⁱ

i,t , and its expectation is exactly the expected cumulative reward in the full information game when v_i,t =g_i,t. The argument starts with an Abel transformation and consequently is “orthogonal” to the usual argument given in the beginning of Section C.2. Letting V₀ = 0∈R^K. We have

∑n t=1

g_I_t_,t≈

∑n t=1

∑K i=1

p_i,tv_i,t

=

∑n t=1

∑K i=1

pi,t(Vi,t−Vi,t−1)

=

∑K i=1

p_i,n+1V_i,n+

∑K i=1

∑n t=1

V_i,t(p_i,t−p_i,t+1)

=

∑K i=1

pi,n+1

(ψ⁻¹(pi,n+1) +Cn

)+

∑K i=1

∑n t=1

(ψ⁻¹(pi,t+1) +Ct)(pi,t−pi,t+1)

=C_n+

∑K i=1

p_i,n+1ψ⁻¹(p_i,n+1) +

∑K i=1

∑n t=1

ψ⁻¹(p_i,t+1)(p_i,t−p_i,t+1),

where the remarkable simpliﬁcation in the last step is closely linked to our speciﬁc class of randomized algorithms. The equality is interesting since, from (2), C_n approximates the maximum estimated cumulative reward max1≤i≤KVi,n, which should be close to the cumulative reward of the optimal arm max₁_≤_i_≤_KG_i,n, where G_i,n = ∑_n

t=1g_i,t. Since the last term in the right-hand side is

∑K i=1

∑n t=1

ψ⁻¹(pi,t+1)(pi,t−pi,t+1)≈

∑K i=1

∑n t=1

∫ _p_i,t+1

pi,t

ψ⁻¹(u)du=

∑K i=1

∫ _p_i,n+1

1/K

ψ⁻¹(u)du, (4) we obtain

1≤i≤Kmax Gi,n−

∑n t=1

gIt,t/−

∑K i=1

pi,n+1ψ⁻¹(pi,n+1) +

∑K i=1

∫ _p_i,n+1

1/K

ψ⁻¹(u)du. (5) The right-hand side is easy to study: it depends only on the ﬁnal probability vector and has simple upper bounds for adequate choices of ψ. For instance, for ψ(x) = exp(ηx) + _K^γ withη >0 and γ ∈[0,1), which corresponds to exponentially weighted average forecasters as we will explain in Section 3, the right-hand side is smaller than ¹⁻_η^γlog( _K

1−γ

)+γC_n. For ψ(x) =( _η

−x

)_q

+_K^γ with η >0, q >1 and γ ∈[0,1), which will appear to be a fruitful

(10)

choice, it is smaller than _q₋^q₁ηK^1/q+γC_n. For sake of simplicity, we have been hiding the residual terms of (4) coming from the Taylor expansions of the primitive function of ψ⁻¹. However, these terms when added together (nK terms!) are not that small, and in fact constrain the choice of the parametersγ and η if one wishes to get the tightest bound.

The rigorous formulation of (5) is given in Theorem 27, which has been put in Ap- pendix A for lack of readability. We propose here its specialization to the functionψ(x) = ( _η

−x

)_q

+_K^γ withη >0, q >1 andγ ∈[0,1). This function obviously satisﬁes conditions (1).

We will refer to the associated forecasting strategy as “Poly INF”. Here the (normalizing) functionChas no closed form expression (this is a consequence of Abel’s impossibility theorem). Actually this remark holds in general, hence the name of the general policy. However this does not lead to a major computational issue since, in the interval given by (2),C(x) is the unique solution of ϕ(c) = 1, whereϕ:c7→∑_K

i=1ψ(x_i−c) is a decreasing function. We will prove that Poly INF forecaster generates nicer probability updates than the exponentially weighted average forecasteras as, for bandits games (label eﬃcient or not), it allows to remove the extraneous logKfactor in the pseudo-regret bounds and some regret bounds.

Theorem 2 (General regret bound for Poly INF) Let ψ(x) =( _η

−x

)q

+_K^γ withq >1, η >0 andγ ∈[0,1). Let (vi,t)_1≤i≤K,_1≤t≤n be a sequence of nonnegative real numbers,

B_t= max

1≤i≤K v_i,t, and B = max

t B_t. If γ = 0 then INF satisﬁes:

(

1max≤i≤K

∑n t=1

vi,t

)

−

∑n t=1

∑K i=1

pi,tvi,t≤ q

q−1ηK^1/q+ q 2η exp

( 2q+ 1

η B )∑n

t=1

B_t², (6) and

(

1max≤i≤K

∑n t=1

vi,t

)

−

∑n t=1

∑K i=1

pi,tvi,t≤ q

q−1ηK^1/q+ qB η exp

(8qB η

)∑n t=1

∑K i=1

pi,tvi,t. (7) Forγ >0, if we havevi,t = _p^c^t

i,t1Ii=It for some random variablect taking values in[0, c]with 0< c < qη( _γ

(q−1)K

)(q−1)/q

, then

(1−γ) (

1max≤i≤K

∑n t=1

v_i,t )

−(1 +γζ)

∑n t=1

∑K i=1

p_i,tv_i,t ≤ q

q−1ηK¹^q, (8) where

ζ = 1

(q−1)K

((q−1)cKµ(1 +µ) 2γη

)_q ,

with

µ= exp {

2(q+ 1)c η

(K γ

)_(q−1)/q( 1− c

qη

((q−1)K γ

)_(q−1)/q)₋_q} .

(11)

In all this work, the parameters η, q and γ will be chosen such that ζ and µ act as numerical constants. To derive concrete bounds from the above theorem, most of the work lies in relating the left-hand side with the diﬀerent notions of regret we consider. This task is trivial for the pseudo-regret. To derive high probability regret bounds, deviation inequalities for supermartingales are used on top of (6) and (8) (which hold with probability one). Finally, the expected regret bounds are obtained by integration of the high probability bounds.

As long as numerical constants do not matter, one can use (7) to recover the bounds obtained from (6). The advantage of (7) over (6) is that it allows to get regret bounds where the factorn is replaced byG_max= max_i=1,...,nG_i,n.

3. Exponentially Weighted Average Forecasters

The normalization by division that weighted average forecasters perform is diﬀerent from the normalization by shift of the real axis that INF performs. Nonetheless, we can recover exactly the exponentially weighted average forecasters because of the special relation of the exponential with the addition and the multiplication.

Let ψ(x) = exp(ηx) + _K^γ with η > 0 and γ ∈ [0,1). Then conditions (1) are clearly satisﬁed and (3) is equivalent to exp(−ηC(x)) = ∑K ¹⁻^γ

i=1exp(ηxi),which implies p_i,t+1 = (1−γ) exp(ηV_i,t)

∑_K

j=1exp(ηV_i,t) + γ K.

In other words, for the full information case (label efficient or not), we recover the exponentially weighted average forecaster (with γ = 0) while for the bandit game we recover EXP3. For the label efficient bandit game, it does not give us the GREEN policy proposed in Allenberg et al. (2006) but rather the straightforward modification of the exponentially weighted average forecaster to this game (György and Ottucsák, 2006). Theorem 3 below gives a unified view on this algorithm for these four games. In the following, we will refer to this algorithm as the “exponentially weighted average forecaster” whatever the game is.

Theorem 3 (General regret bound for the exponentially weighted average fore- caster) Let ψ(x) = exp(ηx) + _K^γ with η > 0 and γ ∈ [0,1). Let (vi,t)1≤i≤K,1≤t≤n be a sequence of nonnegative real numbers,

B_t= max

1≤i≤K v_i,t, and B = max

1≤t≤nB_t.

Consider the increasing function Θ : u 7→ êû⁻_u¹2⁻û equal to 1/2 by continuity at zero. If γ = 0 then INF satisfies:

(

1max≤i≤K

∑n t=1

v_i,t )

−

∑n t=1

∑K i=1

p_i,tv_i,t≤ logK η +η

8

∑n t=1

B_t², (9)

and (

1max≤i≤K

∑n t=1

vi,t

)

−

∑n t=1

∑K i=1

pi,tvi,t ≤ logK

η +ηBΘ(ηB)

∑n t=1

∑K i=1

pi,tvi,t. (10)

(12)

If we have

γ ≥KηΘ(ηB) max

i,t pi,tvi,t, (11)

then INF satisﬁes:

(1−γ) (

1max≤i≤K

∑n t=1

vi,t

)

−

∑n t=1

∑K i=1

pi,tvi,t≤(1−γ)logK

η . (12)

We have the same discussion about (9) and (10) than about (6) and (7): Inequality (10) allows to prove bounds where the factor n is replaced by G_max = max_i=1,...,nG_i,n, but at the price of worsened numerical constants, when compared to (9). We illustrate this point in Theorem 4, where (13) and (14) respectively comes from (9) and (10) .

The above theorem relies on the standard argument based on the cancellation of terms in a sum of logarithms of ratios (see Section C.2). For sake of comparison, we have applied our general result for INF forecasters, that is Theorem 27 (see Appendix A). This leads to the same result with worsened constants. Precisely, ^η₈ becomes ^η₂exp(2ηB) in (9) while Θ(ηB) becomes exp(2Bη)[1+exp(2Bη)]

2 in (11). This seems to be the price for having a theorem applying to a large class of forecasters.

4. The Full Information (FI) Game

The purpose of this section is to illustrate the general regret bounds given in Theorems 2 and 3 in the simplest case, when we set v_i,t =g_i,t, which is possible since the rewards for all arms are observed in the full information setting. The next theorem is given explicitly to show an easy application of Inequalities (9) and (10).

Theorem 4 (Exponentially weighted average forecaster in the FI game) Letψ(x) = exp(ηx) withη >0. Let v_i,t=g_i,t. Then in the full information game, INF satisﬁes

1max≤i≤K

∑n i=1

g_i,t−

∑n t=1

∑K i=1

p_i,tg_i,t ≤ logK η +ηn

8 . (13)

and

1max≤i≤K

∑n i=1

gi,t−

∑n t=1

∑K i=1

pi,tgi,t≤ logK

η +ηΘ(η)

∑n t=1

∑K i=1

pi,tgi,t. (14)

In particular with η=

√8 logK

n , we getERn≤√_n

2logK, and there existsη >0 such that ERn≤√

2EGmaxlogK.

Proof It comes from (9) and (10) since we have B ≤ 1 and ∑_n

t=1B_t² ≤ n. The only nontrivial result is the last inequality. It obviously holds for any η when EGmax = 0, and is achieved for η = log(

1 +√

2(logK)/EGmax

), when EGmax > 0. Indeed, by taking the

(13)

expectation in (14), we get E

∑n t=1

∑K i=1

p_i,tg_i,t ≥ ηEG_max−logK exp(η)−1 = log

( 1 +

√2 logK EGmax

)√

(EG_max)³ 2 logK −

√EG_maxlogK 2

≥EG_max−2

√EG_maxlogK

2 ,

where we use log(1 +x)≥x−^x₂² for anyx≥0 in the last inequality.

Now we consider a new algorithm for the FI game, that is INF with ψ(x) =( _η

−x

)q

and vi,t =gi,t.

Theorem 5 (Poly INF in the FI game) Let ψ(x) =( _η

−x

)q

withη >0 andq > 1. Let vi,t =gi,t. Then in the full information game, INF satisﬁes:

1max≤i≤K

∑n i=1

g_i,t−

∑n t=1

∑K i=1

p_i,tg_i,t ≤ q

q−1ηK^1/q+ exp (4q

η )qn

2η. (15)

In particular with q= 3 logK and η= 1.8√

nlogK we get ER_n≤7√

nlogK.

Proof It comes from (6), q+ 1≤2q and ∑_n

t=1B_t² ≤n.

Remark 6 By using the Hoeﬀding-Azuma inequality (see, e.g., Lemma A.7 of Cesa-Bianchi and Lugosi, 2006), one can derive high probability bounds from (13) and (15): for instance, from (15), for any δ >0, with probability at least 1−δ, Poly INF satisﬁes:

R_n≤ q

q−1ηK^1/q+ exp (4q

η )qn

2η +

√nlog(δ⁻¹)

2 .

5. The Limited Feedback Games

This section provides regret bounds for three limited feedback games: the label eﬃcient game, the bandit game, and the mixed game, that is the label eﬃcient bandit game.

5.1 Label Eﬃcient Game (LE)

The variants of the LE game consider that the number of queried reward vectors is con- strained either strictly or just in expectation. This section considers successively these two cases.

(14)

5.1.1 Constraint on the Expected Number of Queried Reward Vectors As in Section 4, the purpose of this section is to show how to use INF in order to recover known minimax bounds (up to constant factors) in a slight modiﬁcation of the LE game:

the simple LE game, in which the requirement is that theexpected number of queried reward vectors should be less or equal tom.

Let us consider the following policy. At each round, we draw a Bernoulli random variable Zt, with parameterε=m/n, to decide whether we ask for the gains or not. Note that we do not fulﬁll exactly the requirement of the LE game as we might ask a bit more than m reward vectors, but we fulﬁll the one of the simple LE game. We do so in order to avoid technical details and focus on the main argument of the proof. The exact LE game will be addressed in Section 5.1.2, where, in addition, we will prove bounds on the expected regret ERn instead of just the pseudo-regret Rn.

In this section, the estimate ofgi,t isvi,t = ^g^i,t_ε Zt, which is observable since the rewards at timet for all arms are observed whenZ_t= 1.

Theorem 7 (Exponentially weighted average forecaster in the simple LE game) Let ψ(x) = exp(ηx) with η =

√8mlogK

n . Let vi,t = ^g^i,t_ε Zt with ε = ^m_n. Then in the simple LE game, INF satisﬁes

R_n≤n

√logK 2m .

Proof The ﬁrst inequality comes from (9). Since we haveBt≤Zt/εand vi,t = ^g^i,t_ε Zt, we

obtain (

1max≤i≤K

∑n t=1

gi,t

Zt

ε )

−

∑n t=1

∑K i=1

pi,tgi,t

Zt

ε ≤ logK η + η

8ε²

∑n t=1

Zt,

hence, by taking the expectation of both sides, Rn=

(

1max≤i≤KE

∑n t=1

gi,t

Zt

ε )

−E

∑n t=1

∑K i=1

pi,tgi,t

Zt

ε ≤ logK η + nη

8ε = logK η +n²η

8m. Straightforward computations conclude the proof.

A similar result can be proved for the INF forecaster with ψ(x) = ( _η

−x

)q

, η >0 and q of order logK. We do not state it since we will prove a stronger result in the next section.

5.1.2 Hard Constraint on the Number of Queried Reward Vectors

The goal of this section is to push the idea that by using high probability bounds as an intermediate step, one can control the expected regretER_n=Emax₁_≤_i_≤_K∑_n

t=1

(g_i,t−g_I_t_,t) instead of just the pseudo-regretR_n= max₁_≤_i_≤_KE∑_n

t=1

(g_i,t−g_I_t_,t)

. Most previous works have obtained results for Rn. These results are interesting for oblivious opponents, that is when the adversary’s choices of the rewards do not depend on the past draws and obtained rewards, since in this case Proposition 33 in Appendix D shows that one can extend bounds on the pseudo-regret Rn to the expected regret ERn. For non-oblivious opponents, upper bounds onRnare rather weak statements and high probability bounds onRnor bounds on

(15)

ER_n are desirable. In Auer (2002) and Cesa-Bianchi and Lugosi (2006), high probability bounds on Rn have been given. Unfortunately, the policies proposed there are depending on the confidence level of the bound. As a consequence, the resulting best bound on ER_n, obtained by choosing the policies with confidence level parameter of order 1/n, has an extraneous logn term. Specifically, from Theorem 6.2 of Cesa-Bianchi and Lugosi (2006), one can immediately derive ER_n ≤ 8n

√log(4K)+log(n)

m + 1. The theorems of this section essentially show that the logn term can be removed.

As in Section 5.1.1, we still use a draw of a Bernoulli random variable Zt to decide whether we ask for the gains or not. The diﬀerence is that, if ∑_t−1

s=1Z_s≥m, we do not ask for the gains (as we are not allowed to do so). To avoid that this last constraint interferes in the analysis, the parameter of the Bernoulli random variable is set to ε= ^3m_4n and the probability of the event ∑_n

t=1Z_t > m is upper bounded. The estimate of g_i,t remains v_i,t = ^g^i,t_ε Z_t.

Theorem 8 (Exponentially weighted average forecaster in the LE game) Letψ(x) = exp(ηx) with η =

√mlogK

n . Let vi,t = ^g^i,t_ε Zt with ε= ^3m_4n. Then in the LE game, for any δ >0, with probability at least 1−δ, INF satisﬁes:

R_n≤n

√27 log(2Kδ⁻¹)

m ,

and

ER_n≤n

√27 log(6K)

m .

Theorem 9 (Poly INF in the LE game) Let ψ(x) = ( _η

−x

)q

with q = 3 log(2K) and η = 2n

√log(2K)

m . Letv_i,t = ^g^i,t_ε Z_t with ε= ^3m_4n. Then in the LE game, for any δ >0, with probability at least 1−δ, INF satisﬁes:

R_n≤( 8−√

27) n

√log(2K)

m +n

√27 log(2Kδ⁻¹)

m ,

and

ERn≤8n

√log(6K)

m .

5.2 Bandit Game

This section is cut into two parts. In the ﬁrst one, from Theorem 2 and Theorem 3, we derive upper bounds on the pseudo-regret Rn= max1≤i≤KE∑_n

t=1

(gi,t−gIt,t

). To bound the expected regretER_n=Emax₁_≤_i_≤_K∑_n

t=1

(g_i,t−g_I_t_,t)

, we will then use high probability bounds on top of the use of these theorems. Since this makes the proofs more intricate, we have chosen to provide the less general results, but easier to obtain, in Section 5.2.1 and the more general ones in Section 5.2.2.

The main results here are that, by using the INF with a polynomial function ψ, we obtain an upper bound of order√

nK forRn, which imply a bound of order √

nK onERn

for oblivious adversaries (Proposition 33 in Appendix D). In the general case (containing