• Aucun résultat trouvé

Regret Bounds and Minimax Policies under Partial Monitoring

N/A
N/A
Protected

Academic year: 2021

Partager "Regret Bounds and Minimax Policies under Partial Monitoring"

Copied!
57
0
0

Texte intégral

(1)

HAL Id: hal-00654356

https://hal-enpc.archives-ouvertes.fr/hal-00654356

Submitted on 21 Dec 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Jean-Yves Audibert, Sébastien Bubeck

To cite this version:

Jean-Yves Audibert, Sébastien Bubeck. Regret Bounds and Minimax Policies under Partial Moni- toring. Journal of Machine Learning Research, Microtome Publishing, 2010, 11, pp.2785-2836. �hal- 00654356�

(2)

Regret Bounds and Minimax Policies under Partial Monitoring

Jean-Yves Audibert audibert@imagine.enpc.fr

Imagine, Universit´e Paris Est 6 avenue Blaise Pascal

77455 Champs-sur-Marne, France

S´ebastien Bubeck sebastien.bubeck@inria.fr

SequeL Project, INRIA Lille 40 avenue Halley

59650 Villeneuve d’Ascq, France

Editor:Nicol`o Cesa-Bianchi

Abstract

This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo- regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary functionψfor which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, forψ(x) = exp(ηx) +Kγ, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x) =( η

x

)q

+Kγ, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action.

Finally, we consider the stochastic bandit game, and prove that an appropriate mod- ification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.

Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.

1. Introduction

This section starts by defining the prediction tasks, the different regret notions that we will consider, and the different adversaries of the forecaster. We will then recap existing lower and upper regret bounds for the different settings, and give an overview of our contributions.

. Also at Willow, CNRS/ENS/INRIA—UMR 8548.

(3)

Parameters: the number of arms (or actions) K and the number of rounds n withnK2.

For each roundt= 1,2, . . . , n

(1) The forecaster chooses an armIt∈ {1, . . . , K}, possibly with the help of an external randomization.

(2) Simultaneously the adversary chooses a gain vectorgt= (g1,t, . . . , gK,t) [0,1]K (see Section 8 for loss games or signed games).

(3) The forecaster receives the gain gIt,t (without systematically observing it). He observes

the reward vector (g1,t, . . . , gK,t) in thefull informationgame, the reward vector (g1,t, . . . , gK,t) if he asks for it with the global

constraint that he is not allowed to ask it more than m times for some fixed integer number 1mn. This prediction game is the label efficientgame,

onlygIt,t in thebandit game,

only his obtained reward gIt,t if he asks for it with the global con- straint that he is not allowed to ask it more thanmtimes for some fixed integer number 1mn. This prediction game is thelabel efficient bandit game.

Goal : The forecaster tries to maximize his cumulative gain n t=1gIt,t. Figure 1: The four prediction tasks considered in this work.

1.1 The Four Prediction Tasks

We consider a general prediction game where at each stage, a forecaster (or decision maker) chooses one action (or arm), and receives a reward from it. Then the forecaster receives a feedback about the rewards which he can use to make his choice at the next stage. His goal is to maximize his cumulative gain. In the simplest version, after choosing an arm the forecaster observes the rewards for all arms, this is the so called full information game. In the label efficient game, originally proposed by Helmbold and Panizza (1997), after choosing its action at a stage, the forecaster decides whether to ask for the rewards of the different actions at this stage, knowing that he is allowed to do it a limited number of times. Another classical setting is the bandit game where the forecaster only observes the reward of the arm he has chosen. In its original version (Robbins, 1952), this game was considered in a stochastic setting, that is, the nature draws the rewards from a fixed product-distribution.

Later it was considered in an adversarial framework (Auer et al., 1995), where there is an adversary choosing the rewards on the arms. A combination of the two previous settings is the label efficient bandit game (Gy¨orgy and Ottucs´ak, 2006), in which the only observed rewards are the ones obtained and asked by the forecaster, with again a limitation on the number of possible queries. These four games are described more precisely in Figure 1.

Their Hannan consistency has been considered in Allenberg et al. (2006) in the case of

(4)

unbounded losses. Here we will focus on regret upper bounds and minimax policies for bounded losses.

1.2 Regret and Pseudo-regret

A natural way to assess the performance of a forecaster is to compute hisregret with respect to the best action in hindsight (see Section 7 for a more general regret in which we compare to the best switching strategy having a fixed number of action-switches):

Rn= max

1iK

n t=1

(gi,t−gIt,t

).

A lot of attention has been drawn by the characterization of the minimax expected regret in the different games we have described. More precisely for a given game, let us write sup for the supremum over all allowed adversaries and inf for the infimum over all forecaster strategies for this game. We are interested in the quantity:

inf supERn,

where the expectation is with respect to the possible randomization of the forecaster and the adversary. Another related quantity which can be easier to handle is thepseudo-regret:

Rn= max

1iKE

n t=1

(gi,t−gIt,t

).

Note that, by Jensen’s inequality, the pseudo-regret is always smaller than the expected regret. In Appendix D we discuss cases where the converse inequality holds (up to an additional term).

1.3 The Different Adversaries

The simplest adversary is the deterministic one. It is characterized by a fixed matrix ofnK rewards corresponding to (gi,t)1iK,1tn. Another adversary is the “stochastic” one, in which the reward vectors are independent and have the same distribution.1 This adversary is characterized by a distribution on [0,1]K, corresponding to the common distribution of gt, t = 1, . . . , n. A more general adversary is the fully oblivious one, in which the reward vectors are independent. Here the adversary is characterized by n distributions on [0,1]K corresponding to the distributions of g1, . . . , gn. Deterministic and stochastic adversaries are fully oblivious adversaries.

An even more general adversary is the oblivious one, in which the only constraint on the adversary is that the reward vectors are independent of the past decisions of the forecaster.

The most general adversary is the one who may choose the reward vector gt as a function of the past decisions I1, . . . , It1 (non-oblivious adversary).

1. The term “stochastic” can be a bit misleading since the assumption is not just stochasticity but rather an i.i.d. assumption.

(5)

inf supRn inf supERn

Lower bound Upper bound Lower bound Upper bound Full information game

nlogK

nlogK

nlogK

nlogK Label efficient game n

logK

m n

logK

m n

logK

m n

logn m

Bandit game

nK

nKlogK

nK

nKlogn Bandit label efficient game n

K

m n

KlogK

m n

K

m n

Klogn m

Table 1: Existing bounds (apart from the lower bounds in the last line which are proved in this paper) on the pseudo-regret and expected regret. Except for the full informa- tion game, there are logarithmic gaps between lower and upper bounds.

1.4 Known Regret Bounds

Table 1 recaps existing lower and upper bounds on the minimax pseudo-regret and the minimax expected regret for general adversaries (i.e., possibly non-oblivious ones). For the first three lines, we refer the reader to the book (Cesa-Bianchi and Lugosi, 2006) and references within, particularly Cesa-Bianchi et al. (1997) and Cesa-Bianchi (1999) for the full information game, Cesa-Bianchi et al. (2005) for the label efficient game, Auer et al.

(2002b) for the bandit game and Gy¨orgy and Ottucs´ak (2006) for the label efficient bandit game. The lower bounds in the last line do not appear in the existing litterature, but we prove them in this paper. Apart from the full information game, the upper bounds are usually proved on the pseudo-regret. The upper bounds on the expected regret are obtained by using high probability bounds on the regret. The parameters of the algorithm in the latter bounds usually depend on the confidence level δ that we want to obtain. Thus to derive bounds on the expected regret we can not integrate the deviations but rather we have to take δ of order 1/n, which leads to the gaps involving log(n). Table 1 exhibits several logarithmic gaps between upper and lower bounds on the minimax rate, namely:

log(K) gap for the minimax pseudo-regret in the bandit game as well as the label efficient bandit game.

log(n) gap for the minimax expected regret in the bandit game as well as the label efficient bandit game.

log(n)/log(K) gap for the minimax expected regret in the label efficient game, 1.5 Contributions of This Work

We reduce the above gaps by improving the upper bounds as shown by Table 2. Different proof techniques are used and new forecasting strategies are proposed. The most original contribution is the introduction of a new forecaster, INF (Implicitly Normalized Forecaster), for which we propose a unified analysis of its regret in the four games we consider. The analysis is original (it avoids the traditional but scope-limiting argument based on the simplification of a sum of logarithms of ratios), and allows to fill in the long open gap in the bandit problems with oblivious adversaries (and with general adversaries for the pseudo- regret notion). The analysis also applies to exponentially weighted average forecasters. It

(6)

allows to prove a regret bound of order √

nKSlog(nK/S) when the forecaster’s strategy is compared to a strategy allowed to switch S times between arms, while the best known bound was √

nKSlog(nK) (Auer, 2002), and achieved for a different policy.

An “orthogonal” contribution is to propose a tuning of the parameters of the forecasting policies such that the high probability regret bounds holds for any confidence level (instead of holding just for a single confidence level as in previous works). Bounds on the expected regret that are deduced from these PAC (“probably approximately correct”) regret bounds are better than previous bounds by a logarithmic factor in the games with limited informa- tion (see columns on inf supERnin Tables 1 and 2). The arguments to obtain these bounds are not fundamentally new and rely essentially on a careful use of deviation inequalities for supermartingales. They can be used either in the standard analysis of exponentially weighted average forecasters or in the more general context of INF.

Another “orthogonal” contribution is the proposal of a new biased estimate of the re- wards in bandit games, which allows to achieve high probability regret bounds depend- ing on the performance of the optimal arm: in this new bound, the factor n is replaced by Gmax = maxi=1,...,n

n

t=1gi,t. If the forecaster draws It according to the distribution pt= (p1,t, . . . , pK,t), then the new biased estimate ofgi,t is vi,t =1IItβ=ilog(

1βgpi,ti,t) .This estimate should be compared to vi,t = gi,t1IpIt=i

i,t , for which bounds in terms of Gmax exists in expectations as shown in (Auer et al., 2002b, Section 3), and to vi,t = gi,t1IpIt=i

i,t + pβ

i,t

for some β >0 for which high probability bounds exist but they are expressed with then factor, and notGmax (see Section 6 of Auer et al., 2002b, and Section 6.8 of Cesa-Bianchi and Lugosi, 2006).

We also propose a unified proof to obtain the lower bounds in Table 1. The contribution of this proof is two-fold. First it gives the first lower bound for the label efficient bandit game. Secondly in the case of the label efficient (full information) game it is a simpler proof than the one proposed in Cesa-Bianchi et al. (2005). Indeed in the latter proof, the authors use Birg´e’s version of Fano’s lemma to prove the lower bound for deterministic forecasters.

Then the extension to non-deterministic forecasters is done by a generalization of this information lemma and a decomposition of general forecasters into a convex combination of deterministic forecasters. The benefit from this proof technique is to be able to deal with the caseK = 2 andK = 3 while the basic version of Fano’s lemma does not give any information in this case. Here we propose to use Pinsker’s inequality for the case K = 2 and K = 3. This allows us to use the basic version of Fano’s lemma and to extend the result to non-deterministic forecasters with a simple application of Fubini’s Theorem.

The last contribution of this work is also independent of the previous ones and concerns the stochastic bandit game (that is the bandit game with “stochastic” adversary). We prove that a modification of UCB1, Auer et al. (2002a), attains the optimal distribution-free rate

√nK as well as the logarithmic distribution-dependent rate. The key idea, compared to previous works, is to reduce exploration of sufficiently drawn arms.

1.6 Outline

In Section 2, we describe a new class of forecasters, called INF, for prediction games. Then we present a new forecaster inside this class, called Poly INF, for which we propose a general

(7)

inf supRn inf supERn High probability bound onRn

Label efficient game n

logK

m n

log(Kδ−1) m

Bandit game with fully oblivious adversary

nK

nK

nKlog(δ−1) Bandit game with oblivious adversary

nK

nK

nK

logKlog(Kδ−1) Bandit game with general adversary

nK

nKlogK

nK

logKlog(Kδ−1) L.E. bandit with deterministic adversary n

K

m n

K

m n

K

mlog(δ−1) L.E. bandit with oblivious adversary n

K

m n

K

m n

K

mlogKlog(Kδ−1) L.E. bandit with general adversary n

K

m n

KlogK

m n

K

mlogKlog(Kδ1) Table 2: New regret upper bounds proposed in this work. The high probability bounds are

for a policy of the forecaster that does not depend on the confidence levelδ (unlike previously known high probability bounds).

theorem bounding its regret. A more general statement on the regret of any INF can be found in Appendix A. Exponentially weighted average forecasters are a special case of INF as shown in Section 3. In Section 4, we prove that our forecasters and analysis recover the known results for the full information game.

Section 5 contains the core contributions of the paper, namely all the regret bounds for the limited feedback games. The interest of Poly INF appears in the bandit games where it satisfies a regret bound without a logarithmic factor, unlike exponentially weighted average forecasters. Section 6 provides high probability bounds in the bandit games that depends on the cumulative reward of the optimal arm: the factorn is replaced by max1iK

n

t=1gi,t. In Section 7, we consider a stronger notion of regret, when we compare ourselves to a strategy allowed to switch between arms a fixed number of times. Section 8 shows how to generalize our results when one considers losses rather than gains, or signed games.

Section 9 considers a framework fundamentally different from the previous sections, namely the stochastic multi-armed bandit problem. There we propose a new forecaster, MOSS, for which we prove an optimal distribution-free rate as well as a logarithmic distribu- tion-dependent rate.

Appendix A contains a general regret upper bound for INF and two useful technical lemmas. Appendix B contains the unified proof of the lower bounds. Appendix C contains the proofs that have not been detailed in the main body of the paper. Finally, Appendix D gathers the different results we have obtained regarding the relation between the expected regret and the pseudo-regret.

2. The Implicitly Normalized Forecaster

In this section, we define a new class of randomized policies for the general prediction game.

Let us consider a continuously differentiable functionψ:RR+ satisfying ψ>0, lim

x→−∞ψ(x)<1/K, lim

x0ψ(x)≥1. (1)

(8)

Lemma 1 There exists a continuously differentiable function C :RK+ R satisfying for anyx= (x1, . . . , xK)RK+,

1maxiKxi < C(x)≤ max

1iKxi−ψ1(1/K), (2) and

K i=1

ψ(xi−C(x)) = 1. (3)

Proof Consider a fixed x= (x1, . . . , xK). The decreasing functionϕ:c7→K

i=1ψ(xi−c) satisfies

climmax

1≤i≤Kxi

ϕ(c)>1 and lim

c+ϕ(c)<1.

From the intermediate value theorem, there is a uniqueC(x) satisfyingϕ(C(x)) = 1.From the implicit function theorem, the mappingx7→C(x) is continuously differentiable.

INF (Implicitly Normalized Forecaster):

Parameters:

the continuously differentiable functionψ:RR+ satisfying (1)

the estimatesvi,tofgi,t based on the (drawn arms and) observed rewards at time t(and before timet)

Letp1be the uniform distribution over{1, . . . , K}. For each roundt= 1,2, . . . ,

(1) Draw an armIt from the probability distributionpt.

(2) Use the observed reward(s) to build the estimatevt= (v1,t, . . . , vK,t) of (g1,t, . . . , gK,t) and let: Vt=t

s=1vs= (V1,t, . . . , VK,t).

(3) Compute the normalization constantCt=C(Vt).

(4) Compute the new probability distribution pt+1 = (p1,t+1, . . . , pK,t+1) where

pi,t+1=ψ(Vi,tCt).

Figure 2: The proposed policy for the general prediction game.

The implicitly normalized forecaster (INF) is defined in Figure 2. Equality (3) makes the fourth step in Figure 2 legitimate. From (2), C(Vt) is roughly equal to max1iKVi,t. Recall that Vi,t is an estimate of the cumulative gain at time tfor arm i. This means that INF chooses the probability assigned to armias a function of the (estimated) regret. Note that, in spirit, it is similar to the traditional weighted average forecaster, see for example Section 2.1 of Cesa-Bianchi and Lugosi (2006), where the probabilities are proportional to a function of the difference between the (estimated) cumulative reward of arm i and the

(9)

cumulative reward of the policy, which should be, for a well-performing policy, of order C(Vt).

The interesting feature of the implicit normalization is the following argument, which allows to recover the results concerning the exponentially weighted average forecasters, and more interestingly to propose a policy having a regret of order

nK in the bandit game with oblivious adversary. First note that∑n

t=1

K

i=1pi,tvi,troughly evaluates the cumulative reward∑n

t=1gIt,t of the policy. In fact, it is exactly the cumulative gain in the bandit game when vi,t =gi,t1IpIt=i

i,t , and its expectation is exactly the expected cumulative reward in the full information game when vi,t =gi,t. The argument starts with an Abel transformation and consequently is “orthogonal” to the usual argument given in the beginning of Section C.2. Letting V0 = 0RK. We have

n t=1

gIt,t

n t=1

K i=1

pi,tvi,t

=

n t=1

K i=1

pi,t(Vi,t−Vi,t1)

=

K i=1

pi,n+1Vi,n+

K i=1

n t=1

Vi,t(pi,t−pi,t+1)

=

K i=1

pi,n+1

(ψ1(pi,n+1) +Cn

)+

K i=1

n t=1

1(pi,t+1) +Ct)(pi,t−pi,t+1)

=Cn+

K i=1

pi,n+1ψ1(pi,n+1) +

K i=1

n t=1

ψ1(pi,t+1)(pi,t−pi,t+1),

where the remarkable simplification in the last step is closely linked to our specific class of randomized algorithms. The equality is interesting since, from (2), Cn approximates the maximum estimated cumulative reward max1iKVi,n, which should be close to the cumulative reward of the optimal arm max1iKGi,n, where Gi,n = ∑n

t=1gi,t. Since the last term in the right-hand side is

K i=1

n t=1

ψ1(pi,t+1)(pi,t−pi,t+1)

K i=1

n t=1

pi,t+1

pi,t

ψ1(u)du=

K i=1

pi,n+1

1/K

ψ1(u)du, (4) we obtain

1≤i≤Kmax Gi,n

n t=1

gIt,t/

K i=1

pi,n+1ψ1(pi,n+1) +

K i=1

pi,n+1

1/K

ψ1(u)du. (5) The right-hand side is easy to study: it depends only on the final probability vector and has simple upper bounds for adequate choices of ψ. For instance, for ψ(x) = exp(ηx) + Kγ withη >0 and γ [0,1), which corresponds to exponentially weighted average forecasters as we will explain in Section 3, the right-hand side is smaller than 1ηγlog( K

1γ

)+γCn. For ψ(x) =( η

x

)q

+Kγ with η >0, q >1 and γ [0,1), which will appear to be a fruitful

(10)

choice, it is smaller than qq1ηK1/q+γCn. For sake of simplicity, we have been hiding the residual terms of (4) coming from the Taylor expansions of the primitive function of ψ1. However, these terms when added together (nK terms!) are not that small, and in fact constrain the choice of the parametersγ and η if one wishes to get the tightest bound.

The rigorous formulation of (5) is given in Theorem 27, which has been put in Ap- pendix A for lack of readability. We propose here its specialization to the functionψ(x) = ( η

x

)q

+Kγ withη >0, q >1 andγ [0,1). This function obviously satisfies conditions (1).

We will refer to the associated forecasting strategy as “Poly INF”. Here the (normalizing) functionChas no closed form expression (this is a consequence of Abel’s impossibility theo- rem). Actually this remark holds in general, hence the name of the general policy. However this does not lead to a major computational issue since, in the interval given by (2),C(x) is the unique solution of ϕ(c) = 1, whereϕ:c7→K

i=1ψ(xi−c) is a decreasing function. We will prove that Poly INF forecaster generates nicer probability updates than the exponen- tially weighted average forecasteras as, for bandits games (label efficient or not), it allows to remove the extraneous logKfactor in the pseudo-regret bounds and some regret bounds.

Theorem 2 (General regret bound for Poly INF) Let ψ(x) =( η

x

)q

+Kγ withq >1, η >0 andγ [0,1). Let (vi,t)1≤i≤K,1≤t≤n be a sequence of nonnegative real numbers,

Bt= max

1iK vi,t, and B = max

t Bt. If γ = 0 then INF satisfies:

(

1maxiK

n t=1

vi,t

)

n t=1

K i=1

pi,tvi,t q

q−1ηK1/q+ q 2η exp

( 2q+ 1

η B )∑n

t=1

Bt2, (6) and

(

1maxiK

n t=1

vi,t

)

n t=1

K i=1

pi,tvi,t q

q−1ηK1/q+ qB η exp

(8qB η

)∑n t=1

K i=1

pi,tvi,t. (7) Forγ >0, if we havevi,t = pct

i,t1Ii=It for some random variablect taking values in[0, c]with 0< c < qη( γ

(q1)K

)(q1)/q

, then

(1−γ) (

1maxiK

n t=1

vi,t )

(1 +γζ)

n t=1

K i=1

pi,tvi,t q

q−1ηK1q, (8) where

ζ = 1

(q1)K

((q1)cKµ(1 +µ) 2γη

)q ,

with

µ= exp {

2(q+ 1)c η

(K γ

)(q−1)/q( 1 c

((q1)K γ

)(q−1)/q)q} .

(11)

In all this work, the parameters η, q and γ will be chosen such that ζ and µ act as numerical constants. To derive concrete bounds from the above theorem, most of the work lies in relating the left-hand side with the different notions of regret we consider. This task is trivial for the pseudo-regret. To derive high probability regret bounds, deviation inequalities for supermartingales are used on top of (6) and (8) (which hold with probability one). Finally, the expected regret bounds are obtained by integration of the high probability bounds.

As long as numerical constants do not matter, one can use (7) to recover the bounds obtained from (6). The advantage of (7) over (6) is that it allows to get regret bounds where the factorn is replaced byGmax= maxi=1,...,nGi,n.

3. Exponentially Weighted Average Forecasters

The normalization by division that weighted average forecasters perform is different from the normalization by shift of the real axis that INF performs. Nonetheless, we can recover exactly the exponentially weighted average forecasters because of the special relation of the exponential with the addition and the multiplication.

Let ψ(x) = exp(ηx) + Kγ with η > 0 and γ [0,1). Then conditions (1) are clearly satisfied and (3) is equivalent to exp(−ηC(x)) = K 1γ

i=1exp(ηxi),which implies pi,t+1 = (1−γ) exp(ηVi,t)

K

j=1exp(ηVi,t) + γ K.

In other words, for the full information case (label efficient or not), we recover the expo- nentially weighted average forecaster (with γ = 0) while for the bandit game we recover EXP3. For the label efficient bandit game, it does not give us the GREEN policy proposed in Allenberg et al. (2006) but rather the straightforward modification of the exponentially weighted average forecaster to this game (Gy¨orgy and Ottucs´ak, 2006). Theorem 3 below gives a unified view on this algorithm for these four games. In the following, we will refer to this algorithm as the “exponentially weighted average forecaster” whatever the game is.

Theorem 3 (General regret bound for the exponentially weighted average fore- caster) Let ψ(x) = exp(ηx) + Kγ with η > 0 and γ [0,1). Let (vi,t)1iK,1tn be a sequence of nonnegative real numbers,

Bt= max

1iK vi,t, and B = max

1tnBt.

Consider the increasing function Θ : u 7→ euu12u equal to 1/2 by continuity at zero. If γ = 0 then INF satisfies:

(

1maxiK

n t=1

vi,t )

n t=1

K i=1

pi,tvi,t logK η +η

8

n t=1

Bt2, (9)

and (

1maxiK

n t=1

vi,t

)

n t=1

K i=1

pi,tvi,t logK

η +ηBΘ(ηB)

n t=1

K i=1

pi,tvi,t. (10)

(12)

If we have

γ ≥KηΘ(ηB) max

i,t pi,tvi,t, (11)

then INF satisfies:

(1−γ) (

1maxiK

n t=1

vi,t

)

n t=1

K i=1

pi,tvi,t(1−γ)logK

η . (12)

We have the same discussion about (9) and (10) than about (6) and (7): Inequality (10) allows to prove bounds where the factor n is replaced by Gmax = maxi=1,...,nGi,n, but at the price of worsened numerical constants, when compared to (9). We illustrate this point in Theorem 4, where (13) and (14) respectively comes from (9) and (10) .

The above theorem relies on the standard argument based on the cancellation of terms in a sum of logarithms of ratios (see Section C.2). For sake of comparison, we have applied our general result for INF forecasters, that is Theorem 27 (see Appendix A). This leads to the same result with worsened constants. Precisely, η8 becomes η2exp(2ηB) in (9) while Θ(ηB) becomes exp(2Bη)[1+exp(2Bη)]

2 in (11). This seems to be the price for having a theorem applying to a large class of forecasters.

4. The Full Information (FI) Game

The purpose of this section is to illustrate the general regret bounds given in Theorems 2 and 3 in the simplest case, when we set vi,t =gi,t, which is possible since the rewards for all arms are observed in the full information setting. The next theorem is given explicitly to show an easy application of Inequalities (9) and (10).

Theorem 4 (Exponentially weighted average forecaster in the FI game) Letψ(x) = exp(ηx) withη >0. Let vi,t=gi,t. Then in the full information game, INF satisfies

1maxiK

n i=1

gi,t

n t=1

K i=1

pi,tgi,t logK η +ηn

8 . (13)

and

1maxiK

n i=1

gi,t

n t=1

K i=1

pi,tgi,t logK

η +ηΘ(η)

n t=1

K i=1

pi,tgi,t. (14)

In particular with η=

8 logK

n , we getERnn

2logK, and there existsη >0 such that ERn

2EGmaxlogK.

Proof It comes from (9) and (10) since we have B 1 and ∑n

t=1Bt2 n. The only nontrivial result is the last inequality. It obviously holds for any η when EGmax = 0, and is achieved for η = log(

1 +√

2(logK)/EGmax

), when EGmax > 0. Indeed, by taking the

(13)

expectation in (14), we get E

n t=1

K i=1

pi,tgi,t ηEGmaxlogK exp(η)1 = log

( 1 +

√2 logK EGmax

)√

(EGmax)3 2 logK

√EGmaxlogK 2

EGmax2

√EGmaxlogK

2 ,

where we use log(1 +x)≥x−x22 for anyx≥0 in the last inequality.

Now we consider a new algorithm for the FI game, that is INF with ψ(x) =( η

x

)q

and vi,t =gi,t.

Theorem 5 (Poly INF in the FI game) Let ψ(x) =( η

x

)q

withη >0 andq > 1. Let vi,t =gi,t. Then in the full information game, INF satisfies:

1maxiK

n i=1

gi,t

n t=1

K i=1

pi,tgi,t q

q−1ηK1/q+ exp (4q

η )qn

. (15)

In particular with q= 3 logK and η= 1.8

nlogK we get ERn7√

nlogK.

Proof It comes from (6), q+ 12q and ∑n

t=1Bt2 ≤n.

Remark 6 By using the Hoeffding-Azuma inequality (see, e.g., Lemma A.7 of Cesa-Bianchi and Lugosi, 2006), one can derive high probability bounds from (13) and (15): for instance, from (15), for any δ >0, with probability at least 1−δ, Poly INF satisfies:

Rn q

q−1ηK1/q+ exp (4q

η )qn

2η +

nlog(δ1)

2 .

5. The Limited Feedback Games

This section provides regret bounds for three limited feedback games: the label efficient game, the bandit game, and the mixed game, that is the label efficient bandit game.

5.1 Label Efficient Game (LE)

The variants of the LE game consider that the number of queried reward vectors is con- strained either strictly or just in expectation. This section considers successively these two cases.

(14)

5.1.1 Constraint on the Expected Number of Queried Reward Vectors As in Section 4, the purpose of this section is to show how to use INF in order to recover known minimax bounds (up to constant factors) in a slight modification of the LE game:

the simple LE game, in which the requirement is that theexpected number of queried reward vectors should be less or equal tom.

Let us consider the following policy. At each round, we draw a Bernoulli random variable Zt, with parameterε=m/n, to decide whether we ask for the gains or not. Note that we do not fulfill exactly the requirement of the LE game as we might ask a bit more than m reward vectors, but we fulfill the one of the simple LE game. We do so in order to avoid technical details and focus on the main argument of the proof. The exact LE game will be addressed in Section 5.1.2, where, in addition, we will prove bounds on the expected regret ERn instead of just the pseudo-regret Rn.

In this section, the estimate ofgi,t isvi,t = gi,tε Zt, which is observable since the rewards at timet for all arms are observed whenZt= 1.

Theorem 7 (Exponentially weighted average forecaster in the simple LE game) Let ψ(x) = exp(ηx) with η =

8mlogK

n . Let vi,t = gi,tε Zt with ε = mn. Then in the simple LE game, INF satisfies

Rn≤n

√logK 2m .

Proof The first inequality comes from (9). Since we haveBt≤Ztand vi,t = gi,tε Zt, we

obtain (

1maxiK

n t=1

gi,t

Zt

ε )

n t=1

K i=1

pi,tgi,t

Zt

ε logK η + η

2

n t=1

Zt,

hence, by taking the expectation of both sides, Rn=

(

1maxiKE

n t=1

gi,t

Zt

ε )

E

n t=1

K i=1

pi,tgi,t

Zt

ε logK η +

8ε = logK η +n2η

8m. Straightforward computations conclude the proof.

A similar result can be proved for the INF forecaster with ψ(x) = ( η

x

)q

, η >0 and q of order logK. We do not state it since we will prove a stronger result in the next section.

5.1.2 Hard Constraint on the Number of Queried Reward Vectors

The goal of this section is to push the idea that by using high probability bounds as an intermediate step, one can control the expected regretERn=Emax1iKn

t=1

(gi,t−gIt,t) instead of just the pseudo-regretRn= max1iKE∑n

t=1

(gi,t−gIt,t)

. Most previous works have obtained results for Rn. These results are interesting for oblivious opponents, that is when the adversary’s choices of the rewards do not depend on the past draws and obtained rewards, since in this case Proposition 33 in Appendix D shows that one can extend bounds on the pseudo-regret Rn to the expected regret ERn. For non-oblivious opponents, upper bounds onRnare rather weak statements and high probability bounds onRnor bounds on

(15)

ERn are desirable. In Auer (2002) and Cesa-Bianchi and Lugosi (2006), high probability bounds on Rn have been given. Unfortunately, the policies proposed there are depending on the confidence level of the bound. As a consequence, the resulting best bound on ERn, obtained by choosing the policies with confidence level parameter of order 1/n, has an extraneous logn term. Specifically, from Theorem 6.2 of Cesa-Bianchi and Lugosi (2006), one can immediately derive ERn 8n

log(4K)+log(n)

m + 1. The theorems of this section essentially show that the logn term can be removed.

As in Section 5.1.1, we still use a draw of a Bernoulli random variable Zt to decide whether we ask for the gains or not. The difference is that, if ∑t−1

s=1Zs≥m, we do not ask for the gains (as we are not allowed to do so). To avoid that this last constraint interferes in the analysis, the parameter of the Bernoulli random variable is set to ε= 3m4n and the probability of the event ∑n

t=1Zt > m is upper bounded. The estimate of gi,t remains vi,t = gi,tε Zt.

Theorem 8 (Exponentially weighted average forecaster in the LE game) Letψ(x) = exp(ηx) with η =

mlogK

n . Let vi,t = gi,tε Zt with ε= 3m4n. Then in the LE game, for any δ >0, with probability at least 1−δ, INF satisfies:

Rn≤n

√27 log(2Kδ1)

m ,

and

ERn≤n

√27 log(6K)

m .

Theorem 9 (Poly INF in the LE game) Let ψ(x) = ( η

x

)q

with q = 3 log(2K) and η = 2n

log(2K)

m . Letvi,t = gi,tε Zt with ε= 3m4n. Then in the LE game, for any δ >0, with probability at least 1−δ, INF satisfies:

Rn( 8−√

27) n

√log(2K)

m +n

√27 log(2Kδ1)

m ,

and

ERn8n

√log(6K)

m .

5.2 Bandit Game

This section is cut into two parts. In the first one, from Theorem 2 and Theorem 3, we derive upper bounds on the pseudo-regret Rn= max1iKE∑n

t=1

(gi,t−gIt,t

). To bound the expected regretERn=Emax1iKn

t=1

(gi,t−gIt,t)

, we will then use high probability bounds on top of the use of these theorems. Since this makes the proofs more intricate, we have chosen to provide the less general results, but easier to obtain, in Section 5.2.1 and the more general ones in Section 5.2.2.

The main results here are that, by using the INF with a polynomial function ψ, we obtain an upper bound of order

nK forRn, which imply a bound of order

nK onERn

for oblivious adversaries (Proposition 33 in Appendix D). In the general case (containing

Références

Documents relatifs

We show in Theorem 2.1 that the Bernstein Online Aggregation (BOA) and Squint algorithms achieve a fast rate with high probability: i.e.. The theorem also provides a quantile bound

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem... Regret lower bounds and extended Upper Confidence Bounds policies

If Hannan consistency can be achieved for this problem, then there exists a Hannan consistent forecaster whose average regret vanishes at rate n −1/3.. Thus, whenever it is possible

It is shown that convergence of the empirical frequencies of play to the set of correlated equilibria can also be achieved in this case, by playing internal

Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about

Dans cette communication, nous nous int´ eressons ` a une formalisation g´ en´ erique de ce probl` eme de d´ ecision s´ equentielle, et ´ etudions la vitesse minimax d’un crit` ere

Here we show that the performance bounds proved in Sec- tion III for the label efficient exponentially weighted average forecaster are essentially unimprovable in the strong sense

they developed non-asymptotic problem-dependent lower bounds on the regret of any algorithm, in the case of more general limited feedback models than just the simplest case