A two armed bandit type problem revisited

(1)

ESAIM: Probability and Statistics

October 2005, Vol. 9, p. 277–282 DOI: 10.1051/ps:2005017

A TWO ARMED BANDIT TYPE PROBLEM REVISITED

Gilles Pag` es

¹

Abstract. In Bena¨ım and Ben Arous (2003) is solved a multi-armed bandit problem arising in the theory of learning in games. We propose a short and elementary proof of this result based on a variant of the Kronecker lemma.

Mathematics Subject Classification. 91A20, 91A12, 60F99.

Received December 10, 2004. Revised April 29, 2005.

In [2] a multi-armed bandit problem is addressed and investigated by Bena¨ım and Ben Arous. Letf₀, . . . , f_d denoted+ 1 real-valued continuous functions deﬁned on [0,1]^d+1. Given a sequencex= (x_n)_n≥1∈ {0, . . . , d}^N^∗ (the strategy), set for everyn≥1

¯

x_n:= (¯x⁰_n,x¯¹_n, . . . ,x¯^d_n) with x¯ⁱ_n:= 1 n

n

k=1

1{xk=i}, i= 0, . . . , d, and

Q(x) = lim inf

n→+∞

1 n

n−1

k=0

f_x_k+1(¯x_k).

(¯x₀ := (¯x⁰₀,x¯¹₀, . . . ,¯x^d₀)∈[0,1]^d+1, ¯x⁰₀+· · ·+ ¯x^d₀ = 1, is a starting distribution). Imagined+ 1 players enrolled in a cooperative/competitive game with the following simple rules: if playeri∈ {0, . . . , d}plays at timenhe is rewarded byf_i(¯x_n), otherwise he gets nothing; only one player can play at any given time. Then the sequencex is a playing strategy adopted by the group of players andQ(x) is theglobalworst cumulative payoff rate of the strategyx for the whole community of players (regardless of the cumulative payoff rate of each player). This interpretation slightly differs from that proposed in [2] where a single player is considered. This player has the choice among d+ 1 “arms” at every time n with a rewardf_i(¯x_n) when choosing “arm”i. We adopt the first one in view of our illustration.

In [2] an answer (see Th. 1 below) is provided to the following question What are the good strategies (for the group)?

The authors rely on some recent tools developed in stochastic approximation theory (seee.g.[1]). The aim of this note is to provide an elementary and shorter proof based on a slight improvement of the Kronecker lemma.

As an illustration, we emphasize that in such a game a greedy strategy is usually not optimal, even for the

“individual winner”.

Keywords and phrases. Two-armed bandit problem, Kronecker lemma, learning theory, stochastic fictitious play.

1 Laboratoire de Probabilités et Modèles Aléatoires, UMR 7599, Université Paris 6, case 188, 4, place Jussieu, 75252 Paris Cedex 5, France;[email protected]

c EDP Sciences, SMAI 2005

Article published by EDP Sciences and available at http://www.edpsciences.org/psor http://dx.doi.org/10.1051/ps:2005017

(2)

LetSd :={v = (v¹, . . . , v^d)∈[0,1]^d,_d

i=1v_i ≤1}and Pd+1:={u= (u⁰, u¹, . . . , u^d)∈[0,1]^d+1,_d+1

i=1u_i= 1}. Furthermore, for notational convenience, set

∀v∈ Sd, ˜v :=

1−

d

i=1

vⁱ, v¹, . . . , v^d

∈ P_d+1,

∀u∈ Pd+1, ^σu := (u¹, . . . , u^d)∈ Sd. (1) The canonical inner product onR^d will be denoted by (v|w) =_d

i=1vⁱwⁱ. The interior of a subsetAofR^d will be denoted byA^◦. For a sequenceu= (u_n)_n≥0, ∆u_n:=u_n−u_n−1,n≥1.

The main result is the following theorem (ﬁrst established in [2]).

Theorem 1. Assume there is a continuous function Φ : Sd →R, continuously diﬀerentiable on S^◦_d, having a continuous extension of its gradient ∇Φ onSd and satisfying:

∀v∈ Sd, ∇Φ(v) = (fi(˜v)−f₀(˜v))_1≤i≤d. (2) Set for everyu∈ P_d+1,

q(u) :=

d+1

i=0

uⁱf_i(u)

andQ^∗:= max{q(u), u∈ P_d+1}. Then, for every strategyx∈ {0,1, . . . , d}^N^∗, Q(x)≤Q^∗.

Furthermore, for any strategyxsuch that x¯_n →x¯_∞, 1

n

n−1

k=0

f_x_k+1(¯x_k)→q(¯x_∞) as n→ ∞ (so thatQ(x) =q(¯x_∞)).

In particular there is no better strategy than choosing the player at random according to an i.i.d. “Bernouilli strategy” with parameterx¯^∗∈argmaxq.

The key of the proof is the following slight extension of the Kronecker lemma.

Lemma 1 (“`a la Kronecker” lemma). Let (b_n)_n≥1 be a nondecreasing sequence of positive real numbers con- verging to+∞and let(a_n)_n≥1 be a sequence of real numbers. Then

lim inf

n→+∞

n

k=1

a_k

b_k ∈R =⇒ lim inf

n→+∞

1 b_n

n

k=1

a_k≤0.

Proof. SetC_n =

n

k=1

a_k

b_k, n≥1, andC₀= 0 so thata_n =b_n∆C_n. As a consequence, an Abel transform yields 1

b_n

n

k=1

a_k = 1 b_n

n

k=1

b_k∆C_k = 1 b_n

b_nC_n−

n

k=1

C_k−1∆b_k

= C_n− 1 b_n

n

k=1

C_k−1∆b_k.

(3)

Now, lim inf

n→+∞C_nbeing ﬁnite, for everyε >0, there is an integern_εsuch that for everyk≥n_ε,C_k≥lim inf

n→+∞C_n−ε.

Hence

1 b_n

n

k=1

C_k−1∆b_k≥ 1 b_n

nε

k=1

C_k−1∆b_k+b_n−b_n_ε b_n

lim inf

k C_k−ε

. Consequently, lim inf

n→+∞C_n being ﬁnite, one concludes that for everyε >0, lim inf

n→+∞

1 b_n

n

k=1

a_k≤lim inf

n→+∞C_n−0−1×

lim inf

k→+∞C_k−ε

=ε.

Proof of Theorem 1. First note that for everyu= (u⁰, u¹, . . . , u^d)∈ Pd+1,

q(u) :=

d+1

i=0

uⁱf_i(u) =f₀(u) +

d

i=1

uⁱ(f_i(u)−f₀(u)) so that

Q^∗= sup

v∈Sd

f₀(˜v) +

d

i=1

vⁱ(f_i(˜v)−f₀(˜v))

= sup

v∈Sd

{f₀(˜v) + (v|∇Φ(v))}.

Now, for everyk≥0,

f_x_k+1(¯x_k)−q(¯x_k) =

d

i=0

(f_i(¯x_k)1{xk+1=i}−x¯ⁱ_kf_i(¯x_k)) =

d

i=0

f_i(¯x_k)(1{xk+1=i}−x¯ⁱ_k)

=

d

i=0

f_i(¯x_k)(k+ 1)∆¯xⁱ_k+1

= (k+ 1)

d

i=1

(f_i(¯x_k)−f₀(¯x_k))∆¯xⁱ_k+1.

The last equality reads using Assumption (2) and notation (1),

f_x_k+1(¯x_k)−q(¯x_k) = (k+ 1)(∇Φ(^σx¯_k)|∆^σx¯_k+1).

Consequently, by the fundamental formula of calculus applied to Φ on (^σ¯x_k, ^σx¯_k+1)⊂ S^◦_d, 1

n

n−1

k=0

f_x_k+1(¯x_k)−q(¯x_k) = 1 n

n−1

k=0

(k+ 1) (Φ(^σx¯_k+1)−Φ(^σx¯_k))−R_n

with R_n := 1

n

n−1

k=0

(∇Φ(ξ_k)− ∇Φ(^σx¯_k)|(k+ 1)∆^σx¯_k+1)

andξ_k∈(^σx¯_k,^σ¯x_k+1), k= 0, . . . , n−1. The fact that|(k+ 1)∆^σx¯_k+1| ≤1 implies

|Rn| ≤ 1 n

n−1

k=0

w(∇Φ,|∆^σ¯x_k+1|)

(4)

wherew(g, δ) denotes the uniform continuityδ-modulus of a functiong. One derives from the uniform continuity of∇Φ on the compact setSd that

R_n→0 as n→+∞.

Finally, the continuous function Φ being bounded on the compact setS_d, the partial sums

n−1

k=0

Φ(^σx¯_k+1)−Φ(^σx¯_k) = Φ(^σx¯_n+1)−Φ(^σx¯₀) remain bounded asngoes to inﬁnity. Lemma 1 then implies that

lim inf

n→+∞

1 n

n−1

k=0

(k+ 1) (Φ(^σx¯_k+1)−Φ(^σx¯_k))≤0.

One concludes by noting that on one hand lim sup

n→+∞

1 n

n−1

k=0

q(¯x_k)≤Q^∗= sup

Pd+1

q

and that, on the other hand, the functionqbeing continuous,

n→+∞lim 1 n

n−1

k=0

q(¯x_k) =q(x^∗) as soon as x¯_n →x^∗.

Corollary 1. Whend+ 1 = 2(two players), Assumption (2) is satisﬁed as soon asf₀ andf₁ are continuous on P2 and then the conclusions of Theorem 1 hold true.

Proof. This follows from the obvious fact that the continuous functionu¹→f₁(1−u¹, u¹)−f₀(1−u¹, u¹) on

[0,1] has an antiderivative.

Further comments:

•If one considers a slightly more general game in which someweighted strategiesare allowed, the final result is not modified in any way provided the weight sequence satisfies a very light assumption. Namely, assume that at time nthe reward is

∆_n+1f_x_n+1(¯x_n) instead of f_x_n+1(¯x_n) where the weight sequence ∆ = (∆_n)_n≥1satisﬁes

∆_n≥0, n≥1, S_n=

n

k=1

∆_k→+∞, ∆_n

S_n →0 as n→ ∞ then the quantities ¯x^∆₀ ∈ P_d+1, ¯x^∆_n := (¯x^∆,0_n , . . . ,x¯^∆,d_n ) with ¯x^∆,i_n = _S¹

n

_n

k=1∆_k1{xk=i}, i= 0, . . . , d, n≥1, andQ^∆(x) = lim inf

n→+∞

1 S_n

n−1

k=0

∆_k+1f_x_k+1(¯x^∆_k) satisfy all the conclusions of Theorem 1mutatis mutandis.

• Several applications of Theorem 1 to the theory of learning in games and to stochastic ﬁctitious play are extensively investigated in [2] which we refer to for all these aspects. As far as we are concerned we will simply make a remark about some “natural” strategies which illustrates the theorem in an elementary way.

In the reward function at timek,i.e. f_x_k(¯x_k−1),x_k represents the competitive term (“who will play?”) and

¯

x_k−1 represents a cooperative term (everybody’s past behaviour has inﬂuence on everybody’s reward).

(5)

This cooperative/competitive antagonism induces that in such a game agreedycompetitive strategy is usually not optimal (when the players do not play a symmetric role). Let us be more speciﬁc. Assume for the sake of simplicity that d+ 1 = 2 (two players). Then one may consider without loss of generality that ¯x_n = ^σx¯_n i.e.

that ¯x_n is a [0,1]-valued real number. Agreedy competitivestrategy is deﬁned by

player 1 plays at time n (i.e. x_n= 1) iﬀ f₁(¯x_n−1)≥f₀(¯x_n−1) (3) i.e.the player with the highest reward is nominated to play. Then, for everyn≥1,

f_x_n(¯x_n−1) = max(f₀(¯x_n−1), f₁(¯x_n−1)) and it is clear that

f_x_n(¯x_n−1)−q(¯x_n−1) = max(f₀(¯x_n−1), f₁(¯x_n−1))−q(¯x_n−1) =:ϕ(¯x_n−1)≥0.

On the other hand, the proof of Theorem 1 implies that

lim inf

n→+∞

1 n

n−1

k=0

ϕ(¯x_k)≤0.

Hence, there is at least one weak limiting distribution ¯µ_∞ of the sequence of empirical measures ¯µ_n :=

1 n

0≤k≤n−1δ_x_¯_kon the compact interval [0,1] which is supported by the closed set{ϕ= 0} ⊂ {0,1}∪{f0=f₁};

on the other hand supp(µ_∞) is contained in the set ¯X∞ of the limiting values of the sequence (¯x_n) itself (in fact ¯X∞ is an interval since (¯x_n)_n is bounded and ¯x_n+1−x¯_n →0). Hence ¯X∞∩({0,1} ∪ {f0=f₁})=∅.

If the greedy strategy (¯x_n)_n is optimal then dist(¯x_n,argmaxq)→0 asn→ ∞i.e.X¯∞ ⊂argmaxq. Conse- quently if

argmaxq∩({0,1} ∪ {f₀=f₁}) =∅ (4) thenthe purely competitive strategy is never optimal for the group of two players.

Let us be more speciﬁc on the following example: set for two positive parametersa=b f₀(x) :=a x and f₁(x) :=b(1−x), x∈[0,1].

Then one checks that

argmaxq={1/2} and f₀(1/2)=f₁(1/2).

One first shows that the greedy strategyx= (x_n)_n≥1defined by (3) satisfies

¯

x_n→ b

a+b and Q(x) = ab

a+b as n→ ∞.

On the other hand, any optimal (cooperative) strategy (like the i.i.d.Bernoulli(1/2) one) yields an asymptotic (relative) global payoﬀ rate

Q^∗= max

[0,1]q=a+b 4 ·

Note thatQ^∗> _a+b^ab sincea=b. (Whena=bthe greedy strategy becomes optimal.) Now, if one looks at the individual performances (i.e.lim_n ¹_n

0≤k≤n−1f_i(¯x_k)1{xk+1=i}, i = 0,1) of both players when the greedy strategy is played, one checks that:

– the “winner” of the game is player 1 ifb > aand player 0 ifa > b,

– the asymptotic (relative) payoff rate of the winner is equal to âb_(a+b)^max(a,b)2 (and âb_(a+b)^min(a,b)2 for the “looser”).

(6)

If an optimal cooperative strategy is adopted by the players the “winner” remains the same but with an asymptotic payoﬀ rate equal to ^max(a,b)₄ (the “looser” gets ^min(a,b)₄ ). Consequently (when a =b), an optimal cooperative strategy always yields to the winner a strictly higher asymptotic payoﬀ rate than the greedy one.

This is also true for the looser.

• A more abstract version of Theorem 1 can be established using the same approach. The ﬁnite set {0,1, . . . , d} is replaced by a compact metric set K, P_d+1 is replaced by the convex set PK of probability distributions on K equipped with the weak topology and the continuous function f : K× PK → R is still supposed to derive from a potential function in some sense.

References

[1] M. Bena¨ım, Dynamics of stochastic algorithms, inS´eminaire de probabilit´es XXXIII, J. Az´emaet al.Eds., Springer-Verlag, Berlin.Lect. Notes Math.1708(1999) 1–68.

[2] M. Bena¨ım and G. Ben Arous, A two armed bandit type problem.Game Theory32(2003) 3–16.