Multi-Player Bandits Revisited

(1)

HAL Id: hal-02013847

https://hal.inria.fr/hal-02013847

Submitted on 11 Feb 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Multi-Player Bandits Revisited

Lilian Besson

To cite this version:

Lilian Besson. Multi-Player Bandits Revisited. Séminaire “ IETR : Interagir Evaluer Transmettre Réunir ”, Jun 2018, Vannes, France. �hal-02013847�

(2)

Multi-Player Bandits Revisited

Workshop on MAB and Learning Algorithms

, May 2018, Rotterdam (NL)

PhD Student Day

, June 2018, Vannes (Fr)

By: L

ILIAN

B

ESSON

–

Lilian.Besson@

CentraleSupelec.fr

SCEE Team,

CentraleSupélec Rennes

, IETR & SequeL Team,

Inria Lille

1. I

NTRODUCTION

& G

OAL

Goal: insert objects in a wireless network, keep a good Quality of Service.

• Hypothesis: object j choose channel Aj(t) ∈

{1, . . . , K}, to use to communicate at time t.

• Idea: use on-line Machine Learning algorithms ? • Not so easy: each device takes its own decisions, without central control or communication, has light CPU/memory etc.

• =⇒ Solution: Decentralized MAB algorithms !

2. M

ODEL

: T

IME

/F

REQUENCY PROTOCOL

M

PLAYERS IN THE NETWORK

Fix-duration communication. Channels can be free or busy.

• K RF channels (of same bandwidth), e.g., K = 9. • Primary users create a background traffic. • Channel k is busy with mean µ1, . . . , µK ∈ [0, 1].

• Sensing gives binary feedback Yk,t ∼ B(µk).

• 1 ≤ M ≤ K secondary users , j ∈ {1, . . . , M }. • Base station replies if packet is received.

• Collision indic. Cj(t) = 1(∃j0 6= j, Aj(t) = Aj0(t))_. • Non-stochastic 0/1 reward rj(t) := Y_Aj_(t),t × Cj(t).

3. P

REVIOUS WORKS

(

SINCE

2009

)

Ideas: 1) use empirical means to learn the M best arms (µ∗₁, . . . , µ∗_M_{), 2) and each user orthogonalize on this set.} • RhoRand: use an index policy (UCB) and random ranks, [Anandkumar et al, 2011]

• MEGA: use a simple ε-greedy algorithm (hard to tune, and not so efficient), [Avner & Mannor, 2015]

• Musical Chair: pure exploration then random hopping until convergence on M best arms, [Shamir et al, 2016]

• Some others: TDFS uses a time sharing, RhoLearn uses BayesUCB to learn the ranks, etc.

4. O

UR PROPOSAL

1) kl_{-UCB > UCB}₁ _{for best arms identification.}

2) _{New algorithm MCTopM for orthogonalization:}

– Estimate the set of M best arm dMj(t)_,

– Randomly play in dMj(t) _{until fixed on a good} one (no collision: fixed on a “chair”).

5. UCB

1 AND

kl

-UCB

INDEXES

Let T_kj(t) _{the selections of channel k for player j,} X_kj(t) _{mean of free sensing.}

UCB₁ : g_kj (t) := X j k(t) T_kj(t) + s log(t) 2T_kj(t) kl_{-UCB : g}_kj (t) := sup q∈[0,1] ( q : kl X j k(t) T_kj(t) , q ! ≤ log(t) T_kj(t) )

[Garivier & Cappé, 2011], [Cappé et al, 2013]

6. MCTopM

ALGORITHM 1 Aj(1) ∼ U ({1, . . . , K}), Cj(1) = sj(1) = False 2 for t = 0, . . . , T − 1 do 3 if Aj(t) /∈ dMj(t) then // (3) or (5) 4 Aj(t + 1) ∼ U (dMj(t) ∩ n k : g_kj (t − 1) ≤ g_Aj j_(t)(t − 1) o )

5 sj(t + 1) = False // arm with

smaller UCB at t − 1

6 else if Cj(t) and sj(t) then // (2) 7 Aj(t + 1) ∼ U dMj(t) 8 sj(t + 1) = False 9 else // transition (1) or (4) 10 Aj(t + 1) = Aj(t) // same arm 11 sj(t + 1) = True // on “chair” 12 end

13 Play arm Aj(t + 1), get new observations, 14 Compute the indexes g_kj(t + 1), and set

d

Mj(t + 1) _{for next step.}

15 end

7. I

LLUSTRATION OF

MCT

OP

M

(0) _{Start t = 0} Not fixed, sj_(t) Fixed, sj(t) (1) Cj(t), Aj(t) ∈ dMj(t) (2) Cj(t), Aj(t) ∈ dMj_(t) (3) Aj(t) /∈ dMj_(t) (4) Aj(t) ∈ dMj_(t) (5) Aj(t) /∈ dMj_(t)

8. T

HEOREMS Let Tk(T ) := P M j=1 T k

k (T ) total selections of arm k, and Ck(T ) counts collisions on arm k.

1. Regret: RT (µ, M ) := Eµ " T P t=1 M P j=1 µ∗_j − rj(t) # = _M P k=1 µ∗_k T − Eµ " T P t=1 M P j=1 rj(t) # R_T = P k∈M_-worst (µ∗_M − µ_k_)E_µ[T_k(T )] + P k∈M_-best (µ_k − µ∗_M_{) (T − E}_µ[T_k(T )]) + K P k=1 µ_k_E_µ[C_k(T )].

2. Lemma for upper-bound: only two terms to focus on (bad selections and collisions). P k∈M_-best (µ_k − µ∗_M_{) (T − E}_µ[T_k(T )]) ≤ (µ∗₁ − µ∗_M) P k∈M_-worst Eµ[Tk(T )] + P k∈M_-best Eµ[Ck(T )] .

3. If all M players use MCTopM with kl-UCB,

∀µ, ∃G

_M,µ

,

R

_T

(µ, M ) ≤ G

_M,µ

× log(T )

+ o(log T )

.

9.1. I

LLUSTRATION FOR

M = K

0 2000 4000 6000 8000 10000

Time steps t= 1. . T, horizon T= 10000,

0 1000 2000 3000 4000 5000 6000 7000 Cu m ula tiv e c en tra liz ed re gr et 9 X k= 1 µ ∗tk − 9 X k= 1 µk 20 0 [ Tk ( t )]

Multi-players M= 9 : Cumulated centralized regret, averaged 200 times 9 arms: [B(0.1)∗_{, B}₍₀_.₂₎∗_{, B}₍₀_.₃₎∗_{, B}₍₀_.₄₎∗_{, B}₍₀_.₅₎∗_{, B}₍₀_.₆₎∗_{, B}₍₀_.₇₎∗_{, B}₍₀_.₈₎∗_{, B}₍₀_.₉₎∗_]

9× RandTopM-KLUCB

9× MCTopM-KLUCB

9× Selfish-KLUCB

9× RhoRand-KLUCB

Our lower-bound = 0 log(t)

Anandkumar et al.'s lower-bound = 0 log(t)

Centralized lower-bound = 0 log(t)

Only RandTopM _and MCTopM _{achieve constant regret in}

this saturated case (proven).

9.2. I

LLUSTRATION FOR

M < K

0 1000 2000 3000 4000 5000

Time steps t= 1. . T, horizon T= 5000,

0 500 1000 1500 2000 2500 3000 3500 Cu m ula tiv e c en tra liz ed re gr et 6 X k= 1 µ ∗tk − 9 X k= 1 µk 50 0 [ Tk ( t )]

Multi-players M= 6 : Cumulated centralized regret, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0,1]

6× RandTopM-KLUCB

6× MCTopM-KLUCB

6× Selfish-KLUCB

6× RhoRand-KLUCB

RhoRand < RandTopM < Selfish < MCTopM_.

12. T

HANKS TO

. . .

– Organizers of the Workshop on MAB and Learning Algorithms! – ADDI association for the PhD Students Day 2018 !

– SCEE team at IETR, CentraleSupélec (Rennes). – SequeL team at Inria (Lille), and CNRS.

– My PhD advisors: Émilie Kaufmann, Christophe Moy.

11. M

AIN REFERENCES

M

ORE ON

-

LINE

→ http://lbo.k.vu/JdD2018

[BBM+17] R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot (2017). Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings. In 12th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication.

[BK18] L. Besson and E. Kaufmann (April 2018). Multi-Player Bandits Revisited. In Algorithmic Learning Theory. Lanzarote, Spain. URL https://hal.inria.fr/hal-01629733.

[B18] Simulation code on GitHub.com/SMPyBandits/SMPyBandits, open source (MIT license)!

10. C

ONCLUSIONS

• Our algorithm MCTopM is uniformly better than all previous proposals. • Good bounds: log(T ) collisions, arm switches, bad selections and regret.

• Real-world implementation? ,→ Yes, presented at ICT 2018!