• Aucun résultat trouvé

Multi-Player Bandits Revisited

N/A
N/A
Protected

Academic year: 2021

Partager "Multi-Player Bandits Revisited"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02013847

https://hal.inria.fr/hal-02013847

Submitted on 11 Feb 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Multi-Player Bandits Revisited

Lilian Besson

To cite this version:

Lilian Besson. Multi-Player Bandits Revisited. Séminaire “ IETR : Interagir Evaluer Transmettre Réunir ”, Jun 2018, Vannes, France. �hal-02013847�

(2)

Multi-Player Bandits Revisited

Workshop on MAB and Learning Algorithms

, May 2018, Rotterdam (NL)

PhD Student Day

, June 2018, Vannes (Fr)

By: L

ILIAN

B

ESSON

Lilian.Besson@

CentraleSupelec.fr

SCEE Team,

CentraleSupélec Rennes

, IETR & SequeL Team,

Inria Lille

1. I

NTRODUCTION

& G

OAL

Goal: insert objects in a wireless network, keep a good Quality of Service.

• Hypothesis: object j choose channel Aj(t) ∈

{1, . . . , K}, to use to communicate at time t.

• Idea: use on-line Machine Learning algorithms ? • Not so easy: each device takes its own decisions, without central control or communication, has light CPU/memory etc.

• =⇒ Solution: Decentralized MAB algorithms !

2. M

ODEL

: T

IME

/F

REQUENCY PROTOCOL

M

PLAYERS IN THE NETWORK

Fix-duration communication. Channels can be free or busy.

• K RF channels (of same bandwidth), e.g., K = 9. • Primary users create a background traffic. • Channel k is busy with mean µ1, . . . , µK ∈ [0, 1].

• Sensing gives binary feedback Yk,t ∼ B(µk).

• 1 ≤ M ≤ K secondary users , j ∈ {1, . . . , M }. • Base station replies if packet is received.

• Collision indic. Cj(t) = 1(∃j0 6= j, Aj(t) = Aj0(t)). • Non-stochastic 0/1 reward rj(t) := YAj(t),t × Cj(t).

3. P

REVIOUS WORKS

(

SINCE

2009

)

Ideas: 1) use empirical means to learn the M best arms (µ∗1, . . . , µ∗M), 2) and each user orthogonalize on this set. • RhoRand: use an index policy (UCB) and random ranks, [Anandkumar et al, 2011]

• MEGA: use a simple ε-greedy algorithm (hard to tune, and not so efficient), [Avner & Mannor, 2015]

• Musical Chair: pure exploration then random hopping until convergence on M best arms, [Shamir et al, 2016]

• Some others: TDFS uses a time sharing, RhoLearn uses BayesUCB to learn the ranks, etc.

4. O

UR PROPOSAL

1) kl-UCB > UCB1 for best arms identification.

2) New algorithm MCTopM for orthogonalization:

Estimate the set of M best arm dMj(t),

Randomly play in dMj(t) until fixed on a good one (no collision: fixed on a “chair”).

5. UCB

1 AND

kl

-UCB

INDEXES

Let Tkj(t) the selections of channel k for player j, Xkj(t) mean of free sensing.

UCB1 : gkj (t) := X j k(t) Tkj(t) + s log(t) 2Tkj(t) kl-UCB : gkj (t) := sup q∈[0,1] ( q : kl X j k(t) Tkj(t) , q ! ≤ log(t) Tkj(t) )

[Garivier & Cappé, 2011], [Cappé et al, 2013]

6. MCTopM

ALGORITHM 1 Aj(1) ∼ U ({1, . . . , K}), Cj(1) = sj(1) = False 2 for t = 0, . . . , T − 1 do 3 if Aj(t) /∈ dMj(t) then // (3) or (5) 4 Aj(t + 1) ∼ U (dMj(t) ∩ n k : gkj (t − 1) ≤ gAj j(t)(t − 1) o )

5 sj(t + 1) = False // arm with

smaller UCB at t − 1

6 else if Cj(t) and sj(t) then // (2) 7 Aj(t + 1) ∼ U dMj(t) 8 sj(t + 1) = False 9 else // transition (1) or (4) 10 Aj(t + 1) = Aj(t) // same arm 11 sj(t + 1) = True // on “chair” 12 end

13 Play arm Aj(t + 1), get new observations, 14 Compute the indexes gkj(t + 1), and set

d

Mj(t + 1) for next step.

15 end

7. I

LLUSTRATION OF

MCT

OP

M

(0) Start t = 0 Not fixed, sj(t) Fixed, sj(t) (1) Cj(t), Aj(t) ∈ dMj(t) (2) Cj(t), Aj(t) ∈ dMj(t) (3) Aj(t) /∈ dMj(t) (4) Aj(t) ∈ dMj(t) (5) Aj(t) /∈ dMj(t)

8. T

HEOREMS Let Tk(T ) := P M j=1 T k

k (T ) total selections of arm k, and Ck(T ) counts collisions on arm k.

1. Regret: RT (µ, M ) := Eµ " T P t=1 M P j=1 µ∗j − rj(t) # =  M P k=1 µ∗k  T − Eµ " T P t=1 M P j=1 rj(t) # RT = P k∈M-worst (µ∗M − µk)Eµ[Tk(T )] + P k∈M-bestk − µ∗M) (T − Eµ[Tk(T )]) + K P k=1 µkEµ[Ck(T )].

2. Lemma for upper-bound: only two terms to focus on (bad selections and collisions). P k∈M-bestk − µ∗M) (T − Eµ[Tk(T )]) ≤ (µ∗1 − µ∗M)  P k∈M-worst Eµ[Tk(T )] + P k∈M-best Eµ[Ck(T )]  .

3. If all M players use MCTopM with kl-UCB,

∀µ, ∃G

M,µ

,

R

T

(µ, M ) ≤ G

M,µ

× log(T )

+ o(log T )

.

9.1. I

LLUSTRATION FOR

M = K

0 2000 4000 6000 8000 10000

Time steps t= 1. . T, horizon T= 10000,

0 1000 2000 3000 4000 5000 6000 7000 Cu m ula tiv e c en tra liz ed re gr et 9 X k= 1 µ ∗tk − 9 X k= 1 µk 20 0 [ Tk ( t )]

Multi-players M= 9 : Cumulated centralized regret, averaged 200 times 9 arms: [B(0.1)∗, B(0.2), B(0.3), B(0.4), B(0.5), B(0.6), B(0.7), B(0.8), B(0.9)]

9× RandTopM-KLUCB

9× MCTopM-KLUCB

9× Selfish-KLUCB

9× RhoRand-KLUCB

Our lower-bound = 0 log(t)

Anandkumar et al.'s lower-bound = 0 log(t)

Centralized lower-bound = 0 log(t)

Only RandTopM and MCTopM achieve constant regret in

this saturated case (proven).

9.2. I

LLUSTRATION FOR

M < K

0 1000 2000 3000 4000 5000

Time steps t= 1. . T, horizon T= 5000,

0 500 1000 1500 2000 2500 3000 3500 Cu m ula tiv e c en tra liz ed re gr et 6 X k= 1 µ ∗tk − 9 X k= 1 µk 50 0 [ Tk ( t )]

Multi-players M= 6 : Cumulated centralized regret, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0,1]

6× RandTopM-KLUCB

6× MCTopM-KLUCB

6× Selfish-KLUCB

6× RhoRand-KLUCB

RhoRand < RandTopM < Selfish < MCTopM.

12. T

HANKS TO

. . .

– Organizers of the Workshop on MAB and Learning Algorithms! – ADDI association for the PhD Students Day 2018 !

– SCEE team at IETR, CentraleSupélec (Rennes). – SequeL team at Inria (Lille), and CNRS.

– My PhD advisors: Émilie Kaufmann, Christophe Moy.

11. M

AIN REFERENCES

M

ORE ON

-

LINE

→ http://lbo.k.vu/JdD2018

[BBM+17] R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot (2017). Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings. In 12th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication.

[BK18] L. Besson and E. Kaufmann (April 2018). Multi-Player Bandits Revisited. In Algorithmic Learning Theory. Lanzarote, Spain. URL https://hal.inria.fr/hal-01629733.

[B18] Simulation code on GitHub.com/SMPyBandits/SMPyBandits, open source (MIT license)!

10. C

ONCLUSIONS

• Our algorithm MCTopM is uniformly better than all previous proposals. • Good bounds: log(T ) collisions, arm switches, bad selections and regret.

• Real-world implementation? ,→ Yes, presented at ICT 2018!

Références

Documents relatifs

Only shot-level runs obtained through shot aggregation have been submitted for the objective subtask, while both shot and chunk aggregation were used for the subjective one.. Run

Foreign countries.. You are going to read 3 3 short texts about the brain drain. Focus on the 3 titles and try to answer the questions that follow.... A. fill in the flow chart

Thus, game developers began to think not so much in terms of a specific game, but about generic game systems that played different games depending on the data files.. The more

Operating Systems: Concurrent and Distributed Software Design achieves this integration by setting up a common framework of modular structure (a simple object model is

On the other hand, l i 6= l j implies that we do not compare the search results taking into account features that we already know are not shared by both web pages by means of

For this reason the Multilingual Web Person Name Disambiguation task (M- WePNaD) provides the participants with a multilingual corpus (MC4WePS [5]) of web pages in order for them

Statistical models of game outcomes have a rich and diverse history, going back almost a century: as early as 1928, Zermelo [8] proposed a simple algorithm that infers the skill

Une fois le transfert terminé, vous voyez le fichier (ou le dossier ) affiché sur la partie ordinateur distant et l'information est indiquée dans le journal de transfert (pas