HAL Id: hal-02013847
https://hal.inria.fr/hal-02013847
Submitted on 11 Feb 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Multi-Player Bandits Revisited
Lilian Besson
To cite this version:
Lilian Besson. Multi-Player Bandits Revisited. Séminaire “ IETR : Interagir Evaluer Transmettre Réunir ”, Jun 2018, Vannes, France. �hal-02013847�
Multi-Player Bandits Revisited
Workshop on MAB and Learning Algorithms
, May 2018, Rotterdam (NL)
PhD Student Day
, June 2018, Vannes (Fr)
By: L
ILIAN
B
ESSON
–
Lilian.Besson@
CentraleSupelec.fr
SCEE Team,
CentraleSupélec Rennes
, IETR & SequeL Team,
Inria Lille
1. I
NTRODUCTION& G
OALGoal: insert objects in a wireless network, keep a good Quality of Service.
• Hypothesis: object j choose channel Aj(t) ∈
{1, . . . , K}, to use to communicate at time t.
• Idea: use on-line Machine Learning algorithms ? • Not so easy: each device takes its own decisions, without central control or communication, has light CPU/memory etc.
• =⇒ Solution: Decentralized MAB algorithms !
2. M
ODEL: T
IME/F
REQUENCY PROTOCOLM
PLAYERS IN THE NETWORKFix-duration communication. Channels can be free or busy.
• K RF channels (of same bandwidth), e.g., K = 9. • Primary users create a background traffic. • Channel k is busy with mean µ1, . . . , µK ∈ [0, 1].
• Sensing gives binary feedback Yk,t ∼ B(µk).
• 1 ≤ M ≤ K secondary users , j ∈ {1, . . . , M }. • Base station replies if packet is received.
• Collision indic. Cj(t) = 1(∃j0 6= j, Aj(t) = Aj0(t)). • Non-stochastic 0/1 reward rj(t) := YAj(t),t × Cj(t).
3. P
REVIOUS WORKS(
SINCE2009
)
Ideas: 1) use empirical means to learn the M best arms (µ∗1, . . . , µ∗M), 2) and each user orthogonalize on this set. • RhoRand: use an index policy (UCB) and random ranks, [Anandkumar et al, 2011]
• MEGA: use a simple ε-greedy algorithm (hard to tune, and not so efficient), [Avner & Mannor, 2015]
• Musical Chair: pure exploration then random hopping until convergence on M best arms, [Shamir et al, 2016]
• Some others: TDFS uses a time sharing, RhoLearn uses BayesUCB to learn the ranks, etc.
4. O
UR PROPOSAL1) kl-UCB > UCB1 for best arms identification.
2) New algorithm MCTopM for orthogonalization:
– Estimate the set of M best arm dMj(t),
– Randomly play in dMj(t) until fixed on a good one (no collision: fixed on a “chair”).
5. UCB
1 ANDkl
-UCB
INDEXESLet Tkj(t) the selections of channel k for player j, Xkj(t) mean of free sensing.
UCB1 : gkj (t) := X j k(t) Tkj(t) + s log(t) 2Tkj(t) kl-UCB : gkj (t) := sup q∈[0,1] ( q : kl X j k(t) Tkj(t) , q ! ≤ log(t) Tkj(t) )
[Garivier & Cappé, 2011], [Cappé et al, 2013]
6. MCTopM
ALGORITHM 1 Aj(1) ∼ U ({1, . . . , K}), Cj(1) = sj(1) = False 2 for t = 0, . . . , T − 1 do 3 if Aj(t) /∈ dMj(t) then // (3) or (5) 4 Aj(t + 1) ∼ U (dMj(t) ∩ n k : gkj (t − 1) ≤ gAj j(t)(t − 1) o )5 sj(t + 1) = False // arm with
smaller UCB at t − 1
6 else if Cj(t) and sj(t) then // (2) 7 Aj(t + 1) ∼ U dMj(t) 8 sj(t + 1) = False 9 else // transition (1) or (4) 10 Aj(t + 1) = Aj(t) // same arm 11 sj(t + 1) = True // on “chair” 12 end
13 Play arm Aj(t + 1), get new observations, 14 Compute the indexes gkj(t + 1), and set
d
Mj(t + 1) for next step.
15 end
7. I
LLUSTRATION OFMCT
OPM
(0) Start t = 0 Not fixed, sj(t) Fixed, sj(t) (1) Cj(t), Aj(t) ∈ dMj(t) (2) Cj(t), Aj(t) ∈ dMj(t) (3) Aj(t) /∈ dMj(t) (4) Aj(t) ∈ dMj(t) (5) Aj(t) /∈ dMj(t)8. T
HEOREMS Let Tk(T ) := P M j=1 T kk (T ) total selections of arm k, and Ck(T ) counts collisions on arm k.
1. Regret: RT (µ, M ) := Eµ " T P t=1 M P j=1 µ∗j − rj(t) # = M P k=1 µ∗k T − Eµ " T P t=1 M P j=1 rj(t) # RT = P k∈M-worst (µ∗M − µk)Eµ[Tk(T )] + P k∈M-best (µk − µ∗M) (T − Eµ[Tk(T )]) + K P k=1 µkEµ[Ck(T )].
2. Lemma for upper-bound: only two terms to focus on (bad selections and collisions). P k∈M-best (µk − µ∗M) (T − Eµ[Tk(T )]) ≤ (µ∗1 − µ∗M) P k∈M-worst Eµ[Tk(T )] + P k∈M-best Eµ[Ck(T )] .
3. If all M players use MCTopM with kl-UCB,
∀µ, ∃G
M,µ,
R
T(µ, M ) ≤ G
M,µ× log(T )
+ o(log T )
.9.1. I
LLUSTRATION FORM = K
0 2000 4000 6000 8000 10000
Time steps t= 1. . T, horizon T= 10000,
0 1000 2000 3000 4000 5000 6000 7000 Cu m ula tiv e c en tra liz ed re gr et 9 X k= 1 µ ∗tk − 9 X k= 1 µk 20 0 [ Tk ( t )]
Multi-players M= 9 : Cumulated centralized regret, averaged 200 times 9 arms: [B(0.1)∗, B(0.2)∗, B(0.3)∗, B(0.4)∗, B(0.5)∗, B(0.6)∗, B(0.7)∗, B(0.8)∗, B(0.9)∗]
9× RandTopM-KLUCB
9× MCTopM-KLUCB
9× Selfish-KLUCB
9× RhoRand-KLUCB
Our lower-bound = 0 log(t)
Anandkumar et al.'s lower-bound = 0 log(t)
Centralized lower-bound = 0 log(t)
Only RandTopM and MCTopM achieve constant regret in
this saturated case (proven).
9.2. I
LLUSTRATION FORM < K
0 1000 2000 3000 4000 5000
Time steps t= 1. . T, horizon T= 5000,
0 500 1000 1500 2000 2500 3000 3500 Cu m ula tiv e c en tra liz ed re gr et 6 X k= 1 µ ∗tk − 9 X k= 1 µk 50 0 [ Tk ( t )]
Multi-players M= 6 : Cumulated centralized regret, averaged 500 times 9 arms: Bayesian MAB, Bernoulli with means on [0,1]
6× RandTopM-KLUCB
6× MCTopM-KLUCB
6× Selfish-KLUCB
6× RhoRand-KLUCB
RhoRand < RandTopM < Selfish < MCTopM.
12. T
HANKS TO. . .
– Organizers of the Workshop on MAB and Learning Algorithms! – ADDI association for the PhD Students Day 2018 !
– SCEE team at IETR, CentraleSupélec (Rennes). – SequeL team at Inria (Lille), and CNRS.
– My PhD advisors: Émilie Kaufmann, Christophe Moy.
11. M
AIN REFERENCESM
ORE ON-
LINE→ http://lbo.k.vu/JdD2018
[BBM+17] R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot (2017). Multi-Armed Bandit Learning in IoT Networks: Learning helps even in non-stationary settings. In 12th EAI Conference on Cognitive Radio Oriented Wireless Network and Communication.
[BK18] L. Besson and E. Kaufmann (April 2018). Multi-Player Bandits Revisited. In Algorithmic Learning Theory. Lanzarote, Spain. URL https://hal.inria.fr/hal-01629733.
[B18] Simulation code on GitHub.com/SMPyBandits/SMPyBandits, open source (MIT license)!
10. C
ONCLUSIONS• Our algorithm MCTopM is uniformly better than all previous proposals. • Good bounds: log(T ) collisions, arm switches, bad selections and regret.
• Real-world implementation? ,→ Yes, presented at ICT 2018!