HAL Id: hal-01879251
https://hal.archives-ouvertes.fr/hal-01879251
Submitted on 22 Sep 2018
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Memory Bandits: Towards the Switching Bandit Problem Best Resolution
Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud
To cite this version:
Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud. Memory Bandits: Towards the Switching Bandit Problem Best Resolution. MLSS 2018 - Machine Learning Summer School, Aug 2018, Madrid, Spain. �hal-01879251�
M EMORY B ANDITS :
T OWARDS T HE S WITCHING B ANDIT P ROBLEM B EST R ESOLUTION
R ÉDA ALAMI 1,3 , O DALRIC -A MBRYM MAILLARD 2 , R APHAEL FERAUD 3
1 INRIA-S ACLAY (LRI), 2 INRIA L ILLE (S EQUEL ), 3 O RANGE L ABS
M ULTI -A RMED B ANDIT
For each step t = 1, ..., T
• The player chooses an arm k
t∈ K
• The reward k
tis revealed x
kt∈ [0, 1]
• Bernoulli rewards: x
kt∼ B (µ
kt,t)
Objective: Minimize the pseudo regret R
T:
R
T=
T
X
t=1
µ
?t| {z }
Best policy
− E
"
TX
t=1
x
kt#
| {z }
Your policy
µ
?t= max
k
µ
k,tS WITCHING E NVIRONMENT
µ
k,t=
( µ
k,t−1probability 1 − ρ
µ
new∼ U (0, 1) probability ρ where ρ is the switching rate.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time step 0.1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
means µ k
arm 1 arm 2 arm 3
T HOMPSON S AMPLING (TS)
( success counter : α
k= #(x
k= 1) + α
0failure counter : β
k= #(x
k= 0) + β
0At each step t = 1, ..., T :
1. Characterization: θ
k∼ Beta (α
k, β
k) 2. Decision: k
t= arg max
kθ
k3. Play: x
kt∼ B (µ
kt)
4. Update:
( α
kt= α
kt+ 1 if x
kt= 1 β
kt= β
kt+ 1 if x
kt= 0 R
T≤ (1 + ) P
k
µ?−µk
KL(µk,µ?)
(log T + log log T ) (Lai and Robbins (1985) lower bound)
KL (•, •) = Kullback-Leibler divergence
R EFERENCES
R. P. Adams and D.J.C MacKay, Bayesian online changepoint detection, arXiv, 2007.
J. Mellor and J. Shapiro, Thompson Sampling in switching environments with Bayesian online changepoint detection, AISTATS, 2013.
G LOBAL S WITCHING TS WITH B AYESIAN A GGREGATION
Learning with a growing number of Thompson Sampling f
i,t: i denotes the starting time and t the current time. P (f
i,t) : weight at time t of the Thompson sampling starting at time i .
Initialization: P (f
1,1) = 1 , t = 1 ,
∀k ∈ K α
k,f1,1= α
0, β
k,f1,1= β
0-1- Decision process: at each time t :
• ∀ i ≤ t , ∀k : θ
k,fi,t∼ Beta α
k,fi,t, β
k,fi,t• Play (Bayesian Aggregation):
k
t= arg max
k
X
i<t
P (f
i,t) θ
k,fi,t-2- Instantaneous gain update:
∀ i ≤ t P (x
t|f
i,t) =
αkt,fi,t
βkt,fi,t+αkt,fi,t
if x
kt= 1
βkt,fi,t
βkt,fi,t+αkt,fi,t
if x
kt= 0 -3- Arm hyperparameters update:
∀ i ≤ t
( α
kt,fi,t= α
kt,fi,t+ 1 if x
kt= 1 β
kt,fi,t= β
kt,fi,t+ 1 if x
kt= 0 -4- Distribution of experts update:
• Update previous experts: P (f
i,t+1) ∝ (1 − ρ) · P (x
t|f
i,t)· P (f
i,t) ∀i ≤ t
• Create new expert f
t+1,t+1: P (f
t+1,t+1) ∝ ρ P
ti=1