Memory Bandits: Towards the Switching Bandit Problem Best Resolution

(1)

HAL Id: hal-01879251

https://hal.archives-ouvertes.fr/hal-01879251

Submitted on 22 Sep 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Memory Bandits: Towards the Switching Bandit Problem Best Resolution

Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud

To cite this version:

Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud. Memory Bandits: Towards the Switching Bandit Problem Best Resolution. MLSS 2018 - Machine Learning Summer School, Aug 2018, Madrid, Spain. �hal-01879251�

(2)

M ^EMORY B ^ANDITS :

T ÔWARDS T ^HE S ^WITCHING B ÂNDIT P ^ROBLEM B ÊST R ÊSOLUTION

R ^ÉDA ALAMI ^1,3 , O ^DALRIC -A ^MBRYM MAILLARD ² , R ^APHAEL FERAUD ³

1 INRIA-S ÂCLAY (LRI), ² INRIA L ÎLLE (S ÊQUEL ), ³ O ^RANGE L ÂBS

M ^ULTI -A ^RMED B ^ANDIT

For each step t = 1, ..., T

• The player chooses an arm k

_t

∈ K

• The reward k

_t

is revealed x

_k_t

∈ [0, 1]

• Bernoulli rewards: x

_k_t

∼ B (µ

_k_t_,t

)

Objective: Minimize the pseudo regret R

_T

:

R

_T

=

T

X

t=1

µ

^?_t

| {z }

Best policy

− E

"

_T

X

t=1

x

_k_t

#

| {z }

Your policy

µ

^?_t

= max

k

µ

_k,t

S ^WITCHING E ^NVIRONMENT

µ

_k,t

=

( µ

_k,t−1

probability 1 − ρ

µ

_new

∼ U (0, 1) probability ρ where ρ is the switching rate.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time step 0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

means µ k

arm 1 arm 2 arm 3

T ^HOMPSON S ^AMPLING (TS)

( success counter : α

_k

= #(x

_k

= 1) + α

₀

failure counter : β

_k

= #(x

_k

= 0) + β

₀

At each step t = 1, ..., T :

1. Characterization: θ

_k

∼ Beta (α

_k

, β

_k

) 2. Decision: k

_t

= arg max

_k

θ

_k

3. Play: x

_k_t

∼ B (µ

_k_t

)

4. Update:

( α

_k_t

= α

_k_t

+ 1 if x

_k_t

= 1 β

_k_t

= β

_k_t

+ 1 if x

_k_t

= 0 R

_T

≤ (1 + ) P

k

µ^?−µ_k

KL(µ_k,µ^?)

(log T + log log T ) (Lai and Robbins (1985) lower bound)

KL (•, •) = Kullback-Leibler divergence

R ^EFERENCES

R. P. Adams and D.J.C MacKay, Bayesian online changepoint detection, arXiv, 2007.

J. Mellor and J. Shapiro, Thompson Sampling in switching environments with Bayesian online changepoint detection, AISTATS, 2013.

G ^LOBAL S ^WITCHING TS ^WITH B ^AYESIAN A ^GGREGATION

Learning with a growing number of Thompson Sampling f

_i,t

: i denotes the starting time and t the current time. P (f

_i,t

) : weight at time t of the Thompson sampling starting at time i .

Initialization: P (f

_1,1

) = 1 , t = 1 ,

∀k ∈ K α

_k,f_1,1

= α

₀

, β

_k,f_1,1

= β

₀

-1- Decision process: at each time t :

• ∀ i ≤ t , ∀k : θ

_k,f_i,t

∼ Beta α

_k,f_i,t

, β

_k,f_i,t

• Play (Bayesian Aggregation):

k

_t

= arg max

k

X

i<t

P (f

_i,t

) θ

_k,f_i,t

-2- Instantaneous gain update:

∀ i ≤ t P (x

_t

|f

_i,t

) =







α_kt,fi,t

β_kt,fi,t+α_kt,fi,t

if x

_k_t

= 1

β_kt,fi,t

β_kt,fi,t+α_kt,fi,t

if x

_k_t

= 0 -3- Arm hyperparameters update:

∀ i ≤ t

( α

_k_t_,f_i,t

= α

_k_t_,f_i,t

+ 1 if x

_k_t

= 1 β

_k_t_,f_i,t

= β

_k_t_,f_i,t

+ 1 if x

_k_t

= 0 -4- Distribution of experts update:

• Update previous experts: P (f

_i,t+1

) ∝ (1 − ρ) · P (x

_t

|f

_i,t

)· P (f

_i,t

) ∀i ≤ t

• Create new expert f

_t+1,t+1

: P (f

_t+1,t+1

) ∝ ρ P

t

i=1

P (f

_i,t

)

• Prior: α

_k,f_t,t

= α

₀

, β

_k,f_t,t

= β

₀

T RACKING THE OPTIMAL EXPERT

C OMPARISON WITH STATE - ^OF - ^THE - ^ART

S ENSITIVITY ANALYSIS OF PARAMETERS ( ρ _AND M )

Memory Bandits: Towards the Switching Bandit Problem Best Resolution

M EMORY B ANDITS :

T OWARDS T HE S WITCHING B ANDIT P ROBLEM B EST R ESOLUTION

R ÉDA ALAMI 1,3 , O DALRIC -A MBRYM MAILLARD 2 , R APHAEL FERAUD 3

1 INRIA-S ACLAY (LRI), 2 INRIA L ILLE (S EQUEL ), 3 O RANGE L ABS

M ULTI -A RMED B ANDIT

For each step t = 1, ..., T

• The player chooses an arm k

∈ K

• The reward k

is revealed x

∈ [0, 1]

• Bernoulli rewards: x

∼ B (µ

)

Objective: Minimize the pseudo regret R

:

R

=

X

µ

| {z }

− E

"

X

x

#

| {z }

µ

= max

µ

S WITCHING E NVIRONMENT

µ

=

( µ

probability 1 − ρ

µ

∼ U (0, 1) probability ρ where ρ is the switching rate.

T HOMPSON S AMPLING (TS)

( success counter : α

= #(x

= 1) + α

failure counter : β

= #(x

= 0) + β

At each step t = 1, ..., T :

1. Characterization: θ

∼ Beta (α

, β

) 2. Decision: k

= arg max

θ

3. Play: x

∼ B (µ

)

4. Update:

( α

= α

+ 1 if x

= 1 β

= β

+ 1 if x

= 0 R

≤ (1 + ) P

(log T + log log T ) (Lai and Robbins (1985) lower bound)

KL (•, •) = Kullback-Leibler divergence

R EFERENCES

R. P. Adams and D.J.C MacKay, Bayesian online changepoint detection, arXiv, 2007.

J. Mellor and J. Shapiro, Thompson Sampling in switching environments with Bayesian online changepoint detection, AISTATS, 2013.

G LOBAL S WITCHING TS WITH B AYESIAN A GGREGATION

Learning with a growing number of Thompson Sampling f

: i denotes the starting time and t the current time. P (f

) : weight at time t of the Thompson sampling starting at time i .

Initialization: P (f

) = 1 , t = 1 ,

∀k ∈ K α

= α

, β

= β

-1- Decision process: at each time t :

M ^EMORY B ^ANDITS :

T ÔWARDS T ^HE S ^WITCHING B ÂNDIT P ^ROBLEM B ÊST R ÊSOLUTION

R ^ÉDA ALAMI ^1,3 , O ^DALRIC -A ^MBRYM MAILLARD ² , R ^APHAEL FERAUD ³

1 INRIA-S ÂCLAY (LRI), ² INRIA L ÎLLE (S ÊQUEL ), ³ O ^RANGE L ÂBS

M ^ULTI -A ^RMED B ^ANDIT

S ^WITCHING E ^NVIRONMENT

T ^HOMPSON S ^AMPLING (TS)

R ^EFERENCES

G ^LOBAL S ^WITCHING TS ^WITH B ^AYESIAN A ^GGREGATION

C OMPARISON WITH STATE - ^OF - ^THE - ^ART

S ENSITIVITY ANALYSIS OF PARAMETERS ( ρ _AND M )