• Aucun résultat trouvé

Memory Bandits: Towards the Switching Bandit Problem Best Resolution

N/A
N/A
Protected

Academic year: 2021

Partager "Memory Bandits: Towards the Switching Bandit Problem Best Resolution"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01879251

https://hal.archives-ouvertes.fr/hal-01879251

Submitted on 22 Sep 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Memory Bandits: Towards the Switching Bandit Problem Best Resolution

Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud

To cite this version:

Réda Alami, Odalric-Ambrym Maillard, Raphaël Féraud. Memory Bandits: Towards the Switching Bandit Problem Best Resolution. MLSS 2018 - Machine Learning Summer School, Aug 2018, Madrid, Spain. �hal-01879251�

(2)

M EMORY B ANDITS :

T OWARDS T HE S WITCHING B ANDIT P ROBLEM B EST R ESOLUTION

R ÉDA ALAMI 1,3 , O DALRIC -A MBRYM MAILLARD 2 , R APHAEL FERAUD 3

1 INRIA-S ACLAY (LRI), 2 INRIA L ILLE (S EQUEL ), 3 O RANGE L ABS

M ULTI -A RMED B ANDIT

For each step t = 1, ..., T

• The player chooses an arm k

t

∈ K

• The reward k

t

is revealed x

kt

∈ [0, 1]

• Bernoulli rewards: x

kt

∼ B (µ

kt,t

)

Objective: Minimize the pseudo regret R

T

:

R

T

=

T

X

t=1

µ

?t

| {z }

Best policy

− E

"

T

X

t=1

x

kt

#

| {z }

Your policy

µ

?t

= max

k

µ

k,t

S WITCHING E NVIRONMENT

µ

k,t

=

( µ

k,t−1

probability 1 − ρ

µ

new

∼ U (0, 1) probability ρ where ρ is the switching rate.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

time step 0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

means µ k

arm 1 arm 2 arm 3

T HOMPSON S AMPLING (TS)

( success counter : α

k

= #(x

k

= 1) + α

0

failure counter : β

k

= #(x

k

= 0) + β

0

At each step t = 1, ..., T :

1. Characterization: θ

k

∼ Beta (α

k

, β

k

) 2. Decision: k

t

= arg max

k

θ

k

3. Play: x

kt

∼ B (µ

kt

)

4. Update:

( α

kt

= α

kt

+ 1 if x

kt

= 1 β

kt

= β

kt

+ 1 if x

kt

= 0 R

T

≤ (1 + ) P

k

µ?−µk

KL(µk?)

(log T + log log T ) (Lai and Robbins (1985) lower bound)

KL (•, •) = Kullback-Leibler divergence

R EFERENCES

R. P. Adams and D.J.C MacKay, Bayesian online changepoint detection, arXiv, 2007.

J. Mellor and J. Shapiro, Thompson Sampling in switching environments with Bayesian online changepoint detection, AISTATS, 2013.

G LOBAL S WITCHING TS WITH B AYESIAN A GGREGATION

Learning with a growing number of Thompson Sampling f

i,t

: i denotes the starting time and t the current time. P (f

i,t

) : weight at time t of the Thompson sampling starting at time i .

Initialization: P (f

1,1

) = 1 , t = 1 ,

∀k ∈ K α

k,f1,1

= α

0

, β

k,f1,1

= β

0

-1- Decision process: at each time t :

• ∀ i ≤ t , ∀k : θ

k,fi,t

∼ Beta α

k,fi,t

, β

k,fi,t

• Play (Bayesian Aggregation):

k

t

= arg max

k

X

i<t

P (f

i,t

) θ

k,fi,t

-2- Instantaneous gain update:

∀ i ≤ t P (x

t

|f

i,t

) =

αkt,fi,t

βkt,fi,tkt,fi,t

if x

kt

= 1

βkt,fi,t

βkt,fi,tkt,fi,t

if x

kt

= 0 -3- Arm hyperparameters update:

∀ i ≤ t

( α

kt,fi,t

= α

kt,fi,t

+ 1 if x

kt

= 1 β

kt,fi,t

= β

kt,fi,t

+ 1 if x

kt

= 0 -4- Distribution of experts update:

• Update previous experts: P (f

i,t+1

) ∝ (1 − ρ) · P (x

t

|f

i,t

)· P (f

i,t

) ∀i ≤ t

• Create new expert f

t+1,t+1

: P (f

t+1,t+1

) ∝ ρ P

t

i=1

P (f

i,t

)

• Prior: α

k,ft,t

= α

0

, β

k,ft,t

= β

0

T RACKING THE OPTIMAL EXPERT

C OMPARISON WITH STATE - OF - THE - ART

S ENSITIVITY ANALYSIS OF PARAMETERS ( ρ AND M )

Références

Documents relatifs

To test whether the vesicular pool of Atat1 promotes the acetyl- ation of -tubulin in MTs, we isolated subcellular fractions from newborn mouse cortices and then assessed

Néanmoins, la dualité des acides (Lewis et Bronsted) est un système dispendieux, dont le recyclage est une opération complexe et par conséquent difficilement applicable à

Cette mutation familiale du gène MME est une substitution d’une base guanine par une base adenine sur le chromosome 3q25.2, ce qui induit un remplacement d’un acide aminé cystéine

En ouvrant cette page avec Netscape composer, vous verrez que le cadre prévu pour accueillir le panoramique a une taille déterminée, choisie par les concepteurs des hyperpaysages

Chaque séance durera deux heures, mais dans la seconde, seule la première heure sera consacrée à l'expérimentation décrite ici ; durant la seconde, les élèves travailleront sur

A time-varying respiratory elastance model is developed with a negative elastic component (E demand ), to describe the driving pressure generated during a patient initiated

The aim of this study was to assess, in three experimental fields representative of the various topoclimatological zones of Luxembourg, the impact of timing of fungicide

Attention to a relation ontology [...] refocuses security discourses to better reflect and appreciate three forms of interconnection that are not sufficiently attended to