HAL Id: hal-01987001
https://hal.archives-ouvertes.fr/hal-01987001
Preprint submitted on 11 Feb 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Non-Stationary Thompson Sampling For Stochastic Bandits with Graph-Structured Feedback
Réda Alami
To cite this version:
Réda Alami. Non-Stationary Thompson Sampling For Stochastic Bandits with Graph-Structured
Feedback. 2019. �hal-01987001�
Non-Stationary Thompson Sampling For Stochastic Bandits with Graph-Structured Feedback
Réda Alami Orange Labs [email protected]
Abstract
We propose an extension of the multi-armed bandit problem with graph feedback studied in [4] where the distributions of reward and feedback are non-stationary.
We also propose an extension with two variants of the Global Switching Thompson Sampling proposed in [3] for the graph feedback setting. This extension is based on the aggregation of a growing number of experts seen as learners. Finally, we conduct experiments providing evidences that in practice our proposal compares favorably with the oracle that exactly knows the location of the environment changes.
1 Introduction
The Multi-Armed Bandit problem with graph feedback formalizes the fundamental exploration- exploitation dilemma that appears in decision making problems facing partial information carried out by some unknown structure. Specifically, a set of K arms is available to the agent. At each turn, he chooses one arm and observes a reward corresponding to the played arm and also a feedback from the neighboring arms. This seeting has been efficiently studied in [4] by extending the classical Thompson Sampling to the graph feedback setting. This algorithm is called TS-N policy. However, in such these settings, the distribution of rewards and feedback are assumed to be stationary. In this paper, we propose an extension of the bandit with graph feedback to the abrupt switching environment.
Moreover, we propose an adaptation of TS-N policy for the switching environment.
2 The Non-Stationary Stochastic Bandit with Graph Feedback
Let us consider an agent facing a non-stationary stochastic multi-armed bandit with graph feedback characterized by a set K t 1, ..., K u of K independent arms and an undirected graph G p K, E q . At each round t P rr 1, T ss , the agent chooses to observe one of the K possible actions. When playing the arm k
tat time t, two events take place. First, a reward x
tis received, where x
tB p µ
kt,tq is a random variable drawn from a Bernoulli distribution of expectation µ
k,t. Then, the agent observes the feedback y
k1,tof all neighboring arms N
ktt k
1P K |p k
t, k
1q P E u where y
k1,tB p µ
k1,tq is a random variable drawn from a Bernoulli distribution of expectation µ
k1,t. When the graph is empty (E H ), the setting is equivalent to the classical bandit problem. Moreover, when the graph is fully connected, the problem is considered as a pure prediction. Therefore, the graph-structured feedback can be seen as an intermediate setting between the bandit and the pure prediction. To assess the quality of a strategy aiming at resolving the bandit with graph feedback, we simply use the classical regret. It takes the following form: R p T q °
Tt1
µ
t°
Tt1
µ
kt,t, where µ
tmax
kPKt µ
k,tu denotes the best expected reward at round t and k
tthe action chosen by the decision-maker at the same time t.
Piece-wise stationary Bernoulli distributions We assume that there exists a parameter ρ P p 0, 1 q such that the reward mean µ
k,tof arm k at time t follows a global switching model:
µ
k,t"
µ
k,t1with probability 1 ρ
µ
newU p 0, 1 q with probability ρ (1)
32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.
When the environment is modeled by eq (1) for all k P K, the problem setting is called a Global Switching Multi-Armed Bandit with Graph Feedback (GS-MAB-GF), i.e. when a switch occurs all arms change their expected rewards and expected feedbacks. There exists a more general setting where changes occur independently for each arm k (i.e. arms change points are independent from an arm to another). In this case, the problem setting is called a Per-arm Switching Multi-Armed Bandit with Graph Feedback. In this paper, we will focus more on the first setting (GS-MAB-GF).
Hidden sequence of change points From eq (1), we may characterize each GS-MAB-GF with an unknown change points sequence of length Γ
Tdenoted by p τ
κq
κPrr1,ΓT 1ssP N
ΓT 1where:
"
@ κ P rr 1, Γ
Tss , @ t P T
κrr τ
κ1, τ
κ 1ss , @ k P K, µ
k,tµ
k,rκsτ
10 τ
2... τ
ΓT 1T
3 Global Switching Thompson Sampling for Graph-Structured Feedback
3.1 Elemental expert: The TS-N policy
The TS-N policy is an adaptation of the classical Thompson Sampling for graph-structured feedback.
It belongs to the Bayesian online learning family. It is leveraging Bayesian tools by maintaining a Beta posterior distribution π
k,tBeta p α
k,t, β
k,tq on the reward distribution of each arm k. Based on the reward received x
tand the set of observed feedback t y
k1,t: k
1P N
ktu , the posterior distribution π
k,tis updated such as:
π
k,tBeta p # p feedback 1 q # p reward 1 q α
0, # p feedback 0 q # p reward 0 q β
0q
At each time, the agent takes a sample θ
k,tfrom each π
k,tand then plays the arm k
targ max
kt θ
k,tu . Formally, by denoting X
t1
t1i1
x
ithe history of past feedback we write:
θ
tp θ
1,t, ..., θ
K,tq Pp θ
t| X
t1q ±
Kk1
π
k,t. Recently, in [4] TS-N has been shown to have a matching regret bound. Indeed, the authors have obtained a problem-independent bound of O a
χ p G ¯ q T , where χ p G ¯ q denotes the clique cover number of the graph G.
3.2 Decision-making based on a growing number of experts
Best achievable performance: TS-N oracle Let TS-N
denotes the oracle that knows exactly the location of all the change points τ
κobserved until time T . It simply restarts a new TS-N at each of these change points. Assuming that Γ
Tis the overall number of change points observed until the horizon T , then TS-N
simply runs successively Γ
TTS-N procedures starting at τ
κ1 and ending at τ
κ 1. Since TS-N has been analyzed with a matching regret bound [4], TS-N
represents the best achievable performance in the global switching MAB with graph feedback.
Using the Bayesian Online changepoint detector Like in [2, 3], in order to detect the occurrence of the changepoints, we connect the well-known Bayesian online changepoint detector of [1] with the graph-structured version of the multi-armed bandit setting. Indeed, at each time step t, a new expert (i.e. TS-N procedure) is introduced. One can see the expert f
i,tas an index used to get access to the memory saving the hyperparameters of the model created at time t. Thus, dealing with the expert distribution w
i,tP p f
i,t| X
t1q , the computation of the posterior distribution Pp θ
t| X
t1q takes the following form:
Pp θ
t| X
t1q
¸
ti1
Pp θ
t| X
t1, f
i,tqPp f
i,t| X
t1q (2)
Building the expert distribution According to the work of [1], the computation of the expert distribution is done recursively such that:
P p f
i,t| X
t1q loooooomoooooon
Expert distribution att
∝
t
¸
1i1
change point prior
hkkkkkkikkkkkkj
Pp f
i,t| f
i,t1q Pp loooooooooomoooooooooon x
t| f
i,t1, X
t2q
Instantaneous gain
Expert distribution at
hkkkkkkkikkkkkkkj
t1Pp f
i,t1| X
t2q (3)
The change point prior Pp f
i,t| f
i,t1q is naturally computed following eq (1):
Pp f
i,t| f
i,t1q p 1 ρ q 1 p i t q ρ 1 p i t q (4)
2
Thus, the inference model takes the following form (Up to a normalization factor):
"
Growth probability: Pp f
i,t| X
t1q ∝ p 1 ρ q Pp x
t| f
i,t1, X
t2q Pp f
i,t1| X
t2q Change point probability: Pp f
t,t| X
t1q ∝ ρ °
t1i1
Pp x
t| f
i,t1, X
t2q Pp f
i,t1| X
t2q (5) Where Pp x
t| f
i,t1, X
t2q corresponds to the likelihood of the Bernoulli distribution parametrized
with
α αkt,i,t1kt,i,t1 βkt,i,t1
, where α
kt,i,t1and β
kt,i,t1are the hyper-parameters of the arm k
tlearned by the expert f
i,t1. Then, the quantity Pp x
t| f
i,t1, X
t2q takes the following form:
Pp x
t| f
i,t1, X
t2q exp p l
i,t1q where l
i,t1denotes the instantaneous logarithmic loss in- curred by the forecaster f
i,t1at time t 1.
Moreover, in order to build the index ϑ
k,tof arm k at time t, we propose two alternative definitions for the indices:
Bayesian Aggregation of experts Like in [2], one can interpret eq (2) as a Bayesian aggregation of a growing number of experts seen as learning. Thus, at each time step t, instead of having only one characterization for each arm k, a set Θ
k,tt θ
k,i,t: θ
k,i,tBeta p α
k,i,t, β
k,i,tq @ i P rr 1, t ssu of t characterizations is available. Each element of Θ
k,tis a sampling from the Beta distribution associated to the model launched at time i. Finally, combining the Bayesian aggregation with the structured model of the environment leads us to build the index of each arm k as follows: ϑ
k,t°ti1°tθk,i,twi,tj1wj,t
Picking the best estimated expert The easiest way to deal with a growing number of experts is to take at each time step the "best" expert in term of weight. Indeed, the Bayesian online change point detection tends to give to the expert starting at the last change point the highest weight. Thus, by letting i
arg max
iw
i,tbe the best estimated expert at time t, the index of each arm k is built as follows: ϑ
k,tθ
k,i,tFinally, by plugging one of the previous way to build the arm index ϑ
k,tinto the formalism of [3], we get the Global Switching Thompson Sampling with Graph Feedback (Global-STS-N) described in algorithm 1. It should be noted that Global-STS-N presents two variants according to the way of computing the arm index (Bayesian aggregation of experts and picking the best estimated expert).
4 Experiments
In all the experiments, we consider a GS-MAB-GF of eight arms observing three change points occurring at each 1000 rounds. We compare the two versions of the Global-STS-N with the TS-N Oracle. Experiments are run 60 times.
(a) Graph structure (b) Comparison between the three versions of Global-STS-N and the oracle
Figure 1: Overall comparison of Global-STS-N and the oracle.
First, the performances of the two variants of Global-STS-N are very close to those of the oracle.
This means that Global-STS-N is able to perfectly deal with the switching environment. Then, using
the Bayesian aggregation allows us to make performances challenging those of the oracle.
Discussion Observing figure 1(b), one should notice that Global-STS-N is able to perfectly restart a TS-N at each change point. This behavior is possible thanks to the inference model of the experts presented in eq (5). In fact, when a switch occurs the instantaneous gain P p x
t| f
i,t1, X
t2q of all experts starting before the change point suddenly fall down because of their wrong estimation of the environment, giving the advantage to the experts newly created while annihilating the former ones. Then, the total mass of the expert distribution w
i,ttends to focus around the optimal expert i.e. the expert starting at the most recent change point τ
κand corresponds to the most appropriate characterization of the environment. This gives us the impression that Global-STS-N restarts a new TS-N at each change point.
5 Conclusion and future works
We have proposed Global-STS-N: an extension with two variants of the Thompson Sampling for the switching bandit problem with graph feedback. From the experiments, the proposed algorithm presents excellent performances. It is worth noting that Global-STS-N challenges the Thompson sampling with Graph Feedback oracle, an oracle which already knows the exact location of all the change points. These results arise from the fact that Global-STS-N is based on the Bayesian concept of tracking the best experts which allows us to catch efficiently the unknown change points. The proposed algorithm can naturally be extended to the Per-arm Switching Multi-Armed Bandit with Graph Feedback by maintaining an expert distribution per arm. The next step of this work is to analyze the Global-STS-N in term of pseudo cumulative regret.
Algorithm 1 Global Switching Thompson Sampling for Graph Feedback 1: procedure G
LOBAL-STS-N(K, G, T, α
0, β
0, ρ)
2: t Ð 1, w
1,tÐ 1, and @ k P K α
k,1,tÐ α
0, β
k,1,tÐ β
0 Initializations
3: for t ¤ T do Interaction with environment
4: k
tÐ C
HOOSEA
RMpt w u
t, t α u
t, t β u
tq
5: x
tÐ P
LAYA
RMp k
tq Bernoulli trial of parameter µ
kt,t6: t y u
ktÐ G
ETN
EIGHBORHOODF
EEDBACKp N
ktq
7: t w u
t 1Ð U
PDATEE
XPERTD
ISTRIBUTIONpt w u
t, t α u
kt,t, t β u
kt,t, x
t, ρ q 8: t α u
t 1, t β u
t 1Ð U
PDATEA
RMM
ODELpt α u
t, t β u
t, x
t, t y u
kt, G, k
tq 9: procedure C
HOOSEA
RM( t w u
t, t α u
t, t β u
t)
10: @ k P K @ i P rr 1, t ss θ
k,i,tBeta p α
k,i,t, β
k,i,tq 11: @ k P K ϑ
k,tÐ
#°
iPrr1,tss wi,t
°
jPrr1,tsswj,t
θ
k,i,tBayesian aggregation θ
k,i,tp i
arg max
iw
i,tq Picking the best expert 12: return arg max
kϑ
k,t13: procedure G
ETN
EIGHBORHOODF
EEDBACK(N
kt)
14: @ k
1P N
kty
k1,tB p µ
k1,tq Bernoulli trial of parameter µ
k1,t15: return t y u
kt16: procedure U
PDATEE
XPERTD
ISTRIBUTION( t w u
t, t α u
kt,t, t β u
kt,t, x
t, ρ) 17: l
i,tÐ x
tlog
αkt,i,t
αkt,i,t βkt,i,t
p 1 x
tq log
βkt,i,t
αkt,i,t βkt,i,t
@ i P rr 1, t ss
18: w
i,t 1Ð p 1 ρ q w
i,texp p l
i,tq @ i P rr 1, t ss Increasing the size of expert f
i,t19: w
t 1,t 1Ð ρ °
i