A tour of time-uniform concentration inequalities:
Laplace, Peeling and Kernel.
May 14, Besanc¸on Odalric-Ambrym Maillard Inria Lille – Nord europe
.
Motivation
A generic peeling tool The specific Laplace method
Open problems
Table of contents
.
Motivation
Multi-armed bandits
Time uniform empirical mean
A generic peeling tool The specific Laplace method
Open problems
Table of contents
.
Multi-armed bandits
“Facing the traveler tree, you ask yourself: which pathshall I take this time?”
.
A arms↔ A (unknown) probability distribution onR
ν1 ν2 ν3 ν4 ν5
Means: µ1 µ2 µ3 µ4 µ5
Maximal mean: µ? = max
a∈Aµa
At roundt, an agent:
I chooses an armAt ∈ A At∼π(A1,Y1, . . . ,At−1,Yt−1)
I observes a sampleYt =XAt,t ∼νAt (reward) (and only that!)
Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret: RπT(ν) =Tµ?−Eν,π
"T X
t=1
Yt
#
=X
a∈A
(µ?−µa)E
NT(a)
, NT(a) =
T
X
t=1
IAt=a.
The Stochastic Multi-Armed Bandit model
.
A arms↔ A (unknown) probability distribution onR
ν1 ν2 ν3 ν4 ν5
Means: µ1 µ2 µ3 µ4 µ5
Maximal mean: µ? = max
a∈Aµa
At roundt, an agent:
I chooses an armAt ∈ A At∼π(A1,Y1, . . . ,At−1,Yt−1) I observes a sampleYt =XAt,t ∼νAt (reward) (and only that!)
Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret: RπT(ν) =Tµ?−Eν,π
"T X
t=1
Yt
#
=X
a∈A
(µ?−µa)E
NT(a)
, NT(a) =
T
X
t=1
IAt=a.
The Stochastic Multi-Armed Bandit model
.
A arms↔ A (unknown) probability distribution onR
ν1 ν2 ν3 ν4 ν5
Means: µ1 µ2 µ3 µ4 µ5
Maximal mean: µ? = max
a∈Aµa
At roundt, an agent:
I chooses an armAt ∈ A At∼π(A1,Y1, . . . ,At−1,Yt−1) I observes a sampleYt =XAt,t ∼νAt (reward) (and only that!)
Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret:
The Stochastic Multi-Armed Bandit model
.
Basic model (first approximation) for:
I Clinical trials: (Thompson, 1933)
:A=
, , , ,
I Casino slot machines: (Robbins, 1952)
:A=
, , , ,
I Ad-placement: (Nowadays...)
:A=
, , ,
Multi-armed bandit applications
.
Basic model (first approximation) for:
I Clinical trials: (Thompson, 1933)
:A=
, , , ,
I Casino slot machines: (Robbins, 1952)
:A=
, , , ,
I Ad-placement: (Nowadays...)
Multi-armed bandit applications
.
Eco-sustainable decision making I Plant health-care:
:A=
, , ,
I Ground health-care:
:A=
, , ,
Medical decision companion I Emergency admission filtering:
:A=
, , ,
Next generation of applications
.
Eco-sustainable decision making I Plant health-care:
:A=
, , ,
I Ground health-care:
:A=
, , ,
Medical decision companion I Emergency admission filtering:
Next generation of applications
.
Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.
argmax
a∈A
maxnEν˜a[X] : ˜νa compatible with obs. on arm ao Why it works:
I If thebest possible world is correct ⇒ no regret
I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized
Optimism in Face of Uncertainty Learning (OFUL)
.
Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.
argmax
a∈A
maxnEν˜a[X] : ˜νa compatible with obs. on arm ao
Why it works:
I If thebest possible world is correct ⇒ no regret
I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized
Optimism in Face of Uncertainty Learning (OFUL)
.
Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.
argmax
a∈A
maxnEν˜a[X] : ˜νa compatible with obs. on arm ao Why it works:
I If thebest possible world is correct ⇒ no regret
I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized
Optimism in Face of Uncertainty Learning (OFUL)
.
The idea
1 (10) 2 (73) 3 (3) 4 (23)
−1.5
−1
−0.5 0 0.5 1 1.5 2
Arms
Reward
The Upper–Confidence Bound (UCB) Algorithm
.
(Oracle) at+1 ∈argmax
a∈A
µa =µbt,a+ (µa−µbt,a)
| {z }
. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax
a∈A
max
Eνa[X]=µ:µ−µbt,a6
slog(t3) 2Nt(a)
.
. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) At+1 ∈argmax
a∈A
max
Eνa[X]:KL(νbt,a, νa)6...
. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax
a∈A
max
fθ(a):kθ−θbtkVt 6...
(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):
(RA-UCB) At+1∈argmax
a∈A
max
κνa :KL(νbt,a, νa)6...
whereκν = inf
q∈P(R)Eq(X) + 1
λKL(q, ν)
The optimistic principle
.
(Oracle) at+1 ∈argmax
a∈A
µa =µbt,a+ (µa−µbt,a)
| {z }
. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax
a∈A
max
Eνa[X]=µ:µ−µbt,a6
slog(t3) 2Nt(a)
.
. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) At+1 ∈argmax
a∈A
max
Eνa[X]:KL(νbt,a, νa)6...
. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax
a∈A
max
fθ(a):kθ−θbtkVt 6...
(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):
(RA-UCB) At+1∈argmax
a∈A
max
κνa :KL(νbt,a, νa)6...
whereκν = inf
q∈P(R)Eq(X) + 1
λKL(q, ν)
The optimistic principle
.
(Oracle) at+1 ∈argmax
a∈A
µa =µbt,a+ (µa−µbt,a)
| {z }
. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax
a∈A
max
Eνa[X]=µ:µ−µbt,a6
slog(t3) 2Nt(a)
.
. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):
(KL-UCB) At+1 ∈argmax
a∈A
max
Eνa[X]:KL(νbt,a, νa)6...
. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax
a∈A
max
fθ(a):kθ−θbtkVt 6...
(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):
(RA-UCB) At+1∈argmax
a∈A
max
κνa :KL(νbt,a, νa)6...
whereκν = inf
q∈P(R)Eq(X) + 1
λKL(q, ν)
The optimistic principle
.
(Oracle) at+1 ∈argmax
a∈A
µa =µbt,a+ (µa−µbt,a)
| {z }
. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax
a∈A
max
Eνa[X]=µ:µ−µbt,a6
slog(t3) 2Nt(a)
.
. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):
(KL-UCB) At+1 ∈argmax
a∈A
max
Eνa[X]:KL(νbt,a, νa)6...
. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):
(OFUL) At+1 ∈argmax
a∈A
max
fθ(a):kθ−θbtkVt 6...
(or kernel (M. 2016))
. Extends to risk-averse criterion (M. 2013): (RA-UCB) At+1∈argmax
a∈A
max
κνa :KL(νbt,a, νa)6...
whereκν = inf
q∈P(R)Eq(X) + 1
λKL(q, ν)
The optimistic principle
.
(Oracle) at+1 ∈argmax
a∈A
µa =µbt,a+ (µa−µbt,a)
| {z }
. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax
a∈A
max
Eνa[X]=µ:µ−µbt,a6
slog(t3) 2Nt(a)
.
. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):
(KL-UCB) At+1 ∈argmax
a∈A
max
Eνa[X]:KL(νbt,a, νa)6...
. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):
(OFUL) At+1 ∈argmax
a∈A
max
fθ(a):kθ−θbtkVt 6...
(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):
(RA-UCB) At+1∈argmax
a∈A
max
κνa :KL(νbt,a, νa)6...
1
The optimistic principle
.
Motivation Multi-armed bandits
Time uniform empirical mean
A generic peeling tool The specific Laplace method
Open problems
Table of contents
.
TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where
µ+a,t = ˜µa,t+
slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms
I Enough: so as to understand which arm is the best
I Not too much: so as to keep the regret as small as possible
The confidence 1−δ has the following impact (similar for α) I Big 1−δ: high level ofexploration
I Small1−δ: high level ofexploitation
Solution: depending on the time horizon, we can tune how to trade-off between
”exploration” and ”exploitation”.
Tuning the confidence δ of UCB
.
TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where
µ+a,t = ˜µa,t+
slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms
I Enough: so as to understand which arm is the best
I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)
I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation
Solution: depending on the time horizon, we can tune how to trade-off between
”exploration” and ”exploitation”.
Tuning the confidence δ of UCB
.
TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where
µ+a,t = ˜µa,t+
slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms
I Enough: so as to understand which arm is the best
I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)
I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation
Solution: depending on the time horizon, we can tune how to trade-off between
”exploration” and ”exploitation”.
Tuning the confidence δ of UCB
.
Na(t) =
t
X
s=1
I{As =a} is a random stopping time.
Remark
.
Handlerandom stopping times τt, τ carefully
(Union bound) P 1
τt τt
X
i=1
(µ−Xi)>
sln(t/δ) 2τt
6δ .
(Peeling method) P 1
τt
τt
X
i=1
(µ−Xi)> s α
2τt ln
ln(t) ln(α)
1 δ
6δ
(Peeling method) P 1
τ
τ
X
i=1
(µ−Xi)> sα
2τ ln
ln(τ) ln(ατ) ln2(α)
1 δ
6δ
(Laplace method) P 1
τ
τ
X
i=1
(µ−Xi)> s
1 +1τ 2τ ln √
τ + 1/δ
6δ Provably reducesregret, thus mistakes (saveslives).
Concentration with random stopping time
.
Handlerandom stopping times τt, τ carefully
(Union bound) P 1
τt τt
X
i=1
(µ−Xi)>
sln(t/δ) 2τt
6δ .
(Peeling method) P 1
τt
τt
X
i=1
(µ−Xi)>
s α 2τt ln
ln(t) ln(α)
1 δ
6δ
(Peeling method) P 1
τ
τ
X
i=1
(µ−Xi)> sα
2τ ln
ln(τ) ln(ατ) ln2(α)
1 δ
6δ
(Laplace method) P 1
τ
τ
X
i=1
(µ−Xi)> s
1 +1τ 2τ ln √
τ + 1/δ
6δ Provably reducesregret, thus mistakes (saveslives).
Concentration with random stopping time
.
Handlerandom stopping times τt, τ carefully
(Union bound) P 1
τt τt
X
i=1
(µ−Xi)>
sln(t/δ) 2τt
6δ .
(Peeling method) P 1
τt
τt
X
i=1
(µ−Xi)>
s α 2τt ln
ln(t) ln(α)
1 δ
6δ
(Peeling method) P 1
τ
τ
X
i=1
(µ−Xi)>
sα 2τ ln
ln(τ) ln(ατ) ln2(α)
1 δ
6δ
(Laplace method) P 1
τ
τ
X
i=1
(µ−Xi)> s
1 +1τ 2τ ln √
τ + 1/δ
6δ Provably reducesregret, thus mistakes (saveslives).
Concentration with random stopping time
.
Handlerandom stopping times τt, τ carefully
(Union bound) P 1
τt τt
X
i=1
(µ−Xi)>
sln(t/δ) 2τt
6δ .
(Peeling method) P 1
τt
τt
X
i=1
(µ−Xi)>
s α 2τt ln
ln(t) ln(α)
1 δ
6δ
(Peeling method) P 1
τ
τ
X
i=1
(µ−Xi)>
sα 2τ ln
ln(τ) ln(ατ) ln2(α)
1 δ
6δ
(Laplace method) P 1
τ
τ
X
i=1
(µ−Xi)>
s 1 +1τ
2τ ln √
τ + 1/δ
6δ
Concentration with random stopping time
.
. Geometric peeling: α >1 (agnostic to the probability measure).
t0 = 1, t1 =α, t2=α2, t3 =α3, . . . tk =αk, . . . P(Ω(τt))6X
k∈N
P
∃t ∈[tk,tk+1), Ω(t).
P
: yields ln(n)
ln(α)
term.
sup over [tk,tk+1) yields tk+1/tk =α slow-down.
Chernoff inequality P(Z >t)6exp(−ϕ?(t))
. Laplace: Integration ('infinitesimal peeling), closely following the distribution. Mtλ = exp
λ
t
X
s=1
(Xs −µ)−tλ2/8
= exp
λSt−tλ2/8
LetΛ∼ N(0,4). ThenMt =EΛ[MtΛ] = exp2St+1t2)√t+11 and P
St >
s t+ 1
2 ln(√
t+ 1/δ)
=P Mt>1/δ). Still valid forτ stopping time...
Note: closed form expression only forspecific measures.
Time-Peeling and Laplace
.
. Geometric peeling: α >1 (agnostic to the probability measure).
t0 = 1, t1 =α, t2=α2, t3 =α3, . . . tk =αk, . . . P(Ω(τt))6X
k∈N
P
∃t ∈[tk,tk+1), Ω(t).
P
: yields ln(n)
ln(α)
term.
sup over [tk,tk+1) yields tk+1/tk =α slow-down.
Chernoff inequality P(Z >t)6exp(−ϕ?(t))
. Laplace: Integration ('infinitesimal peeling), closely following the distribution.
Mtλ = exp
λ
t
X
s=1
(Xs −µ)−tλ2/8
= exp
λSt−tλ2/8
LetΛ∼ N(0,4). Then Mt =EΛ[MtΛ] = expt+12St2)√t+11 and P
St >
s t+ 1
2 ln(√
t+ 1/δ)
=P Mt>1/δ).
Time-Peeling and Laplace
.
Bounded by n= 102
Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.
. Here, log√
t bound is better than log logt: constants matter!
Comparison for bounded stopping time
.
Bounded by n= 103
Ratio of different time-uniform concentration bounds over that of the Laplace
Comparison for bounded stopping time
.
Bounded by n= 104
Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.
. Here, log√
t bound is better than log logt: constants matter!
Comparison for bounded stopping time
.
Bounded by n = 105
Ratio of different time-uniform concentration bounds over that of the Laplace
Comparison for bounded stopping time
.
Bounded by n= 106
Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.
. Here, log√
t bound is better than log logt: constants matter!
Comparison for bounded stopping time
.
Bounded by n= 107
Ratio of different time-uniform concentration bounds over that of the Laplace
Comparison for bounded stopping time
.
Unbounded
Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01 and various choice of α= 1 +. . ..
. Here, log√
t bound is better than log logt: constants matter!
Comparison for unbounded stopping time
.
Motivation
A generic peeling tool
The specific Laplace method Open problems
Table of contents
.
LetZ ={Zi}∞i=1 be a predictable process, andH its natural filtration. Let ϕbe an upper-envelope of the log-Laplace of theZi, and let ϕ−1?,+ denote the positive invert map of itsCramer transform (Legendre-Fenchel dual),that is:
∀λ∈ D,∀i, lnE
hexpλZiHi−1i6ϕ(λ),
∀x ∈R ϕ?(x) = sup
λ∈R
λx−ϕ(λ),
Letϕ−1?,+:R→R+ (respϕ−1?,−) be its reverse map onR+ (resp. R−).
LetNn be aN-valued H-measurable random variable a.s. bounded by n. Then
∀α∈(1,n], δ∈(0,1), P 1
Nn Nn
X
i=1
Zi >ϕ−1?,+
α Nn
ln ln(n)
ln(α) 1
δ
6 δ
Now, ifN is a (possiblyunbounded)N-valued H-measurable random variable,
∀α >1, δ∈(0,1) P 1
N
N
X
i=1
Zi >ϕ−1?,+
α Nln
ln(N) ln(αN) δln2(α)
6 δ
Time-Peeling
.
. sub-Gaussian N(σ2):
ϕ(λ)6 λ2σ2
2 , λ∈R,
(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√
2σ2z
. sub-Gamma Γ(v,c): (Good forBernstein bounds) ϕ(λ)6 λ2v
2(1−cλ), λ∈[0,1/c),
(ϕ−1?,+(z) =√
2vz+cz ϕ−1?,−(z) =−√
2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)
ϕ(λ)6 λ2 2
(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R
ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z whereg(p) = 1/2−p
log(1/p−1) and g(p) =
(p(1−p) ifp > 12 g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)
ϕ(λ)6−1
2log1−2λσ2, λ∈[0,1/2σ2),
(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√
z)
Examples
.
. sub-Gaussian N(σ2):
ϕ(λ)6 λ2σ2
2 , λ∈R,
(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√
2σ2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)
ϕ(λ)6 λ2v
2(1−cλ), λ∈[0,1/c),
(ϕ−1?,+(z) =√
2vz+cz ϕ−1?,−(z) =−√
2vz−cz
. Bernoulli B(p): (Good forDiscrete distributions) ϕ(λ)6 λ2
2
(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R
ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z whereg(p) = 1/2−p
log(1/p−1) and g(p) =
(p(1−p) ifp > 12 g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)
ϕ(λ)6−1
2log1−2λσ2, λ∈[0,1/2σ2),
(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√
z)
Examples
.
. sub-Gaussian N(σ2):
ϕ(λ)6 λ2σ2
2 , λ∈R,
(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√
2σ2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)
ϕ(λ)6 λ2v
2(1−cλ), λ∈[0,1/c),
(ϕ−1?,+(z) =√
2vz+cz ϕ−1?,−(z) =−√
2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)
ϕ(λ)6 λ2 2
(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R
ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z where g(p) = 1/2−p
log(1/p−1) andg(p) =
(p(1−p) ifp > 12 g(p) else.
. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1
2log1−2λσ2, λ∈[0,1/2σ2),
(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√
z)
Examples
.
. sub-Gaussian N(σ2):
ϕ(λ)6 λ2σ2
2 , λ∈R,
(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√
2σ2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)
ϕ(λ)6 λ2v
2(1−cλ), λ∈[0,1/c),
(ϕ−1?,+(z) =√
2vz+cz ϕ−1?,−(z) =−√
2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)
ϕ(λ)6 λ2 2
(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R
ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z where g(p) = 1/2−p
log(1/p−1) andg(p) =
(p(1−p) ifp > 12 g(p) else.
. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1
2log1−2λσ2, λ∈[0,1/2σ2),
(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√
z)
Examples
.
. Let (εt)t, non-increasing.
P 1
Nn Nn
X
i=1
Zi >εNn
6
K
X
k=1
P
∃t∈(tk−1,tk] : exp(λk t
X
i=1
Zi)>exp(tλkεt)
6
K
X
k=1
P
∃t ∈(tk−1,tk] : exp(λk t
X
i=1
Zi −tϕ(λk))
| {z }
Wk,t
>exp t(λkεtk −ϕ(λk)
| {z }
ϕ?(εtk)
)
6
K
X
k=1
P
t∈(tmaxk−1,tk]Wk,t >exp tϕ?(εtk)
, with t > tk α 6
K
X
k=1
exp
− tkϕ?(εtk) α
.
. Tuning for BoundedK: εt s.t. tϕ?(εt) =αln(K/δ), gives
K
X
k=1
δ K =δ.
K δ
A simple proof
.
Rν(T) = X
a:∆a>0
∆aE[Na(T)]
⇒only need to study the expected number of pulls of suboptimal arms Lower bound (Lai& Robbins 85, Burnetas& Katehakis 96)
For any banditν ∈ D. For any ”uniformly good” strategy knowing D.
∀a:µa < µ? lim inf
T→∞
E[Na(T)]
lnT > 1 Ka(νa, µ?), Ka(νa, µ?) = inf{KL(νa, ν) :ν∈ D,Eν[X]> µ?} I Bernoulli: Ka(νa, µ?) =KL(B(µa),B(µ?))def= kl(µa, µ?)
Main tool: Change of measure (Probability) ∀Ω,∀c ∈R, Pν
Ω∩nlog dν
dν˜(X)
6co
6exp(c)P˜ν Ω. (Expectation) Eν
log
dν dν˜(X)
> sup
g:X →[0,1]
kl
Eν[g(X)],Eν˜[g(X)]
.
Back to bandits?
.
. Extension of time-peeling argument for KL concentration.
For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:
(Peeling method) P
KL νbτDt, ν> α τt
ln
ln(t) ln(α)
1 δ
,µbτt < µ
6δ . Letpθ in an (F, ψ, ν0)-exponential family: pθ(dx) = exp(hθ,F(x)i −ψ(θ))ν0(dx).
It holdsKL(νbn, ν) =Bψ(θbn, θ?) = Φ?(Fbn) for some explicit Φ?.
In dimension 1: Φ? ismonotonic on positive coneR+ (and onR−).
. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013): P
KL νbτDt, ν> f(t)
τt ,µbτt < µ
6 df(t) log(t)ee−f(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.
Boundary crossing in dimension 1
.
. Extension of time-peeling argument for KL concentration.
For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:
(Peeling method) P
KL νbτDt, ν> α τt
ln
ln(t) ln(α)
1 δ
,µbτt < µ
6δ . Letpθ in an (F, ψ, ν0)-exponential family: pθ(dx) = exp(hθ,F(x)i −ψ(θ))ν0(dx).
It holdsKL(νbn, ν) =Bψ(θbn, θ?) = Φ?(Fbn) for some explicit Φ?.
In dimension 1: Φ? ismonotonic on positive coneR+ (and onR−).
. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013):
P
KL νbτDt, ν> f(t)
τt ,µbτt < µ
6 df(t) log(t)ee−f(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.
Boundary crossing in dimension 1
.
. KL-ucbstrategy onD (parameterized by threshold functionf):
argmax
a∈A
maxnEνa[X]=µ:Na(t)K(νba,ND a(t), µ)6f(t)o whereK(νa, µ?)= inf{KL(νa, ν) :Eν[X]> µ?}.
. Regretupper-bound of KL-ucbstrategy, for ε <∆a: E[Na(T)] 6 2 + inf
n0
n0+
T
X
n=n0+1
P
nK(νba,nD , µ?−ε)<f(T)
+
T
X
t=A
P
N?(t)K(bν?,ND ?(t), µ?−ε)>f(t)
| {z }
Boundary Crossing probability:o(1/t)?
µa µ?
(−→ increasing means)
Regret decomposition of KL-ucb
.
. KL-ucbstrategy onD (parameterized by threshold functionf):
argmax
a∈A
maxnEνa[X]=µ:Na(t)K(νba,ND a(t), µ)6f(t)o whereK(νa, µ?)= inf{KL(νa, ν) :Eν[X]> µ?}.
. Regretupper-bound of KL-ucbstrategy, for ε <∆a: E[Na(T)] 6 2 + inf
n0
n0+
T
X
n=n0+1
P
nK(νba,nD , µ?−ε)<f(T)
+
T
X
t=A
P
N?(t)K(νb?,ND ?(t), µ?−ε)>f(t)
| {z }
Boundary Crossing probability:o(1/t)?
µa µ?
(−→ increasing means)
Regret decomposition of KL-ucb
.
. Dimension 1: Handle the blue part (one direction enough: µbn< µ?−ε).
µa µ?
(−→ increasing means)
Dimension D : one difficulty
.
. DimensionD: Handle the blue part. bνn s.t. µbn< µ?−ε
µa µ? .
Dimension D : one difficulty
.
. DimensionD: Handle the blue part.
µa µ? .
Dimension D : one difficulty
.
The exponential familyE(F;ν0) generated by function F and reference measureν0
is: n
νθ∈M1(X) : ∀x∈ X νθ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈RK o, with
I Log-partition function: ψ(θ)def= lnRXexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef
= nθ∈RK:ψ(θ)<∞o I Invertible parameter set:
ΘIdef
=nθ∈ΘD: 0< λMIN(∇2ψ(θ))6λMAX(∇2ψ(θ))<∞o
whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.
Examples
Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x2)). Bregman divergence
KL(νθ, νθ0) =Bψ(θ, θ0)def= ψ(θ0)−ψ(θ)− hθ0−θ,∇ψ(θ)i.
Exponential families
.
The exponential familyE(F;ν0) generated by function F and reference measureν0
is: n
νθ∈M1(X) : ∀x∈ X νθ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈RK o, with
I Log-partition function: ψ(θ)def= lnRXexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef
= nθ∈RK:ψ(θ)<∞o I Invertible parameter set:
ΘIdef
=nθ∈ΘD: 0< λMIN(∇2ψ(θ))6λMAX(∇2ψ(θ))<∞o
whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.
Examples
Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x2)).
Bregman divergence