• Aucun résultat trouvé

A tour of time-uniform concentration inequalities:

N/A
N/A
Protected

Academic year: 2022

Partager "A tour of time-uniform concentration inequalities:"

Copied!
85
0
0

Texte intégral

(1)

A tour of time-uniform concentration inequalities:

Laplace, Peeling and Kernel.

May 14, Besanc¸on Odalric-Ambrym Maillard Inria Lille – Nord europe

(2)

.

Motivation

A generic peeling tool The specific Laplace method

Open problems

Table of contents

(3)

.

Motivation

Multi-armed bandits

Time uniform empirical mean

A generic peeling tool The specific Laplace method

Open problems

Table of contents

(4)

.

Multi-armed bandits

“Facing the traveler tree, you ask yourself: which pathshall I take this time?

(5)

.

A arms↔ A (unknown) probability distribution onR

ν1 ν2 ν3 ν4 ν5

Means: µ1 µ2 µ3 µ4 µ5

Maximal mean: µ? = max

a∈Aµa

At roundt, an agent:

I chooses an armAt ∈ A Atπ(A1,Y1, . . . ,At−1,Yt−1)

I observes a sampleYt =XAt,tνAt (reward) (and only that!)

Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret: RπT(ν) =?−Eν,π

"T X

t=1

Yt

#

=X

a∈A

?−µa)E

NT(a)

, NT(a) =

T

X

t=1

IAt=a.

The Stochastic Multi-Armed Bandit model

(6)

.

A arms↔ A (unknown) probability distribution onR

ν1 ν2 ν3 ν4 ν5

Means: µ1 µ2 µ3 µ4 µ5

Maximal mean: µ? = max

a∈Aµa

At roundt, an agent:

I chooses an armAt ∈ A Atπ(A1,Y1, . . . ,At−1,Yt−1) I observes a sampleYt =XAt,tνAt (reward) (and only that!)

Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret: RπT(ν) =?−Eν,π

"T X

t=1

Yt

#

=X

a∈A

?−µa)E

NT(a)

, NT(a) =

T

X

t=1

IAt=a.

The Stochastic Multi-Armed Bandit model

(7)

.

A arms↔ A (unknown) probability distribution onR

ν1 ν2 ν3 ν4 ν5

Means: µ1 µ2 µ3 µ4 µ5

Maximal mean: µ? = max

a∈Aµa

At roundt, an agent:

I chooses an armAt ∈ A Atπ(A1,Y1, . . . ,At−1,Yt−1) I observes a sampleYt =XAt,tνAt (reward) (and only that!)

Goal: maximize sum of collected rewardsPt=1Yt over time; Minimize regret:

The Stochastic Multi-Armed Bandit model

(8)

.

Basic model (first approximation) for:

I Clinical trials: (Thompson, 1933)

:A=

, , , ,

I Casino slot machines: (Robbins, 1952)

:A=

, , , ,

I Ad-placement: (Nowadays...)

:A=

, , ,

Multi-armed bandit applications

(9)

.

Basic model (first approximation) for:

I Clinical trials: (Thompson, 1933)

:A=

, , , ,

I Casino slot machines: (Robbins, 1952)

:A=

, , , ,

I Ad-placement: (Nowadays...)

Multi-armed bandit applications

(10)

.

Eco-sustainable decision making I Plant health-care:

:A=

, , ,

I Ground health-care:

:A=

, , ,

Medical decision companion I Emergency admission filtering:

:A=

, , ,

Next generation of applications

(11)

.

Eco-sustainable decision making I Plant health-care:

:A=

, , ,

I Ground health-care:

:A=

, , ,

Medical decision companion I Emergency admission filtering:

Next generation of applications

(12)

.

Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.

argmax

a∈A

maxnEν˜a[X] : ˜νa compatible with obs. on arm ao Why it works:

I If thebest possible world is correct ⇒ no regret

I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized

Optimism in Face of Uncertainty Learning (OFUL)

(13)

.

Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.

argmax

a∈A

maxnEν˜a[X] : ˜νa compatible with obs. on arm ao

Why it works:

I If thebest possible world is correct ⇒ no regret

I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized

Optimism in Face of Uncertainty Learning (OFUL)

(14)

.

Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.

argmax

a∈A

maxnEν˜a[X] : ˜νa compatible with obs. on arm ao Why it works:

I If thebest possible world is correct ⇒ no regret

I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized

Optimism in Face of Uncertainty Learning (OFUL)

(15)

.

The idea

1 (10) 2 (73) 3 (3) 4 (23)

−1.5

−1

−0.5 0 0.5 1 1.5 2

Arms

Reward

The Upper–Confidence Bound (UCB) Algorithm

(16)

.

(Oracle) at+1 ∈argmax

a∈A

µa =µbt,a+ (µaµbt,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µbt,a6

slog(t3) 2Nt(a)

.

. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) At+1 ∈argmax

a∈A

max

Eνa[X]:KL(νbt,a, νa)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtkVt 6...

(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):

(RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(νbt,a, νa)6...

whereκν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(17)

.

(Oracle) at+1 ∈argmax

a∈A

µa =µbt,a+ (µaµbt,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µbt,a6

slog(t3) 2Nt(a)

.

. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) At+1 ∈argmax

a∈A

max

Eνa[X]:KL(νbt,a, νa)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtkVt 6...

(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):

(RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(νbt,a, νa)6...

whereκν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(18)

.

(Oracle) at+1 ∈argmax

a∈A

µa =µbt,a+ (µaµbt,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µbt,a6

slog(t3) 2Nt(a)

.

. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):

(KL-UCB) At+1 ∈argmax

a∈A

max

Eνa[X]:KL(νbt,a, νa)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtkVt 6...

(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):

(RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(νbt,a, νa)6...

whereκν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(19)

.

(Oracle) at+1 ∈argmax

a∈A

µa =µbt,a+ (µaµbt,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µbt,a6

slog(t3) 2Nt(a)

.

. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):

(KL-UCB) At+1 ∈argmax

a∈A

max

Eνa[X]:KL(νbt,a, νa)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):

(OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtkVt 6...

(or kernel (M. 2016))

. Extends to risk-averse criterion (M. 2013): (RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(νbt,a, νa)6...

whereκν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(20)

.

(Oracle) at+1 ∈argmax

a∈A

µa =µbt,a+ (µaµbt,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µbt,a6

slog(t3) 2Nt(a)

.

. Full empiricaldistributions νbt,a (exp. families) instead (Lai, Robbins 1987):

(KL-UCB) At+1 ∈argmax

a∈A

max

Eνa[X]:KL(νbt,a, νa)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):

(OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtkVt 6...

(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):

(RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(νbt,a, νa)6...

1

The optimistic principle

(21)

.

Motivation Multi-armed bandits

Time uniform empirical mean

A generic peeling tool The specific Laplace method

Open problems

Table of contents

(22)

.

TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where

µ+a,t = ˜µa,t+

slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms

I Enough: so as to understand which arm is the best

I Not too much: so as to keep the regret as small as possible

The confidence 1−δ has the following impact (similar for α) I Big 1−δ: high level ofexploration

I Small1−δ: high level ofexploitation

Solution: depending on the time horizon, we can tune how to trade-off between

”exploration” and ”exploitation”.

Tuning the confidence δ of UCB

(23)

.

TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where

µ+a,t = ˜µa,t+

slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms

I Enough: so as to understand which arm is the best

I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)

I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation

Solution: depending on the time horizon, we can tune how to trade-off between

”exploration” and ”exploitation”.

Tuning the confidence δ of UCB

(24)

.

TheUpper ConfidenceBound algorithm (Auer et al. 2002) ChooseAt+1=Argmax{µ+a,t,a∈ A}where

µ+a,t = ˜µa,t+

slog(1/δt) 2Na(t) . Intuition: UCB should pull the suboptimal arms

I Enough: so as to understand which arm is the best

I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)

I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation

Solution: depending on the time horizon, we can tune how to trade-off between

”exploration” and ”exploitation”.

Tuning the confidence δ of UCB

(25)

.

Na(t) =

t

X

s=1

I{As =a} is a random stopping time.

Remark

(26)

.

Handlerandom stopping times τt, τ carefully

(Union bound) P 1

τt τt

X

i=1

(µ−Xi)>

sln(t/δ) 2τt

6δ .

(Peeling method) P 1

τt

τt

X

i=1

(µ−Xi)> s α

t ln

ln(t) ln(α)

1 δ

6δ

(Peeling method) P 1

τ

τ

X

i=1

(µ−Xi)> sα

2τ ln

ln(τ) ln(ατ) ln2(α)

1 δ

6δ

(Laplace method) P 1

τ

τ

X

i=1

(µ−Xi)> s

1 +1τ 2τ ln √

τ + 1/δ

6δ Provably reducesregret, thus mistakes (saveslives).

Concentration with random stopping time

(27)

.

Handlerandom stopping times τt, τ carefully

(Union bound) P 1

τt τt

X

i=1

(µ−Xi)>

sln(t/δ) 2τt

6δ .

(Peeling method) P 1

τt

τt

X

i=1

(µ−Xi)>

s αt ln

ln(t) ln(α)

1 δ

6δ

(Peeling method) P 1

τ

τ

X

i=1

(µ−Xi)> sα

2τ ln

ln(τ) ln(ατ) ln2(α)

1 δ

6δ

(Laplace method) P 1

τ

τ

X

i=1

(µ−Xi)> s

1 +1τ 2τ ln √

τ + 1/δ

6δ Provably reducesregret, thus mistakes (saveslives).

Concentration with random stopping time

(28)

.

Handlerandom stopping times τt, τ carefully

(Union bound) P 1

τt τt

X

i=1

(µ−Xi)>

sln(t/δ) 2τt

6δ .

(Peeling method) P 1

τt

τt

X

i=1

(µ−Xi)>

s αt ln

ln(t) ln(α)

1 δ

6δ

(Peeling method) P 1

τ

τ

X

i=1

(µ−Xi)>

sα 2τ ln

ln(τ) ln(ατ) ln2(α)

1 δ

6δ

(Laplace method) P 1

τ

τ

X

i=1

(µ−Xi)> s

1 +1τ 2τ ln √

τ + 1/δ

6δ Provably reducesregret, thus mistakes (saveslives).

Concentration with random stopping time

(29)

.

Handlerandom stopping times τt, τ carefully

(Union bound) P 1

τt τt

X

i=1

(µ−Xi)>

sln(t/δ) 2τt

6δ .

(Peeling method) P 1

τt

τt

X

i=1

(µ−Xi)>

s αt ln

ln(t) ln(α)

1 δ

6δ

(Peeling method) P 1

τ

τ

X

i=1

(µ−Xi)>

sα 2τ ln

ln(τ) ln(ατ) ln2(α)

1 δ

6δ

(Laplace method) P 1

τ

τ

X

i=1

(µ−Xi)>

s 1 +1τ

2τ ln √

τ + 1/δ

6δ

Concentration with random stopping time

(30)

.

. Geometric peeling: α >1 (agnostic to the probability measure).

t0 = 1, t1 =α, t2=α2, t3 =α3, . . . tk =αk, . . . P(Ω(τt))6X

k∈N

P

∃t ∈[tk,tk+1), Ω(t).

P

: yields ln(n)

ln(α)

term.

sup over [tk,tk+1) yields tk+1/tk =α slow-down.

Chernoff inequality P(Z >t)6exp(−ϕ?(t))

. Laplace: Integration ('infinitesimal peeling), closely following the distribution. Mtλ = exp

λ

t

X

s=1

(Xsµ)2/8

= exp

λSt2/8

LetΛ∼ N(0,4). ThenMt =EΛ[MtΛ] = exp2St+1t2)t+11 and P

St >

s t+ 1

2 ln(√

t+ 1/δ)

=P Mt>1/δ). Still valid forτ stopping time...

Note: closed form expression only forspecific measures.

Time-Peeling and Laplace

(31)

.

. Geometric peeling: α >1 (agnostic to the probability measure).

t0 = 1, t1 =α, t2=α2, t3 =α3, . . . tk =αk, . . . P(Ω(τt))6X

k∈N

P

∃t ∈[tk,tk+1), Ω(t).

P

: yields ln(n)

ln(α)

term.

sup over [tk,tk+1) yields tk+1/tk =α slow-down.

Chernoff inequality P(Z >t)6exp(−ϕ?(t))

. Laplace: Integration ('infinitesimal peeling), closely following the distribution.

Mtλ = exp

λ

t

X

s=1

(Xsµ)2/8

= exp

λSt2/8

LetΛ∼ N(0,4). Then Mt =EΛ[MtΛ] = expt+12St2)t+11 and P

St >

s t+ 1

2 ln(√

t+ 1/δ)

=P Mt>1/δ).

Time-Peeling and Laplace

(32)

.

Bounded by n= 102

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.

. Here, log√

t bound is better than log logt: constants matter!

Comparison for bounded stopping time

(33)

.

Bounded by n= 103

Ratio of different time-uniform concentration bounds over that of the Laplace

Comparison for bounded stopping time

(34)

.

Bounded by n= 104

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.

. Here, log√

t bound is better than log logt: constants matter!

Comparison for bounded stopping time

(35)

.

Bounded by n = 105

Ratio of different time-uniform concentration bounds over that of the Laplace

Comparison for bounded stopping time

(36)

.

Bounded by n= 106

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.

. Here, log√

t bound is better than log logt: constants matter!

Comparison for bounded stopping time

(37)

.

Bounded by n= 107

Ratio of different time-uniform concentration bounds over that of the Laplace

Comparison for bounded stopping time

(38)

.

Unbounded

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01 and various choice of α= 1 +. . ..

. Here, log√

t bound is better than log logt: constants matter!

Comparison for unbounded stopping time

(39)

.

Motivation

A generic peeling tool

The specific Laplace method Open problems

Table of contents

(40)

.

LetZ ={Zi}i=1 be a predictable process, andH its natural filtration. Let ϕbe an upper-envelope of the log-Laplace of theZi, and let ϕ−1?,+ denote the positive invert map of itsCramer transform (Legendre-Fenchel dual),that is:

∀λ∈ D,∀i, lnE

hexpλZiHi−1i6ϕ(λ),

∀x ∈R ϕ?(x) = sup

λ∈R

λxϕ(λ),

Letϕ−1?,+:R→R+ (respϕ−1?,−) be its reverse map onR+ (resp. R).

LetNn be aN-valued H-measurable random variable a.s. bounded by n. Then

∀α∈(1,n], δ∈(0,1), P 1

Nn Nn

X

i=1

Zi >ϕ−1?,+

α Nn

ln ln(n)

ln(α) 1

δ

6 δ

Now, ifN is a (possiblyunbounded)N-valued H-measurable random variable,

∀α >1, δ∈(0,1) P 1

N

N

X

i=1

Zi >ϕ−1?,+

α Nln

ln(N) ln(αN) δln2(α)

6 δ

Time-Peeling

(41)

.

. sub-Gaussian N(σ2):

ϕ(λ)6 λ2σ2

2 , λ∈R,

(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√

2z

. sub-Gamma Γ(v,c): (Good forBernstein bounds) ϕ(λ)6 λ2v

2(1−cλ), λ∈[0,1/c),

(ϕ−1?,+(z) =√

2vz+cz ϕ−1?,−(z) =−√

2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)

ϕ(λ)6 λ2 2

(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R

ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z whereg(p) = 1/2−p

log(1/p−1) and g(p) =

(p(1p) ifp > 12 g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)

ϕ(λ)6−1

2log1−2λσ2, λ∈[0,1/2σ2),

(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√

z)

Examples

(42)

.

. sub-Gaussian N(σ2):

ϕ(λ)6 λ2σ2

2 , λ∈R,

(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√

2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)

ϕ(λ)6 λ2v

2(1−cλ), λ∈[0,1/c),

(ϕ−1?,+(z) =√

2vz+cz ϕ−1?,−(z) =−√

2vz−cz

. Bernoulli B(p): (Good forDiscrete distributions) ϕ(λ)6 λ2

2

(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R

ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z whereg(p) = 1/2−p

log(1/p−1) and g(p) =

(p(1p) ifp > 12 g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)

ϕ(λ)6−1

2log1−2λσ2, λ∈[0,1/2σ2),

(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√

z)

Examples

(43)

.

. sub-Gaussian N(σ2):

ϕ(λ)6 λ2σ2

2 , λ∈R,

(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√

2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)

ϕ(λ)6 λ2v

2(1−cλ), λ∈[0,1/c),

(ϕ−1?,+(z) =√

2vz+cz ϕ−1?,−(z) =−√

2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)

ϕ(λ)6 λ2 2

(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R

ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z where g(p) = 1/2−p

log(1/p−1) andg(p) =

(p(1p) ifp > 12 g(p) else.

. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1

2log1−2λσ2, λ∈[0,1/2σ2),

(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√

z)

Examples

(44)

.

. sub-Gaussian N(σ2):

ϕ(λ)6 λ2σ2

2 , λ∈R,

(ϕ−1?,+(z) =√ 2σ2z ϕ−1?,−(z) =−√

2z . sub-Gamma Γ(v,c): (Good forBernstein bounds)

ϕ(λ)6 λ2v

2(1−cλ), λ∈[0,1/c),

(ϕ−1?,+(z) =√

2vz+cz ϕ−1?,−(z) =−√

2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)

ϕ(λ)6 λ2 2

(p(1−p) ifp > 12, λ∈R+ g(p) ifp ∈[0,1], λ∈R

ϕ−1?,+(z) =q2g(p)z ϕ−1?,−(z) =−p2g(p)z where g(p) = 1/2−p

log(1/p−1) andg(p) =

(p(1p) ifp > 12 g(p) else.

. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1

2log1−2λσ2, λ∈[0,1/2σ2),

(ϕ−1?,+(z)6σ2(1 +√ 2z)2 ϕ−1?,−(z)>σ2(1−2√

z)

Examples

(45)

.

. Let (εt)t, non-increasing.

P 1

Nn Nn

X

i=1

Zi >εNn

6

K

X

k=1

P

∃t∈(tk−1,tk] : exp(λk t

X

i=1

Zi)>exp(tλkεt)

6

K

X

k=1

P

∃t ∈(tk−1,tk] : exp(λk t

X

i=1

Zitϕ(λk))

| {z }

Wk,t

>exp t(λkεtkϕ(λk)

| {z }

ϕ?tk)

)

6

K

X

k=1

P

t∈(tmaxk−1,tk]Wk,t >exp ?tk)

, with t > tk α 6

K

X

k=1

exp

tkϕ?tk) α

.

. Tuning for BoundedK: εt s.t. ?t) =αln(K/δ), gives

K

X

k=1

δ K =δ.

K δ

A simple proof

(46)

.

Rν(T) = X

a:∆a>0

aE[Na(T)]

⇒only need to study the expected number of pulls of suboptimal arms Lower bound (Lai& Robbins 85, Burnetas& Katehakis 96)

For any banditν ∈ D. For any ”uniformly good” strategy knowing D.

∀a:µa < µ? lim inf

T→∞

E[Na(T)]

lnT > 1 Kaa, µ?), Kaa, µ?) = inf{KL(νa, ν) :ν∈ D,Eν[X]> µ?} I Bernoulli: Kaa, µ?) =KL(B(µa),B(µ?))def= kl(µa, µ?)

Main tool: Change of measure (Probability) ∀Ω,∀c ∈R, Pν

Ω∩nlog

˜(X)

6co

6exp(c)P˜ν. (Expectation) Eν

log

˜(X)

> sup

g:X →[0,1]

kl

Eν[g(X)],Eν˜[g(X)]

.

Back to bandits?

(47)

.

. Extension of time-peeling argument for KL concentration.

For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:

(Peeling method) P

KL νbτDt, ν> α τt

ln

ln(t) ln(α)

1 δ

bτt < µ

6δ . Letpθ in an (F, ψ, ν0)-exponential family: pθ(dx) = exp(hθ,F(x)i −ψ(θ))ν0(dx).

It holdsKL(νbn, ν) =Bψ(θbn, θ?) = Φ?(Fbn) for some explicit Φ?.

In dimension 1: Φ? ismonotonic on positive coneR+ (and onR).

. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013): P

KL νbτDt, ν> f(t)

τt bτt < µ

6 df(t) log(t)ee−f(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.

Boundary crossing in dimension 1

(48)

.

. Extension of time-peeling argument for KL concentration.

For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:

(Peeling method) P

KL νbτDt, ν> α τt

ln

ln(t) ln(α)

1 δ

bτt < µ

6δ . Letpθ in an (F, ψ, ν0)-exponential family: pθ(dx) = exp(hθ,F(x)i −ψ(θ))ν0(dx).

It holdsKL(νbn, ν) =Bψ(θbn, θ?) = Φ?(Fbn) for some explicit Φ?.

In dimension 1: Φ? ismonotonic on positive coneR+ (and onR).

. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013):

P

KL νbτDt, ν> f(t)

τt bτt < µ

6 df(t) log(t)ee−f(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.

Boundary crossing in dimension 1

(49)

.

. KL-ucbstrategy onD (parameterized by threshold functionf):

argmax

a∈A

maxnEνa[X]=µ:Na(t)K(νba,ND a(t), µ)6f(t)o whereK(νa, µ?)= inf{KL(νa, ν) :Eν[X]> µ?}.

. Regretupper-bound of KL-ucbstrategy, for ε <a: E[Na(T)] 6 2 + inf

n0

n0+

T

X

n=n0+1

P

nK(νba,nD , µ?−ε)<f(T)

+

T

X

t=A

P

N?(t)K(bν?,ND ?(t), µ?−ε)>f(t)

| {z }

Boundary Crossing probability:o(1/t)?

µa µ?

(−→ increasing means)

Regret decomposition of KL-ucb

(50)

.

. KL-ucbstrategy onD (parameterized by threshold functionf):

argmax

a∈A

maxnEνa[X]=µ:Na(t)K(νba,ND a(t), µ)6f(t)o whereK(νa, µ?)= inf{KL(νa, ν) :Eν[X]> µ?}.

. Regretupper-bound of KL-ucbstrategy, for ε <a: E[Na(T)] 6 2 + inf

n0

n0+

T

X

n=n0+1

P

nK(νba,nD , µ?−ε)<f(T)

+

T

X

t=A

P

N?(t)K(νb?,ND ?(t), µ?−ε)>f(t)

| {z }

Boundary Crossing probability:o(1/t)?

µa µ?

(−→ increasing means)

Regret decomposition of KL-ucb

(51)

.

. Dimension 1: Handle the blue part (one direction enough: µbn< µ?ε).

µa µ?

(−→ increasing means)

Dimension D : one difficulty

(52)

.

. DimensionD: Handle the blue part. bνn s.t. µbn< µ?ε

µa µ? .

Dimension D : one difficulty

(53)

.

. DimensionD: Handle the blue part.

µa µ? .

Dimension D : one difficulty

(54)

.

The exponential familyE(F;ν0) generated by function F and reference measureν0

is: n

νθ∈M1(X) : ∀x∈ X νθ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈RK o, with

I Log-partition function: ψ(θ)def= lnRXexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef

= nθ∈RK:ψ(θ)<∞o I Invertible parameter set:

ΘIdef

=nθ∈ΘD: 0< λMIN(∇2ψ(θ))6λMAX(∇2ψ(θ))<o

whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.

Examples

Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x2)). Bregman divergence

KL(νθ, νθ0) =Bψ(θ, θ0)def= ψ(θ0)−ψ(θ)− hθ0θ,∇ψ(θ)i.

Exponential families

(55)

.

The exponential familyE(F;ν0) generated by function F and reference measureν0

is: n

νθ∈M1(X) : ∀x∈ X νθ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈RK o, with

I Log-partition function: ψ(θ)def= lnRXexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef

= nθ∈RK:ψ(θ)<∞o I Invertible parameter set:

ΘIdef

=nθ∈ΘD: 0< λMIN(∇2ψ(θ))6λMAX(∇2ψ(θ))<o

whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.

Examples

Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x2)).

Bregman divergence

Exponential families

Références

Documents relatifs

Le prolapsus urétral est une affection peu fréquente (1/3000) qui concerne les filles prépubères et plus rarement les femmes ménopausées. Il s’agit de l’éversion

The first part of his book features the Iulius as a purely Gallican pamphlet, the second is devoted to arguing the impossibility of Erasmus as author, and a third deals

This paper is devoted to the study of the deviation of the (random) average L 2 −error associated to the least–squares regressor over a family of functions F n (with

In this work, risk averse (with respect to their own noise) agents have been considered; investigating a mean field game model with common noise and risk averse agents would be

Dimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data. In this paper, we study non-isotropic distributions

Plus elle découvrait la beauté de la Parole de Dieu en étudiant l’Évangile, plus elle se voyait devenir enfant de Dieu, comme pour entrer dans le Royaume des Cieux une

There is no composite index that would allow us to compare risk aversion between different actors (Europe versus the United States, for example). However, there are a certain number

This article makes the hypothesis that the European Union (EU) is a political actor whose identity and strategy on the international fi eld are based on a strong aversion