A tour of time-uniform concentration inequalities:

(1)

A tour of time-uniform concentration inequalities:

Laplace, Peeling and Kernel.

May 14, Besanc¸on Odalric-Ambrym Maillard Inria Lille – Nord europe

(2)

.

Motivation

A generic peeling tool The specific Laplace method

Open problems

Motivation

Multi-armed bandits

Time uniform empirical mean

A generic peeling tool The specific Laplace method

Open problems

Multi-armed bandits

“Facing the traveler tree, you ask yourself: which pathshall I take this time?”

(5)

.

A arms↔ A (unknown) probability distribution onR

ν1 ν2 ν3 ν4 ν5

Means: µ₁ µ₂ µ₃ µ₄ µ₅

Maximal mean: µ? = max

a∈Aµa

At roundt, an agent:

I chooses an armA_t ∈ A A_t∼π(A₁,Y₁, . . . ,At−1,Yt−1)

I observes a sampleYt =X_A_t_,t ∼ν_A_t (reward) (and only that!)

Goal: maximize sum of collected rewards^P_t=1Y_t over time; Minimize regret: R^π_T(ν) =Tµ^?−Eν,π

"_T X

t=1

Y_t

#

=^X

a∈A

(µ_?−µ_a)E

N_T(a)

, N_T(a) =

T

X

t=1

IAt=a.

The Stochastic Multi-Armed Bandit model

(6)

.

ν1 ν2 ν3 ν4 ν5

a∈Aµa

I chooses an armA_t ∈ A A_t∼π(A₁,Y₁, . . . ,At−1,Yt−1) I observes a sampleYt =X_A_t_,t ∼ν_A_t (reward) (and only that!)

Goal: maximize sum of collected rewards^P_t=1Y_t over time; Minimize regret: R^π_T(ν) =Tµ^?−Eν,π

"_T X

t=1

Y_t

#

=^X

a∈A

(µ_?−µ_a)E

N_T(a)

, N_T(a) =

T

X

t=1

IAt=a.

The Stochastic Multi-Armed Bandit model

(7)

.

ν1 ν2 ν3 ν4 ν5

a∈Aµa

I chooses an armA_t ∈ A A_t∼π(A₁,Y₁, . . . ,At−1,Yt−1) I observes a sampleYt =X_A_t_,t ∼ν_A_t (reward) (and only that!)

Goal: maximize sum of collected rewards^P_t=1Y_t over time; Minimize regret:

The Stochastic Multi-Armed Bandit model

(8)

.

Basic model (first approximation) for:

I Clinical trials: (Thompson, 1933)

:A=

, , , ,

I Casino slot machines: (Robbins, 1952)

:A=

, , , ,

I Ad-placement: (Nowadays...)

:A=

, , ,

Multi-armed bandit applications

(9)

.

Basic model (first approximation) for:

I Clinical trials: (Thompson, 1933)

:A=

, , , ,

I Casino slot machines: (Robbins, 1952)

:A=

, , , ,

I Ad-placement: (Nowadays...)

Multi-armed bandit applications

(10)

.

Eco-sustainable decision making I Plant health-care:

:A=

, , ,

I Ground health-care:

:A=

, , ,

Medical decision companion I Emergency admission filtering:

:A=

, , ,

Next generation of applications

(11)

.

Eco-sustainable decision making I Plant health-care:

:A=

, , ,

I Ground health-care:

:A=

, , ,

Medical decision companion I Emergency admission filtering:

Next generation of applications

(12)

.

Whenever we areuncertain about the outcome of an arm, we consider the best possible world and choose thebest arm.

argmax

a∈A

maxⁿEν˜a[X] : ˜ν_a compatible with obs. on arm a^o Why it works:

I If thebest possible world is correct ⇒ no regret

I If thebest possible world is wrong ⇒ the reduction in the uncertainty is maximized

Optimism in Face of Uncertainty Learning (OFUL)

(13)

.

argmax

a∈A

maxⁿEν˜a[X] : ˜ν_a compatible with obs. on arm a^o

Why it works:

Optimism in Face of Uncertainty Learning (OFUL)

(14)

.

argmax

a∈A

maxⁿEν˜a[X] : ˜ν_a compatible with obs. on arm a^o Why it works:

Optimism in Face of Uncertainty Learning (OFUL)

(15)

.

The idea

1 (10) 2 (73) 3 (3) 4 (23)

−1.5

−1

−0.5 0 0.5 1 1.5 2

Arms

Reward

The Upper–Confidence Bound (UCB) Algorithm

(16)

.

(Oracle) a_t+1 ∈argmax

a∈A

µ_a =µ_b_t,a+ (µ_a−µ_b_t,a)

| {z }

. Choose highest value compatible with empirical means (Auer et al. 2002) : (UCB) At+1∈argmax

a∈A

max

Eνa[X]=µ:µ−µ_bt,a6

slog(t³) 2N_t(a)

.

. Full empiricaldistributions ν_b_t,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) A_t+1 ∈argmax

a∈A

max

Eνa[X]:KL(ν_b_t,a, ν_a)6...

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011): (OFUL) At+1 ∈argmax

a∈A

max

fθ(a):kθ−θbtk_V_t 6...

(or kernel (M. 2016)) . Extends to risk-averse criterion (M. 2013):

(RA-UCB) At+1∈argmax

a∈A

max

κνa :KL(ν_bt,a, νa)6...

whereκ_ν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(17)

.

a∈A

| {z }

a∈A

max

slog(t³) 2N_t(a)

.

. Full empiricaldistributions ν_b_t,a (exp. families) instead (Lai, Robbins 1987): (KL-UCB) A_t+1 ∈argmax

a∈A

max

a∈A

max

a∈A

max

whereκ_ν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(18)

.

a∈A

| {z }

a∈A

max

slog(t³) 2N_t(a)

.

. Full empiricaldistributions ν_b_t,a (exp. families) instead (Lai, Robbins 1987):

(KL-UCB) A_t+1 ∈argmax

a∈A

max

a∈A

max

a∈A

max

whereκ_ν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(19)

.

a∈A

| {z }

a∈A

max

slog(t³) 2N_t(a)

.

a∈A

max

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):

(OFUL) At+1 ∈argmax

a∈A

max

f_θ(a):kθ−θbtk_V_t 6...

(or kernel (M. 2016))

. Extends to risk-averse criterion (M. 2013): (RA-UCB) At+1∈argmax

a∈A

max

whereκ_ν = inf

q∈P(R)Eq(X) + 1

λKL(q, ν)

The optimistic principle

(20)

.

a∈A

| {z }

a∈A

max

slog(t³) 2N_t(a)

.

a∈A

max

. Extends to linear objective functionsfθ(a) =hθ, ϕ(a)i (A-Y et al. 2011):

(OFUL) At+1 ∈argmax

a∈A

max

f_θ(a):kθ−θbtk_V_t 6...

a∈A

max

1

The optimistic principle

(21)

.

Motivation Multi-armed bandits

Time uniform empirical mean

A generic peeling tool The specific Laplace method

Open problems

Tuning the confidence δ of UCB

(23)

.

µ⁺_a,t = ˜µa,t+

I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)

I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation

Tuning the confidence δ of UCB

(24)

.

µ⁺_a,t = ˜µa,t+

I Not too much: so as to keep the regret as small as possible The confidence 1−δ has the following impact (similar for α)

I Big 1−δ: high level ofexploration I Small1−δ: high level ofexploitation

Tuning the confidence δ of UCB

(25)

.

Na(t) =

t

X

s=1

I{A_s =a} is a random stopping time.

Remark

(26)

.

Handlerandom stopping times τt, τ carefully

(Union bound) P 1

τt τt

X

i=1

(µ−X_i)>

sln(t/δ) 2τt

6δ .

(Peeling method) P 1

τ_t

τt

X

i=1

(µ−X_i)> s α

2τ_t ln

ln(t) ln(α)

1 δ

6δ

τ

X

i=1

(µ−X_i)> sα

2τ ln

ln(τ) ln(ατ) ln²(α)

1 δ

6δ

(Laplace method) P 1

τ

X

i=1

(µ−X_i)> s

1 +¹_τ 2τ ln √

τ + 1/δ

6δ Provably reducesregret, thus mistakes (saveslives).

Concentration with random stopping time

(27)

.

(Union bound) P 1

τt τt

X

i=1

(µ−X_i)>

sln(t/δ) 2τt

6δ .

τ_t

τt

X

i=1

(µ−X_i)>

s α 2τ_t ln

ln(t) ln(α)

1 δ

6δ

τ

X

i=1

(µ−X_i)> sα

2τ ln

1 δ

6δ

τ

X

i=1

(µ−X_i)> s

1 +¹_τ 2τ ln √

τ + 1/δ

Concentration with random stopping time

(28)

.

(Union bound) P 1

τt τt

X

i=1

(µ−X_i)>

sln(t/δ) 2τt

6δ .

τ_t

τt

X

i=1

(µ−X_i)>

s α 2τ_t ln

ln(t) ln(α)

1 δ

6δ

τ

X

i=1

(µ−X_i)>

sα 2τ ln

1 δ

6δ

τ

X

i=1

(µ−X_i)> s

1 +¹_τ 2τ ln √

τ + 1/δ

Concentration with random stopping time

(29)

.

(Union bound) P 1

τt τt

X

i=1

(µ−X_i)>

sln(t/δ) 2τt

6δ .

τ_t

τt

X

i=1

(µ−X_i)>

s α 2τ_t ln

ln(t) ln(α)

1 δ

6δ

τ

X

i=1

(µ−X_i)>

sα 2τ ln

1 δ

6δ

τ

X

i=1

(µ−X_i)>

s 1 +¹_τ

2τ ln √

τ + 1/δ

6δ

Concentration with random stopping time

(30)

.

. Geometric peeling: α >1 (agnostic to the probability measure).

t0 = 1, t1 =α, t2=α², t3 =α³, . . . tk =α^k, . . . P(Ω(τt))6^X

k∈N

P

∃t ∈[tk,tk+1), Ω(t).

P

: yields ln(n)

ln(α)

term.

sup over [t_k,t_k+1) yields t_k₊₁/t_k =α slow-down.

Chernoff inequality P(Z >t)6exp(−ϕ_?(t))

. Laplace: Integration ('infinitesimal peeling), closely following the distribution. M_t^λ = exp

λ

t

X

s=1

(X_s −µ)−tλ²/8

= exp

λS_t−tλ²/8

LetΛ∼ N(0,4). ThenMt =EΛ[M_t^Λ] = exp^2S_t+1^t²)^√_t+1¹ and P

St >

s t+ 1

2 ln(√

t+ 1/δ)

=P Mt>1/δ). Still valid forτ stopping time...

Note: closed form expression only forspecific measures.

Time-Peeling and Laplace

(31)

.

. Geometric peeling: α >1 (agnostic to the probability measure).

t0 = 1, t1 =α, t2=α², t3 =α³, . . . tk =α^k, . . . P(Ω(τt))6^X

k∈N

P

∃t ∈[tk,tk+1), Ω(t).

P

: yields ln(n)

ln(α)

term.

sup over [t_k,t_k+1) yields t_k₊₁/t_k =α slow-down.

Chernoff inequality P(Z >t)6exp(−ϕ_?(t))

. Laplace: Integration ('infinitesimal peeling), closely following the distribution.

M_t^λ = exp

λ

t

X

s=1

(X_s −µ)−tλ²/8

= exp

λS_t−tλ²/8

LetΛ∼ N(0,4). Then Mt =EΛ[M_t^Λ] = exp_t+1^2S^t²)^√_t+1¹ and P

St >

s t+ 1

2 ln(√

t+ 1/δ)

=P Mt>1/δ).

Time-Peeling and Laplace

(32)

.

Bounded by n= 10²

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01.

. Here, log√

t bound is better than log logt: constants matter!

Comparison for bounded stopping time

(33)

.

Bounded by n= 10³

Ratio of different time-uniform concentration bounds over that of the Laplace

Comparison for bounded stopping time

(34)

.

Bounded by n= 10⁴

. Here, log√

Comparison for bounded stopping time

(35)

.

Bounded by n = 10⁵

Comparison for bounded stopping time

(36)

.

Bounded by n= 10⁶

. Here, log√

Comparison for bounded stopping time

(37)

.

Bounded by n= 10⁷

Comparison for bounded stopping time

(38)

.

Unbounded

Ratio of different time-uniform concentration bounds over that of the Laplace method, as a function oft, for a confidence levelδ = 0.01 and various choice of α= 1 +. . ..

. Here, log√

Comparison for unbounded stopping time

(39)

.

Motivation

A generic peeling tool

The specific Laplace method Open problems

Time-Peeling

(41)

.

. sub-Gaussian N(σ²):

ϕ(λ)6 λ²σ²

2 , λ∈R,

(ϕ⁻¹_?,+(z) =√ 2σ²z ϕ⁻¹_?,−(z) =−√

2σ²z

. sub-Gamma Γ(v,c): (Good forBernstein bounds) ϕ(λ)6 λ²v

2(1−cλ), λ∈[0,1/c),

(ϕ⁻¹_?,+(z) =√

2vz+cz ϕ⁻¹_?,−(z) =−√

2vz−cz . Bernoulli B(p): (Good forDiscrete distributions)

ϕ(λ)6 λ² 2

(p(1−p) ifp > ¹₂, λ∈R⁺ g(p) ifp ∈[0,1], λ∈R







ϕ⁻¹_?,+(z) =^q2g(p)z ϕ⁻¹_?,−(z) =−^p2g(p)z whereg(p) = 1/2−p

log(1/p−1) and g(p) =

(p(1−p) ifp > ¹₂ g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)

ϕ(λ)6−1

2log1−2λσ², λ∈[0,1/2σ²),

(ϕ⁻¹_?,+(z)6σ²(1 +√ 2z)² ϕ⁻¹_?,−(z)>σ²(1−2√

z)

Examples

(42)

.

ϕ(λ)6 λ²σ²

2 , λ∈R,

(ϕ⁻¹_?,+(z) =√ 2σ²z ϕ⁻¹_?,−(z) =−√

2σ²z . sub-Gamma Γ(v,c): (Good forBernstein bounds)

ϕ(λ)6 λ²v

2(1−cλ), λ∈[0,1/c),

(ϕ⁻¹_?,+(z) =√

2vz+cz ϕ⁻¹_?,−(z) =−√

2vz−cz

. Bernoulli B(p): (Good forDiscrete distributions) ϕ(λ)6 λ²

2

(p(1−p) ifp > ¹₂, λ∈R⁺ g(p) ifp ∈[0,1], λ∈R







ϕ⁻¹_?,+(z) =^q2g(p)z ϕ⁻¹_?,−(z) =−^p2g(p)z whereg(p) = 1/2−p

log(1/p−1) and g(p) =

(p(1−p) ifp > ¹₂ g(p) else. . square of Gaussian: (Good for Empirical Bernstein bounds)

ϕ(λ)6−1

2log1−2λσ², λ∈[0,1/2σ²),

(ϕ⁻¹_?,+(z)6σ²(1 +√ 2z)² ϕ⁻¹_?,−(z)>σ²(1−2√

z)

Examples

(43)

.

ϕ(λ)6 λ²σ²

2 , λ∈R,

(ϕ⁻¹_?,+(z) =√ 2σ²z ϕ⁻¹_?,−(z) =−√

ϕ(λ)6 λ²v

2(1−cλ), λ∈[0,1/c),

(ϕ⁻¹_?,+(z) =√

2vz+cz ϕ⁻¹_?,−(z) =−√

ϕ(λ)6 λ² 2

(p(1−p) ifp > ¹₂, λ∈R⁺ g(p) ifp ∈[0,1], λ∈R







ϕ⁻¹_?,+(z) =^q2g(p)z ϕ⁻¹_?,−(z) =−^p2g(p)z where g(p) = 1/2−p

log(1/p−1) andg(p) =

(p(1−p) ifp > ¹₂ g(p) else.

. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1

2log1−2λσ², λ∈[0,1/2σ²),

(ϕ⁻¹_?,+(z)6σ²(1 +√ 2z)² ϕ⁻¹_?,−(z)>σ²(1−2√

z)

Examples

(44)

.

ϕ(λ)6 λ²σ²

2 , λ∈R,

(ϕ⁻¹_?,+(z) =√ 2σ²z ϕ⁻¹_?,−(z) =−√

ϕ(λ)6 λ²v

2(1−cλ), λ∈[0,1/c),

(ϕ⁻¹_?,+(z) =√

2vz+cz ϕ⁻¹_?,−(z) =−√

ϕ(λ)6 λ² 2

(p(1−p) ifp > ¹₂, λ∈R⁺ g(p) ifp ∈[0,1], λ∈R







ϕ⁻¹_?,+(z) =^q2g(p)z ϕ⁻¹_?,−(z) =−^p2g(p)z where g(p) = 1/2−p

log(1/p−1) andg(p) =

(p(1−p) ifp > ¹₂ g(p) else.

. square of Gaussian: (Good for Empirical Bernstein bounds) ϕ(λ)6−1

2log1−2λσ², λ∈[0,1/2σ²),

(ϕ⁻¹_?,+(z)6σ²(1 +√ 2z)² ϕ⁻¹_?,−(z)>σ²(1−2√

z)

Examples

(45)

.

. Let (ε_t)_t, non-increasing.

P 1

Nn Nn

X

i=1

Zi >εNn

6

K

X

k=1

P

∃t∈(tk−1,tk] : exp(λk t

X

i=1

Zi)>exp(tλkεt)

6

K

X

k=1

P

∃t ∈(tk−1,tk] : exp(λk t

X

i=1

Zi −tϕ(λk))

| {z }

Wk,t

>exp t(λkεtk −ϕ(λk)

| {z }

ϕ?(ε_tk)

)

6

K

X

k=1

P

t∈(tmaxk−1,tk]W_k,t >exp tϕ_?(ε_t_k)

, with t > t_k α 6

K

X

k=1

exp

− t_kϕ?(εt_k) α

.

. Tuning for BoundedK: εt s.t. tϕ?(εt) =αln(K/δ), gives

K

X

k=1

δ K =δ.

K δ

A simple proof

(46)

.

R_ν(T) = ^X

a:∆a>0

∆_aE[N_a(T)]

⇒only need to study the expected number of pulls of suboptimal arms Lower bound (Lai& Robbins 85, Burnetas& Katehakis 96)

For any banditν ∈ D. For any ”uniformly good” strategy knowing D.

∀a:µ_a < µ_? lim inf

T→∞

E[N_a(T)]

lnT > 1 K_a(νa, µ?), K_a(νa, µ?) = inf{KL(ν_a, ν) :ν∈ D,Eν[X]> µ?} I Bernoulli: K_a(ν_a, µ_?) =KL(B(µ_a),B(µ_?))^def= kl(µ_a, µ_?)

Main tool: Change of measure (Probability) ∀Ω,∀c ∈R, Pν

Ω∩ⁿlog dν

dν˜(X)

6c^o

6exp(c)P˜ν Ω. (Expectation) Eν

log

dν dν˜(X)

> sup

g:X →[0,1]

kl

Eν[g(X)],Eν˜[g(X)]

.

Back to bandits?

(47)

.

. Extension of time-peeling argument for KL concentration.

For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:

(Peeling method) P

KL ν_b_τ^D_t, ν> α τt

ln

ln(t) ln(α)

1 δ

,µ_bτt < µ

6δ . Letp_θ in an (F, ψ, ν₀)-exponential family: p_θ(dx) = exp(hθ,F(x)i −ψ(θ))ν₀(dx).

It holdsKL(ν_bn, ν) =B^ψ(θ^bn, θ^?) = Φ^?(F^bn) for some explicit Φ^?.

In dimension 1: Φ^? ismonotonic on positive coneR⁺ (and onR⁻).

. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013): P

KL ν_b_τ^D_t, ν> f(t)

τ_t ,µ_b_τ_t < µ

6 df(t) log(t)ee^−f^(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.

Boundary crossing in dimension 1

(48)

.

. Extension of time-peeling argument for KL concentration.

For exponential families ofdimension 1(Bernoulli, Gaussian with known variance, Poisson, etc.), we have:

(Peeling method) P

KL ν_b_τ^D_t, ν> α τt

ln

ln(t) ln(α)

1 δ

,µ_bτt < µ

6δ . Letp_θ in an (F, ψ, ν₀)-exponential family: p_θ(dx) = exp(hθ,F(x)i −ψ(θ))ν₀(dx).

It holdsKL(ν_bn, ν) =B^ψ(θ^bn, θ^?) = Φ^?(F^bn) for some explicit Φ^?.

In dimension 1: Φ^? ismonotonic on positive coneR⁺ (and onR⁻).

. Application in bandit(Cappe, Garivier, M. Stoltz, Munos 2013):

P

KL ν_b_τ^D_t, ν> f(t)

τ_t ,µ_b_τ_t < µ

6 df(t) log(t)ee^−f^(t)+1. usingα=f(t)/(f(t)−1), where f(t) = log(t) +ξlog(log(t)), with ξ >2.

Boundary crossing in dimension 1

(49)

.

. KL-ucbstrategy onD (parameterized by threshold functionf):

argmax

a∈A

maxⁿEνa[X]=µ:Na(t)K(ν_b_a,N^D _a_(t), µ)6f(t)^o whereK(ν_a, µ?)= inf{KL(ν_a, ν) :Eν[X]> µ?}.

. Regretupper-bound of KL-ucbstrategy, for ε <∆a: E[Na(T)] 6 2 + inf

n0

n0+

T

X

n=n0+1

P

nK(ν_b_a,n^D , µ^?−ε)<f(T)

+

T

X

t=A

P

N?(t)K(_bν_?,N^D _?_(t), µ^?−ε)>f(t)

| {z }

Boundary Crossing probability:o(1/t)?

µa µ?

(−→ increasing means)

Regret decomposition of KL-ucb

(50)

.

. KL-ucbstrategy onD (parameterized by threshold functionf):

argmax

a∈A

maxⁿEνa[X]=µ:Na(t)K(ν_b_a,N^D _a_(t), µ)6f(t)^o whereK(ν_a, µ?)= inf{KL(ν_a, ν) :Eν[X]> µ?}.

. Regretupper-bound of KL-ucbstrategy, for ε <∆a: E[Na(T)] 6 2 + inf

n0

n0+

T

X

n=n0+1

P

nK(ν_b_a,n^D , µ^?−ε)<f(T)

+

T

X

t=A

P

N?(t)K(ν_b_?,N^D _?_(t), µ^?−ε)>f(t)

| {z }

Boundary Crossing probability:o(1/t)?

µa µ?

Regret decomposition of KL-ucb

(51)

.

. Dimension 1: Handle the blue part (one direction enough: µ_b_n< µ_?−ε).

µ_a µ_?

Dimension D : one difficulty

(52)

.

. DimensionD: Handle the blue part. _bν_n s.t. µ_b_n< µ_?−ε

µa µ? .

Dimension D : one difficulty

(53)

.

. DimensionD: Handle the blue part.

µ_a µ_? .

Dimension D : one difficulty

(54)

.

The exponential familyE(F;ν₀) generated by function F and reference measureν₀

is: n

ν_θ∈M₁(X) : ∀x∈ X ν_θ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈R^K o, with

I Log-partition function: ψ(θ)^def= ln^R_Xexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef

= ⁿθ∈R^K:ψ(θ)<∞^o I Invertible parameter set:

ΘIdef

=ⁿθ∈ΘD: 0< λMIN(∇²ψ(θ))6λMAX(∇²ψ(θ))<∞^o

whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.

Examples

Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x²)). Bregman divergence

KL(ν_θ, ν_θ⁰) =B^ψ(θ, θ⁰)^def= ψ(θ⁰)−ψ(θ)− hθ⁰−θ,∇ψ(θ)i.

Exponential families

(55)

.

The exponential familyE(F;ν₀) generated by function F and reference measureν₀

is: n

ν_θ∈M₁(X) : ∀x∈ X ν_θ(x) = exp hθ,F(x)i−ψ(θ)ν0(x), θ∈R^K o, with

I Log-partition function: ψ(θ)^def= ln^R_Xexphθ,F(x)iν0(dx) I Canonical parameter set: ΘDdef

= ⁿθ∈R^K:ψ(θ)<∞^o I Invertible parameter set:

ΘIdef

=ⁿθ∈ΘD: 0< λMIN(∇²ψ(θ))6λMAX(∇²ψ(θ))<∞^o

whereλMIN(M) andλMAX(M) are the minimum and maximum eigenvalues of a semi-definite positive matrixM.

Examples

Bernoulli (K = 1,F(x) =x), Gaussian (K = 2,F(x) = (x,x²)).

Bregman divergence

A tour of time-uniform concentration inequalities: