• Aucun résultat trouvé

Statistical Learning with errors in variables

N/A
N/A
Protected

Academic year: 2022

Partager "Statistical Learning with errors in variables"

Copied!
64
0
0

Texte intégral

(1)

Inverse Statistical Learning

Minimax theory, adaptation and algorithm

Sébastien Loustau avec (par ordre d’apparition)

C. Marteau, M. Chichignoud, C. Brunet and S. Souchet

Dijon, le 15 janvier 2014

(2)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

The problem of Inverse Statistical Learning

Given(X,Y)∼P on X × Y, a class G and a loss function

`:G ×(X × Y)→R+, , we aim at :

g? ∈arg minEP`(g,(X,Y)),

from anindirect sequence of observations :

(Z1,Y1), . . . ,(Zn,Yn) i.i.d. fromP˜,

where Zi ∼Af,Ais a linear compact operator (andX ∼f) .

(3)

The problem of Inverse Statistical Learning

Given(X,Y)∼P on X × Y, a class G and a loss function

`:G ×(X × Y)→R+, , we aim at :

g? ∈arg minEP`(g,(X,Y)), from anindirect sequence of observations :

(Z1,Y1), . . . ,(Zn,Yn) i.i.d. fromP˜,

where Zi ∼Af,Ais a linear compact operator (andX ∼f) .

(4)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Statistical Learning with errors in variables

Given(X,Y)∼P on X × Y, a class G and a loss function

`:G ×(X × Y)→R+, , we aim at :

g? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :

(X1+1,Y1), . . . ,(Xn+n,Yn) i.i.d. fromP˜,

whereZi ∼f ∗η andηis the density of the i.i.d. sequence (i)ni=1.

I Y =R : regression with errors in variables,

I Y ={1, . . . ,M}: classification with errors in variables,

I Y =∅ : unsupervised learning with errors in variables.

(5)

Statistical Learning with errors in variables

Given(X,Y)∼P on X × Y, a class G and a loss function

`:G ×(X × Y)→R+, , we aim at :

g? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :

(X1+1,Y1), . . . ,(Xn+n,Yn) i.i.d. fromP˜,

whereZi ∼f ∗η andηis the density of the i.i.d. sequence (i)ni=1.

I Y =R : regression with errors in variables,

I Y ={1, . . . ,M} : classification with errors in variables,

I Y =∅ : unsupervised learning with errors in variables.

(6)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Toy Example (I)

Direct dataset (Unobservable)

Observations (Available)

(7)

Toy example (II)

Direct dataset (Unobservable)

Observations (Available)

(8)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Real-world example in oncology (I)

Fig.1 : The same tumor observed by two radiologists Zij =Xi+ij,j ∈ {1,2}.

(9)

Real-world example in oncology (II)

Fig.1 : Batch effect in a Micro-array dataset

I J. A. Gagnon-Bartsch, L. Jacob and T. P. Speed, 2013

(10)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Contents

1. Minimax rates in discriminant analysis 2. Excess risk bound

3. The algorithm of noisyk-means (4.) Adaptation

(11)

Origin : a minimax motivation (with C. Marteau)

Density estimation Classification Direct case n

2γ+1 n

γ(α+1) γ(α+2)+d

Noisy case n2γ+2β+1 ? ? ?

f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)

Assumptions Margin parameter α≥0

|F[η](t)| ∼ |t|−β |F[ηj](t)| ∼ |tj|−βj ∀j =1, . . . ,d

(12)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Mammen and Tsybakov (1999)

Given two densitiesf andg, for anyG ⊂K, the Bayes risk is defined as :

RK(G) = 1 2

"

Z

K/G

fdQ+ Z

G

gdQ

# .

GivenX11, . . . ,Xn1 ∼f andX12, . . . ,Xn2 ∼g, we aim at : G?=arg min

G∈GRK(G).

Goal To obtain minimax fast rates rn(F)∼inf

Gˆ

sup

(f,g)∈FEd2( ˆG,G?), where d2 ∈ {df,g,d}.

(13)

Mammen and Tsybakov (1999) with errors in variables

We observeZ11, . . . ,Zn1 andZ12, . . . ,Zn2 such that :

Zi1=Xi1+1i andZi2=Xi2+2i, for i =1, . . .n, where :

I Xi1 ∼f andXi2 ∼g,

I ji i.i.d. with densityη.

Goal To obtain minimax fast rates rn(F, β)∼inf

Gˆ

sup

(f,g)∈FEd2( ˆG,G?), where d2∈ {df,g,d}.

(14)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

ERM approach

ERM principle in the direct case : 1

2n

n

X

i=1

1X1

i∈GC + 1

2n

n

X

i=1

1X2

i ∈G −→RK(G).

ERM principle in this model fails : 1

2n

n

X

i=1

1Z1

i∈GC + 1

2n

n

X

i=1

1Z2

i ∈G → 1

2 Z

GC

f ∗η+ Z

G

g∗η

6=RK(G).

Solution Define Rnλ(G) = 1

2 Z

GC

nλ(x)dx+ Z

G

ˆ gnλ(x)dx

−→RK(G),

where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1

n

X

i=1

Ke

Zi1−x λ

.

(15)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

ERM approach

1 2n

n

X

i=1

1X1

i∈GC + 1

2n

n

X

i=1

1X2

i ∈G −→RK(G).

ERM principle in this model fails : 1

2n

n

X

i=1

1Z1

i∈GC + 1

2n

n

X

i=1

1Z2

i ∈G → 1

2 Z

GC

f ∗η+ Z

G

g∗η

6=RK(G).

Solution Define Rnλ(G) = 1

2 Z

GC

nλ(x)dx+ Z

G

ˆ gnλ(x)dx

−→RK(G),

where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1

n

X

i=1

Ke

Zi1−x λ

.

(16)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

ERM approach

ERM principle in the direct case : 1

2n

n

X

i=1

1X1

i∈GC + 1

2n

n

X

i=1

1X2

i ∈G −→RK(G).

ERM principle in this model fails : 1

2n

n

X

i=1

1Z1

i∈GC + 1

2n

n

X

i=1

1Z2

i ∈G → 1

2 Z

GC

f ∗η+ Z

G

g∗η

6=RK(G).

Solution Define Rnλ(G) = 1

2 Z

GC

nλ(x)dx+ Z

G

ˆ gnλ(x)dx

−→RK(G),

where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1

n

X

i=1

Ke

Zi1−x λ

.

(17)

Details

Z11, . . . ,Zn1 i.i.d.f ∗η etZ12, . . . ,Zn2 i.i.d.g∗η. We consider : Rnλ(G) = 1

2 Z

GC

nλ(x)dx+ Z

G

ˆ gnλ(x)dx

,

wherefˆnλ andgˆnλ are deconvolution kernel estimator. Then : Rnλ(G) = 1

n

" n X

i=1

hλGC(Zi1) +

n

X

i=1

hλG(Zi2)

# ,

where :

hGλ(z) = Z

G

1 λKe

z−x λ

dx =1G ∗Keλ(z).

(18)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Vapnik’s bound ( = 0)

The use of empirical process comes from VC theory :

RK( ˆGn)−RK(G?) ≤ RK( ˆGn)−Rn( ˆGn) +Rn(G?)−RK(G?)

≤ 2 sup

G∈G

|(Rn−R)(G)|.

Goal to control uniformly the empirical process indexed byG.

ISL {1G ∗Keλ,G ∈ G}.

(19)

Vapnik’s bound ( = 0)

The use of empirical process comes from VC theory :

RK( ˆGn)−RK(G?) ≤ RK( ˆGn)−Rn( ˆGn) +Rn(G?)−RK(G?)

≤ 2 sup

G∈G

|(Rn−R)(G)|.

Goal to control uniformly the empirical process indexed byG.

ISL {1G ∗Keλ,G ∈ G}.

(20)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Theorem 1 : Upper bound (j.w. with C. Marteau)

Suppose(f,g)∈ G(α, γ) and|F[η](t)| ∼Πdi=1|ti|−βii >1/2, i =1, . . . ,d. Consider a kernel K of orderbγc, which satisfies some properties. Then :

n→+∞lim sup

(f,g)∈G(α,γ)

nτd(α,β,γ)Ef,gd( ˆGn,G?)<+∞, where

τd(α, β, γ) =





γα γ(2+α)+d+2Pd

i=1βi

for d =d

γ(α+1) γ(2+α)+d+2Pd

i=1βi

for d =df,g. andλ= (λ1, . . . , λd) is chosen as :

λj =n

1 γ(2+α)+2Pd

i=1βi+d, ∀j ∈ {1, . . . ,d}.

(21)

Theorem 2 : Lower bound (j.w. with C. Marteau)

Suppose|F[η](t)| ∼Πdi=1|ti|−βii >1/2,i =1, . . . ,d. Then forα≤1,

lim inf

n→+∞inf

Gˆn

sup

(f,g)∈G(α,γ)

nτd(α,β,γ)Ef,gd( ˆGn,m,G?)>0, where the infinimum is taken over all possible estimators of the set G? and

τd(α, β, γ) =





γα γ(2+α)+d+2Pd

i=1βi

for d =d

γ(α+1) γ(2+α)+d+2Pd

i=1βi

for d =df,g.

(22)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Conclusion (minimax)

Density estimation Classification

Direct case n2γ+1 n

γ(α+1) γ(α+2)+d

Noisy case n2γ+2β+1 n

γ(α+1) γ(α+2)+2β+d¯

f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)

Assumptions Margin parameter α≥0

|F[η](t)| ∼ |t|−β |F[ηj](t)| ∼ |tj|−βj ∀j =1, . . .d

(23)

Sketch of the proofs, heuristic

1. Noisy quantization (for simplicity) 2. Excess risk decomposition

3. Bias control (easy and minimax) 4. Variance control : key lemma

(24)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Other results (I)

I (Un)supervised classification with errors-in-variables : R`(ˆgnλ)−R`(g?)≤Cn

κγ γ(2κ+ρ−1)+(2κ−1)Pd

i=1βi, where

g?=arg minR`(g,(X,Y))

I (Un)supervised classification with Zi ∼Af using ˆfnN(x) =

N

X

k=1

θˆkφk(x),

where θk =b−1k 1nPn

i=1ψk(Zi) andAk =b2kφk and f ∈Θ(γ,L) :={f =

X

k=1

θkϕk :X

θk2k2γ+1 ≤L}.

(25)

Other results (I)

I (Un)supervised classification with errors-in-variables : R`(ˆgnλ)−R`(g?)≤Cn

κγ γ(2κ+ρ−1)+(2κ−1)Pd

i=1βi, where

g?=arg minR`(g,(X,Y))

I (Un)supervised classification with Zi ∼Af using ˆfnN(x) =

N

X

k=1

θˆkφk(x),

where θk =b−1k 1nPn

i=1ψk(Zi) andAk =b2kφk and f ∈Θ(γ,L) :={f =

kϕk :X

θk2k2γ+1 ≤L}.

(26)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Other results (II)

I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R`(ˆgnλ)−R`(g?)≤Cn2κ+ρ−1+(κ,β,γ)κ

where :

(κ, β,−→γ) = (2κ−1)

d

X

j=1

βj

γj

,

andλ= (λ1, . . . , λd) is chosen as : λj ∼n

2κ−1 j(2κ+ρ−1+(κ,β,γ))

,∀j =1, . . .d.

I Non-exact oracle inequalities : R`(ˆg)≤(1+) inf

g∈GR`(g) +C()n

γ γ(1+ρ)+Pd

i=1βi, without margin assumption.

(27)

Other results (II)

I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R`(ˆgnλ)−R`(g?)≤Cn2κ+ρ−1+(κ,β,γ)κ

where :

(κ, β,−→γ) = (2κ−1)

d

X

j=1

βj

γj

,

andλ= (λ1, . . . , λd) is chosen as : λj ∼n

2κ−1 j(2κ+ρ−1+(κ,β,γ))

,∀j =1, . . .d.

I Non-exact oracle inequalities : R`(ˆg)≤(1+) inf

g∈GR`(g) +C()n

γ γ(1+ρ)+Pd

i=1βi,

(28)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Finite dimensional clustering

Givenk, we aim at :

arg min

c=(c1,...,ck)∈Rdk

E min

j=1,...k||X−cj||2. The empirical couterpart :

ˆcn∈arg min

c=(c1,...,ck)∈Rdk

1 n

n

X

i=1

min

j=1,...k||Xi −cj||2, gives rise to the populark-means studied in (Pollard, 1982).

(29)

Finite dimensional noisy clustering (j.w. with C. Brunet)

We want to approximate a solution of the stochastic minimization : min

c=(c1,...,ck)∈Rdk

1 n

n

X

i=1

γλ(c,Zi),

where

γλ(c,z) = Z

K

min

j=1,...,k||x −cj||2Keλ(z −x)dx.

(30)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

First order conditions (I)

Suppose||X||≤M and Pollard’s regularity assumptions are satisfied. Then,∀u∈ {1, . . . ,d} ∀j ∈ {1, . . . ,k}, we have the following assertion :

cuj = Pn

i=1

R

VjxuKeλ(Zi −x)dx Pn

i=1

R

VjKeλ(Zi −x)dx =⇒ ∇eujJnλ(c) =0, where

Jnλ(c) =

n

X

i=1

γλ(c,Zi).

(31)

First order conditions (II)

I The standard k-means : cu,j =

Pn

i=1Xi,u1Xi∈Vj Pn

i=11Xi∈Vj = Pn

i=1

R

VjxuδXidx Pn

i=1

R

VjδXidx ,∀u,j, where δXi is the Dirac function at point Xi.

I Another look : cu,j =

R

Vjxuˆfn(x)dx R

Vj

ˆfn(x)dx ,∀u ∈ {1, . . . ,d},∀j ∈ {1, . . . ,k},

where ˆfn(x) =1/nPn

i=1Keλ(Zi−x)is the kernel deconvolution estimator of the density f.

(32)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

The algorithm of Noisy K -means (j.w. with C. Brunet)

(33)

Experimental setting : simulation study

1. We draw i.i.d. sequences(Xi)i=1,...,n (gaussian mixtures), and (i)ni=1 (symmetric noise) forn∈ {100,500}.

2. We draw repetitions (j)j=1,...,m with m=100.

3. We compute Noisy k-means clustersˆc with an estimation step of fη thanks to 2.

4. We calculate the clustering risk : rn(ˆc) = 1

100

100

X

i=1

1IXj i∈V/ jc).

(34)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Experimental setting - Model 1

Foru ∈ {1, . . . ,10}, we callMod1(u):

Zi =Xi +i(u),i =1, . . . ,n, Mod1(u) where :

I (Xi)ni=1 are i.i.d. with density f =1/2fN(02,I2)+1/2fN((5,0)T,I2)

I and(i(u))ni=1 are i.i.d. with law N(02,Σ(u)), where Σ(u) is a diagonal matrix with diagonal vector(0,u)T, for

u ∈ {1, . . . ,10}.

(35)

Illustrations Mod1

(36)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Experimental setting - Model 2

Foru ∈ {1, . . . ,10}, we call model Mod2(u) :

Zi =Xi(u) +i,i =1, . . . ,n, Mod2(u) where :

I (Xi(u))ni=1 are i.i.d. with density

f =1/3fN(02,I2)+1/3fN((a,b)T,I2) +1/3fN((b,a)T,I2), where (a,b) = (15−(u−1)/2,5+ (u−1)/2), for u ∈ {1, . . . ,10},

I and(i)ni=1 are i.i.d. with law N(02,Σ), where Σis a diagonal matrix with diagonal vector (5,5)T.

(37)

Illustrations Mod2

(38)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Results Mod1 for n = 100

(39)

Results Mod1 for n = 500

(40)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Results Mod2

(41)

Adaptation !

To get the optimal rates, we act as follows : R(ˆcλ,c?)≤inf

λ

( C1

c(λ)

√n

2/(1+ρ)

+C2λ )

≤Cn

γ 2γ(1+ρ)+2β

where

λ? =O(n2γ(1+ρ)+2β1 ).

Goal to choose the bandwidth based on Lepski’s principle

(42)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Empirical Risk Comparison (j.w. with M. Chichignoud)

We chooseλas follows :

ˆλ=max{λ∈Λ :Rnλ0(ˆcλ)−Rnλ0(ˆcλ0)≤3δλ0,∀λ0 ≤λ}, whereδλ is defined as :

δλ =Cadaptλ−2β n logn, whereCadapt>0 is an explicit constant.

(43)

Adaptation : data-driven choices of λ

(44)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Uniform law for

(45)

Adaptation : stability of ICI method

(46)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Real dataset : Iris

(47)

Adaptation using Empirical Risk Comparison (ERC)

To get the optimal rates, we act as follows : R(ˆcλ,c?)≤inf

λ

( C1

c(λ)

√n

2/(1+ρ)

+C2λ )

≤Cn

γ 2γ(1+ρ)+2β

where

λ? =O(n2γ(1+ρ)+2β1 ).

Goal to choose the bandwidth based on Lepski’s principle

(48)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Lepski’s method

I {fˆh,h∈ H}a family of (kernel) estimators, with associated (bandwidth) h∈ H ⊂R.

I BV decomposition : kˆfh−fk ≤C{B(h) +V(h)}, where (usually)V(·)is known.

I Related to minimax theory :

f ∈Σ(γ,L)⇒ kˆfh(γ)−fk ≤Cinf{B(h) +V(h)}=Cψn(γ).

Goal a data-driven method to reach the bias-variance trade-off (minimax adaptive method).

(49)

Lepski’s method : the rule

The rule :

ˆh=max{h>0:∀h0 ≤h, kˆfh−ˆfh0k ≤cV(h0)}

kfˆh−ˆfh0k ∼ kˆfh−fk+kf −ˆfh0k

∼ B(h) +V(h) +B(h0) +V(h0)

h0≤h

∼ B(h) +V(h0) The rule selects the biggesth >0 such that :

sup

h0≤h

B(h) +V(h0)

V(h0) ≤c ⇔ ∀h0 ≤h,B(h)≤(c−1)V(h0).

(50)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Empirical Risk Comparison (j.w. with M. Chichignoud)

We chooseλas follows :

ˆλ=max{λ∈Λ :Rnλ0(ˆcλ)−Rnλ0(ˆcλ0)≤3δλ0,∀λ0 ≤λ}, whereδλ is defined as :

δλ =Cadaptλ−2β n logn, whereCadapt>0 is an explicit constant.

(51)

Theorem 3 : Adaptive upper bound (j.w. with M.

Chichignoud)

Supposef ∈Σ(γ,L), the noise assumption and Pollard’s regularity assumptions are satisfied. Consider a kernelK of orderbγc, which satisfies the kernel assumption. Then :

lim sup

n→+∞

(logn)n

γ γ+Pd

i=1βi sup

f∈Σ(γ,L)E[R(ˆcˆλ)−R(c?)]<+∞, where

ˆ

cλ =arg min

c∈C n

X

i=1

`λ(c,Zi),

andλˆ is chosen with ERC rule.

(52)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

The rule becomes :

λˆ=λ1 1I2 1IC, where

Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

I Case 1 : λ?1< λ2.

I Case 2 : λ?2> λ1.

(53)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

The rule becomes :

λˆ=λ1 1I2 1IC, where

Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

I Case 2 : λ?2> λ1.

(54)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

The rule becomes :

λˆ=λ1 1I2 1IC, where

Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

I Case 1 : λ?1< λ2.

I Case 2 : λ?2> λ1.

(55)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 1 : λ?1< λ2.

ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1I+ 1IC)

≤ ψn?) +ER(ˆcλˆ,c?) 1IC.

R(ˆcλˆ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +Rnλ?(ˆcλˆ,c?)

≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?

≤ B(λ?) +rλ?(2 logn) +3δλ?

≤ Cψn?),

whererλ(t) : P supc|Rnλ−Rλ|(c,c?)≥rλ(t)

≤e−t.

(56)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 1 : λ?1< λ2.

ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1I+ 1IC)

≤ ψn?) +ER(ˆcλˆ,c?) 1IC. OnΩC, we have with high proba :

R(ˆcˆλ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcλˆ,c?) +Rnλ?(ˆcλˆ,c?)

≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?

≤ B(λ?) +rλ?(2 logn) +3δλ?

≤ Cψn?),

whererλ(t) : P supc|Rnλ−Rλ|(c,c?)≥rλ(t)

≤e−t.

(57)

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 1 : λ?1< λ2.

ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1I+ 1IC)

≤ ψn?) +ER(ˆcλˆ,c?) 1IC. OnΩC, we have with high proba :

R(ˆcˆλ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcλˆ,c?) +Rnλ?(ˆcλˆ,c?)

≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?

≤ B(λ?) +rλ?(2 logn) +3δλ?

≤ Cψn?),

whererλ(t) : P sup |Rλ−Rλ|(c,c?)≥rλ(t)

≤e−t.

(58)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 2 : λ?2> λ1.

ER(ˆcˆλ,c?) ≤ ψn?) +ER(ˆcˆλ,c?) 1I≤ψn?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)

≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?). SinceB(λ1)<B(λ2) =B(λ?) =δλ?λ2 < δλ1 and using Bousquet twice, we have with proba 1−2n−2 :

Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2

≤ Cδλ1.

(59)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 2 : λ?2> λ1.

ER(ˆcˆλ,c?) ≤ ψn?) +ER(ˆcˆλ,c?) 1I≤ψn?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)

≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?).

Bousquet twice, we have with proba 1−2n :

Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2

≤ Cδλ1.

(60)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Proof for λ

?

∈ {λ

1

, λ

2

}, λ

1

< λ

2

.

I Case 2 : λ?2> λ1.

ER(ˆcˆλ,c?) ≤ ψn?) +ER(ˆcˆλ,c?) 1I≤ψn?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.

Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)

≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?).

SinceB(λ1)<B(λ2) =B(λ?) =δλ?λ2 < δλ1 and using Bousquet twice, we have with proba 1−2n−2 :

Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2

≤ Cδλ1.

(61)

ERC’s Extension

Consider a family ofλ-ERM {ˆgλ, λ >0}. Assume :

1.There exists an increasing function denoted byBias(·) such that : (Rλ−R)(g,g?)

≤Bias(λ) +1

4R(g,g?), for all g ∈ G.

2.There exists a decreasing function denoted byVart(·) (t ≥0) such that∀λ,t >0 :

P sup

g∈G

(Rnλ−Rλ)(g,g?) −1

4R(g,g?)

>Vart(λ)

!

≤e-t.

Then, there exists a universal constantC3 such that ER(ˆgˆλ,g?)≤C3

λ∈∆inf n

Bias(λ) +Vart(λ)o +e-t

, for allt≥0.

(62)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Examples

I Nonparametric estimation

I Image denoisingRnλ(ft) =P

i(Yift)2Kλ(Xix0).

I Local robust regressionRnλ(t) =P

iρ(Yit)Kλ(Xix0).

I Fitted local likelihoodRnλ(θ) =P

ilogp(Yi, θ)Kλ(Xix0).

I Inverse Statistical Learning

I Quantile estimation Rnλ(q) =P

i

R(xq)(τ 1Ix≤q) ˜Kλ(Zix)dx.

I Learning principal curves Rnλ(g) =P

i

R inftkxf(t)k2K˜λ(Zix)dx.

I Binary classificationRnλ(G) =P

i

R 1IY

i6=1I(x∈G)K˜λ(Zix)dx.

(63)

Open problems

I Anisotropic case

I Margin adaptation

I Model selection

(64)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

Conclusion

Thanks for your attention !

Références

Documents relatifs

The first method (see [5]) compares empirical risks associated with different bandwidths by using ICI (Intersection of Confidence Intervals) rule whereas the second one (see

If we use a basic clustering algorithm based on the minimization of the empirical distortion (such as the k-means algorithm) when we deal with noisy data, the expected criterion

We prove minimax lower bounds for this problem and explain how can these rates be attained, using in particular an Empirical Risk Minimizer (ERM) method based on deconvolution

To fully ascertain what constitutes a waveform's normal shape, we use k-means clustering to form different clusters to determine a normal cluster group.. The clusters can be

Recently a new HOS- based method for identification of continuous-time EIV models has been proposed, using the fact that the equation of the model is satisfied by the

For the L q -loss with q ≥ 1, new matching upper and lower bounds are given: in the online learning framework under boundedness as- sumption (Corollary 4.5 and Section 8.4.2

Keywords: Asymptotic normality, consistency, deconvolution kernel estimator, errors-in-variables model, M-estimators, ordinary smooth and super-smooth functions, rates of

The treated frameworks in supervised and semi-supervised learning for classification and ranking problems belong to the algorithms that hold the necessary conditions to have