Statistical Learning with errors in variables

(1)

Inverse Statistical Learning

Minimax theory, adaptation and algorithm

Sébastien Loustau avec (par ordre d’apparition)

C. Marteau, M. Chichignoud, C. Brunet and S. Souchet

Dijon, le 15 janvier 2014

(2)

Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC

The problem of Inverse Statistical Learning

Given(X,Y)∼P on X × Y, a class G and a loss function

`:G ×(X × Y)→R+, , we aim at :

g^? ∈arg minEP`(g,(X,Y)),

from anindirect sequence of observations :

(Z₁,Y₁), . . . ,(Z_n,Y_n) i.i.d. fromP˜,

where Z_i ∼Af,Ais a linear compact operator (andX ∼f) .

(3)

The problem of Inverse Statistical Learning

`:G ×(X × Y)→R+, , we aim at :

g^? ∈arg minEP`(g,(X,Y)), from anindirect sequence of observations :

(Z₁,Y₁), . . . ,(Z_n,Y_n) i.i.d. fromP˜,

where Z_i ∼Af,Ais a linear compact operator (andX ∼f) .

(4)

Statistical Learning with errors in variables

`:G ×(X × Y)→R+, , we aim at :

g^? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :

(X₁+₁,Y₁), . . . ,(X_n+_n,Y_n) i.i.d. fromP˜,

whereZ_i ∼f ∗η andηis the density of the i.i.d. sequence (i)ⁿ_i=1.

I Y =R : regression with errors in variables,

I Y ={1, . . . ,M}: classification with errors in variables,

I Y =∅ : unsupervised learning with errors in variables.

(5)

Statistical Learning with errors in variables

`:G ×(X × Y)→R+, , we aim at :

g^? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :

(X₁+₁,Y₁), . . . ,(X_n+_n,Y_n) i.i.d. fromP˜,

whereZ_i ∼f ∗η andηis the density of the i.i.d. sequence (i)ⁿ_i=1.

I Y =R : regression with errors in variables,

I Y ={1, . . . ,M} : classification with errors in variables,

I Y =∅ : unsupervised learning with errors in variables.

(6)

Toy Example (I)

Direct dataset (Unobservable)

Observations (Available)

(7)

Toy example (II)

Direct dataset (Unobservable)

Observations (Available)

(8)

Real-world example in oncology (I)

Fig.1 : The same tumor observed by two radiologists Z_ij =X_i+_ij,j ∈ {1,2}.

(9)

Real-world example in oncology (II)

Fig.1 : Batch effect in a Micro-array dataset

I J. A. Gagnon-Bartsch, L. Jacob and T. P. Speed, 2013

(10)

Origin : a minimax motivation (with C. Marteau)

Density estimation Classification Direct case n⁻

2γ

2γ+1 n⁻

γ(α+1) γ(α+2)+d

Noisy case n⁻^2γ+2β+1^2γ ? ? ?

f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)

Assumptions Margin parameter α≥0

|F[η](t)| ∼ |t|^−β |F[ηj](t)| ∼ |t_j|^−β^j ∀j =1, . . . ,d

(12)

Mammen and Tsybakov (1999)

Given two densitiesf andg, for anyG ⊂K, the Bayes risk is defined as :

R_K(G) = 1 2

"

Z

K/G

fdQ+ Z

G

gdQ

# .

GivenX₁¹, . . . ,X_n¹ ∼f andX₁², . . . ,X_n² ∼g, we aim at : G^?=arg min

G∈GR_K(G).

Goal To obtain minimax fast rates r_n(F)∼inf

Gˆ

sup

(f,g)∈FEd₂( ˆG,G^?), where d₂ ∈ {d_f_,g,d_∆}.

(13)

Mammen and Tsybakov (1999) with errors in variables

We observeZ₁¹, . . . ,Z_n¹ andZ₁², . . . ,Z_n² such that :

Z_i¹=X_i¹+¹_i andZ_i²=X_i²+²_i, for i =1, . . .n, where :

I X_i¹ ∼f andX_i² ∼g,

I ^j_i i.i.d. with densityη.

Goal To obtain minimax fast rates r_n(F, β)∼inf

Gˆ

sup

(f,g)∈FEd₂( ˆG,G^?), where d₂∈ {d_f_,g,d_∆}.

(14)

ERM approach

ERM principle in the direct case : 1

2n

n

X

i=1

1_X1

i∈G^C + 1

2n

n

X

i=1

1_X2

i ∈G −→R_K(G).

ERM principle in this model fails : 1

2n

n

X

i=1

1_Z1

i∈G^C + 1

2n

n

X

i=1

1_Z2

i ∈G → 1

2 Z

G^C

f ∗η+ Z

G

g∗η

6=R_K(G).

Solution Define R_n^λ(G) = 1

2 Z

G^C

fˆ_n^λ(x)dx+ Z

G

ˆ g_n^λ(x)dx

−→R_K(G),

where(ˆf_n^λ,gˆ_n^λ) are estimators of(f,g) of the form : fˆ_n^λ(x) = 1

nλ

n

X

i=1

Ke

Z_i¹−x λ

.

(15)

ERM approach

1 2n

n

X

i=1

1_X1

i∈G^C + 1

2n

n

X

i=1

1_X2

i ∈G −→R_K(G).

2n

n

X

i=1

1_Z1

i∈G^C + 1

2n

n

X

i=1

1_Z2

i ∈G → 1

2 Z

G^C

f ∗η+ Z

G

g∗η

6=R_K(G).

2 Z

G^C

fˆ_n^λ(x)dx+ Z

G

ˆ g_n^λ(x)dx

−→R_K(G),

nλ

n

X

i=1

Ke

Z_i¹−x λ

.

(16)

ERM approach

ERM principle in the direct case : 1

2n

n

X

i=1

1_X1

i∈G^C + 1

2n

n

X

i=1

1_X2

i ∈G −→R_K(G).

2n

n

X

i=1

1_Z1

i∈G^C + 1

2n

n

X

i=1

1_Z2

i ∈G → 1

2 Z

G^C

f ∗η+ Z

G

g∗η

6=R_K(G).

2 Z

G^C

fˆ_n^λ(x)dx+ Z

G

ˆ g_n^λ(x)dx

−→R_K(G),

nλ

n

X

i=1

Ke

Z_i¹−x λ

.

(17)

Details

Z₁¹, . . . ,Z_n¹ i.i.d.f ∗η etZ₁², . . . ,Z_n² i.i.d.g∗η. We consider : R_n^λ(G) = 1

2 Z

G^C

fˆ_n^λ(x)dx+ Z

G

ˆ g_n^λ(x)dx

,

wherefˆ_n^λ andgˆ_n^λ are deconvolution kernel estimator. Then : R_n^λ(G) = 1

n

" _n X

i=1

h^λ_GC(Z_i¹) +

n

X

i=1

h^λ_G(Z_i²)

# ,

where :

h_G^λ(z) = Z

G

1 λKe

z−x λ

dx =1_G ∗Ke_λ(z).

(18)

Vapnik’s bound ( = 0)

The use of empirical process comes from VC theory :

R_K( ˆG_n)−R_K(G^?) ≤ R_K( ˆG_n)−R_n( ˆG_n) +R_n(G^?)−R_K(G^?)

≤ 2 sup

G∈G

|(R_n−R)(G)|.

Goal to control uniformly the empirical process indexed byG.

ISL {1_G ∗Ke_λ,G ∈ G}.

(19)

Vapnik’s bound ( = 0)

The use of empirical process comes from VC theory :

R_K( ˆG_n)−R_K(G^?) ≤ R_K( ˆG_n)−R_n( ˆG_n) +R_n(G^?)−R_K(G^?)

≤ 2 sup

G∈G

|(R_n−R)(G)|.

Goal to control uniformly the empirical process indexed byG.

ISL {1_G ∗Ke_λ,G ∈ G}.

(20)

Theorem 1 : Upper bound (j.w. with C. Marteau)

Suppose(f,g)∈ G(α, γ) and|F[η](t)| ∼Π^d_i₌₁|t_i|^−βⁱ,β_i >1/2, i =1, . . . ,d. Consider a kernel K of orderbγc, which satisfies some properties. Then :

n→+∞lim sup

(f,g)∈G(α,γ)

n^τ^d^(α,β,γ)E_f_,gd( ˆG_n,G^?)<+∞, where

τd(α, β, γ) =











γα γ(2+α)+d+2Pd

i=1βi

for d =d_∆

γ(α+1) γ(2+α)+d+2Pd

i=1βi

for d =d_f_,g. andλ= (λ₁, . . . , λ_d) is chosen as :

λ_j =n⁻

1 γ(2+α)+2Pd

i=1βi+d, ∀j ∈ {1, . . . ,d}.

(21)

Theorem 2 : Lower bound (j.w. with C. Marteau)

Suppose|F[η](t)| ∼Π^d_i=1|t_i|^−βⁱ,βi >1/2,i =1, . . . ,d. Then forα≤1,

lim inf

n→+∞inf

Gˆn

sup

(f,g)∈G(α,γ)

n^τ^d^(α,β,γ)E_f_,gd( ˆG_n,m,G^?)>0, where the infinimum is taken over all possible estimators of the set G^? and

τd(α, β, γ) =











γα γ(2+α)+d+2Pd

i=1βi

for d =d_∆

γ(α+1) γ(2+α)+d+2Pd

i=1βi

for d =d_f_,g.

(22)

Conclusion (minimax)

Density estimation Classification

Direct case n⁻^2γ+1^2γ n⁻

γ(α+1) γ(α+2)+d

Noisy case n⁻^2γ+2β+1^2γ n⁻

γ(α+1) γ(α+2)+2β+d¯

f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)

Assumptions Margin parameter α≥0

|F[η](t)| ∼ |t|^−β |F[ηj](t)| ∼ |t_j|^−β^j ∀j =1, . . .d

(23)

Sketch of the proofs, heuristic

1. Noisy quantization (for simplicity) 2. Excess risk decomposition

3. Bias control (easy and minimax) 4. Variance control : key lemma

(24)

Other results (I)

I (Un)supervised classification with errors-in-variables : R_`(ˆg_n^λ)−R_`(g^?)≤Cn⁻

κγ γ(2κ+ρ−1)+(2κ−1)Pd

i=1βi, where

g^?=arg minR_`(g,(X,Y))

I (Un)supervised classification with Z_i ∼Af using ˆf_n^N(x) =

N

X

k=1

θˆ_kφ_k(x),

where θk =b⁻¹_k ¹_nPn

i=1ψk(Z_i) andA^∗Aφ_k =b²_kφk and f ∈Θ(γ,L) :={f =

∞

X

k=1

θkϕk :X

θ_k²k^2γ+1 ≤L}.

(25)

Other results (I)

I (Un)supervised classification with errors-in-variables : R_`(ˆg_n^λ)−R_`(g^?)≤Cn⁻

κγ γ(2κ+ρ−1)+(2κ−1)Pd

i=1βi, where

g^?=arg minR_`(g,(X,Y))

I (Un)supervised classification with Z_i ∼Af using ˆf_n^N(x) =

N

X

k=1

θˆ_kφ_k(x),

where θk =b⁻¹_k ¹_nPn

i=1ψk(Z_i) andA^∗Aφ_k =b²_kφk and f ∈Θ(γ,L) :={f =

∞

Xθkϕk :X

θ_k²k^2γ+1 ≤L}.

(26)

Other results (II)

I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R_`(ˆg_n^λ)−R_`(g^?)≤Cn⁻2κ+ρ−1+(κ,β,γ)^κ

where :

(κ, β,−→γ) = (2κ−1)

d

X

j=1

βj

γj

,

andλ= (λ₁, . . . , λ_d) is chosen as : λj ∼n⁻

2κ−1 2γj(2κ+ρ−1+(κ,β,→−γ))

,∀j =1, . . .d.

I Non-exact oracle inequalities : R_`(ˆg)≤(1+) inf

g∈GR_`(g) +C()n⁻

γ γ(1+ρ)+Pd

i=1βi, without margin assumption.

(27)

Other results (II)

I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R_`(ˆg_n^λ)−R_`(g^?)≤Cn⁻2κ+ρ−1+(κ,β,γ)^κ

where :

(κ, β,−→γ) = (2κ−1)

d

X

j=1

βj

γj

,

andλ= (λ₁, . . . , λ_d) is chosen as : λj ∼n⁻

2κ−1 2γj(2κ+ρ−1+(κ,β,→−γ))

,∀j =1, . . .d.

I Non-exact oracle inequalities : R_`(ˆg)≤(1+) inf

g∈GR_`(g) +C()n⁻

γ γ(1+ρ)+Pd

i=1βi,

(28)

Finite dimensional clustering

Givenk, we aim at :

arg min

c=(c1,...,c_k)∈R^dk

E min

j=1,...k||X−c_j||². The empirical couterpart :

ˆc_n∈arg min

c=(c1,...,ck)∈R^dk

1 n

n

X

i=1

min

j=1,...k||X_i −c_j||², gives rise to the populark-means studied in (Pollard, 1982).

(29)

Finite dimensional noisy clustering (j.w. with C. Brunet)

We want to approximate a solution of the stochastic minimization : min

c=(c1,...,ck)∈R^dk

1 n

n

X

i=1

γλ(c,Z_i),

where

γλ(c,z) = Z

K

min

j=1,...,k||x −c_j||²Ke_λ(z −x)dx.

(30)

First order conditions (I)

Suppose||X||∞≤M and Pollard’s regularity assumptions are satisfied. Then,∀u∈ {1, . . . ,d} ∀j ∈ {1, . . . ,k}, we have the following assertion :

c_uj = Pn

i=1

R

Vjx_uKe_λ(Z_i −x)dx Pn

i=1

R

VjKe_λ(Z_i −x)dx =⇒ ∇_e_ujJ_n^λ(c) =0, where

J_n^λ(c) =

n

X

i=1

γλ(c,Z_i).

(31)

First order conditions (II)

I The standard k-means : c_u,j =

Pn

i=1X_i,u1_X_i_∈V_j Pn

i=11_X_i_∈V_j = Pn

i=1

R

Vjx_uδXidx Pn

i=1

R

Vjδ_X_idx ,∀u,j, where δXi is the Dirac function at point X_i.

I Another look : c_u,j =

R

Vjx_uˆf_n(x)dx R

Vj

ˆf_n(x)dx ,∀u ∈ {1, . . . ,d},∀j ∈ {1, . . . ,k},

where ˆf_n(x) =1/nPn

i=1Ke_λ(Z_i−x)is the kernel deconvolution estimator of the density f.

(32)

The algorithm of Noisy K -means (j.w. with C. Brunet)

(33)

Experimental setting : simulation study

1. We draw i.i.d. sequences(X_i)_i=1,...,n (gaussian mixtures), and (i)ⁿ_i=1 (symmetric noise) forn∈ {100,500}.

2. We draw repetitions (_j)_j_=1,...,m with m=100.

3. We compute Noisy k-means clustersˆc with an estimation step of f_η thanks to 2.

4. We calculate the clustering risk : r_n(ˆc) = 1

100

X

i=1

1I_Xj i∈V/ _j(ˆc).

(34)

Experimental setting - Model 1

Foru ∈ {1, . . . ,10}, we callMod1(u):

Z_i =X_i +_i(u),i =1, . . . ,n, Mod1(u) where :

I (X_i)ⁿ_i=1 are i.i.d. with density f =1/2f_N₍₀₂_,I₂₎+1/2f_N(^(5,0)^T^,I2)

I and(_i(u))ⁿ_i₌₁ are i.i.d. with law N(0₂,Σ(u)), where Σ(u) is a diagonal matrix with diagonal vector(0,u)^T, for

u ∈ {1, . . . ,10}.

(35)

Illustrations Mod1

(36)

Experimental setting - Model 2

Foru ∈ {1, . . . ,10}, we call model Mod2(u) :

Z_i =X_i(u) +_i,i =1, . . . ,n, Mod2(u) where :

I (X_i(u))ⁿ_i=1 are i.i.d. with density

f =1/3f_N₍₀₂_,I₂₎+1/3f_N(^(a,b)^T^,I2) +1/3f_N(^(b,a)^T^,I2), where (a,b) = (15−(u−1)/2,5+ (u−1)/2), for u ∈ {1, . . . ,10},

I and(i)ⁿ_i₌₁ are i.i.d. with law N(0₂,Σ), where Σis a diagonal matrix with diagonal vector (5,5)^T.

(37)

Illustrations Mod2

(38)

Results Mod1 for n = 100

(39)

Results Mod1 for n = 500

(40)

Results Mod2

(41)

Adaptation !

To get the optimal rates, we act as follows : R(ˆc_λ,c^?)≤inf

λ

( C1

c(λ)

√n

2/(1+ρ)

+C2λ^2γ )

≤Cn⁻

γ 2γ(1+ρ)+2β

where

λ^? =O(n⁻^{2γ(1+ρ)+2β}¹ ).

Goal to choose the bandwidth based on Lepski’s principle

(42)

Empirical Risk Comparison (j.w. with M. Chichignoud)

We chooseλas follows :

ˆλ=max{λ∈Λ :R_n^λ⁰(ˆc_λ)−R_n^λ⁰(ˆc_λ⁰)≤3δ_λ⁰,∀λ⁰ ≤λ}, whereδ_λ is defined as :

δλ =C_adaptλ^−2β n logn, whereC_adapt>0 is an explicit constant.

(43)

Adaptation : data-driven choices of λ

(44)

Uniform law for

(45)

Adaptation : stability of ICI method

(46)

Real dataset : Iris

(47)

Adaptation using Empirical Risk Comparison (ERC)

To get the optimal rates, we act as follows : R(ˆc_λ,c^?)≤inf

λ

( C1

c(λ)

√n

2/(1+ρ)

+C2λ^2γ )

≤Cn⁻

γ 2γ(1+ρ)+2β

where

λ^? =O(n⁻^{2γ(1+ρ)+2β}¹ ).

Goal to choose the bandwidth based on Lepski’s principle

(48)

Lepski’s method

I {fˆ_h,h∈ H}a family of (kernel) estimators, with associated (bandwidth) h∈ H ⊂R.

I BV decomposition : kˆf_h−fk ≤C{B(h) +V(h)}, where (usually)V(·)is known.

I Related to minimax theory :

f ∈Σ(γ,L)⇒ kˆf_h^∗_(γ)−fk ≤Cinf{B(h) +V(h)}=Cψn(γ).

Goal a data-driven method to reach the bias-variance trade-off (minimax adaptive method).

(49)

Lepski’s method : the rule

The rule :

ˆh=max{h>0:∀h⁰ ≤h, kˆf_h−ˆf_h⁰k ≤cV(h⁰)}

kfˆ_h−ˆf_h⁰k ∼ kˆf_h−fk+kf −ˆf_h⁰k

∼ B(h) +V(h) +B(h⁰) +V(h⁰)

h⁰≤h

∼ B(h) +V(h⁰) The rule selects the biggesth >0 such that :

sup

h⁰≤h

B(h) +V(h⁰)

V(h⁰) ≤c ⇔ ∀h⁰ ≤h,B(h)≤(c−1)V(h⁰).

(50)

Empirical Risk Comparison (j.w. with M. Chichignoud)

We chooseλas follows :

ˆλ=max{λ∈Λ :R_n^λ⁰(ˆc_λ)−R_n^λ⁰(ˆc_λ⁰)≤3δ_λ⁰,∀λ⁰ ≤λ}, whereδ_λ is defined as :

δλ =C_adaptλ^−2β n logn, whereC_adapt>0 is an explicit constant.

(51)

Theorem 3 : Adaptive upper bound (j.w. with M.

Chichignoud)

Supposef ∈Σ(γ,L), the noise assumption and Pollard’s regularity assumptions are satisfied. Consider a kernelK of orderbγc, which satisfies the kernel assumption. Then :

lim sup

n→+∞

(logn)n⁻

γ γ+Pd

i=1βi sup

f∈Σ(γ,L)E[R(ˆc_ˆ_λ)−R(c^?)]<+∞, where

ˆ

c_λ =arg min

c∈C n

X

i=1

`_λ(c,Z_i),

andλˆ is chosen with ERC rule.

(52)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

The rule becomes :

λˆ=λ1 1I_Ω+λ2 1I_ΩC, where

Ω ={R_n^λ¹(ˆc_λ₂,ˆc_λ₁)>Cδ_λ₁}.

I Case 1 : λ^? =λ1< λ2.

I Case 2 : λ^? =λ2> λ1.

(53)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

The rule becomes :

I Case 2 : λ^? =λ2> λ1.

(54)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

The rule becomes :

I Case 1 : λ^? =λ1< λ2.

I Case 2 : λ^? =λ2> λ1.

(55)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 1 : λ^? =λ1< λ2.

ER(ˆc_λ_ˆ,c^?) = ER(ˆc_ˆ_λ,c^?)(1I_Ω+ 1I_ΩC)

≤ ψ_n(λ^?) +ER(ˆc_λ_ˆ,c^?) 1I_ΩC.

R(ˆc_λ_ˆ,c^?) = (R −R^λ^?)(ˆc_ˆ_λ,c^?) + (R^λ^?−R_n^λ^?)(ˆc_ˆ_λ,c^?) +R_n^λ^?(ˆc_λ_ˆ,c^?)

≤ B(λ^?) + (R^λ^?−R_n^λ^?)(ˆc_ˆ_λ,c^?) +3δ_λ^?

≤ B(λ^?) +r_λ^?(2 logn) +3δ_λ^?

≤ Cψn(λ^?),

wherer_λ(t) : P sup_c|R_n^λ−R^λ|(c,c^?)≥r_λ(t)

≤e^−t.

(56)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 1 : λ^? =λ1< λ2.

≤ ψ_n(λ^?) +ER(ˆc_λ_ˆ,c^?) 1I_ΩC. OnΩ^C, we have with high proba :

R(ˆc_ˆ_λ,c^?) = (R −R^λ^?)(ˆc_ˆ_λ,c^?) + (R^λ^?−R_n^λ^?)(ˆc_λ_ˆ,c^?) +R_n^λ^?(ˆc_λ_ˆ,c^?)

≤ B(λ^?) + (R^λ^?−R_n^λ^?)(ˆc_ˆ_λ,c^?) +3δ_λ^?

≤ B(λ^?) +r_λ^?(2 logn) +3δ_λ^?

≤ Cψn(λ^?),

wherer_λ(t) : P sup_c|R_n^λ−R^λ|(c,c^?)≥r_λ(t)

≤e^−t.

(57)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 1 : λ^? =λ1< λ2.

≤ ψ_n(λ^?) +ER(ˆc_λ_ˆ,c^?) 1I_ΩC. OnΩ^C, we have with high proba :

R(ˆc_ˆ_λ,c^?) = (R −R^λ^?)(ˆc_ˆ_λ,c^?) + (R^λ^?−R_n^λ^?)(ˆc_λ_ˆ,c^?) +R_n^λ^?(ˆc_λ_ˆ,c^?)

≤ B(λ^?) + (R^λ^?−R_n^λ^?)(ˆc_ˆ_λ,c^?) +3δ_λ^?

≤ B(λ^?) +r_λ^?(2 logn) +3δ_λ^?

≤ Cψn(λ^?),

wherer_λ(t) : P sup |R^λ−R^λ|(c,c^?)≥r_λ(t)

≤e^−t.

(58)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 2 : λ^? =λ2> λ1.

ER(ˆc_ˆ_λ,c^?) ≤ ψn(λ^?) +ER(ˆc_ˆ_λ,c^?) 1I_Ω≤ψn(λ^?) +P(Ω), whereΩ ={R_n^λ¹(ˆc_λ₂,ˆc_λ₁)>Cδλ1}.

R_n^λ¹(ˆc_λ₂,ˆc_λ₁) = (R_n^λ¹−R^λ¹)(ˆc_λ₂,ˆc_λ₁) + (R^λ¹ −R)(ˆc_λ₂,cˆ_λ₁) +R(ˆc_λ₂,ˆc_λ₁)

≤ (R_n^λ¹−R^λ¹)(ˆc_λ₂,ˆc_λ₁) +2B(λ₁) +R(ˆc_λ₂,c^?). SinceB(λ₁)<B(λ₂) =B(λ^?) =δλ^? =δλ2 < δλ1 and using Bousquet twice, we have with proba 1−2n⁻² :

R_n^λ¹(ˆc_λ₂,ˆc_λ₁) ≤ 2r_λ₁(2 logn) +2δ_λ₁+B(λ₂) +δ_λ₂

≤ Cδ_λ₁.

(59)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 2 : λ^? =λ2> λ1.

≤ (R_n^λ¹−R^λ¹)(ˆc_λ₂,ˆc_λ₁) +2B(λ₁) +R(ˆc_λ₂,c^?).

Bousquet twice, we have with proba 1−2n :

≤ Cδ_λ₁.

(60)

Proof for λ

^?

∈ {λ

₁

, λ

₂

}, λ

₁

< λ

₂

.

I Case 2 : λ^? =λ2> λ1.

≤ (R_n^λ¹−R^λ¹)(ˆc_λ₂,ˆc_λ₁) +2B(λ₁) +R(ˆc_λ₂,c^?).

SinceB(λ₁)<B(λ₂) =B(λ^?) =δλ^? =δλ2 < δλ1 and using Bousquet twice, we have with proba 1−2n⁻² :

≤ Cδ_λ₁.

(61)

ERC’s Extension

Consider a family ofλ-ERM {ˆg_λ, λ >0}. Assume :

1.There exists an increasing function denoted byBias(·) such that : (R^λ−R)(g,g^?)

≤Bias(λ) +1

4R(g,g^?), for all g ∈ G.

2.There exists a decreasing function denoted byVar_t(·) (t ≥0) such that∀λ,t >0 :

P sup

g∈G

(R_n^λ−R^λ)(g,g^?) −1

4R(g,g^?)

>Var_t(λ)

!

≤e^-t.

Then, there exists a universal constantC₃ such that ER(ˆg_ˆ_λ,g^?)≤C₃

λ∈∆inf n

Bias(λ) +Var_t(λ)o +e^-t

, for allt≥0.

(62)

Examples

I Nonparametric estimation

I Image denoisingR_n^λ(ft) =P

i(Yi−ft)²Kλ(Xi−x0).

I Local robust regressionR_n^λ(t) =P

iρ(Yi−t)Kλ(Xi−x0).

I Fitted local likelihoodR_n^λ(θ) =P

i−logp(Yi, θ)Kλ(Xi−x0).

I Inverse Statistical Learning

I Quantile estimation R_n^λ(q) =P

i

R(x−q)(τ− 1Ix≤q) ˜Kλ(Zi−x)dx.

I Learning principal curves R_n^λ(g) =P

i

R inftkx−f(t)k²K˜λ(Zi−x)dx.

I Binary classificationR_n^λ(G) =P

i

R 1I_Y

i6=1I^(x∈G)K˜λ(Zi−x)dx.

(63)

Open problems

I Anisotropic case

I Margin adaptation

I Model selection

(64)

Conclusion

Thanks for your attention !

Statistical Learning with errors in variables

Inverse Statistical Learning

The problem of Inverse Statistical Learning

The problem of Inverse Statistical Learning

Statistical Learning with errors in variables

Statistical Learning with errors in variables

Toy Example (I)

Toy example (II)

Real-world example in oncology (I)

Real-world example in oncology (II)

Contents

Origin : a minimax motivation (with C. Marteau)

Mammen and Tsybakov (1999)

Mammen and Tsybakov (1999) with errors in variables

ERM approach

ERM approach

ERM approach

Details

Vapnik’s bound ( = 0)

Vapnik’s bound ( = 0)

Theorem 1 : Upper bound (j.w. with C. Marteau)

Theorem 2 : Lower bound (j.w. with C. Marteau)

Conclusion (minimax)

Sketch of the proofs, heuristic

Other results (I)

Other results (I)

Other results (II)

Other results (II)

Finite dimensional clustering

Finite dimensional noisy clustering (j.w. with C. Brunet)

First order conditions (I)

First order conditions (II)

The algorithm of Noisy K -means (j.w. with C. Brunet)

Experimental setting : simulation study

Experimental setting - Model 1

Illustrations Mod1

Experimental setting - Model 2

Illustrations Mod2

Results Mod1 for n = 100

Results Mod1 for n = 500

Results Mod2

Adaptation !

Empirical Risk Comparison (j.w. with M. Chichignoud)

Adaptation : data-driven choices of λ

Uniform law for

Adaptation : stability of ICI method

Real dataset : Iris

Adaptation using Empirical Risk Comparison (ERC)

Lepski’s method

Lepski’s method : the rule

Empirical Risk Comparison (j.w. with M. Chichignoud)

Theorem 3 : Adaptive upper bound (j.w. with M.

Chichignoud)

Proof for λ

∈ {λ

, λ

}, λ

< λ

.

Proof for λ

∈ {λ

, λ

}, λ

< λ

.

Proof for λ

∈ {λ

, λ

}, λ

< λ

.

Proof for λ

∈ {λ

, λ

}, λ

< λ

.

Proof for λ

∈ {λ

, λ