Inverse Statistical Learning
Minimax theory, adaptation and algorithm
Sébastien Loustau avec (par ordre d’apparition)
C. Marteau, M. Chichignoud, C. Brunet and S. Souchet
Dijon, le 15 janvier 2014
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
The problem of Inverse Statistical Learning
Given(X,Y)∼P on X × Y, a class G and a loss function
`:G ×(X × Y)→R+, , we aim at :
g? ∈arg minEP`(g,(X,Y)),
from anindirect sequence of observations :
(Z1,Y1), . . . ,(Zn,Yn) i.i.d. fromP˜,
where Zi ∼Af,Ais a linear compact operator (andX ∼f) .
The problem of Inverse Statistical Learning
Given(X,Y)∼P on X × Y, a class G and a loss function
`:G ×(X × Y)→R+, , we aim at :
g? ∈arg minEP`(g,(X,Y)), from anindirect sequence of observations :
(Z1,Y1), . . . ,(Zn,Yn) i.i.d. fromP˜,
where Zi ∼Af,Ais a linear compact operator (andX ∼f) .
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Statistical Learning with errors in variables
Given(X,Y)∼P on X × Y, a class G and a loss function
`:G ×(X × Y)→R+, , we aim at :
g? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :
(X1+1,Y1), . . . ,(Xn+n,Yn) i.i.d. fromP˜,
whereZi ∼f ∗η andηis the density of the i.i.d. sequence (i)ni=1.
I Y =R : regression with errors in variables,
I Y ={1, . . . ,M}: classification with errors in variables,
I Y =∅ : unsupervised learning with errors in variables.
Statistical Learning with errors in variables
Given(X,Y)∼P on X × Y, a class G and a loss function
`:G ×(X × Y)→R+, , we aim at :
g? ∈arg minEP`(g,(X,Y)), from anoisysequence of observations :
(X1+1,Y1), . . . ,(Xn+n,Yn) i.i.d. fromP˜,
whereZi ∼f ∗η andηis the density of the i.i.d. sequence (i)ni=1.
I Y =R : regression with errors in variables,
I Y ={1, . . . ,M} : classification with errors in variables,
I Y =∅ : unsupervised learning with errors in variables.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Toy Example (I)
Direct dataset (Unobservable)
Observations (Available)
Toy example (II)
Direct dataset (Unobservable)
Observations (Available)
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Real-world example in oncology (I)
Fig.1 : The same tumor observed by two radiologists Zij =Xi+ij,j ∈ {1,2}.
Real-world example in oncology (II)
Fig.1 : Batch effect in a Micro-array dataset
I J. A. Gagnon-Bartsch, L. Jacob and T. P. Speed, 2013
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Contents
1. Minimax rates in discriminant analysis 2. Excess risk bound
3. The algorithm of noisyk-means (4.) Adaptation
Origin : a minimax motivation (with C. Marteau)
Density estimation Classification Direct case n−
2γ
2γ+1 n−
γ(α+1) γ(α+2)+d
Noisy case n−2γ+2β+12γ ? ? ?
f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)
Assumptions Margin parameter α≥0
|F[η](t)| ∼ |t|−β |F[ηj](t)| ∼ |tj|−βj ∀j =1, . . . ,d
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Mammen and Tsybakov (1999)
Given two densitiesf andg, for anyG ⊂K, the Bayes risk is defined as :
RK(G) = 1 2
"
Z
K/G
fdQ+ Z
G
gdQ
# .
GivenX11, . . . ,Xn1 ∼f andX12, . . . ,Xn2 ∼g, we aim at : G?=arg min
G∈GRK(G).
Goal To obtain minimax fast rates rn(F)∼inf
Gˆ
sup
(f,g)∈FEd2( ˆG,G?), where d2 ∈ {df,g,d∆}.
Mammen and Tsybakov (1999) with errors in variables
We observeZ11, . . . ,Zn1 andZ12, . . . ,Zn2 such that :
Zi1=Xi1+1i andZi2=Xi2+2i, for i =1, . . .n, where :
I Xi1 ∼f andXi2 ∼g,
I ji i.i.d. with densityη.
Goal To obtain minimax fast rates rn(F, β)∼inf
Gˆ
sup
(f,g)∈FEd2( ˆG,G?), where d2∈ {df,g,d∆}.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
ERM approach
ERM principle in the direct case : 1
2n
n
X
i=1
1X1
i∈GC + 1
2n
n
X
i=1
1X2
i ∈G −→RK(G).
ERM principle in this model fails : 1
2n
n
X
i=1
1Z1
i∈GC + 1
2n
n
X
i=1
1Z2
i ∈G → 1
2 Z
GC
f ∗η+ Z
G
g∗η
6=RK(G).
Solution Define Rnλ(G) = 1
2 Z
GC
fˆnλ(x)dx+ Z
G
ˆ gnλ(x)dx
−→RK(G),
where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1
nλ
n
X
i=1
Ke
Zi1−x λ
.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
ERM approach
1 2n
n
X
i=1
1X1
i∈GC + 1
2n
n
X
i=1
1X2
i ∈G −→RK(G).
ERM principle in this model fails : 1
2n
n
X
i=1
1Z1
i∈GC + 1
2n
n
X
i=1
1Z2
i ∈G → 1
2 Z
GC
f ∗η+ Z
G
g∗η
6=RK(G).
Solution Define Rnλ(G) = 1
2 Z
GC
fˆnλ(x)dx+ Z
G
ˆ gnλ(x)dx
−→RK(G),
where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1
nλ
n
X
i=1
Ke
Zi1−x λ
.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
ERM approach
ERM principle in the direct case : 1
2n
n
X
i=1
1X1
i∈GC + 1
2n
n
X
i=1
1X2
i ∈G −→RK(G).
ERM principle in this model fails : 1
2n
n
X
i=1
1Z1
i∈GC + 1
2n
n
X
i=1
1Z2
i ∈G → 1
2 Z
GC
f ∗η+ Z
G
g∗η
6=RK(G).
Solution Define Rnλ(G) = 1
2 Z
GC
fˆnλ(x)dx+ Z
G
ˆ gnλ(x)dx
−→RK(G),
where(ˆfnλ,gˆnλ) are estimators of(f,g) of the form : fˆnλ(x) = 1
nλ
n
X
i=1
Ke
Zi1−x λ
.
Details
Z11, . . . ,Zn1 i.i.d.f ∗η etZ12, . . . ,Zn2 i.i.d.g∗η. We consider : Rnλ(G) = 1
2 Z
GC
fˆnλ(x)dx+ Z
G
ˆ gnλ(x)dx
,
wherefˆnλ andgˆnλ are deconvolution kernel estimator. Then : Rnλ(G) = 1
n
" n X
i=1
hλGC(Zi1) +
n
X
i=1
hλG(Zi2)
# ,
where :
hGλ(z) = Z
G
1 λKe
z−x λ
dx =1G ∗Keλ(z).
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Vapnik’s bound ( = 0)
The use of empirical process comes from VC theory :
RK( ˆGn)−RK(G?) ≤ RK( ˆGn)−Rn( ˆGn) +Rn(G?)−RK(G?)
≤ 2 sup
G∈G
|(Rn−R)(G)|.
Goal to control uniformly the empirical process indexed byG.
ISL {1G ∗Keλ,G ∈ G}.
Vapnik’s bound ( = 0)
The use of empirical process comes from VC theory :
RK( ˆGn)−RK(G?) ≤ RK( ˆGn)−Rn( ˆGn) +Rn(G?)−RK(G?)
≤ 2 sup
G∈G
|(Rn−R)(G)|.
Goal to control uniformly the empirical process indexed byG.
ISL {1G ∗Keλ,G ∈ G}.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Theorem 1 : Upper bound (j.w. with C. Marteau)
Suppose(f,g)∈ G(α, γ) and|F[η](t)| ∼Πdi=1|ti|−βi,βi >1/2, i =1, . . . ,d. Consider a kernel K of orderbγc, which satisfies some properties. Then :
n→+∞lim sup
(f,g)∈G(α,γ)
nτd(α,β,γ)Ef,gd( ˆGn,G?)<+∞, where
τd(α, β, γ) =
γα γ(2+α)+d+2Pd
i=1βi
for d =d∆
γ(α+1) γ(2+α)+d+2Pd
i=1βi
for d =df,g. andλ= (λ1, . . . , λd) is chosen as :
λj =n−
1 γ(2+α)+2Pd
i=1βi+d, ∀j ∈ {1, . . . ,d}.
Theorem 2 : Lower bound (j.w. with C. Marteau)
Suppose|F[η](t)| ∼Πdi=1|ti|−βi,βi >1/2,i =1, . . . ,d. Then forα≤1,
lim inf
n→+∞inf
Gˆn
sup
(f,g)∈G(α,γ)
nτd(α,β,γ)Ef,gd( ˆGn,m,G?)>0, where the infinimum is taken over all possible estimators of the set G? and
τd(α, β, γ) =
γα γ(2+α)+d+2Pd
i=1βi
for d =d∆
γ(α+1) γ(2+α)+d+2Pd
i=1βi
for d =df,g.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Conclusion (minimax)
Density estimation Classification
Direct case n−2γ+12γ n−
γ(α+1) γ(α+2)+d
Noisy case n−2γ+2β+12γ n−
γ(α+1) γ(α+2)+2β+d¯
f ∈Σ(γ,L) E(Y =1|X =x)∈Σ(γ,L)
Assumptions Margin parameter α≥0
|F[η](t)| ∼ |t|−β |F[ηj](t)| ∼ |tj|−βj ∀j =1, . . .d
Sketch of the proofs, heuristic
1. Noisy quantization (for simplicity) 2. Excess risk decomposition
3. Bias control (easy and minimax) 4. Variance control : key lemma
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Other results (I)
I (Un)supervised classification with errors-in-variables : R`(ˆgnλ)−R`(g?)≤Cn−
κγ γ(2κ+ρ−1)+(2κ−1)Pd
i=1βi, where
g?=arg minR`(g,(X,Y))
I (Un)supervised classification with Zi ∼Af using ˆfnN(x) =
N
X
k=1
θˆkφk(x),
where θk =b−1k 1nPn
i=1ψk(Zi) andA∗Aφk =b2kφk and f ∈Θ(γ,L) :={f =
∞
X
k=1
θkϕk :X
θk2k2γ+1 ≤L}.
Other results (I)
I (Un)supervised classification with errors-in-variables : R`(ˆgnλ)−R`(g?)≤Cn−
κγ γ(2κ+ρ−1)+(2κ−1)Pd
i=1βi, where
g?=arg minR`(g,(X,Y))
I (Un)supervised classification with Zi ∼Af using ˆfnN(x) =
N
X
k=1
θˆkφk(x),
where θk =b−1k 1nPn
i=1ψk(Zi) andA∗Aφk =b2kφk and f ∈Θ(γ,L) :={f =
∞
Xθkϕk :X
θk2k2γ+1 ≤L}.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Other results (II)
I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R`(ˆgnλ)−R`(g?)≤Cn−2κ+ρ−1+(κ,β,γ)κ
where :
(κ, β,−→γ) = (2κ−1)
d
X
j=1
βj
γj
,
andλ= (λ1, . . . , λd) is chosen as : λj ∼n−
2κ−1 2γj(2κ+ρ−1+(κ,β,→−γ))
,∀j =1, . . .d.
I Non-exact oracle inequalities : R`(ˆg)≤(1+) inf
g∈GR`(g) +C()n−
γ γ(1+ρ)+Pd
i=1βi, without margin assumption.
Other results (II)
I If f ∈Σ(−→γ ,L) the anisotropic Hölder class : R`(ˆgnλ)−R`(g?)≤Cn−2κ+ρ−1+(κ,β,γ)κ
where :
(κ, β,−→γ) = (2κ−1)
d
X
j=1
βj
γj
,
andλ= (λ1, . . . , λd) is chosen as : λj ∼n−
2κ−1 2γj(2κ+ρ−1+(κ,β,→−γ))
,∀j =1, . . .d.
I Non-exact oracle inequalities : R`(ˆg)≤(1+) inf
g∈GR`(g) +C()n−
γ γ(1+ρ)+Pd
i=1βi,
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Finite dimensional clustering
Givenk, we aim at :
arg min
c=(c1,...,ck)∈Rdk
E min
j=1,...k||X−cj||2. The empirical couterpart :
ˆcn∈arg min
c=(c1,...,ck)∈Rdk
1 n
n
X
i=1
min
j=1,...k||Xi −cj||2, gives rise to the populark-means studied in (Pollard, 1982).
Finite dimensional noisy clustering (j.w. with C. Brunet)
We want to approximate a solution of the stochastic minimization : min
c=(c1,...,ck)∈Rdk
1 n
n
X
i=1
γλ(c,Zi),
where
γλ(c,z) = Z
K
min
j=1,...,k||x −cj||2Keλ(z −x)dx.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
First order conditions (I)
Suppose||X||∞≤M and Pollard’s regularity assumptions are satisfied. Then,∀u∈ {1, . . . ,d} ∀j ∈ {1, . . . ,k}, we have the following assertion :
cuj = Pn
i=1
R
VjxuKeλ(Zi −x)dx Pn
i=1
R
VjKeλ(Zi −x)dx =⇒ ∇eujJnλ(c) =0, where
Jnλ(c) =
n
X
i=1
γλ(c,Zi).
First order conditions (II)
I The standard k-means : cu,j =
Pn
i=1Xi,u1Xi∈Vj Pn
i=11Xi∈Vj = Pn
i=1
R
VjxuδXidx Pn
i=1
R
VjδXidx ,∀u,j, where δXi is the Dirac function at point Xi.
I Another look : cu,j =
R
Vjxuˆfn(x)dx R
Vj
ˆfn(x)dx ,∀u ∈ {1, . . . ,d},∀j ∈ {1, . . . ,k},
where ˆfn(x) =1/nPn
i=1Keλ(Zi−x)is the kernel deconvolution estimator of the density f.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
The algorithm of Noisy K -means (j.w. with C. Brunet)
Experimental setting : simulation study
1. We draw i.i.d. sequences(Xi)i=1,...,n (gaussian mixtures), and (i)ni=1 (symmetric noise) forn∈ {100,500}.
2. We draw repetitions (j)j=1,...,m with m=100.
3. We compute Noisy k-means clustersˆc with an estimation step of fη thanks to 2.
4. We calculate the clustering risk : rn(ˆc) = 1
100
100
X
i=1
1IXj i∈V/ j(ˆc).
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Experimental setting - Model 1
Foru ∈ {1, . . . ,10}, we callMod1(u):
Zi =Xi +i(u),i =1, . . . ,n, Mod1(u) where :
I (Xi)ni=1 are i.i.d. with density f =1/2fN(02,I2)+1/2fN((5,0)T,I2)
I and(i(u))ni=1 are i.i.d. with law N(02,Σ(u)), where Σ(u) is a diagonal matrix with diagonal vector(0,u)T, for
u ∈ {1, . . . ,10}.
Illustrations Mod1
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Experimental setting - Model 2
Foru ∈ {1, . . . ,10}, we call model Mod2(u) :
Zi =Xi(u) +i,i =1, . . . ,n, Mod2(u) where :
I (Xi(u))ni=1 are i.i.d. with density
f =1/3fN(02,I2)+1/3fN((a,b)T,I2) +1/3fN((b,a)T,I2), where (a,b) = (15−(u−1)/2,5+ (u−1)/2), for u ∈ {1, . . . ,10},
I and(i)ni=1 are i.i.d. with law N(02,Σ), where Σis a diagonal matrix with diagonal vector (5,5)T.
Illustrations Mod2
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Results Mod1 for n = 100
Results Mod1 for n = 500
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Results Mod2
Adaptation !
To get the optimal rates, we act as follows : R(ˆcλ,c?)≤inf
λ
( C1
c(λ)
√n
2/(1+ρ)
+C2λ2γ )
≤Cn−
γ 2γ(1+ρ)+2β
where
λ? =O(n−2γ(1+ρ)+2β1 ).
Goal to choose the bandwidth based on Lepski’s principle
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Empirical Risk Comparison (j.w. with M. Chichignoud)
We chooseλas follows :
ˆλ=max{λ∈Λ :Rnλ0(ˆcλ)−Rnλ0(ˆcλ0)≤3δλ0,∀λ0 ≤λ}, whereδλ is defined as :
δλ =Cadaptλ−2β n logn, whereCadapt>0 is an explicit constant.
Adaptation : data-driven choices of λ
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Uniform law for
Adaptation : stability of ICI method
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Real dataset : Iris
Adaptation using Empirical Risk Comparison (ERC)
To get the optimal rates, we act as follows : R(ˆcλ,c?)≤inf
λ
( C1
c(λ)
√n
2/(1+ρ)
+C2λ2γ )
≤Cn−
γ 2γ(1+ρ)+2β
where
λ? =O(n−2γ(1+ρ)+2β1 ).
Goal to choose the bandwidth based on Lepski’s principle
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Lepski’s method
I {fˆh,h∈ H}a family of (kernel) estimators, with associated (bandwidth) h∈ H ⊂R.
I BV decomposition : kˆfh−fk ≤C{B(h) +V(h)}, where (usually)V(·)is known.
I Related to minimax theory :
f ∈Σ(γ,L)⇒ kˆfh∗(γ)−fk ≤Cinf{B(h) +V(h)}=Cψn(γ).
Goal a data-driven method to reach the bias-variance trade-off (minimax adaptive method).
Lepski’s method : the rule
The rule :
ˆh=max{h>0:∀h0 ≤h, kˆfh−ˆfh0k ≤cV(h0)}
kfˆh−ˆfh0k ∼ kˆfh−fk+kf −ˆfh0k
∼ B(h) +V(h) +B(h0) +V(h0)
h0≤h
∼ B(h) +V(h0) The rule selects the biggesth >0 such that :
sup
h0≤h
B(h) +V(h0)
V(h0) ≤c ⇔ ∀h0 ≤h,B(h)≤(c−1)V(h0).
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Empirical Risk Comparison (j.w. with M. Chichignoud)
We chooseλas follows :
ˆλ=max{λ∈Λ :Rnλ0(ˆcλ)−Rnλ0(ˆcλ0)≤3δλ0,∀λ0 ≤λ}, whereδλ is defined as :
δλ =Cadaptλ−2β n logn, whereCadapt>0 is an explicit constant.
Theorem 3 : Adaptive upper bound (j.w. with M.
Chichignoud)
Supposef ∈Σ(γ,L), the noise assumption and Pollard’s regularity assumptions are satisfied. Consider a kernelK of orderbγc, which satisfies the kernel assumption. Then :
lim sup
n→+∞
(logn)n−
γ γ+Pd
i=1βi sup
f∈Σ(γ,L)E[R(ˆcˆλ)−R(c?)]<+∞, where
ˆ
cλ =arg min
c∈C n
X
i=1
`λ(c,Zi),
andλˆ is chosen with ERC rule.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
The rule becomes :
λˆ=λ1 1IΩ+λ2 1IΩC, where
Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
I Case 1 : λ? =λ1< λ2.
I Case 2 : λ? =λ2> λ1.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
The rule becomes :
λˆ=λ1 1IΩ+λ2 1IΩC, where
Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
I Case 2 : λ? =λ2> λ1.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
The rule becomes :
λˆ=λ1 1IΩ+λ2 1IΩC, where
Ω ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
I Case 1 : λ? =λ1< λ2.
I Case 2 : λ? =λ2> λ1.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 1 : λ? =λ1< λ2.
ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1IΩ+ 1IΩC)
≤ ψn(λ?) +ER(ˆcλˆ,c?) 1IΩC.
R(ˆcλˆ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +Rnλ?(ˆcλˆ,c?)
≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?
≤ B(λ?) +rλ?(2 logn) +3δλ?
≤ Cψn(λ?),
whererλ(t) : P supc|Rnλ−Rλ|(c,c?)≥rλ(t)
≤e−t.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 1 : λ? =λ1< λ2.
ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1IΩ+ 1IΩC)
≤ ψn(λ?) +ER(ˆcλˆ,c?) 1IΩC. OnΩC, we have with high proba :
R(ˆcˆλ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcλˆ,c?) +Rnλ?(ˆcλˆ,c?)
≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?
≤ B(λ?) +rλ?(2 logn) +3δλ?
≤ Cψn(λ?),
whererλ(t) : P supc|Rnλ−Rλ|(c,c?)≥rλ(t)
≤e−t.
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 1 : λ? =λ1< λ2.
ER(ˆcλˆ,c?) = ER(ˆcˆλ,c?)(1IΩ+ 1IΩC)
≤ ψn(λ?) +ER(ˆcλˆ,c?) 1IΩC. OnΩC, we have with high proba :
R(ˆcˆλ,c?) = (R −Rλ?)(ˆcˆλ,c?) + (Rλ?−Rnλ?)(ˆcλˆ,c?) +Rnλ?(ˆcλˆ,c?)
≤ B(λ?) + (Rλ?−Rnλ?)(ˆcˆλ,c?) +3δλ?
≤ B(λ?) +rλ?(2 logn) +3δλ?
≤ Cψn(λ?),
whererλ(t) : P sup |Rλ−Rλ|(c,c?)≥rλ(t)
≤e−t.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 2 : λ? =λ2> λ1.
ER(ˆcˆλ,c?) ≤ ψn(λ?) +ER(ˆcˆλ,c?) 1IΩ≤ψn(λ?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)
≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?). SinceB(λ1)<B(λ2) =B(λ?) =δλ? =δλ2 < δλ1 and using Bousquet twice, we have with proba 1−2n−2 :
Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2
≤ Cδλ1.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 2 : λ? =λ2> λ1.
ER(ˆcˆλ,c?) ≤ ψn(λ?) +ER(ˆcˆλ,c?) 1IΩ≤ψn(λ?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)
≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?).
Bousquet twice, we have with proba 1−2n :
Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2
≤ Cδλ1.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Proof for λ
?∈ {λ
1, λ
2}, λ
1< λ
2.
I Case 2 : λ? =λ2> λ1.
ER(ˆcˆλ,c?) ≤ ψn(λ?) +ER(ˆcˆλ,c?) 1IΩ≤ψn(λ?) +P(Ω), whereΩ ={Rnλ1(ˆcλ2,ˆcλ1)>Cδλ1}.
Rnλ1(ˆcλ2,ˆcλ1) = (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) + (Rλ1 −R)(ˆcλ2,cˆλ1) +R(ˆcλ2,ˆcλ1)
≤ (Rnλ1−Rλ1)(ˆcλ2,ˆcλ1) +2B(λ1) +R(ˆcλ2,c?).
SinceB(λ1)<B(λ2) =B(λ?) =δλ? =δλ2 < δλ1 and using Bousquet twice, we have with proba 1−2n−2 :
Rnλ1(ˆcλ2,ˆcλ1) ≤ 2rλ1(2 logn) +2δλ1+B(λ2) +δλ2
≤ Cδλ1.
ERC’s Extension
Consider a family ofλ-ERM {ˆgλ, λ >0}. Assume :
1.There exists an increasing function denoted byBias(·) such that : (Rλ−R)(g,g?)
≤Bias(λ) +1
4R(g,g?), for all g ∈ G.
2.There exists a decreasing function denoted byVart(·) (t ≥0) such that∀λ,t >0 :
P sup
g∈G
(Rnλ−Rλ)(g,g?) −1
4R(g,g?)
>Vart(λ)
!
≤e-t.
Then, there exists a universal constantC3 such that ER(ˆgˆλ,g?)≤C3
λ∈∆inf n
Bias(λ) +Vart(λ)o +e-t
, for allt≥0.
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Examples
I Nonparametric estimation
I Image denoisingRnλ(ft) =P
i(Yi−ft)2Kλ(Xi−x0).
I Local robust regressionRnλ(t) =P
iρ(Yi−t)Kλ(Xi−x0).
I Fitted local likelihoodRnλ(θ) =P
i−logp(Yi, θ)Kλ(Xi−x0).
I Inverse Statistical Learning
I Quantile estimation Rnλ(q) =P
i
R(x−q)(τ− 1Ix≤q) ˜Kλ(Zi−x)dx.
I Learning principal curves Rnλ(g) =P
i
R inftkx−f(t)k2K˜λ(Zi−x)dx.
I Binary classificationRnλ(G) =P
i
R 1IY
i6=1I(x∈G)K˜λ(Zi−x)dx.
Open problems
I Anisotropic case
I Margin adaptation
I Model selection
Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC
Conclusion
Thanks for your attention !