Optimal model selection

(1)

1/23

Optimal model selection

Sylvain Arlot

1

Cnrs

2

Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team

EMS 2009, Toulouse, 23/07/2009

(2)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(3)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(4)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(5)

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

(6)

3/23

Loss function, least-square estimator

Least-square risk:

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

_λ

β b λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(7)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

_λ

β b λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(8)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

m

P n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X i ) − Y i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

_m

of X .

b s m = X

λ∈Λ

m

β b _λ 1 I

λ

β b _λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(9)

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

t∈S

m

P _n γ (t, ·) = arg min

t∈S

m

1 n

n

X

i=1

(t(X _i ) − Y _i ) ² . e.g., histograms on a partition (I _λ ) λ∈Λ

m

of X .

b s m = X

λ∈Λ

m

β b λ 1 I

_λ

β b λ = 1 Card{X _i ∈ I _λ }

X

i

∈I

_λ

Y i .

(10)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(11)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(12)

4/23

Model selection

(S m ) m∈M −→ ( b s m ) m∈M −→ b s

m b ???

Goals:

Oracle inequality (in expectation, or with a large probability):

`(s , b s

m b ) ≤ C inf

m∈M {`(s , b s _m ) + R(m, n)}

Adaptivity (provided (S m ) m∈M

n

is well chosen), e.g., to the

smoothness of s or to the variations of σ

(13)

5/23

Penalization

m b ∈ arg min

m∈M {P _n γ( b s _m ) + pen(m)}

(14)

5/23

Penalization

m b ∈ arg min

m∈M {P _n γ( b s _m ) + pen(m)}

Unbiased risk estimation principle

⇒ Ideal penalty: pen _id (m) = (P − P n )(γ( b s m , ·))

pen(m) = 2σ ² D m

n (Mallows 1973) pen(m) = 2 σ b ² D m

n or K D b _m

(15)

6/23

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

ε n = 1000 data points Regular histograms on

0; ¹ ₂

(D m,1 bins), then regular histograms on ₁

2 ; 1

(D _m,2 bins).

data sample oracle: D m,1 = 1, D m,2 = 3

(16)

6/23

Limitations of linear penalties: illustration

Y = X + 1 + 1 X ≤1/2

ε n = 1000 data points

The ideal penalty is not a linear function of the dimension.

(17)

7/23

Limitations of linear penalties: illustration

(18)

8/23

Limitations of linear penalties: m(K b ^? ) 6= m ^?

Density of (D

m(K b

^?

),1 , D

m(K b

^?

),2 ) and (D m

^?

,1 , D m

^?

,2 ) according to N = 1000 samples

m(K b ^? ) m ^?

(19)

9/23

Limitations of linear penalties: theory

Y = X + σ(X )ε with X ∼ U ([0; 1]) , E [ε|X ] = 0 E

ε ² |X

= 1 and

Z 1/2 0

(σ(x)) ² dx 6=

Z 1 1/2

(σ(x)) ² dx Regular histograms on

0; ¹ ₂

(1 ≤ D m,1 ≤ n/(2 ln(n) ² ) bins), then regular histograms on ₁

2 ; 1

(1 ≤ D _m,2 ≤ n/(2 ln(n) ² ) bins).

Theorem (A. 2008, arXiv:0812.3141)

There exist constants C , η > 0 (only depending on σ(·)) and an event of probability at least 1 − Cn ⁻² on which

∀K > 0, ∀ m(K b ) ∈ arg min

m∈M

_n

{P _n γ ( b s _m ) + KD _m } ,

`(s, b s

m(K) b ) ≥ (1 + η) inf

m∈M

n

{`(s , b s _m )} .

(20)

10/23

Cross-validation

(X ₁ , Y ₁ ), . . . , (X _q , Y _q )

| {z }

,(X _q+1 , Y _q+1 ), . . . , (X _n , Y _n )

| {z }

Training Validation

b s _m ^(e) ∈ arg min

t∈S

m

( _q X

i=1

γ(t , (X _i , Y _i )) )

P _n ^(v) = 1 n − q

n

X

i=q+1

δ _(X

_i

_,Y

_i

₎ ⇒ P _n ^(v) γ b s _m ^(e)

V -fold cross-validation : (B j ) 1≤j≤V partition of {1, . . . , n}

⇒ m b ∈ arg min

m∈M





 1 V

V

X

j=1

P _n ^j γ b s _m ^(−j)







e s = b s

m b

(21)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D _m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(22)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D _m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(23)

11/23

Bias of cross-validation

Ideal criterion: P γ( b s _m )

Regression on a model of histograms with D m bins (σ(X ) ≡ σ for simplicity):

E [P γ( b s _m )] ≈ P γ(s _m ) + D _m σ ² n

E h

P n ^(j) γ b s m ^(−j ⁾

i

= E h

P γ b s m ^(−j ⁾

i ≈ Pγ (s _m ) + V V − 1

D _m σ ² n

⇒ bias if V is fixed (“overpenalization”)

(24)

12/23

Suboptimality of V -fold cross-validation

Y = X + σε with ε bounded and σ > 0 M: family of regular histograms on X = [0, 1]

m b selected by V -fold cross-validation with V fixed as n grows Theorem (A. 2008, arXiv:0802.0566)

With probability at least 1 − ♦ n ⁻² ,

`(s, b s

m b ) ≥ (1 + κ(V )) inf

m∈M {`(s , b s m )}

with κ(V ) > 0.

(25)

13/23

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

0 0.5 1

−4 0 4

0 0.5 1

−4 0 4

Models: regular histograms on 0; ¹ ₂

, then regular histograms on ₁

2 ; 1 .

E [`(s , b s _m _b )]

E [inf m∈M {`(s , b s m )}]

computed over 1000 samples.

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

(26)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world : P ^sampling ^// P _n ⁺³ b s m

pen _id (m) = (P − P n )γ ( b s m ) = F (P, P n )

(27)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P ^sampling ^// P _n ⁺³ b s m

Bootstrap world : P n

pen _id (m) = (P − P _n )γ ( b s _m ) = F (P, P _n )

(28)

14/23

Resampling heuristics (bootstrap, Efron 1979)

Real world :

O O O O O O O P ^sampling ^// P _n ⁺³ b s m

Bootstrap world : P n

resampling // P _n ^W ⁺³ b s _m ^W

(P − P _n )γ ( b s _m ) = F (P , P _n ) ^/o ^/o ^/o ^// F (P _n , P _n ^W ) = (P _n − P _n ^W )γ b s _m ^W where P _n ^W = n ⁻¹

n

X

i=1

W i δ _(X

_i

_,Y

_i

₎ .

(29)

15/23

Resampling penalization

Ideal penalty:

(P − P _n )(γ( b s _m )) Resampling penalty:

pen(m) = C E h

(P n − P _n ^W )γ

b s _m ^W

|(X _i , Y i ) 1≤i≤n

i

b s _m ^W ∈ arg min

t∈S

m

P _n ^W γ (t) with C ≥ C _W to be chosen (no bias if C = C _W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P _n γ ( b s m ) + pen(m)}

(30)

15/23

Resampling penalization

Ideal penalty:

(P − P _n )(γ( b s _m )) Resampling penalty:

pen(m) = C E h

(P n − P _n ^W )γ

b s _m ^W

|(X _i , Y i ) 1≤i≤n

i

b s _m ^W ∈ arg min

t∈S

m

P _n ^W γ (t) with C ≥ C _W to be chosen (no bias if C = C _W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P _n γ ( b s m ) + pen(m)}

(31)

15/23

Resampling penalization

Ideal penalty:

(P − P _n )(γ( b s _m )) Resampling penalty:

pen(m) = C E h

(P n − P _n ^W )γ

b s _m ^W

|(X _i , Y i ) 1≤i≤n

i

b s _m ^W ∈ arg min

t∈S

m

P _n ^W γ (t) with C ≥ C _W to be chosen (no bias if C = C _W ) The final estimator is b s

m b with m b ∈ arg min

m∈M {P _n γ ( b s m ) + pen(m)}

(32)

16/23

Resampling penalization with heteroscedastic data

(33)

17/23

Other resampling-based penalties

Efron’s bootstrap penalties (Efron, 1983; Shibata, 1997):

pen(m) = E h

(P n − P _n ^W )(γ ( b s _m ^W ))

(X i , Y i ) 1≤i≤n

i

Rademacher complexities (Koltchinskii 2001; Bartlett, Boucheron and Lugosi, 2002): subsampling

pen _id (m) ≤ pen ^glo _id (m) = sup

t∈S

m

(P − P n )γ(t, ·) idem with general exchangeable weights (Fromont, 2004) Local Rademacher complexities (Bartlett, Bousquet and Mendelson, 2004; Koltchinskii, 2006)

. . .

(34)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C _W

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian

Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(35)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C _W

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(36)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C _W

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(37)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C _W

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(38)

18/23

Non-asymptotic pathwise oracle inequality

W exchangeable (e.g., bootstrap or subsampling) C ≈ C _W

Histograms; “small” number of models (Card(M _n ) ≤ ♦ n ^♦ ) Bounded data: kY k _∞ ≤ A < ∞

Noise-level lower bounded: 0 < σ _min ≤ σ(X ) Smooth s : non-constant, α-h¨ olderian Theorem (A. 2009, EJS)

Under a “reasonable” set of assumptions on P, with probability at least 1 − ♦ n ⁻² ,

`(s , b s

m b ) ≤

1 + ln(n) ^−1/5

m∈M inf {`(s, b s m )}

Similar result in density estimation recently (Lerasle, 2009)

(39)

19/23

V -fold penalization

V -fold penalty:

pen _VF (m) = C V

V

X

j =1

h

(P _n − P n ^(−j) )(γ( b s m ^(−j) )) i

b s m ^(−j ⁾ ∈ arg min

t∈S

m

P n ^(−j) γ(t )

with C ≥ V − 1 to be chosen (no bias if C = V − 1, see also Burman, 1989)

The final estimator is b s _m _b with m b ∈ arg min

m∈M {P _n γ ( b s m ) + pen _VF (m)}

⇒ oracle inequality with constant 1 + ln(n) ^−1/5 if V = O(1) or

V = n (A. 2008, arXiv:0802.0566)

(40)

20/23

V -fold penalization in the general framework

Resampling and V -fold penalization are well-defined in the general framework

Constant C W or V − 1: could be estimated with the slope heuristics (A. and Massart, JMLR 2009)

Constant V − 1 for V -fold penalization:

pen _VF (m, n) = C V

P _n ^(j ⁾ − P _n ^(−j) γ

b s _m ^(−j)

⇒ E [pen _VF (m, n)] = C E

h pen _id

m, ^n(V _V ⁻¹⁾ i V

= C E [pen _id (m, n)]

V − 1 if E [pen _id (m, n)] ≈ α(m)

n

(41)

21/23

Simulations: sin, n = 200, σ(x ) = x , 2 bin sizes

Mallows 3.69 ± 0.07

2-fold 2.54 ± 0.05

5-fold 2.58 ± 0.06

10-fold 2.60 ± 0.06

20-fold 2.58 ± 0.06

leave-one-out 2.59 ± 0.06

pen 2-f 3.06 ± 0.07

pen 5-f 2.75 ± 0.06

pen 10-f 2.65 ± 0.06

pen Loo 2.59 ± 0.06

Mallows ×1.25 3.17 ± 0.07

pen 2-f ×1.25 2.75 ± 0.06

pen 5-f ×1.25 2.38 ± 0.06

pen 10-f ×1.25 2.28 ± 0.05

pen Loo ×1.25 2.21 ± 0.05

(42)

22/23

Simulations: change-point detection, n = 200

N = 5000 samples generated

5-fold 1.436 ± 0.008

10-fold 1.400 ± 0.008

20-fold 1.372 ± 0.008

pen 5-f 1.615 ± 0.011

pen 10-f 1.444 ± 0.009

pen 20-f 1.390 ± 0.008

pen 5-f ×1.25 1.462 ± 0.008

pen 10-f ×1.25 1.379 ± 0.008

pen 20-f ×1.25 1.315 ± 0.007

(43)

23/23

Conclusion

Usual model selection procedures (C p , V -fold cross-validation) are suboptimal in some realistic frameworks

Resampling and V -fold penalties are (first order) optimal and robust to unknown variations of the noise-level

Theoretical results for regressograms (and recently in density estimation by Lerasle, see CPS 49),

but these procedures are well-defined in the general

framework, rely on a widely valid heuristics, and

experimentally perform well.

Optimal model selection

1/23

Optimal model selection

Sylvain Arlot

Cnrs

Ecole Normale Sup´ ´ erieure (Paris), Liens, Willow Team

EMS 2009, Toulouse, 23/07/2009

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

2/23

Statistical framework: regression on a random design

(X 1 , Y 1 ), . . . , (X n , Y n ) ∈ X × Y i.i.d. (X i , Y i ) ∼ P unknown

Y = s(X ) + σ(X )ε X ∈ X ⊂ R d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε 2 |X

= 1 noise level σ(X )

predictor t : X 7→ Y ?

3/23

Loss function, least-square estimator

Least-square risk:

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

P n γ (t, ·) = arg min

t∈S

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

of X .

b s m = X

λ∈Λ

β b λ 1 I

β b λ = 1 Card{X i ∈ I λ }

X

X

∈I

Y i .

3/23

Loss function, least-square estimator

Loss function:

`(s , t) = P γ(t, ·) − P γ(s , ·) = E

(t(X ) − s(X )) 2 with γ (t, (x, y )) = (t (x) − y ) 2

Empirical risk minimizer on S m (= model):

b s m ∈ arg min

t∈S

P n γ (t, ·) = arg min

t∈S

1 n

n

X

i=1

(t(X i ) − Y i ) 2 .

e.g., histograms on a partition (I λ ) λ∈Λ

of X .

b s m = X

λ∈Λ

β b λ 1 I

β b λ = 1 Card{X i ∈ I λ }

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

Y = s(X ) + σ(X )ε X ∈ X ⊂ R ^d , Y ∈ Y = [0; 1] or R

noise ε : E [ε|X ] = 0 E ε ² |X

E γ(t, (X , Y )) = P γ(t, ·) with γ(t, (x, y)) = (t(x) − y) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

P _n γ (t, ·) = arg min

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

β b λ = 1 Card{X _i ∈ I _λ }

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

P _n γ (t, ·) = arg min

(t(X _i ) − Y _i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

β b λ = 1 Card{X _i ∈ I _λ }

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

(t(X i ) − Y i ) ² .

e.g., histograms on a partition (I _λ ) λ∈Λ

β b _λ 1 I

β b _λ = 1 Card{X _i ∈ I _λ }

(t(X ) − s(X )) ² with γ (t, (x, y )) = (t (x) − y ) ²

Empirical risk minimizer on S _m (= model):

b s _m ∈ arg min

P _n γ (t, ·) = arg min

(t(X _i ) − Y _i ) ² . e.g., histograms on a partition (I _λ ) λ∈Λ

β b λ = 1 Card{X _i ∈ I _λ }

m∈M {`(s , b s _m ) + R(m, n)}