Classification and regression based on derivatives : a consistency result

(1)

HAL Id: hal-00668212

https://hal.archives-ouvertes.fr/hal-00668212

Submitted on 9 Feb 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Classification and regression based on derivatives : a

consistency result

Nathalie Villa-Vialaneix, Fabrice Rossi

To cite this version:

Nathalie Villa-Vialaneix, Fabrice Rossi. Classification and regression based on derivatives : a consis-tency result. II Simposio sobre Modelamiento Estadístico, Dec 2010, Valparaiso, Chile. �hal-00668212�

(2)

Classification and regression based on derivatives: a consistency result

Nathalie Villa-Vialaneix (Joint work with Fabrice Rossi)

http://www.nathalievilla.org

II Simposio sobre Modelamiento Estadístico

Valparaiso, December, 3rd

1 / 30 Nathalie Villa-Vialaneix N

(3)

Introduction and motivations

Outline

1 Introduction and motivations

2 A general consistency result

(4)

Regression and classification from an

infinite dimensional predictor

Settings

(X,Y)is a random pair of variables where

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R

(5)

Regression and classification from an

infinite dimensional predictor

Settings

Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈(X, h., .iX), an infinite dimensional Hilbert space.

(6)

Regression and classification from an

infinite dimensional predictor

Settings

We are given alearning setSn = {(Xi,Yi)}n_i=1of n i.i.d. copies

of(X,Y).

(7)

Regression and classification from an

infinite dimensional predictor

Settings

of(X,Y).

(8)

Regression and classification from an

infinite dimensional predictor

Settings

of(X,Y).

Purpose: Findφn : X → {−1,1} or R, that is universally consistent:

Regression case: limn→+∞E

[φn(X) −Y]2 =L∗where L∗=infφ:X→RE

[φ(X) −Y]2will also be called the Bayes risk.

(9)

(10)

Using derivatives

Practically, X(m)is often more relevant than X for the prediction.

(11)

Using derivatives

(12)

Using derivatives

Practically, X(m)is often more relevant than X for the prediction. ButX →X(m)inducesinformation lossand

inf φ:Dm_X→{−1,1} P_φ(_X(m)_{) ,}_Y≥ inf φ:X→{−1,1} P_(φ(_X_{) ,}_Y_{) =}_L∗ and inf φ:Dm_X→R Eh_φ(_X(m)_{) −}_Yi2 ≥ inf φ:X→R P_[φ(_X_{) −}_Y_]2₌_L∗_. 5 / 30 Nathalie Villa-Vialaneix N

(13)

Sampled functions

Practically,(Xi)iare not perfectly known; only a discrete sampling

is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}.

(14)

Sampled functions

The sampling can be non uniform...

(15)

Sampled functions

(16)

Sampled functions

is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}. Then, X_i(m)isestimatedfromXτd

i , bybX (m)

τd , which also induces information loss: inf φ:Dm_X→{−1,1} P φ(bX_τ(m)_d ) ,Y ≥ inf φ:Dm_X→{−1,1} P_φ(_X(m)_{) ,}_Y≥L∗ and inf φ:Dm_X→R E φ(bX_τ(m)_d ) −Y 2! ≥ inf φ:Dm_X→R Eh_φ(_X(m)_{) −}_Yi2 ≥L∗. 6 / 30 Nathalie Villa-Vialaneix N

(17)

Purpose of the presentation

Find a classifier or a regression functionφn,τd built frombX (m)

τd such

that the risk ofφn,τd asymptotically reachesthe Bayes risk L ∗_: lim |τd|→+∞ lim n→+∞ P φn,τd(bX (m) τd ) ,Y =L∗ or lim |τd|→+∞ lim n→+∞E φ_n,τ_d(bX_τ(m)_d ) −Y 2! =L∗

(18)

Purpose of the presentation

Find a classifier or a regression functionφn,τd built frombX (m)

τd such

that the risk ofφn,τd asymptotically reachesthe Bayes risk L ∗_: lim |τd|→+∞ lim n→+∞ P φn,τd(bX (m) τd ) ,Y =L∗ or lim |τd|→+∞ lim n→+∞E φ_n,τ_d(bX_τ(m)_d ) −Y 2! =L∗

Main idea: Use a relevant way to estimate X(m)fromXτd _(by smoothing splines) and combine the consistency of splines with the consistency of aR|τd|_{-classifier or regression function.}

(19)

A general consistency result

Outline

(20)

Basics about smoothing splines I

Suppose thatXis the Sobolev space

Hm =nh∈L_[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o

(21)

Basics about smoothing splines I

Hm =nh∈L_[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o equipped with the scalar product

hu,vi_Hm = hDmu,Dmvi_L2+ m

X

j=1

BjuBjv

(22)

Basics about smoothing splines I

Hm =nh∈L_[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o equipped with the scalar product

hu,vi_Hm = hDmu,Dmvi_L2+ m

X

j=1

BjuBjv

where B are m boundary conditions such thatKerB∩ Pm−1= {0}.

(Hm_{, h., .i}

Hm)is a RKHS:∃k₀ : Pm−1× Pm−1→ Rand

k1 : KerB× KerB → Rsuch that

∀u∈ Pm−1, t ∈[0,1], hu,k0(t, .)iHm =u(t) and

∀u∈ KerB, t ∈[0,1], hu,k1(t, .)iHm =u(t)

See[Berlinet and Thomas-Agnan, 2004]for further details.Nathalie Villa-Vialaneix 9 / 30 N

(23)

Basics about smoothing splines II

A simple example of boundary conditions: h(0) =h(1)(0) = . . . =h(m−1)(0) =0. Then, k0(s,t) = m−1_X k=0 tksk (k!)2 and _Z 1₍_t₋_w₎m−1 + (s−w)m−1+

(24)

Estimating the predictors with

smooth-ing splines I

Assumption (A1)

|τd| ≥ m − 1

sampling points are distinct in[0, 1]

Bj _{are linearly independent from h → h(t) for all t ∈ τ} d

(25)

Estimating the predictors with

smooth-ing splines I

Assumption (A1)

|τd| ≥ m − 1

[Kimeldorf and Wahba, 1971]: forxτd _inR|τd|_,_∃_!ˆ_x

λ,τd ∈ H m solution of arg min 1 |τd| X (h(t) −xτ )2+ λ Z (h(m)(t))2dt

(26)

Estimating the predictors with

smooth-ing splines I

Assumption (A1)

|τd| ≥ m − 1

[Kimeldorf and Wahba, 1971]: forxτd _inR|τd|_,_∃_!ˆ_x

λ,τd ∈ H m solution of arg min h∈Hm 1 |τd| |τd| X l=1 (h(tl) −xτd)2+ λ Z [0,1] (h(m)(t))2dt. andˆxλ,τd = Sλ,τdx τd _where_S λ,τd : R |τd|_{→ H}m_.

These assumptions are fullfilled by the previous simple example as long as 0<τd. Nathalie Villa-Vialaneix 11 / 30

(27)

Estimating the predictors with

smooth-ing splines II

S_λ,τ_d is given by: S_λ,τ_d = ωT(U(K1+ λI|τd|)U T₎−1_U₍_K 1+ λI|τd|) −1 +ηT(K1+ λI|τd|) −1_(I |τd|−U T₍_U₍_K 1+ λI|τd|) −1_U₍_K 1+ λI|τd|) −1₎ = ωTM0+ ηTM1 with {ω1, . . . , ωm} is a basis of Pm−1, ω= (ω1, . . . , ωm)T and U= (ωi(t))i=1,...,m t∈τd;

(28)

Estimating the predictors with

smooth-ing splines II

S_λ,τ_d is given by: S_λ,τ_d = ωT(U(K1+ λI|τd|)U T₎−1_U₍_K 1+ λI|τd|) −1 +ηT(K1+ λI|τd|) −1_(I |τd|−U T₍_U₍_K 1+ λI|τd|) −1_U₍_K 1+ λI|τd|) −1₎ = ωTM0+ ηTM1 with {ω1, . . . , ωm} is a basis of Pm−1, ω= (ω1, . . . , ωm)T and U= (ωi(t))i=1,...,m t∈τd; η= (k1(t, .))Tt∈τdand K1= (k1(t, t ′₎₎ t,t′_∈τ_d.

The observations of thepredictor X (NIR spectra) are then estimatedfrom their samplingXτd _by_Xb

λ,τd.

(29)

Two important consequences

1 No information loss inf φ:Hm→{−1,1} P_φ(b_X_λ,τ d) , Y = inf φ:R|τd |→{−1,1} P_(φ(Xτd_{) , Y )} and inf φ:Hm_→{−₁_,₁_} Eh_φ(b_X_λ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P_[φ(Xτd_{) − Y ]}2

(30)

Two important consequences

2 Easy way to use derivatives:

hSλ,τdu τd_{, S} λ,τdv τd_i Hm = hbu_λ,τ d,bvλ,τdiHm 13 / 30 Nathalie Villa-Vialaneix N

(31)

Two important consequences

(uτd₎T_MT

(32)

Two important consequences

(uτd₎T_M

λ,τdv

τd ₌ _h_b_u

λ,τd,bvλ,τdiHm

whereMλ,τd is symmetric, definite positive.

(33)

Two important consequences

(Qλ,τdu

τd₎T_(Q

λ,τdv

τd₎ ₌ _h_b_u

(34)

Two important consequences

(Qλ,τdu τd₎T_(Q λ,τdv τd₎ ₌ _h_b_u λ,τd,bvλ,τdiHm ≃ h_bu_λ,τ(m) d,bv (m) λ,τdiL2

whereQλ,τd is the Choleski triangle ofMλ,τd:Q T

λ,τdQλ,τd = Mλ,τd.

Remark:Qλ,τd is calculated only from the RKHS, λ and τd: it does

not depend on the data set.

(35)

Classification and regression based on

derivatives

Suppose that we know aconsistent classifier or regression function inR|τd| _{that is based on}R|τd|_{scalar product or norm.}

Example: Nonparametric kernel regression

Ψ :u∈ R|τd| _→ Pn i=1TiK _ku−U ik_{R|τd |} hn Pn i=1K _ku−U ik_{R|τd |} hn

(36)

Classification and regression based on

derivatives

Suppose that we know aconsistent classifier or regression function inR|τd| _{that is based on}R|τd|_{scalar product or norm.} Thecorresponding derivative based classifier or regression functionis given by using the norm induced byQλ,τd:

φn,d =Ψ ◦Qλ,τd :x∈ H m _→ Pn i=1YiK kQ_λ,τdxτd−Q_λ,τdXτd_i k R|τd | hn ! Pn i=1K kQ_λ,τdxτd−Q_λ,τdXτd_i k R|τd | hn ! 14 / 30 Nathalie Villa-Vialaneix N

(37)

Classification and regression based on

derivatives

Suppose that we know aconsistent classifier or regression function inR|τd| _{that is based on}R|τd|_{scalar product or norm.} Thecorresponding derivative based classifier or regression functionis given by using the norm induced byQλ,τd:

φn,d =Ψ ◦Qλ,τd :x∈ H m _→ Pn i=1YiK kQ_λ,τdxτd−Q_λ,τdXτd_i k R|τd | hn ! Pn i=1K kQ_λ,τdxτd−Q_λ,τdXτd_i k R|τd | hn ! !

(38)

Remark for consistency

Classification case(approximatively the same is true for regression): P_φ_n,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −L_d∗+L_d∗−L∗ where L_d∗=inf_φ:R|τd |→{−1,1}P(φ(Xτd) ,Y). 15 / 30 Nathalie Villa-Vialaneix N

(39)

Remark for consistency

Classification case(approximatively the same is true for regression): P_φ_n,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −L_d∗+L_d∗−L∗ where L_d∗=inf_φ:R|τd |→{−1,1}P(φ(Xτd) ,Y).

1 For all fixed d,

lim n→+∞ P_φ_n_,τ d(bXλ,τd) , Y = L_d∗

as long as the R|τd|_{-classifier is consistent because there is a}

one-to-one mapping betweenXτd _{and b}_X

(40)

Remark for consistency

1 For all fixed d,

λ,τd. 2 L∗ d− L ∗_{≤ E}E(Y|bX λ,τd) − E(Y |X)

with consistency of spline estimate bXλ,τdand assumption on the

regularity of E(Y |X = .), consistency would be proved.

(41)

Remark for consistency

1 For all fixed d,

λ,τd.

(42)

Spline consistency

Letλdepends on d and denote(λd)d the series of regularization

parameters. Also introduce

∆τd :=max{t1,t2−t1, . . . ,1−t|τd|}, ∆τd :=min1≤i<|τd|{ti+1−ti} Assumption (A2)

∃ R such that∆τd/∆τd ≤ R for all d;

limd→+∞|τd|= +∞;

limd→+∞λd= 0.

(43)

Spline consistency

Letλdepends on d and denote(λd)d the series of regularization

parameters. Also introduce

∆τd :=max{t1,t2−t1, . . . ,1−t|τd|}, ∆τd :=min1≤i<|τd|{ti+1−ti} Assumption (A2)

∃ R such that∆τd/∆τd ≤ R for all d;

limd→+∞|τd|= +∞;

limd→+∞λd= 0.

(44)

Bayes risk consistency

Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. 17 / 30 Nathalie Villa-Vialaneix N

(45)

Bayes risk consistency

Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. or Assumption (A3b)

(46)

Bayes risk consistency

Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. or Assumption (A3b)

τd ⊂ τd+1for all d andE(Y2)is finite.

Under (A1)-(A3), limd→+∞L_d∗=L∗.

(47)

Proof under assumption (A3a)

Assumption (A3a) EkDmXk2

L2

(48)

Proof under assumption (A3a)

L2

is finite and Y ∈ {−1,1}.

The proof is based on a result of[Faragó and Györfi, 1975]: For a pair of random variables (X,Y) taking their values in X×{−1,1}whereXis an arbitrary metric space and for a series of functions Td : X → Xsuch that

E_(δ(_T_d₍_X_),_X₎₎−−−−−−d→+∞→0 then limd→+∞infφ:X→{−1,1}P(φ(Td(X)) ,Y) =L∗.

(49)

Proof under assumption (A3a)

L2

is finite and Y ∈ {−1,1}.

The proof is based on a result of[Faragó and Györfi, 1975]:

Tdis the spline estimate based on the sampling;

the inequality of[Ragozin, 1983]about this estimate is exactly the assumption of Farago and Gyorfi’s Theorem.

(50)

Proof under assumption (A3b)

Assumption (A3b)

(51)

Proof under assumption (A3b)

Assumption (A3b)

Under (A3b),(E(Y|bXλd,τd))d is a uniformly bounded martingale and thus converges for the L1-norm. Using the consistency of(bXλd,τd)d to X ends the proof.

(52)

Concluding result (consistency)

Theorem

Under assumptions (A1)-(A3), lim |τd|→+∞ lim n→+∞ Pφ_n,τ_d(bXλd,τd) ,Y =L∗ and lim |τd|→+∞ lim n→+∞E h φn,τd(bXλd,τd) −Y i2 =L∗

Proof: For aǫ >0, fix d0 such that, for all d≥d0, L_d∗−L∗≤ ǫ/2.

Then, by consistency of theR|τd|_{-classifier or regression function,} conclude.

(53)

A practical application to SVM I

Recallthat, for a learning set(Ui,Ti)i=1,...,n inRp × {−1,1},

gaussian SVM is the classifier u∈ Rp → Sign    n X i=1 αiTie−γku−Uik 2 Rp   

where(αi)i satisfy the following quadratic optimization problem:

arg min w n X i=1 1−Tiw(Ui)₊+Ckwk2_S

(54)

A practical application to SVM I

Recallthat, for a learning set(Ui,Ti)i=1,...,n inRp × {−1,1},

gaussian SVM is the classifier u∈ Rp → Sign    n X i=1 αiTie−γku−Uik 2 Rp   

where(αi)i satisfy the following quadratic optimization problem:

arg min w n X i=1 1−Tiw(Ui)₊+Ckwk2_S

where w(u) =Pn_i=1αie−γku−Uik 2

Rp andSis the RKHS associated with the gaussian kernel and C is aregularization parameter. Under suitable assumptions,[Steinwart, 2002]proves the consistency of SVM classifiers.

(55)

A practical application to SVM II

Additional assumptions related to SVM: Assumptions (A4)

For all d, the regularization parameter depends on n such that limn→+∞nCnd= +∞ and Cnd= On

nβd−1_{for a 0 < β}

d<1/d.

For all d, there is a bounded subset of R|τd|_{, B}

d, such thatXτd

(56)

A practical application to SVM II

Additional assumptions related to SVM: Assumptions (A4)

For all d, the regularization parameter depends on n such that limn→+∞nCnd= +∞ and Cnd= On

nβd−1_{for a 0 < β}

d<1/d.

For all d, there is a bounded subset of R|τd|_{, B}

d, such thatXτd

belongs to Bd.

Result: Under assumptions (A1)-(A4), the SVMφn,d :x ∈ Hm →

Sign    n X i=1 αiYie−γkQλd ,τdx τd−Q_{λd ,τd}Xτd_i k2 Rd    ≃ Sign    n X i=1 αiYie−γkx (m)_−X(m) i | 2 L 2    is consistent: lim|τd|→+∞limn→+∞P

φn,τd(bXλd,τd) ,Y =L∗. 22 / 30 Nathalie Villa-Vialaneix N

(57)

Additional remark about the link

be-tween n and

|τ

d

|

Under suitable (and usual) regularity assumptions onE₍_Y|X = .) and if n∼ ν|τd| log |τd|_{, the}_{rate of convergence}_{of this method is of} order d−2ν+12ν _where_ν_{is either equal to m or to a Lipchitz constant} related toE₍_Y|X = .).

(58)

Examples

Outline

3 Examples

(59)

Examples

Chosen regression method: Regression

with kernel ridge regression

Recall thatkernel ridge regressioninRp _{is given by solving}

arg min w n X i=1 (Ti−w(Ui))2+Ckwk2_S

whereSis a RKHS induced by a given kernel (such as the Gaussian kernel) and(Ui,Ti)iis a training sample inRp × R.

(60)

Examples

Chosen regression method: Regression

with kernel ridge regression

Recall thatkernel ridge regressioninRp _{is given by solving}

arg min w n X i=1 (Ti−w(Ui))2+Ckwk2_S

whereSis a RKHS induced by a given kernel (such as the Gaussian kernel) and(Ui,Ti)iis a training sample inRp × R.

In the following examples,Uiis either:

the original (sampled) functionsXi (viewed as R|τd|vectors);

Qλ,τdX

τd

i for derivatives of order 1 or 2.

25 / 30 Nathalie Villa-Vialaneix

(61)

Examples

Example 1: Predicting yellow berry in

durum wheat from NIR spectra

953 wheat samples were analyzed:

NIR spectrometry: 1049 wavelengths regularly ranged from 400 to

2498 nm;

(62)

Examples

Example 1: Predicting yellow berry in

durum wheat from NIR spectra

953 wheat samples were analyzed:

NIR spectrometry: 1049 wavelengths regularly ranged from 400 to

2498 nm;

Yellow berry: manual count (%) of affected grains.

Methodology for comparison:

Split the datainto train/test sets (50 times);

Train50 regression functions for the 50 train sets

(hyper-parameters were tuned by CV);

Evaluatethese regression functions by calculating theMSEfor the

50 corresponding test sets.

(63)

Examples

Example 1: Predicting yellow berry in

durum wheat from NIR spectra

Kernel (SVM) MSE on test (and sd ×10−3₎

Linear (L ) 0.122 (8.77)

Linear on derivatives (L(1)₎ _{0.138 (9.53)}

Linear on second derivatives (L(2)₎ _{0.122 (1.71)}

Gaussian (G) 0.110 (20.2) Gaussian on derivatives (G(1)₎ _{0.098 (7.92)}

Gaussian on second derivatives (G(2)₎ _0.094_(8.35)

(64)

Examples

Comparison with PLS...

MSE (mean) MSE (sd)

PLS 0.154 0.012

Kernel PLS 0.154 0.013

KRR splines (reg. D2) 0.094 0.008 Error decrease: almost 40 %

SVM−D2 KPLS PLS 0.08 0.10 0.12 0.14 0.16 0.18 27 / 30 Nathalie Villa-Vialaneix N

(65)

Examples

Example 2: Simulated noisy spectra

(66)

Examples

Example 2: Simulated noisy spectra

Noisy data: Xb

i (t) =Xi(t) + ǫit,ǫit ∼ N(0,0.01), i.i.d.:

(67)

Examples

Example 2: Simulated noisy spectra

(68)

Examples

Methodology for comparison

Split the datainto train/test sets (250 times);

Train250 regression functions for the 250 train sets

(hyper-parameters were tuned by CV) with the predictors being

the original (sampled) functionsXi(viewed asR|τd|vectors);

Qλ,τdX

τd

i for derivatives of order 1 or 2:smoothing splines derivatives;

Q0,τdX

τd

i for derivatives of order 1 or 2:interpolating splines derivatives;

derivatives of order 1 or 2 evaluated byXi(tj+1)−Xi(tj)

tj+1−tj :finite differences

derivatives;

Evaluatethese regression functions by calculating theMSEfor the

50 corresponding test sets.

(69)

Examples

(70)

Examples

Performances

(71)

References

Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher.

Faragó, T. and Györfi, L. (1975).

On the continuity of the error distortion function for multiple-hypothesis decisions. IEEE Transactions on Information Theory, 21(4):458–460.

Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions.

Journal of Mathematical Analysis and Applications, 33(1):82–95.

Ragozin, D. (1983).

Error bounds for derivative estimation based on spline smoothing of exact or noisy data. Journal of Approximation Theory, 37:335–355.

Steinwart, I. (2002).

Support vector machines are universally consistent. Journal of Complexity, 18:768–791.