HAL Id: hal-00668212
https://hal.archives-ouvertes.fr/hal-00668212
Submitted on 9 Feb 2012HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Classification and regression based on derivatives : a
consistency result
Nathalie Villa-Vialaneix, Fabrice Rossi
To cite this version:
Nathalie Villa-Vialaneix, Fabrice Rossi. Classification and regression based on derivatives : a consis-tency result. II Simposio sobre Modelamiento Estadístico, Dec 2010, Valparaiso, Chile. �hal-00668212�
Classification and regression based on derivatives: a consistency result
Nathalie Villa-Vialaneix (Joint work with Fabrice Rossi)
http://www.nathalievilla.org
II Simposio sobre Modelamiento Estadístico
Valparaiso, December, 3rd
1 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Outline
1 Introduction and motivations
2 A general consistency result
Introduction and motivations
Regression and classification from an
infinite dimensional predictor
Settings
(X,Y)is a random pair of variables where
Y ∈ {−1, 1} (binary classification problem) or Y ∈ R
3 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Regression and classification from an
infinite dimensional predictor
Settings
(X,Y)is a random pair of variables where
Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈(X, h., .iX), an infinite dimensional Hilbert space.
Introduction and motivations
Regression and classification from an
infinite dimensional predictor
Settings
(X,Y)is a random pair of variables where
Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈(X, h., .iX), an infinite dimensional Hilbert space.
We are given alearning setSn = {(Xi,Yi)}ni=1of n i.i.d. copies
of(X,Y).
3 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Regression and classification from an
infinite dimensional predictor
Settings
(X,Y)is a random pair of variables where
Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈(X, h., .iX), an infinite dimensional Hilbert space.
We are given alearning setSn = {(Xi,Yi)}ni=1of n i.i.d. copies
of(X,Y).
Introduction and motivations
Regression and classification from an
infinite dimensional predictor
Settings
(X,Y)is a random pair of variables where
Y ∈ {−1, 1} (binary classification problem) or Y ∈ R X ∈(X, h., .iX), an infinite dimensional Hilbert space.
We are given alearning setSn = {(Xi,Yi)}ni=1of n i.i.d. copies
of(X,Y).
Purpose: Findφn : X → {−1,1} or R, that is universally consistent:
Regression case: limn→+∞E
[φn(X) −Y]2 =L∗where L∗=infφ:X→RE
[φ(X) −Y]2will also be called the Bayes risk.
3 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Introduction and motivations
Using derivatives
Practically, X(m)is often more relevant than X for the prediction.
5 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Using derivatives
Introduction and motivations
Using derivatives
Practically, X(m)is often more relevant than X for the prediction. ButX →X(m)inducesinformation lossand
inf φ:DmX→{−1,1} Pφ(X(m)) ,Y≥ inf φ:X→{−1,1} P(φ(X) ,Y) =L∗ and inf φ:DmX→R Ehφ(X(m)) −Yi2 ≥ inf φ:X→R P[φ(X) −Y]2=L∗. 5 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Sampled functions
Practically,(Xi)iare not perfectly known; only a discrete sampling
is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}.
Introduction and motivations
Sampled functions
Practically,(Xi)iare not perfectly known; only a discrete sampling
is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}.
The sampling can be non uniform...
6 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Sampled functions
Practically,(Xi)iare not perfectly known; only a discrete sampling
is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}.
Introduction and motivations
Sampled functions
Practically,(Xi)iare not perfectly known; only a discrete sampling
is given:Xτd i = (Xi(t))t∈τd whereτd = {t τd 1 , . . . ,t τd |τd|}. Then, Xi(m)isestimatedfromXτd
i , bybX (m)
τd , which also induces information loss: inf φ:DmX→{−1,1} P φ(bXτ(m)d ) ,Y ≥ inf φ:DmX→{−1,1} Pφ(X(m)) ,Y≥L∗ and inf φ:DmX→R E φ(bXτ(m)d ) −Y 2! ≥ inf φ:DmX→R Ehφ(X(m)) −Yi2 ≥L∗. 6 / 30 Nathalie Villa-Vialaneix N
Introduction and motivations
Purpose of the presentation
Find a classifier or a regression functionφn,τd built frombX (m)
τd such
that the risk ofφn,τd asymptotically reachesthe Bayes risk L ∗: lim |τd|→+∞ lim n→+∞ P φn,τd(bX (m) τd ) ,Y =L∗ or lim |τd|→+∞ lim n→+∞E φn,τd(bXτ(m)d ) −Y 2! =L∗
Introduction and motivations
Purpose of the presentation
Find a classifier or a regression functionφn,τd built frombX (m)
τd such
that the risk ofφn,τd asymptotically reachesthe Bayes risk L ∗: lim |τd|→+∞ lim n→+∞ P φn,τd(bX (m) τd ) ,Y =L∗ or lim |τd|→+∞ lim n→+∞E φn,τd(bXτ(m)d ) −Y 2! =L∗
Main idea: Use a relevant way to estimate X(m)fromXτd (by smoothing splines) and combine the consistency of splines with the consistency of aR|τd|-classifier or regression function.
7 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Outline
1 Introduction and motivations
2 A general consistency result
A general consistency result
Basics about smoothing splines I
Suppose thatXis the Sobolev space
Hm =nh∈L[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o
9 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Basics about smoothing splines I
Suppose thatXis the Sobolev space
Hm =nh∈L[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o equipped with the scalar product
hu,viHm = hDmu,DmviL2+ m
X
j=1
BjuBjv
A general consistency result
Basics about smoothing splines I
Suppose thatXis the Sobolev space
Hm =nh∈L[0,1]2 |∀j=1, . . . ,m,Djhexists (weak sense) andDmh∈L2o equipped with the scalar product
hu,viHm = hDmu,DmviL2+ m
X
j=1
BjuBjv
where B are m boundary conditions such thatKerB∩ Pm−1= {0}.
(Hm, h., .i
Hm)is a RKHS:∃k0 : Pm−1× Pm−1→ Rand
k1 : KerB× KerB → Rsuch that
∀u∈ Pm−1, t ∈[0,1], hu,k0(t, .)iHm =u(t) and
∀u∈ KerB, t ∈[0,1], hu,k1(t, .)iHm =u(t)
See[Berlinet and Thomas-Agnan, 2004]for further details.Nathalie Villa-Vialaneix 9 / 30 N
A general consistency result
Basics about smoothing splines II
A simple example of boundary conditions: h(0) =h(1)(0) = . . . =h(m−1)(0) =0. Then, k0(s,t) = m−1X k=0 tksk (k!)2 and Z 1(t−w)m−1 + (s−w)m−1+
A general consistency result
Estimating the predictors with
smooth-ing splines I
Assumption (A1)
|τd| ≥ m − 1
sampling points are distinct in[0, 1]
Bj are linearly independent from h → h(t) for all t ∈ τ d
11 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Estimating the predictors with
smooth-ing splines I
Assumption (A1)
|τd| ≥ m − 1
sampling points are distinct in[0, 1]
Bj are linearly independent from h → h(t) for all t ∈ τ d
[Kimeldorf and Wahba, 1971]: forxτd inR|τd|,∃!ˆx
λ,τd ∈ H m solution of arg min 1 |τd| X (h(t) −xτ )2+ λ Z (h(m)(t))2dt
A general consistency result
Estimating the predictors with
smooth-ing splines I
Assumption (A1)
|τd| ≥ m − 1
sampling points are distinct in[0, 1]
Bj are linearly independent from h → h(t) for all t ∈ τ d
[Kimeldorf and Wahba, 1971]: forxτd inR|τd|,∃!ˆx
λ,τd ∈ H m solution of arg min h∈Hm 1 |τd| |τd| X l=1 (h(tl) −xτd)2+ λ Z [0,1] (h(m)(t))2dt. andˆxλ,τd = Sλ,τdx τd whereS λ,τd : R |τd|→ Hm.
These assumptions are fullfilled by the previous simple example as long as 0<τd. Nathalie Villa-Vialaneix 11 / 30
A general consistency result
Estimating the predictors with
smooth-ing splines II
Sλ,τd is given by: Sλ,τd = ωT(U(K1+ λI|τd|)U T)−1U(K 1+ λI|τd|) −1 +ηT(K1+ λI|τd|) −1(I |τd|−U T(U(K 1+ λI|τd|) −1U(K 1+ λI|τd|) −1) = ωTM0+ ηTM1 with {ω1, . . . , ωm} is a basis of Pm−1, ω= (ω1, . . . , ωm)T and U= (ωi(t))i=1,...,m t∈τd;A general consistency result
Estimating the predictors with
smooth-ing splines II
Sλ,τd is given by: Sλ,τd = ωT(U(K1+ λI|τd|)U T)−1U(K 1+ λI|τd|) −1 +ηT(K1+ λI|τd|) −1(I |τd|−U T(U(K 1+ λI|τd|) −1U(K 1+ λI|τd|) −1) = ωTM0+ ηTM1 with {ω1, . . . , ωm} is a basis of Pm−1, ω= (ω1, . . . , ωm)T and U= (ωi(t))i=1,...,m t∈τd; η= (k1(t, .))Tt∈τdand K1= (k1(t, t ′)) t,t′∈τd.The observations of thepredictor X (NIR spectra) are then estimatedfrom their samplingXτd byXb
λ,τd.
12 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
2 Easy way to use derivatives:
hSλ,τdu τd, S λ,τdv τdi Hm = hbuλ,τ d,bvλ,τdiHm 13 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
2 Easy way to use derivatives:
(uτd)TMT
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
2 Easy way to use derivatives:
(uτd)TM
λ,τdv
τd = hbu
λ,τd,bvλ,τdiHm
whereMλ,τd is symmetric, definite positive.
13 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
2 Easy way to use derivatives:
(Qλ,τdu
τd)T(Q
λ,τdv
τd) = hbu
A general consistency result
Two important consequences
1 No information loss inf φ:Hm→{−1,1} Pφ(bXλ,τ d) , Y = inf φ:R|τd |→{−1,1} P(φ(Xτd) , Y ) and inf φ:Hm→{−1,1} Ehφ(bXλ,τ d) − Y i2 = inf φ:R|τd |→{−1,1} P[φ(Xτd) − Y ]2
2 Easy way to use derivatives:
(Qλ,τdu τd)T(Q λ,τdv τd) = hbu λ,τd,bvλ,τdiHm ≃ hbuλ,τ(m) d,bv (m) λ,τdiL2
whereQλ,τd is the Choleski triangle ofMλ,τd:Q T
λ,τdQλ,τd = Mλ,τd.
Remark:Qλ,τd is calculated only from the RKHS, λ and τd: it does
not depend on the data set.
13 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Classification and regression based on
derivatives
Suppose that we know aconsistent classifier or regression function inR|τd| that is based onR|τd|scalar product or norm.
Example: Nonparametric kernel regression
Ψ :u∈ R|τd| → Pn i=1TiK ku−U ikR|τd | hn Pn i=1K ku−U ikR|τd | hn
A general consistency result
Classification and regression based on
derivatives
Suppose that we know aconsistent classifier or regression function inR|τd| that is based onR|τd|scalar product or norm. Thecorresponding derivative based classifier or regression functionis given by using the norm induced byQλ,τd:
Example: Nonparametric kernel regression
φn,d =Ψ ◦Qλ,τd :x∈ H m → Pn i=1YiK kQλ,τdxτd−Qλ,τdXτdi k R|τd | hn ! Pn i=1K kQλ,τdxτd−Qλ,τdXτdi k R|τd | hn ! 14 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Classification and regression based on
derivatives
Suppose that we know aconsistent classifier or regression function inR|τd| that is based onR|τd|scalar product or norm. Thecorresponding derivative based classifier or regression functionis given by using the norm induced byQλ,τd:
Example: Nonparametric kernel regression
φn,d =Ψ ◦Qλ,τd :x∈ H m → Pn i=1YiK kQλ,τdxτd−Qλ,τdXτdi k R|τd | hn ! Pn i=1K kQλ,τdxτd−Qλ,τdXτdi k R|τd | hn ! !
A general consistency result
Remark for consistency
Classification case(approximatively the same is true for regression): Pφn,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −Ld∗+Ld∗−L∗ where Ld∗=infφ:R|τd |→{−1,1}P(φ(Xτd) ,Y). 15 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Remark for consistency
Classification case(approximatively the same is true for regression): Pφn,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −Ld∗+Ld∗−L∗ where Ld∗=infφ:R|τd |→{−1,1}P(φ(Xτd) ,Y).
1 For all fixed d,
lim n→+∞ Pφn,τ d(bXλ,τd) , Y = Ld∗
as long as the R|τd|-classifier is consistent because there is a
one-to-one mapping betweenXτd and bX
A general consistency result
Remark for consistency
Classification case(approximatively the same is true for regression): Pφn,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −Ld∗+Ld∗−L∗ where Ld∗=infφ:R|τd |→{−1,1}P(φ(Xτd) ,Y).
1 For all fixed d,
lim n→+∞ Pφn,τ d(bXλ,τd) , Y = Ld∗
as long as the R|τd|-classifier is consistent because there is a
one-to-one mapping betweenXτd and bX
λ,τd. 2 L∗ d− L ∗≤ EE(Y|bX λ,τd) − E(Y |X)
with consistency of spline estimate bXλ,τdand assumption on the
regularity of E(Y |X = .), consistency would be proved.
15 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Remark for consistency
Classification case(approximatively the same is true for regression): Pφn,τ d(bXλ,τd) ,Y −L∗= Pφn,τd(bXλ,τd) ,Y −Ld∗+Ld∗−L∗ where Ld∗=infφ:R|τd |→{−1,1}P(φ(Xτd) ,Y).
1 For all fixed d,
lim n→+∞ Pφn,τ d(bXλ,τd) , Y = Ld∗
as long as the R|τd|-classifier is consistent because there is a
one-to-one mapping betweenXτd and bX
λ,τd.
A general consistency result
Spline consistency
Letλdepends on d and denote(λd)d the series of regularization
parameters. Also introduce
∆τd :=max{t1,t2−t1, . . . ,1−t|τd|}, ∆τd :=min1≤i<|τd|{ti+1−ti} Assumption (A2)
∃ R such that∆τd/∆τd ≤ R for all d;
limd→+∞|τd|= +∞;
limd→+∞λd= 0.
16 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Spline consistency
Letλdepends on d and denote(λd)d the series of regularization
parameters. Also introduce
∆τd :=max{t1,t2−t1, . . . ,1−t|τd|}, ∆τd :=min1≤i<|τd|{ti+1−ti} Assumption (A2)
∃ R such that∆τd/∆τd ≤ R for all d;
limd→+∞|τd|= +∞;
limd→+∞λd= 0.
A general consistency result
Bayes risk consistency
Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. 17 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Bayes risk consistency
Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. or Assumption (A3b)
A general consistency result
Bayes risk consistency
Assumption (A3a) EkDmXk2 L2 is finite and Y ∈ {−1,1}. or Assumption (A3b)
τd ⊂ τd+1for all d andE(Y2)is finite.
Under (A1)-(A3), limd→+∞Ld∗=L∗.
17 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Proof under assumption (A3a)
Assumption (A3a) EkDmXk2
L2
A general consistency result
Proof under assumption (A3a)
Assumption (A3a) EkDmXk2
L2
is finite and Y ∈ {−1,1}.
The proof is based on a result of[Faragó and Györfi, 1975]: For a pair of random variables (X,Y) taking their values in X×{−1,1}whereXis an arbitrary metric space and for a series of functions Td : X → Xsuch that
E(δ(Td(X),X))−−−−−−d→+∞→0 then limd→+∞infφ:X→{−1,1}P(φ(Td(X)) ,Y) =L∗.
18 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Proof under assumption (A3a)
Assumption (A3a) EkDmXk2
L2
is finite and Y ∈ {−1,1}.
The proof is based on a result of[Faragó and Györfi, 1975]:
Tdis the spline estimate based on the sampling;
the inequality of[Ragozin, 1983]about this estimate is exactly the assumption of Farago and Gyorfi’s Theorem.
A general consistency result
Proof under assumption (A3b)
Assumption (A3b)
τd ⊂ τd+1for all d andE(Y2)is finite.
19 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Proof under assumption (A3b)
Assumption (A3b)
τd ⊂ τd+1for all d andE(Y2)is finite.
Under (A3b),(E(Y|bXλd,τd))d is a uniformly bounded martingale and thus converges for the L1-norm. Using the consistency of(bXλd,τd)d to X ends the proof.
A general consistency result
Concluding result (consistency)
Theorem
Under assumptions (A1)-(A3), lim |τd|→+∞ lim n→+∞ Pφn,τd(bXλd,τd) ,Y =L∗ and lim |τd|→+∞ lim n→+∞E h φn,τd(bXλd,τd) −Y i2 =L∗
Proof: For aǫ >0, fix d0 such that, for all d≥d0, Ld∗−L∗≤ ǫ/2.
Then, by consistency of theR|τd|-classifier or regression function, conclude.
20 / 30 Nathalie Villa-Vialaneix N
A general consistency result
A practical application to SVM I
Recallthat, for a learning set(Ui,Ti)i=1,...,n inRp × {−1,1},
gaussian SVM is the classifier u∈ Rp → Sign n X i=1 αiTie−γku−Uik 2 Rp
where(αi)i satisfy the following quadratic optimization problem:
arg min w n X i=1 1−Tiw(Ui)++Ckwk2S
A general consistency result
A practical application to SVM I
Recallthat, for a learning set(Ui,Ti)i=1,...,n inRp × {−1,1},
gaussian SVM is the classifier u∈ Rp → Sign n X i=1 αiTie−γku−Uik 2 Rp
where(αi)i satisfy the following quadratic optimization problem:
arg min w n X i=1 1−Tiw(Ui)++Ckwk2S
where w(u) =Pni=1αie−γku−Uik 2
Rp andSis the RKHS associated with the gaussian kernel and C is aregularization parameter. Under suitable assumptions,[Steinwart, 2002]proves the consistency of SVM classifiers.
21 / 30 Nathalie Villa-Vialaneix N
A general consistency result
A practical application to SVM II
Additional assumptions related to SVM: Assumptions (A4)
For all d, the regularization parameter depends on n such that limn→+∞nCnd= +∞ and Cnd= On
nβd−1for a 0 < β
d<1/d.
For all d, there is a bounded subset of R|τd|, B
d, such thatXτd
A general consistency result
A practical application to SVM II
Additional assumptions related to SVM: Assumptions (A4)
For all d, the regularization parameter depends on n such that limn→+∞nCnd= +∞ and Cnd= On
nβd−1for a 0 < β
d<1/d.
For all d, there is a bounded subset of R|τd|, B
d, such thatXτd
belongs to Bd.
Result: Under assumptions (A1)-(A4), the SVMφn,d :x ∈ Hm →
Sign n X i=1 αiYie−γkQλd ,τdx τd−Qλd ,τdXτdi k2 Rd ≃ Sign n X i=1 αiYie−γkx (m)−X(m) i | 2 L 2 is consistent: lim|τd|→+∞limn→+∞P
φn,τd(bXλd,τd) ,Y =L∗. 22 / 30 Nathalie Villa-Vialaneix N
A general consistency result
Additional remark about the link
be-tween n and
|τ
d|
Under suitable (and usual) regularity assumptions onE(Y|X = .) and if n∼ ν|τd| log |τd|, therate of convergenceof this method is of order d−2ν+12ν whereνis either equal to m or to a Lipchitz constant related toE(Y|X = .).
Examples
Outline
1 Introduction and motivations
2 A general consistency result
3 Examples
24 / 30 Nathalie Villa-Vialaneix N
Examples
Chosen regression method: Regression
with kernel ridge regression
Recall thatkernel ridge regressioninRp is given by solving
arg min w n X i=1 (Ti−w(Ui))2+Ckwk2S
whereSis a RKHS induced by a given kernel (such as the Gaussian kernel) and(Ui,Ti)iis a training sample inRp × R.
Examples
Chosen regression method: Regression
with kernel ridge regression
Recall thatkernel ridge regressioninRp is given by solving
arg min w n X i=1 (Ti−w(Ui))2+Ckwk2S
whereSis a RKHS induced by a given kernel (such as the Gaussian kernel) and(Ui,Ti)iis a training sample inRp × R.
In the following examples,Uiis either:
the original (sampled) functionsXi (viewed as R|τd|vectors);
Qλ,τdX
τd
i for derivatives of order 1 or 2.
25 / 30 Nathalie Villa-Vialaneix
Examples
Example 1: Predicting yellow berry in
durum wheat from NIR spectra
953 wheat samples were analyzed:
NIR spectrometry: 1049 wavelengths regularly ranged from 400 to
2498 nm;
Examples
Example 1: Predicting yellow berry in
durum wheat from NIR spectra
953 wheat samples were analyzed:
NIR spectrometry: 1049 wavelengths regularly ranged from 400 to
2498 nm;
Yellow berry: manual count (%) of affected grains.
Methodology for comparison:
Split the datainto train/test sets (50 times);
Train50 regression functions for the 50 train sets
(hyper-parameters were tuned by CV);
Evaluatethese regression functions by calculating theMSEfor the
50 corresponding test sets.
26 / 30 Nathalie Villa-Vialaneix
Examples
Example 1: Predicting yellow berry in
durum wheat from NIR spectra
Kernel (SVM) MSE on test (and sd ×10−3)
Linear (L ) 0.122 (8.77)
Linear on derivatives (L(1)) 0.138 (9.53)
Linear on second derivatives (L(2)) 0.122 (1.71)
Gaussian (G) 0.110 (20.2) Gaussian on derivatives (G(1)) 0.098 (7.92)
Gaussian on second derivatives (G(2)) 0.094(8.35)
Examples
Comparison with PLS...
MSE (mean) MSE (sd)
PLS 0.154 0.012
Kernel PLS 0.154 0.013
KRR splines (reg. D2) 0.094 0.008 Error decrease: almost 40 %
SVM−D2 KPLS PLS 0.08 0.10 0.12 0.14 0.16 0.18 27 / 30 Nathalie Villa-Vialaneix N
Examples
Example 2: Simulated noisy spectra
Examples
Example 2: Simulated noisy spectra
Noisy data: Xb
i (t) =Xi(t) + ǫit,ǫit ∼ N(0,0.01), i.i.d.:
28 / 30 Nathalie Villa-Vialaneix
Examples
Example 2: Simulated noisy spectra
Examples
Methodology for comparison
Split the datainto train/test sets (250 times);
Train250 regression functions for the 250 train sets
(hyper-parameters were tuned by CV) with the predictors being
the original (sampled) functionsXi(viewed asR|τd|vectors);
Qλ,τdX
τd
i for derivatives of order 1 or 2:smoothing splines derivatives;
Q0,τdX
τd
i for derivatives of order 1 or 2:interpolating splines derivatives;
derivatives of order 1 or 2 evaluated byXi(tj+1)−Xi(tj)
tj+1−tj :finite differences
derivatives;
Evaluatethese regression functions by calculating theMSEfor the
50 corresponding test sets.
29 / 30 Nathalie Villa-Vialaneix
Examples
Examples
Performances
30 / 30 Nathalie Villa-Vialaneix
References
Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publisher.
Faragó, T. and Györfi, L. (1975).
On the continuity of the error distortion function for multiple-hypothesis decisions. IEEE Transactions on Information Theory, 21(4):458–460.
Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions.
Journal of Mathematical Analysis and Applications, 33(1):82–95.
Ragozin, D. (1983).
Error bounds for derivative estimation based on spline smoothing of exact or noisy data. Journal of Approximation Theory, 37:335–355.
Steinwart, I. (2002).
Support vector machines are universally consistent. Journal of Complexity, 18:768–791.