• Aucun résultat trouvé

, and K.{R. Muller

N/A
N/A
Protected

Academic year: 2022

Partager ", and K.{R. Muller"

Copied!
6
0
0

Texte intégral

(1)

Asymptotically Optimal Choice of

"

{Loss for Support Vector Machines

A.J. Smola

yz

, N. Murata , B. Scholkopf

yz

, and K.{R. Muller

y

yGMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany RIKEN Brain Science Institute, Saitama 351-0198, Japan

zAustralian National University, FEIT, Canberra ACT 0200, Australia

fsmola, bs, klausg@rst.gmd.de, mura@brain.riken.go.jp

Abstract

Under the assumption of asymptotically unbiased estimators we show that there exists a nontrivial choice of the insensitivity parameter in Vapnik's"{insensitive loss function which scales linearly with the input noise of the training data. This nding is backed by experimental results.

1 Introduction

Support Vector (SV) machines have shown to be an eective tool for regression and function estimation 9, 3]. However, it still lacks good theoretical results for controlling the model selection parameters. In particular there are two quantities which determine the behaviour of an SV machine once the kernel and the training data are xed | the regularization constant and the insensitivity

"of the corresponding cost function. So far the approach mainly has been to use crossvalidation or bootstrap techniques for model selection (e.g. 3]).

The purpose of this paper is to give some theoretical analysis of the prob- lem how to choose an optimal parameter"for Vapnik's cost function. We will completely ignore the choice of dierent regularization schemes or kernels. De- tailed information on this topic can be obtained in 2, 6]. For the sake of being self{contained let us briey review some basic notions of SV regression.

2 Support Vector Regression

Given some training setf(

x

1y1):::(

x

`y`)gRn R one tries to estimate a linear function

f :Rn !R withf(

x

) = (

w

x

) +band

w

2Rnb2R (1)

that minimizes the following regularized risk functional

R

regf] = 12k

w

k2+CX`

i=1

jf(

x

i);yij": (2) Herejxj":= max(0jxj;") is Vapnik's"{insensitive loss function 8].

1

(2)

x x

x x

x

x x

x

x

x x

x x

x





x





The desired accuracy " is specied a priori. It is then attempted to t the attest tube with radius " to the data. The trade-o between model com- plexity and points lying outside of the tube (with positive slack variables) is determined by minimizing (2). One can show 8] that this leads to a quadratic optimization problem which can be solved eciently. It also can be shown that it is quite straightforward to estimate nonlinear functions by replacing the dot products in input space by kernels which represent dot products in feature space 1].

Vapnik's "{insensitive cost function has algorithmical advantages such as providing sparse decompositions and rendering the computations more amenable 9]. In most cases, unfortunately, jxj" is not the cost function with respect to which we would like to train our estimator. The problem that now arises is to nd a value of " such that, say, the variance of yi;f(

x

i) is minimized.

Intuitively one might think that setting" = 0 is the best choice that can be made, as jxj0 =jxjwhich leads to the standard Laplacian loss function. Ex- periments show, however, that this is not the case and hint that there exists a linear dependency between the level of the additive noise in the target values

y

i and the size of". We will show that under some assumptions this nding is correct, indeed.

3 Asymptotic Eciency

What we will show is that there is a nontrivial choice of"for which the statis- tical eciency of estimating a location parameter using the "{insensitive loss function is maximized and that the optimal" is proportional to the variance of the random variable (here additive noise) under consideration.

For this purpose we need to introduce some statistical notations. Denote

^

(

X

) an estimator of the parameters based on the sample

X

and let

X

be drawn from some probability density function p(

X

) (also parametrized by ). Finally denote hi the expectation of the random variable with respect top(

X

). Now we can dene an unbiased estimator ^(

X

) by requiring

h^(

X

)i = . Moreover in this case we can introduce the Fisher information matrixI with

I

ij := @ilnp(X)@jlnp(X) (3)

(3)

and the covariance matrixB of the estimator ^by

B

ij :=h(^i;h^ii)(^j;h^ji)i: (4) Then the Cramer{Rao inequality 5] states that detIB 1 for all possible estimators ^. This allows us to dene the statistical eciencyeof an estimator as e := 1=(detIB). Comparing the quality of unbiased estimators in this context can be reduced to comparing their statistical eciencies.

For a special class of estimators where ^is dened by

^

(

X

) := argmin

d(

X

) (5)

and d is a two times dierentiable function in one can show 4, Lemma 3]

that asymptoticallyB=Q;1GQ;1 with

G

ij := Cov @id(

X

)@jd(

X

) andQij :=D@2ijd(

X

)E

(6) and thereforee = (detQ)2=(detIG). In our case we are only dealing with a one{parametrical model, namely estimating a location parameter (the mean of a distribution). Denote "() the noise model given by the "{insensitive loss function, i.e.

"() =c0exp(;jj") = 1 2(1 +")

1 ifjj"

exp(";jj) otherwise (7) and() the actual noise model of the data. Without loss of generality we will assume the location parameter to be 0 or formallyh

X

ii= 0.1 Then the maxi- mum likelihood estimatoris given by settingd(

X

) =;1` P`

i=1

ln"(

X

i;)

and therefore

I= D(@ln())2E

(8)

G= D(@ln"())2E

= 2

Z

1

"

()d (9)

Q= @2ln"() = 2(") (10) Now let us consider dierent choices ofand the corresponding values ofe. Gaussian Noise

Set() = 1=p2 2exp;;2=;22. ThenI =;2G= 1;erfp"2 and therefore

1

e

= detdetGIQ2 = 2 exp("2=2)

1;erf

"

p2

: (11)

The maximum ofeis obtained for"== 0:6166 and therefore we have a linear dependency between"and the noise level. As a sanity check we will compute the optimal"for Laplacian noise.

1One always could rede ne the problem for a nonzero location parameter by shifting all variables by the corresponding amount.

(4)

Laplacian Noise

Set() = 1=(2)exp(;jj=). ThenI =;2G= exp(;"=) and therefore

1

e = exp("=). Here the maximum of e is achieved for"= = 0, i.e. the case where "degenerates to theL1loss, exactly matching the Laplacian noise. In this case the estimator is asymptotically ecient (e= 1). Finally let us analyze the case of arbitrary polynomial noise models.

Polynomial Noise

Many cases, e.g. the one of Gaussian noise, can be dealt with in a more general approach, namely() = 2;(1=p)p exp(;(=)p) where ;(x) is the gamma func- tion, var=2;(3=p);(1=p). There we haveI = p22;(2;1=p);(1=p) andG= ;(1=p)p Fp("=) whereFp(x) :=Rx1exp(;xp)dx. In most casesFp(x) cannot be computed in closed form. Putting everything together we get

1

e

= 4Fp("=)exp(2("=)p);(2;1=p): (12) One can observe that e only depends on "= which in turn may be written in terms of "= and p. Hence the optimal scales linearly with provided that the optimal quotient p = "= 6= 0. This, however, can be shown for

p > 1 by computing the derivative of 1=ew.r.t. "= for "= = 0. One gets

d(1=e)

d("=) j

0 = ;4;(2;1=p) < 0 which proves that the minimum of 1=e is not obtained for0= 0. Hence also in the case of general polynomial loss we have a linear dependency between the noise level and the optimal choice of". Arbitrary Symmetric Noise

One can derive general conditions for a linar scaling behaviour of . Assume

() to be a symmetric density with unit variance. Hence 1=(=) has stan- dard deviation. Now rewriteGandQin terms of :== as

G= 2

Z

1

1

(=)d= 2

Z

1

()d (13)

I

=I1;2 andQ=2(). This leads to

e= detdetGIQ2 = 4;2()

;2

I

12R1()d = 2

I

1

2()

R

1

()d: (14)

What remains is to check thate does not have a maximum for = 0. Com- puting the derivative ofew.r.t. for = 0 analogously to the polynomial case yields the sucient condition @ (0) +2(0) > 0. For instance any density with@ (0) = 0 satises this property.

It is not directly possible, to carry over our conclusions to the SV case due to two assumptions that may not be satised. Neither are we normally dealing with the asymptotic case nor is a SV machine only dealing with a

(5)

single parameter at a time. Instead we have nite sample size and estimate a function. Experiments illustrate, however, that the conclusions we obtained from this simplied situation are still valid in the more complex SV case.

4 Experiments and Conclusions

We consider a toy example, namely f(x) = 0:9sinc(10x= ) on ;11]. 100 datapointsxiwere independently drawn from a uniform distribution on ;11]

and generated the sample viayi =f(xi) +i. Herei was a gaussian random variable with zero mean and variance 2. In the implementation we used a spline kernel of degree 1 with 100 gridpoints (see 9] for details). As the purpose of these experiments was to exhibit the dependency onwe carried out \model{

selection" for the regularization parameter in such a way to always choose that value that led to the smallest (L1 or L2) error on the test set. By doing so it was possible to exclude side eects of possible model selection criteria. For statistical reliability we averaged the results over multiple trials.

103 102 101 100 101

1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6

epsilon/sigma

1/efficiency

asymptotic efficiency

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

1e-05 0.0001 0.001 0.01 0.1 epsilon

L1 error L2 error

Figure 1: Left: Asymptotic eciency for an{insensitive model and data with Gaussian noise. Right: L1and L2 loss for dierent values of and xed noise level= 0:2, averaged over 25 trials. Note the minimum for= 0:18.

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

best epsilon

Theory Figure 2: Experimentally optimal

choice of"for dierent levels of gaussian noise and the theoretically best value in the asymptic unbiased case. For each noise level, after determining the opti- mal regularization parameter, we chose the corresponding with minimal L1 error. Although the obtained values do not match the exact value of the theoretical predictions, the scaling be- haviour with the noise is clearly linear.

Observe that qualitatively similar behaviour in the errors and the eciency of the estimator in gure 1. Also note the linear dependency in gure 2 of"opton

(6)

. Even though it does not exhibit the same scaling factor one obtains from the asymptotic calculations still the scaling property remains unchanged.

Our nding is corroborated by results of Solla and Levin 7] where it was shown that for linear Boltzmann machines the best performance is achieved when the \internal noise" matches the external noise.

Acknowledgements:

This work was supported in part by a grant of the DFG (# Ja 379/71) and the Australian Research Council. The authors thank Peter Bartlett and Robert Williamson for helpful discussions and comments.

References

1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for opti- mal margin classiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144{152, Pittsburgh, PA, 1992. ACM Press.

2] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis func- tions: From regularization to radial, tensor and additive splines. A.I. Memo No. 1430, MIT, 1993.

3] K.-R. M%uller, A. J. Smola, G. R%atsch, B. Sch%olkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In Articial Neural Networks | ICANN'97, 1997.

4] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion|

determining the number of hidden units for articial neural network models.

IEEE Transactions on Neural Networks, 5:865{872, 1994.

5] C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 1973.

6] A. J. Smola, B. Sch%olkopf, and K.-R. M%uller. The connection between regularization operators and support vector kernels. Neural Networks, 1998.

in press.

7] S. A. Solla and E. Levin. Learning in linear neural networks: The validity of the annealed approximation. Physical Review A, 46(4):2124{2130, 1992.

8] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.

9] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In NIPS 9, San Mateo, CA, 1997.

Références

Documents relatifs

Formally prove that this equation is mass conser- vative and satisfies the (weak) maximum principle.. 4) Dynamic estimate on the entropy and the first

While the networks that make up the Internet are based on a standard set of protocols (a mutually agreed upon method of.. communication between parties), the Internet also

Show that a (connected) topological group is generated (as a group) by any neighborhood of its

[r]

Recalling the impossibility of doubting his existence and looking back to his “sum”-intuition, the Cartesian

Another example of such a statistic is provided with the Denert statistic, “den”, introduced recently (see [Den]) in the study of hereditary orders in central simple algebras1.

If, as we argued above, the brain is developing in response to a changing environment, and function is determined by developmental constraints and environmental demands,

Consider an infinite sequence of equal mass m indexed by n in Z (each mass representing an atom)... Conclude that E(t) ≤ Ce −γt E(0) for any solution y(x, t) of the damped