• Aucun résultat trouvé

Alex Smola, Bernhard Sch¨olkopf & Gunnar R¨atsch GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany Australian National University, FEIT, Canberra ACT 0200, Australia

N/A
N/A
Protected

Academic year: 2022

Partager "Alex Smola, Bernhard Sch¨olkopf & Gunnar R¨atsch GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany Australian National University, FEIT, Canberra ACT 0200, Australia"

Copied!
6
0
0

Texte intégral

(1)

Linear Programs for Automatic Accuracy Control in Regression

Alex Smola, Bernhard Sch¨olkopf & Gunnar R¨atsch GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany Australian National University, FEIT, Canberra ACT 0200, Australia

f

smola, bs, raetsch

g

@first.gmd.de

Abstract

We have recently proposed a new approach to control the number of basis functions and the accuracy in Support Vector Machines. The lat- ter is transferred to a linear programming set- ting, which inherently enforces sparseness of the solution.

The algorithm computes a nonlinear estimate in terms of kernel functions and an >

0

with

the property that at most a fractionof the train- ing set has an error exceeding . The algorithm is robust to local perturbations of these points’

target values.

We give an explicit formulation of the opti- mization equations needed to solve the linear program and point out which modifications of the standard optimization setting are necessary to take advantage of the particular structure of the equations in the regression case.

1 Introduction

Support Vector (SV) regression comprises a new class of learning algorithms motivated by results of statistical learning theory [11]. Origi- nally developed for pattern recognition, the ba- sic properties carry over to regression by choos- ing a suitable cost function, the so-called "- insensitive loss:

jy;f

(

x

)

j

" = max

f

0

jy;f

(

x

)

j;"g (1)

This does not penalize errors below some">

0

,

chosen a priori. A possible algorithm, which will henceforth be called"-SVR, seeks to esti- mate functions

f

(

x

) =

hwxi

+

b withwx2R

d

b2R

(2)

based on independent identically distributed (iid) data

(

x1y1

)

:::

(

x

`

y

` )

2XR: (3)

Here,Xis the space in which the input patterns live (e.g., for vectorial data,X

=

R

d

). The goal of the learning process is tofind a functionf with a small risk (or test error)

R

f

] =

Z

X

l

(

fxy

)

dP

(

xy

)

(4)

whereP is the probability measure which is as- sumed to be responsible for the generation of the observations (3), and l is a loss function likel

(

fxy

) = (

f

(

x

)

;y

)

2depending on the specific regression estimation problem at hand.

Note that this does not necessarily have to co- incide with the loss function used in our learn- ing algorithm: there might be both algorithmic and theoretical reasons for choosing an easier to implement or a more robust one. The problem however is that we cannot minimize (4) directly in thefirst place, since we do not knowP. In- stead, we are given the sample (3), and we try to obtain a small risk by minimizing the regu- larized risk functional

Rreg

:= 12

kwk2

+

CR

"emp

f

]

: (5)

Here, kwk2 is a term which characterizes the model complexity.

R

"

emp

f

] := 1

`

`

X

i

=1

jy

i

;f

(

x

i )

j

"

(6) measures the"-insensitive training error andC is a constant determining the trade-off. In short, minimizing (5) captures the main insight of sta- tistical learning theory, stating that in order to obtain a small risk, one needs to control both

(2)

training error and model complexity, i.e. explain the data with a simple model.

Nonlinearity of the algorithm is achieved by mappingxinto a feature spaceFvia the feature map

:

X ! F and computing a linear esti- mate there [1, 3] to obtainf

(

x

) =

hw

(

x

)

i

+

b.

However, asF may be very high dimensional, this direct approach is often not computation- ally feasible. Hence one uses kernels instead, i.e. one rewrites the algorithm in terms of dot products which do not require explicit knowl- edge of

(

x

)

and introduces kernels by letting

k

(

xx0

) :=

h

(

x

)

(

x0

)

i: (7)

Since w may be written as a linear combina- tion of the (mapped) training patternsx

i

[3] one obtains the following well known kernel expan- sion offas

f

(

x

) =

X

`

i

=1

i

k

(

x

i

x

) +

b (8)

The training algorithm itself can also be for- mulated in terms ofk

(

xx0

)

such that the basic optimization problem remains unchanged [3].

The following two modifications to the original setting of (5) will allow us to obtain a new type of algorithm.

Explicit enforcement of sparsity via a lin- ear regularizer (cf. section 2).

Automatic adaptation of the width of the

"-insensitive loss zone (cf. section 3).

Moreover we will show in section 4 how the the new approach can be related to robust and asymptotically efficient estimators.

2 Sparsity Regularization

Recently several modifications were proposed to change the SV problem from a quadratic pro- gramming to a linear programming problem.

This is commonly achieved [12, 2, 8] by a reg- ularizer derived from sparse coding [4]. Instead of choosing the flattest function as in (5) we seekwthat is contained in the smallestconvex combinationof training patternsx

i

(or

(

x

i )

):

we minimize

Rreg

:= 1

`

`

X

i

=1

j

i

j

+

CR

"emp

f

]

: (9)

wherew

=

P

` i

=1

i

x

i

. This will allow us to obtain solutionswgenerated from a linear com- bination of only a few patternsx

i

, i.e. functions

f generated from only a few kernel functionsk. Minimizing (9) can be written as a linear pro- gramming problem (we setk

ij :=

k

(

x

i

x

j )

):

minimize 1

`

P

`

i

=1

(

i +

i ) + C` i

P=1

` ( i + i

)

with

i

i

i

i

0

`

P

i

=1

(

i

;

i )

k

ij +

b;y

i

"

+ i

y

i

;P

`

i

=1

(

i

;

i )

k

ij

;b"

+ i

Here we substituted

i

by the two positive vari- ables

i

and

i

in order to overcome problems withj

i

jin the objective function.1

3 Automatic

"

-Tuning

Recently a new algorithm [5] was proposed to achieve automatic accuracy control in Sup- port Vector (SV) Machines. This was done by making", the width of the tube itself, part of the optimization problem.2 Hence instead of minimizing (9), which we will henceforth call

"-LPR (Linear Programming Regression), we minimize

Rreg

+

C"

= 1

`

`

X

i

=1

j

i

j

+

CR

"

emp

f

] +

C":

(10) Consequently the goal is not only to achieve small training error (with respect to") but also to obtain a solution with small"itself. Rewrit- ing (10) as a linear program yields

minimize P

`

i

=11

` (

i +

i ) + C` ( i +

i ) +

C"

with

i

i

i

i

"

0

`

P

i

=1

(

i

;

i )

k

ij +

b;y

i

"

+ i

y

i

;P

`

i

=1

(

i

;

i )

k

ij

;b"

+

i

Hence the difference between (10) and (9) lies in the fact that"has become a positively con- strained variable of the optimization problem it- self. Before proceeding to the implementation issues (cf. section 5) let us analyze the theoret- ical aspects of the new optimization problem.

1There is no point in computing the Wolfe dual as in SV machines since the primal problem already can be solved conveniently by linear programming codes without any fur- ther problems. In fact, a simple calculation reveals that the dual does not give any computational advantage.

2This is equivalent to imposing a Laplacian prior on the accuracy parameter and computing the maximum a poste- riori estimate (MAP).

(3)

The core aspect can be captured in the proposi- tion stated below.

Proposition 1 Assume " >

0

. The following statements hold:

(i) is an upper bound on the fraction of er- rors (i.e. points outside the"-tube).

(ii) is a lower bound on the fraction of points not inside (i.e. outside or on the edge of) the"tube.

(iii) Suppose the data (3) were gen- erated iid from a distribution

P

(

xy

) =

P

(

x

)

P

(

yjx

)

with P

(

yjx

)

continuous. With probability 1, asymptoti- cally, equals both the fraction of points inside the tube and the one of errors.

Proof

Ad (i): Imagine increasing " starting from

0

.

The second term in1

`

P

` i

=1

( i + i

) +

"

will increase proportionally to, while the first term will decrease proportionally to the fraction of points outside of the tube.

Hence, " will grow as long as the latter function is larger than. At the optimum, it therefore must be.

Ad (ii): Next, imagine decreasing " starting from some large value. Again, the change in the second term is proportional to, but this time, the change in thefirst term is pro- portional to the fraction of points not inside the tube (even points on the edge will con- tribute). Hence,"will shrink as long as the fraction of such points is smaller than, eventually leading the stated claim.

Ad (iii): The strategy of proof is to show that asymptotically, the probability of a point lyingonthe edge of the tube vanishes. For lack of space, we do not give a proof. It can be found in [5].

Hence,

0

1

can be used to control the number of errors. Moreover, since by construc- tionf

(

x

) =

hw

(

x

)

i

+

ballows shiftingf, this degree of freedom implies that Proposition 1 ac- tually holds for the upper and the lower edge of the tube separately, with=

2

each (proof by shiftingf). As an aside, note that by the same argument, the number of mispredictions larger than" of the standard "-LPR tube asymptoti- cally agree.

Note, finally, that even though the proposi- tion reads almost as in the -SVR case, there

is an important distinction. In the-SVR case [5], the set of points not inside the tube coin- cides with the set of SVs. Thus, the geometrical statement of the proposition translates to a state- ment about the SV set. In the LP context, this is no longer true — although the solution is still sparse, any point could be an SV, even if it is inside the tube. By de-coupling the geometri- cal and the computational properties of SVs, LP machines therefore come with two different no- tions of SVs. In our usage of the term, we shall stick to the geometrical notion.

4 Robustness and Efficiency

Using the"-insensitive loss function, only the patterns outside of the"-tube enter the empiri- cal risk term, whereas the patterns closest to the actual regression have zero loss. This, however, does not mean that it is only the ‘outliers’ that determine the regression. In fact, the contrary is the case.

Proposition 2 Using Linear Programming Re- gression with the"-insensitive loss function (1), local movements of target values of points out- side the tube do not influence the regression.

Proof Shifting y

i

locally into y

i

0 does not change the status of

(

x

i

y

i )

as being a point out- side the tube. Without loss of generality assume that

i

>

0

, i.e.y

i

<f

(

x

i )

;".3All we have to show that by changing

i

into

i

0

= i +(

y

i

;y

i

0

)

and keeping all the other variables and therefore also the estimatef unchanged, we obtain an op- timal solution again.

By construction the new set of variables for the modified problem is still feasible and the same constraints are active as in the initial so- lution. Finally the gradients in both the objec- tive function and the constraint with respect to each variable remain unchanged since

i

only appears in a linear fashion and all other vari- ables did not change at all. Thus the new solu- tion is optimal again, leading to the same esti- mate off as before.

Robustness is not the only quality measure that is applicable to a statistical estimator. The other criterion is (asymptotic) efficiency. While an efficiency of

1

obviously cannot be guaranteed for unmatched loss functions and noise models (e.g. one should be using squared loss for Gaus- sians), it is still important to choose(or") as

3The case of

i

>0works in the same way, just with opposite signs.

(4)

to obtain as efficient estimators as possible. In [6, 8, 5] it was shown how this could be done for Support Vector Regression. The results thereof carry over directly to the Linear Programming case without any further modification. Hence we only state the main result for convenience.

Proposition 3 Denote p a density with unit variance,4 and P a family of noise models generated from p by

P

:=

pp

=

1p;

y

>

0

. Moreover

assume that the data were generated iid from a distributionp

(

xy

) =

p

(

x

)

p

(

y;f

(

x

))

with

p

(

y;f

(

x

))

continuous. Under the assumption that LP regression produces an estimate f

^

converging to the underlying functional depen- dencyf, the asymptotically optimal, for the estimation-of-location-parameter model of LP regression is

= 1

;

Z

"

;

"

p

(

t

)

dt (11)

where

"

:= argmin 1

;R;

p

(

t

)

dt

(

p

(

;

) +

p

(

))

2

For explicit constants how to set an asymptoti- cally optimal value ofin the presence of poly- nomial noise see [8, 5]. Section 6 contains some experiments covering this topic.

5 Optimization

We adopt an interior point primal-dual strategy as pointed out in [10] with modifications aimed at exploiting the special structure of the-LPR problem (the "-LPR problem is similar). Af- ter adding slack variables [10] for the inequal- ity constraints we obtain the following system of equations.

minimize c>a

subject to Aa

+

Bb

+

y;s

= 0

as

0

andbfree

where a

:=

~ ~~ ~"

]

2R4

`

+1

b2R s2R2

`

A

=

;K K 1

0

~

1

K ;K

0

1 ~

1

B

=

;

~

1

~

1

y

=

~ y

;~y

c

=

h~

1

~

1

C~

1

C~

1

C`i

(12)

4

pis just a prototype generating the class of densitiesP. Normalization assumptions are made merely for the ease of notation.

Next we compute the corresponding dual opti- mization problem together with the constraints and KKT-conditions. This is needed for the op- timization strategy since we want to find a so- lution byfinding a set of variables that is both primal and dual feasible and satisfies the KKT conditions. For details see [10, 7]. We obtain minimize y>z

subject to c;g;A>z

= 0

B

>

z

= 0

gz

0

withg2R4

`

+1z2R2

`

KKT s

i

z

i = 0

for all

1

i

2

`

a

i

g

i = 0

for all

1

i

4

`

+ 1

(13) Hence we have to solve

A

a

+

B

b;

s

=

;Aa;Bb;y

+

s

:=

A

;

g;A>

z

=

;c

+

g

+

A>z

:=

g

B

>

z

=

;B>z

:=

B

s

i

z

i +

s

i

z

i =

;s

i

z

i

;

s

i

z

i :=

sz

a

i

g

i +

a

i

g

i =

;a

i

g

i

;

a

i

g

i :=

ag

(14) iteratively by a Predictor-Corrector method while decreasing until the gap between pri- mal and dual objective function is sufficiently small.5 Manual pivoting of (14) for

g

a,

and

syields

g

=

;

g

;A>

z

a

=

g;1

ag

;ag;1

g

=

g;1

ag +

ag;1

g +

ag;1A>

z

s

=

z;1

sz

;sz;1

z:

(15) Here agsz denote diagonal matrices with entries taken from the corresponding vectors

agsu. Finally,

zand

bare found as the solution of the so-called reduced KKT system

Aag

;1A

>

+

sz;1 B

B

>

0

z

b

=

z

B

(16) with

z =

A

;Ag;1

ag

;Aa g;1

g +

z;1

sz

. This

2

`

+1

dimensional system (16) could be re- duced further by applying an orthogonal trans- formation such that only`dimensional matrices have to be inverted to solve for

zand

b.

After finding a good starting value (cf. e.g.

[10]) one iterates over the system until the gap between primal and dual objective function be- comes sufficiently small while keeping the fea- sibility conditions (i.e. the linear constraints) satisfied. It is beyond the scope (and space) of this paper to explain these issues that are well known in optimization theory, in detail here.

5In both the predictor and the corrector step, quadratic dependencies in the KKT conditions are ignored. Moreover the predictor step, sets=0.

(5)

Our only aim was to show how an efficient method to solve the linear programming prob- lem arising from LP regression could be solved effectively. In our experiments we used a MATLAB implementation of the interior point algorithm described above.

6 Experiments

In this first study, we merely show a set of toy experiments illustrating some crucial points of -LPR. We used the kernel k

(

xy

) = exp(

;kx;yk2=

0

:

2)

,C

= 100

, and the target functionf

(

x

) = cos(

x

)

(sin(5

x

)+

sin

(4

x

))+

1

. A number of`training points was generated as

(

x

i

f

(

x

i )+

)

, whereis a random variable, normally distributed with standard deviation. We obtained the following results.

-LPR automatically adapts to the noise level in the data. Figure 1 shows how the algorithm, with the same parameter, deals with two problems which differ in the noise that is added to the targets by increas- ing the"–parameter automatically.

controls the fraction of points outside the tube. Figure 2 presents two different so- lutions to the same problem, with differ- ent values of, leading to different sizes of the"-tubes. Averaged over 100 training sets, one can see how the fraction of points nicely increases with(figure 3).

The test error (in theL2metric), averaged over 100 training sets, exhibits a rather

flat minimum in (figure 4). This indi-

cates that just as for-SVR, where cor- responding results have been obtained on toy data as well as real-world data, is a well-behaved parameter in the sense that slightly misadjusting it is not harmful. As a side note, we add that the minimum is very close to the value which asymptiotical calculations for the case of Gaussian noise predict (0.54, cf. [5, 8]).

7 Summary

We have presented an algorithm which uses an

"-insensitive loss function and an L1 sparsity regularizer to estimate regression functions via SV kernel expansions. Via a parameter, it au- tomatically adapts the accuracy"to the noise level in the data.

1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2

x

y

1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2

x

y

Figure 1: -LPR with

= 0

:

2

, run on two datasets with very small (

= 0

:

05

) and large (

= 0

:

3

) noise (`

= 80

points), automati- cally adapts the"-tube to the data ("

= 0

:

06

and

0

:

5

, respectively). Patterns inside the"-tube are plotted as ’x’, all others as ’o’. The solid line shows the fit, the dotted line the tube and the dash-dotted line the original data.

1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2

x

y

1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2

x

y

Figure 2: Running-LPR with different values of (left:

0

:

2

, right:

0

:

8

) on the same dataset (`

= 80

points with noise

= 0

:

3

) automat-

ically adapts the "-tube (to"

= 0

:

5

and

0

:

05

,

respectively) such that at most a fraction of the points is outside (cf. Prop. 1).

At least two extensions of the present algo- rithm are possible. One can include semipara- metric modelling techniques [9] (which is easily done by replacing the vectorBby a matrix cor- responding to the basis functions to be chosen beforehand) as well as non-constant tube shapes (or also parametric variants thereof) to deal with heteroscedastic noise [5].

Future experiments should evaluate how the present linear programming algorithm com- pares to the standard SV regularization via

P

ij

i

j

k

(

x

i

x

j )

in terms of sparsity, accu- racy, and computational complexity on real- world data.

Acknowledgements The authors would like to thank Peter Bartlett and Robert Williamson for helpful discussions. This work was sup- ported by a grant of the DFG Ja 379/51.

(6)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3: Number of SVs (i.e. points outside or on the edge of the tube) vs. (100 runs on training sets of size`

= 50

). Note thatdirectly controls the number of SVs.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.025 0.03 0.035 0.04 0.045

Figure 4: Mean squared error of the regression estimate vs.(100 runs on training sets of size

`

= 50

), with aflat minimum close to the theo- retically predicted value of

0

:

54

(cf. text).

References

[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozono´er. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Re- mote Control, 25:821–837, 1964.

[2] K. Bennett. Combining support vector and mathematical programming methods for induction. In B. Sch¨olkopf, C. J. C.

Burges, and A. J. Smola, editors, Ad- vances in Kernel Methods - SV Learning, pages 307–326, Cambridge, MA, 1999.

MIT Press.

[3] B. E. Boser, I. M. Guyon, and V. N.

Vapnik. A training algorithm for op- timal margin classifiers. In D. Haus- sler, editor, 5th Annual ACM Workshop

on COLT, pages 144–152, Pittsburgh, PA, 1992. ACM Press.

[4] S. Chen, D. Donoho, and M. Saun- ders. Atomic decomposition by basis pur- suit. Technical Report 479, Department of Statistics, Stanford University, 1995.

[5] B. Sch¨olkopf, A. Smola, R. Williamson, and P. Bartlett. New support vector algo- rithms. Technical Report NC-TR-98-027, NeuroColt2, University of London, UK, 1998.

[6] A. Smola, N. Murata, B. Sch¨olkopf, and K.-R. M¨uller. Asymptotically optimal choice of "-loss for support vector ma- chines. In L. Niklasson, M. Bod´en, and T. Ziemke, editors, Proceedings of ICANN’98, Perspectives in Neural Com- puting, pages 105–110, Berlin, 1998.

Springer Verlag.

[7] A. Smola, B. Sch¨olkopf, and K.-R. M¨uller.

Convex cost functions for support vector regression. In L. Niklasson, M. Bod´en, and T. Ziemke, editors,Proceedings of the 8th International Conference on Artificial Neural Networks, Perspectives in Neural Computing, Berlin, 1998. Springer Verlag.

[8] A. J. Smola. Learning with Kernels.

PhD thesis, Technische Universit¨at Berlin, 1998.

[9] A. J. Smola, T. Frieß, and B. Sch¨olkopf.

Semiparametric support vector and linear programming machines. In Advances in Neural Information Processing Systems, 11. MIT Press, 1998. in press.

[10] R. J. Vanderbei. Linear Programming:

Foundations and Extensions. Kluwer Aca- demic Publishers, Hingham, MA, 1997.

[11] V. Vapnik.The Nature of Statistical Learn- ing Theory. Springer, N.Y., 1995.

[12] J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, and C. Watkins.

Support vector density estimation. In B. Sch¨olkopf, C. J. C. Burges, and A. J.

Smola, editors,Advances in Kernel Meth- ods — Support Vector Learning, pages 293–306, Cambridge, MA, 1999. MIT Press.

Références

Documents relatifs

Moreover it is shown how the resulting convex constrained optimization problems can be eciently solved by a Primal{Dual Interior Point path following method.. Both

Pairwise comparisons showed that correct saccades were larger when the target stimulus was a face than a vehicle (face: 6.75 ± 0.65°, Vehicle: 6.26 ± 0.89°, p &lt; 0.005) but

Ap´ ery of the irrationality of ζ(3) in 1976, a number of articles have been devoted to the study of Diophantine properties of values of the Riemann zeta function at positive

In Section 4, we will give an original explicit expression of the constant C (F, β). To this end, we recall in Section 3 results of Dwork [5], André [1] and of the author [10] on

In sum, in the present research, the mere activation of the selection function of the university system is examined as a min- imal condition to impair first-generation

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Aggregated hold-out (Agghoo) satisfies an oracle inequality (Theorem 3.2) in sparse linear regression with the Huber loss.. This oracle inequality is asymptot- ically optimal in

The 1 A 1 ground and the first 1 B 2 excited states of the methylenecyclopropene (triafulvene) are described by localized wave functions, based on 20 structures