• Aucun résultat trouvé

Radial Basis Function Parameter Estimation

2.1 General Methods

2.1.1 Levenberg-Marquardt

A simple approach to minimizingthe meritfunction2is to use information about the gradient of the function to step along the surface of the function \downhill" towards a minimum, i.e. update the parameters

a

at each iteration using the rule

a

=;r2(

a

) (2:2)

1This choice of2is often motivated by assuming independent and normally distributed measure-ment errors with constant variance, in which case minimizing2is a maximumlikelihood estimator, but in fact it is a reasonable choice even if those assumptions are not true.

44 CHAPTER 2. RADIAL BASISFUNCTION PARAMETER ESTIMATION where the gradient r2(

a

) is composed of the p rst derivatives 2=

a

k of 2 and is a small positive constant. This for instance is one of the estimation strategies proposed in Poggio and Girosi (1990) for radial basis functions.

The problem with the gradient descent approach is in choosing : we'd like it to be small, so that we stay on the2 surface and thus ensure we make progress moving downhill, but we'd also like it to be big so that we converge to the solution quickly.

Solutions to this dilemma include varying in response to how well previous steps worked, or iteratively nding the minimum in the direction of the gradient (i.e. \line minimization").

The Levenberg-Marquardt method (see Marquardt (1963)) takes a dierent ap-proach, by recognizing that the curvature of the function gives us some information about how far to move along theslopeof the function. It approximates the2function with a second order Taylor series expansion around the current point

a

0:

2(

a

)2(

a

0) +r2(

a

0)T

a

+ 12

a

T

H

a

(2:3) where

H

is the Hessian matrix evaluated at

a

0, i.e.

[

H

]kl 22

a

k

a

l

a

0

(2:4) Since the approximating function is quadratic its minimum can easily be moved to using step size

a

=;H;1r2(

a

0) (2:5)

However, this approximation will not always be a good one (especially early in the estimation process), and thus the Levenberg-Marquardt method allows the user to adopt any combination of the simple gradient descent rule and the inverse Hessian rule by multiplyingthe diagonal of

H

with a constant (i.e.

H

0kk

H

kk(1+)). Thus a typical iterative strategy is to use more of a gradient descent step by increasing

2.1. GENERAL METHODS 45 when the previous

a

doesn't work (i.e. increases 2), and use more of an inverse Hessian step by decreasing when the previous

a

does work (i.e. decreases 2).

Some comments about using the Hessian for RBF estimation are in order. First, note that since the function f inside of 2 is an RBF model calculating the Hessian can be done analytically, although following Press et.al. (1988) we drop the second derivatives of f in evaluating

H

to minimize the eects of outliers on the quadratic approximation. Second, the large number of parameters in RBF models is problem-atic; the added cost of assembling and inverting the Hessian may not be worth the speedup in convergence gained for large models. In fact even gradient descent may be unsatisfactory for very non-smooth cost functions; we will pursue this thought in the next section.

Perhaps a more immediate problem is that the Hessian matrix is likely to be ill-conditioned for large models, causing numerical problems for the inversion procedure despite the tendency of the Levenberg-Marquardt method to avoid these problems by increasing and making the Hessian diagonally dominant. This problem is especially severe early in the estimation process when we are far from a minimum and there are many conicting ways the parameters could be improved. We avoid this problem by using a Singular Value Decomposition for inverting

H

0, and zeroing singular values less than 10;6 times the largest value.

On the other hand, the inverse Hessian provides valuable information about the accuracy of parameter estimates, and can be useful for pruning unnecessary param-eters from our model. If the quadratic approximation made above is reasonably accurate and the residuals from our model are normally distributed, 2

H

;1 is an esti-mate of the covariance matrix of the standard errors in the tted parameters

a

(see Press et.al. (1988)), thus a t-test on

a

k=(2q[

H

]kk) can be used to determine if the k-th parameter should be removed from the model. Even if the residuals are not normally distributed, the above test may be a reasonable way to identify unnecessary parameters. Hassibi and Stork (1992) use this approach for multilayer perceptron

46 CHAPTER 2. RADIAL BASISFUNCTION PARAMETER ESTIMATION networks, and give some evidence that it is superior to schemes that assume diagonal dominance of the Hessian.

However, we would like to point out the advantage of looking at eliminating multi-ple parameters simultaneously. In RBF networks, for instance, it makes little sense to talk about eliminating just the scale parameter from one basis function. Similarly, why should we eliminate the coecientc connecting a basis function to the output without eliminating all of the parameters of that basis function? Instead it makes sense to consider the change in our cost function that would occur if we eliminated an entire basis function at once. If the above quadratic approximation holds and we are at the minimum, the increase in 2 for setting q parameters to zero is

2= 12

a

T

P

T h

P

H

;1

P

Ti;1

P

a

(2:6) where

P

is a qp projection matrix which selects the q dimensions of interest from

a

and

H

;1. If our model residuals are normally distributed, this quantity follows a 2 distribution with q degrees of freedom and we have an exact condence test.

Note that commonly we will be uncertain both about the normality assumption and the accuracy of the quadratic approximation, and thus we typically must take this statistic with a grain of salt, especially if we attempt to test a condence region that is large with respect to the nonlinear structure in the2 surface.

As a simpleexampleof the above pruning methods, we can model a linear equation with a RBF network and see if the above diagnostics tell us to drop the nonlinear unit. To check this we t the equation y = 2x1;3x2 + 5 +, where x1 and x2 are evenly spaced along [-50,50] and N(0;0:1) is independent gaussian noise, and we used an RBF model with one gaussian basis function2. The results in Table 2.1 show that the above tests correctly reject the single nonlinear basis function parameters,

2Note that because of the degeneracy of this simple example, the input weightsW and the center

~

z were held xed to prevent complete dependence between the parameters.

2.1. GENERAL METHODS 47

Parameter Value 21 p-value

c1 0.0004276 -1.836e-07 0.9997

c2 2 -1.298e+04 0.0000

c3 -3 -12.5 0.0004

c4 5 6.467e+08 0.0000

0.6408 -0.4107 0.5216

Table 2.1: Example of pruning RBF parameters using the inverse Hessian (see text for details). P-value given is probability that the condence region includes zero, which is signicant for the scale parameter and the linear coecient c1 of the gaussian unit. The joint test for these parameters yields a22 statistic of 0.4106 with a p-value of 0.8144. Thus both tests correctly indicate that we should drop the nonlinear unit.

both individually and jointly.