• Aucun résultat trouvé

Relation to Kernel Regression

Radial Basis Function Parameter Estimation

2.2 RBF Specic Methods

2.2.2 Relation to Kernel Regression

6

6

6

6

6

6

6

6

4

h1(kx1;z1k) ::: hk(kx1;zkk) x11 ::: xd1 1 h1(kx2;z1k) ::: hk(kx2;zkk) x12 ::: xd2 1

... ... ... ... ...

h1(kxn;z1k) ::: hk(kxn;zkk) x1n ::: xdn 1

3

7

7

7

7

7

7

7

7

5

and xji denotes the j-th component of observation i. Inverting XTX, however, is unusually subject to numerical problems because of the likelihood in our high dimen-sional nonlinear setting of the matrix being ill-conditioned. Dyn and Levin (1983) proposed a solution to this problem in the context of vanilla interpolation RBFs that preconditioned the X matrix thru the use of an iterated Laplacian operator, which tends to make the matrix diagonally dominant and thus more easily inverted. We instead adopt the solution of using a robust inversion routine; in particular, follow-ing Press et.al. (1988) we use the Sfollow-ingular Value Decomposition method for solvfollow-ing (2.8), zeroing singular values that are six orders of magnitude smaller than the largest singular value. As noted in further discussion of this issue in Chapter 6, however, numerical stability alone does not guarantee good, meaningful solutions to the overall problem. We also note that depending on the cost function2of interest, other linear solvers (e.g. least median or least L1 regression) may be more appropriate than least squares.

2.2.2 Relation to Kernel Regression

There is an obvious similarity between RBF models and the class of models derived from a standard statistical method known as kernel regression. The goal in kernel regression is to come up with nonparametric models of in the system

yj =(xj) +j; j = 1;:::;n (2:9)

2.2. RBF SPECIFIC METHODS 51 where we are givenn observations f(xj;yj) :j = 1;:::;ng, and the j are zero mean, uncorrelated random variables. For ease of exposition we will restrict ourselves to scalar valued xj here, although all of the results discussed have straightforward ex-tensions to the vector case. The general form of a kernel estimator is a weighted average of the example outputs yj near the input x, i.e.

(x) =Xn

j=1yjK(x;xj) (2:10) where K(u) is the kernel function parameterized by a bandwidth or smoothing pa-rameter . Typically assumptions are made about the function K, for instance that its integral is 1 and it is symmetric about the origin.

We note that Equation (2.10) can be seen as a restrictive form of the general RBF Equation (1.1). This is obviously the case for a carefully chosen set of RBF parameters; in particular, Equation (1.1) reduces to Equation (2.10) if we use the observed outputs y as the coecients c, all n observations as centers, the kernel function K for each basis function hi, the smoothing parameter for each basis function scale parameter, the simple Euclidean norm for the distance measure (i.e.

W = I), and if we drop the polynomial term p(~Y ).

However, there is a broader connection between RBFs and kernel regression in the context of estimation than the above correspondence of variables implies. To see this, consider the solving the linear portion of the RBF equations once we have xed the parameters \inside" the basis functions. At this point Equation 1.1 is just

y = ~c~h(~x;W;~z;~) (2:11)

which must be satised at every data point (~x;y). If we chose the typical least squares solution to this linear problem this becomes

y = ~y(HTH);1HT ~h(~x;W;~z;~) (2:12)

52 CHAPTER 2. RADIAL BASISFUNCTION PARAMETER ESTIMATION where ~y is the vector of all outputs in the data set, and H is the matrix of all basis function outputs for the data set. Thus we can think of RBF networks as being linear combinations of the outputs~y, and note that the kernel regression of Equation 2.10 is a dual form of the RBF representation if we choose the kernel function K such that

K= (HTH);1HT ~h(~x;W;~z;~) (2:13) This duality encourages us to apply results from the kernel regression literature, at least in this restrictive sense. We point out a few such results here; interested readers are encouraged to pursue others in the excellent review in Eubank (1988).

First, a basic observation is that the most important problem associated with the use of a kernel estimator is the selection of a good value for the smoothing parameter (and thus by our above duality principle, the RBF scale parameters and W).

Although asymptotic results concerning the optimal choice of have been obtained (e.g. see Theorem 4.2 in Eubank (1988)), their practical usefulness is dubious since they rely on knowledge of the unknown function. Similarly in deriving RBF networks from regularization theory and adopting a Bayesian interpretation, Poggio and Girosi (1990) note that and W are in principle specied by the prior probability distri-bution of the data, although in practice this can be quite dicult to estimate. Thus from either perspective, a reasonable approach for these parameters is to set them from the data using some robust technique such as cross-validation.

Other kernel regression results have implications for how we choose the basis function in RBF models. In particular, the Nadaraya-Watson kernel estimator

(x) =

Pnj=1yjK(x;xj)

Pnj=1K(x;xj) (2:14)

was originally derived for use when the (xj;yj) are independent and identically dis-tributed as a continuous bivariate random variable (X;Y ) (i.e. the xj are not xed experimental design points). Note that in this case (x) estimates the conditional

2.2. RBF SPECIFIC METHODS 53 mean E[YjX = x], and the j will be independent but not identically distributed.

Under certain restrictions this estimator is known to have a variety of consistency properties (see Eubank (1988)). For RBF modeling problems that t the above as-sumptions, then, a natural constraint to place on the basis functionsh is to normalize their outputs in the above fashion.

If our inputs (i.e. xj's) are nonstochastic, we can be more specic and suggest a particular choice for the basis functions. It is well known that if we want to minimize the sum squared error cost function 2 from Equation (2.1), the asymptotically op-timal choice of unparameterized kernels is the quadratic or Epanechnikov kernel (see Epanechnikov (1969))

K(u) =

8

>

<

>

:

0:75(1;u2) juj1

0 juj> 1 (2:15)

Note that to obtain this result we must make some assumption about the second moment of the kernel such as

Z

1

;1

u2K(u)du = 6= 0 (2:16)

so that we cannot shrink the kernel functions (and thus2) arbitrarily small.

A nal suggestion from the kernel estimation literature concerns estimation bias (i.e. E[(x); (x)]). It can be shown that kernels with support on the entire line have diculties with bias globally, since it is typically impossible to choose one smoothing parameter to minimize bias everywhere. If the inputs x are unequally spaced, the situation is even worse, since the estimator will eectively be computed from dierent numbers of observations in dierent input regions. Kernels with nite support (such as the Epanechnikov kernel above), however, localize the bias problems to the boundary of the data, where specializedboundary kernels can often be dened to lessen the bias (see Gasser and Muller (1979) for examples). Variable bandwidth

54 CHAPTER 2. RADIAL BASISFUNCTION PARAMETER ESTIMATION estimators, which vary depending on the density of x, can also help here, although it is not known how to do this optimally in general. For RBF networks this would correspond to using dierent input weightsW in dierent regions of the input space, an idea we will pursue further next.