• Aucun résultat trouvé

Prediction and Model Condence

4.1 Prediction Condence

4.1.3 RBF Case

Can we derive similar condence intervals for RBF network outputs? In a restricted sense we can; although our method is shown to be asymptotically correct only for RBF networks of the kernel type (see Chapter 2), in fact it may be nonetheless useful for more general cases. Before we present our result, however, two notable dierences from the linear case above deserve some discussion.

A rst dierence with the linear case concerns the uniformity of the condence interval across the input space. In the linear case the width of the condence intervals is relatively uniform, partly because of our assumption of i.i.d. error , and partly because our variance estimate is global function of all of the ~xj's. However, as the function complexity and input dimensionality of the problems we consider increase, it is desirable to nd more local estimates of condence, in the sense that they depend strongly on local input data density and error estimates. In fact this is possible for the kernel-type RBF networks, due to their local representation.

The second dierence with the linear case concerns bias. In the linear case it is easy to show that ^(~x) is an unbiased estimator of (~x) (i.e. E[^(~x);(~x)] = 0), so that on the average the only contribution to the error term is from the variance of the estimator. Unfortunately, this proof is not easy for the RBF case, and in fact the best results we know of in this regard are various asymptotic proofs of unbiasedness for kernel regression (e.g. Krzy_zak (1986), Hardle (1990)). For this reason, from a practical viewpoint a careful assessment of the bias situation is in order for for real problems with nite data sets, despite any claims we might make about unbiasedness in showing other asymptotic results.

We now present a pointwise condence interval result for kernel-type RBF net-works based on the heuristic method of Leonard, Kramer and Ungar (1991). Following the notation of RBF Equation (1.1), let us assume we use all of the data as centers (i.e. k = n and ~zi = ~xi for i = 1::n), identical basis functions which are decreasing functions with maxima at the centers, and identical scale parameters (e.g. the of

4.1. PREDICTION CONFIDENCE 89 the gaussian) which are decreasing functions of the number of data points n. Then our conjecture is that for a new input ~x the random varaiable

^f(~x);f(~x)

Q(~x) (4:5)

where

Q(~x) = pA Pki=1h(k~x;~zik)si

Pki=1h(k~x;~zik) si =

v

u

u

t

Pnj=1h(k~zi;~zjk)(yj; ^f(~zj))2

Pnj=1h(k~zi;~zjk) and

A = Z h2(u)du

converges in distribution to N(B;1) where the bias term B is a function of the derivatives off and the marginal density of the data ~x (denote this by g(~x)). In cases were this bias term is judged to be small relative to the variance term above, we can then write approximate condence intervals for the RBF case as

^fl(~x); ^fu(~x) = ^f(~x)z(=2)Q(~x) (4:6) where z(p) is the 100p quantile of the standard normal distribution.

Proof of this conjecture relies upon a similar result for kernel regression due to Hardle (1990). In particular we can see that our above conjecture is equivalent to a smoothed version of Hardle's Theorem 4.2.1, if we can show two things. First, we note that Hardle's assumptions must hold, notably that our scale parameter must decrease at a rate proportional to n1=5, f(~x) and g(~x) must both be twice dierentiable, the conditional variance of must be continuous around ~x, and the basis functionsh(u) must have a bounded moment higher than the second1. Second,

1Note that the bounded moment assumption excludes increasing basis functions such as the

90 CHAPTER 4. PREDICTION AND MODELCONFIDENCE we must show that the spatial averaging of our conditional variance estimate si in Q(~x) is correct at the data, i.e. thatQ(~xi)!pAsi as n!1 for all i = 1::n, but this is clearly the case if n;1=5 and the moments of h(u) are bounded as above.

Thus to complete the analogy we note that the marginal densityg(~x) is approximated in the limit by 1=nPni=1h(k~x;~xik) given the above conditions.

4.1.4 Examples

A few examples will serve to illustrate some properties of the RBF pointwise con-dence limits. First, a 1D example of these concon-dence limits is shown in Figure 4-2 for data chosen on a evenly spaced grid. Here the data were generated according to the relation y = sinx + xN(0;(0:05)2). Selection of the basis function scale parameter is a careful balance of the bias in ^f(~x) from oversmoothing and the variance in our condence limit from undersmoothing. For a suitable choice, though, the Q statistic is able to estimate the size of the heteroskedastic error term of this example quite well, something the global linear method has no chance of doing.

In general variance arises in our estimates not just from the inherent noise of the problem, but also from not having enough data points in some regions of the input space. In Figure 4-3 we show an example where the location of the input data are chosen randomly in such a way that the marginal density of x does not exactly coincide with the \interesting" features of the function. In particular, the input data x is drawn from a N(1;52) distribution, but the function isy = sinxe;0:1(x+1)2+(x; 10)N(0;(0:01)2), which has interesting features well away from x = 1, the point of highest marginal density. However, the increased size of the condence intervals for lower marginal densities is quickly dominated by the variance from the noise term.

multiquadric.

4.1. PREDICTION CONFIDENCE 91

Figure 4-2: 1D example of RBF pointwise condence limits. Graph (a) shows the data, true function, and approximate 95% condence limits computed by Equa-tion (4.6) for = 0:5. 17 evenly spaced training points (shown with o's) were used to t the curve, and 83 new points (shown withx's) were used for testing. Graph (b) shows the true versus estimated conditional standard deviation of the error term for various values of the scale parameter .

92 CHAPTER 4. PREDICTION AND MODELCONFIDENCE

Figure 4-3: Example of RBF pointwise condence limits for uneven data distribution.

20 randomly sampled training points (shown witho's) were used to t the curve, and 80 new points (shown with x's) were used to test the condence limits. The value of 1.0 is used here for the scale parameter.

4.1. PREDICTION CONFIDENCE 93