Bounding the Generalization Error - The Informational Complexity of Learning from Examples

At this stage it might be worthwhile to review and remark on some general features of the problem of learning from examples. Let us remember that our goal is to minimize the expected riskI[f] over the set^F. If we were to use a nite number of parameters, then we have already seen that the best we could possibly do is to minimize our functional over the setHn, yielding the estimatorfn:

fnarg min_f

2HⁿI[f] :

However, not only is the parametrization limited, but the data is also nite, and we can only minimize the empirical riskIemp, obtaining as our nal estimate the function

^fn;l. Our goal is to bound the distance from ^fn;l that is our solution, fromf0, that is 46

the \optimal" solution. If we choose to measure the distance in theL²(P) metric (see appendix 2-A), the quantity that we need to bound, that we will call generalization error, is:

E[(f0^; ^fn;l)²] =^R_Xd

x

)(f0(

x

)^; ^fn;l(

x

))² =

=^kf0^; ^fn;l^k2L²(P)

There are 2 main factors that contribute to the generalization error, and we are going to analyze them separately for the moment.

1. A rst cause of error comes from the fact that we are trying to approximate an innite dimensional object, the regression functionf0 ²^F, with a nite number of parameters. We call this errorthe approximation error, and we measure it by the quantityE[(f0^;fn)²], that is theL2(P) distance between the best function in Hn and the regression function. The approximation error can be expressed in terms of the expected risk using the decomposition (2.2) as

E[(f0^;fn)²] = I[fn]^;I[f0]: (2:6) Notice that the approximation error does not depend on the data setDl, but de-pends only on the approximating power of the classHn. The natural framework to study it is approximation theory, that abound with bounds on the approx-imation error for a variety of choices of Hn and ^F. In the following we will always assume that it is possible to bound the approximation error as follows:

E[(f0^;fn)²]"(n)

where "(n) is a function that goes to zero as n goes to innity if H is dense in

F. In other words, as shown in gure (2-6), as the numbern of parameters gets larger the representation capacity ofHnincreases, and allows a better and better approximation of the regression function f0. This issue has been studied by a number of researchers (Cybenko, 1989; Hornik, Stinchcombe, and White, 1989;

Barron, 1991, 1993; Funahashi, 1989; Mhaskar, and Micchelli, 1992; Mhaskar, 1993) in the neural networks community.

2. Another source of error comes from the fact that, due to nite data, we minimize the empirical riskI^emp[f], and obtain ^fn;l, rather than minimizing the expected riskI[f], and obtaining fn. As the number of data goes to innity we hope that

^fn;l will converge to fn, and convergence will take place if the empirical risk converges to the expected risk uniformly in probability (Vapnik, 1982) . The quantity

jIemp[f]^;I[f]^j

is called estimation error, and conditions for the estimation error to converge to zero uniformly in probability have been investigated by Vapnik and Cher-vonenkis Pollard , Dudley (1987) , and Haussler (1989) . Under a variety of dierent hypothesis it is possible to prove that, with probability 1^;, a bound of this form is valid:

jIemp[f]^;I[f]^j!(l;n;) ⁸f ²Hn (2:7) The specic form of! depends on the setting of the problem, but, in general, we expect!(l;n;) to be a decreasing function of l. However, we also expect it to be an increasing function ofn. The reason is that, if the number of parameters is large then the expected risk is a very complex object, and then more data will be needed to estimate it. Therefore, keeping xed the number of data and increasing the number of parameters will result, on the average, in a larger distance between the expected risk and the empirical risk.

The approximation and estimation error are clearly two components of the gen-eralization error, and it is interesting to notice, as shown in the next statement, the generalization error can be bounded by the sum of the two:

Statement 2.2.1

The following inequality holds:

kf0^; ^fn;l^k2L²(P)"(n) + 2!(l;n;) : (2:8)

Proof:

using the decomposition of the expected risk (2.2), the generalization error can be written as:

kf0^; ^fn;l^k2L²(P)=E[(f0 ^; ^fn;l)²] =I[ ^fn;l]^;I[f0]: (2:9) A natural way of bounding the generalization error is as follows:

E[(f0^; ^fn;l)²]^jI[fn]^;I[f0]^j+^jI[fn]^;I[ ^fn;l]^j : (2:10) 48

In the rst term of the right hand side of the previous inequality we recognize the approximation error (2.6). If a bound of the form (2.7) is known for the generalization error, it is simple to show (see appendix (2-C) that the second term can be bounded as

jI[fn]^;I[^fn;l]^j2!(l;n;) and statement (2.2.1) follows ².

Thus we see that the generalization error has two components: one, bounded by"(n), is related to the approximation power of the class of functions ^fHn^g, and is studied in the framework of approximation theory. The second, bounded by!(l;n;), is related to the dicultyof estimating the parameters given nite data, and is studied in the framework of statistics. Consequently, results from both these elds are needed in order to provide an understanding of the problemof learning from examples. Figure (2-6) also shows a picture of the problem.

F f₀

f n H n fn l^

Figure 2-6: This gure shows a picture of the problem. The outermost circle repre-sents the set F. Embedded in this are the nested subsets, the Hn's. f0 is an arbitrary target function in ^F, fn is the closest element of Hn and ^fn;l is the element of Hn

which the learner hypothesizes on the basis of data.

Dans le document The Informational Complexity of Learning from Examples (Page 46-50)