Learning via Uniform Convergence - Understanding Machine Learning

The first formal learning model that we have discussed was the PAC model. In Chapter2we have shown that under the realizability assumption, any finite hypoth-esis class is PAC learnable. In this chapter we will develop a general tool,uniform convergence, and apply it to show that any finite class is learnable in the agnos-tic PAC model with general loss functions, as long as the range loss function is bounded.

4.1 UNIFORM CONVERGENCE IS SUFFICIENT FOR LEARNABILITY

The idea behind the learning condition discussed in this chapter is very simple.

Recall that, given a hypothesis class,H, the ERM learning paradigm works as fol-lows: Upon receiving a training sample,S, the learner evaluates the risk (or error) of eachhinHon the given sample and outputs a member ofHthat minimizes this empirical risk. The hope is that anhthat minimizes the empirical risk with respect to Sis a risk minimizer (or has risk close to the minimum) with respect to the true data probability distribution as well. For that, it suffices to ensure that the empirical risks of all members ofHare good approximations of their true risk. Put another way, we need that uniformly over all hypotheses in the hypothesis class, the empirical risk will be close to the true risk, as formalized in the following.

Definition 4.1 (-representative sample). A training setSis called-representative (w.r.t. domain Z, hypothesis classH, loss function, and distributionD) if

∀h∈H, |LS(h)−L_D(h)| ≤.

The next simple lemma states that whenever the sample is (/2)-representative, the ERM learning rule is guaranteed to return a good hypothesis.

Lemma 4.2. Assume that a training set S is ₂-representative (w.r.t. domain Z, hypothesis class H, loss function , and distribution D). Then, any output of ERM_H(S), namely, anyhS∈argmin_h∈_HLS(h), satisfies

L_D(hS) ≤ min

h∈HL_D(h)+.

32 Learning via Uniform Convergence

Proof. For everyh∈H,

L_D(hS)≤LS(hS)+₂≤LS(h)+₂≤L_D(h)+₂+₂=L_D(h)+,

where the first and third inequalities are due to the assumption that S is

2-representative (Definition4.1) and the second inequality holds sincehSis an ERM predictor.

The preceding lemma implies that to ensure that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least 1−δ over the ran-dom choice of a training set, it will be an-representative training set. The uniform convergence condition formalizes this requirement.

Definition 4.3 (Uniform Convergence). We say that a hypothesis classHhas the uniform convergence property(w.r.t. a domainZand a loss function) if there exists a functionm^UC_H : (0,1)²→Nsuch that for every,δ∈(0,1) and for every probabil-ity distribution D over Z, if S is a sample ofm≥m^UC_H (,δ) examples drawn i.i.d.

according toD, then, with probability of at least 1−δ,Sis-representative.

Similar to the definition of sample complexity for PAC learning, the function m^UC_H measures the (minimal) sample complexity of obtaining the uniform con-vergence property, namely, how many examples we need to ensure that with probability of at least 1−δthe sample would be-representative.

The term uniform here refers to having a fixed sample size that works for all members ofHand over all possible probability distributions over the domain.

The following corollary follows directly from Lemma4.2and the definition of uniform convergence.

Corollary 4.4. If a classHhas the uniform convergence property with a functionm^UC_H then the class is agnostically PAC learnable with the sample complexitym_H(,δ)≤ m^UC_H(/2,δ). Furthermore, in that case, theERM_Hparadigm is a successful agnostic PAC learner forH.

4.2 FINITE CLASSES ARE AGNOSTIC PAC LEARNABLE

In view of Corollary4.4, the claim that every finite hypothesis class is agnostic PAC learnable will follow once we establish that uniform convergence holds for a finite hypothesis class.

To show that uniform convergence holds we follow a two step argument, similar to the derivation in Chapter 2. The first step applies the union bound while the second step employs a measure concentration inequality. We now explain these two steps in detail.

Fix some,δ. We need to find a sample sizemthat guarantees that for anyD, with probability of at least 1−δof the choice ofS=(z1,...,zm) sampled i.i.d. from Dwe have that for allh∈H,|LS(h)−L_D(h)| ≤. That is,

D^m({S:∀h∈H,|LS(h)−L_D(h)| ≤})≥1−δ.

Equivalently, we need to show that

D^m({S:∃h∈H,|LS(h)−L_D(h)|> })< δ.

4.2 Finite Classes Are Agnostic PAC Learnable 33

Writing

{S:∃h∈H,|LS(h)−L_D(h)|> } = ∪h∈H{S:|LS(h)−L_D(h)|> }, and applying the union bound (Lemma2.2) we obtain

D^m({S:∃h∈H,|LS(h)−L_D(h)|> })≤

h∈H

D^m({S:|LS(h)−L_D(h)|> }). (4.1) Our second step will be to argue that each summand of the right-hand side of this inequality is small enough (for a sufficiently largem). That is, we will show that for any fixed hypothesis,h, (which is chosen in advance prior to the sampling of the training set), the gap between the true and empirical risks,|LS(h)−L_D(h)|, is likely to be small.

Recall thatL_D(h)=Ez∼D[(h,z)] and thatLS(h)=_m¹_m

i=1(h,zi). Since eachzi

is sampled i.i.d. fromD, the expected value of the random variable(h,zi) isL_D(h).

By the linearity of expectation, it follows thatL_D(h) is also the expected value of LS(h). Hence, the quantity|LD(h)−LS(h)|is the deviation of the random variable LS(h) from its expectation. We therefore need to show that the measure ofLS(h) is concentratedaround its expected value.

A basic statistical fact, the law of large numbers, states that when m goes to infinity, empirical averages converge to their true expectation. This is true forLS(h), since it is the empirical average ofmi.i.d random variables. However, since the law of large numbers is only an asymptotic result, it provides no information about the gap between the empirically estimated error and its true value for any given, finite, sample size.

Instead, we will use a measure concentration inequality due to Hoeffding, which quantifies the gap between empirical averages and their expected value.

Lemma 4.5(Hoeffding’s Inequality). Letθ1,...,θm be a sequence of i.i.d. random variables and assume that for alli,E[θi]=µandP[a≤θi≤b]=1. Then, for any The proof can be found in AppendixB.

Getting back to our problem, letθi be the random variable(h,zi). Sinceh is fixed andz1,...,zmare sampled i.i.d., it follows thatθ1,...,θm are also i.i.d. random Combining this with Equation (4.1) yields

D^m({S:∃h∈H,|LS(h)−L_D(h)|> })≤

34 Learning via Uniform Convergence

Finally, if we choose

m≥log (2|H|/δ) 2² then

D^m({S:∃h∈H,|LS(h)−L_D(h)|> })≤δ.

Corollary 4.6. LetHbe a finite hypothesis class, let Z be a domain, and let:H× Z→[0,1]be a loss function. Then,Henjoys the uniform convergence property with sample complexity

m^UC_H(,δ)≤

log (2|H|/δ)

2² .

Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity

m_H(,δ)≤m^UC_H(/2,δ)≤

2 log (2|H|/δ)

² .

Remark 4.1 (The “Discretization Trick”). While the preceding corollary only applies to finite hypothesis classes, there is a simple trick that allows us to get a very good estimate of the practical sample complexity of infinite hypothesis classes.

Consider a hypothesis class that is parameterized by d parameters. For example, let X =R, Y = {±1}, and the hypothesis class, H, be all functions of the form h_θ(x)=sign(x−θ). That is, each hypothesis is parameterized by one parameter, θ∈R, and the hypothesis outputs 1 for all instances larger thanθ and outputs−1 for instances smaller thanθ. This is a hypothesis class of an infinite size. However, if we are going to learn this hypothesis class in practice, using a computer, we will probably maintain real numbers using floating point representation, say, of 64 bits.

It follows that in practice, our hypothesis class is parameterized by the set of scalars that can be represented using a 64 bits floating point number. There are at most 2⁶⁴ such numbers; hence the actual size of our hypothesis class is at most 2⁶⁴. More gen-erally, if our hypothesis class is parameterized by d numbers, in practice we learn a hypothesis class of size at most 2^64d. Applying Corollary4.6we obtain that the sample complexity of such classes is bounded by ^128d⁺^{2 log(2}₂ ^/δ⁾. This upper bound on the sample complexity has the deficiency of being dependent on the specific rep-resentation of real numbers used by our machine. In Chapter6 we will introduce a rigorous way to analyze the sample complexity of infinite size hypothesis classes.

Nevertheless, the discretization trick can be used to get a rough estimate of the sample complexity in many practical situations.

4.3 SUMMARY

If the uniform convergence property holds for a hypothesis classH then in most cases the empirical risks of hypotheses in H will faithfully represent their true risks. Uniform convergence suffices for agnostic PAC learnability using the ERM rule. We have shown that finite hypothesis classes enjoy the uniform convergence property and are hence agnostic PAC learnable.

4.5 Exercises 35

4.4 BIBLIOGRAPHIC REMARKS

Classes of functions for which the uniform convergence property holds are also called Glivenko-Cantelli classes, named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, who proved the first uniform convergence result in the 1930s. See (Dudley, Gine & Zinn 1991). The relation between uniform convergence and learnability was thoroughly studied by Vapnik – see (Vapnik 1992, Vapnik 1995, Vapnik 1998). In fact, as we will see later in Chapter 6, the fundamental theorem of learning theory states that in binary classification problems, uniform convergence is not only a sufficient condition for learnability but is also a necessary condition. This is not the case for more general learning problems (see (Shalev-Shwartz, Shamir, Srebro & Sridharan 2010)).

4.5 EXERCISES

4.1 In this exercise, we show that the (,δ) requirement on the convergence of errors in our definitions of PAC learning, is, in fact, quite close to a simpler looking require-ment about averages (or expectations). Prove that the following two staterequire-ments are equivalent (for any learning algorithm A, any probability distributionD, and any loss function whose range is [0,1]):

1. For every,δ >0, there existsm(,δ) such that∀m≥m(,δ)

S∼DP^m[L_D(A(S))> ]< δ 2.

mlim→∞ E

S∼D^m[L_D(A(S))]=0

(whereES∼D^m denotes the expectation over samplesSof sizem).

4.2 Bounded loss functions:In Corollary4.6we assumed that the range of the loss func-tion is [0,1]. Prove that if the range of the loss function is [a,b] then the sample complexity satisfies

m_H(,δ)≤m^UC_H (/2,δ)≤

2 log (2|H|/δ)(b−a)² ²

5

Dans le document Understanding Machine Learning (Page 49-54)