Statistical Learning - Machine Learning Paradigms

The learning process faces problems that are partially unknown, too complex or too noisy. As we mentioned in the previous section, statistical learning techniques deal with the difficult learning process of finding desired dependence-relation for an infinite domain using a finite amount of given data [9, 35,36]. In fact, there is an infinite number of such relations. Thus, the problem is to select the relation (i.e., to infer the dependence) which is the most appropriate. The answer to this problem is provided by the principle ofOccam’s razor(also known asprinciple of parsimony):

“Entities should not be multiplied beyond necessity.” This principle means that one should not increase the number of entities unnecessarily or make further assumptions than are needed to explain anything. In general, one should pursue the simplest hypothesis available. Thereby, when many solutions are available for a given problem we should select the simplest one because according to the interpretation of Occam’s razor, “the simplest explanation is the best” [9,35,36].

Consequently, we have to define what is considered as the simplest solution.

In order to define what a simple solution is, we need to use prior knowledge of the problem to be solved. Often, this prior knowledge is theprinciple of smooth-ness. Specifically, the smoothness principle states that physical changes do not occur instantaneously, but all changes take place in a continuous way. If two pointsx1,x2

are close, then so should be the corresponding outputs y1,y2. For example, if we have two points, one of which is red and the other one is green, then if we will get a new point which is very close to the red one, the principle of smoothness says that this point will most probably be red. This means that the function (which describes

3.3 Statistical Learning 35 the relation among location of points and color) does not change abruptly, but its change takes place in a continuous way.Therefore, the learning problem could be described as the search to find a function which solves our task.

Taking into consideration the previous facts, two main paradigms are developed in statistical inference: the particular parametric paradigm (parametric inference) and the general paradigm (nonparametric inference).

3.3.1 Classical Parametric Paradigm

The classical parametric paradigm aims at creating simple statistical methods of inference [36]. This approach of statistical inference was influenced by Fisher’s work in the area of discriminant analysis. The problem was to infer a functional dependence by a given collection of empirical data. In other words, the investigator of a specific problem knows the physical law that generates the stochastic properties of the data in the form of a function to be determined up to a finite number of parameters. Thus, the problem is reduced to the estimation of the parameters using the data. Fisher’s goal was to estimate the model that generates the observed signal.

More specifically, Fisher proposed that any signal X could be modeled as the sum of two components, namely adeterministiccomponent andrandomcomponent:

Y = f(x,a)+ε. (3.1)

In this model, f(x,a)is the deterministic part defined by values of a function which is determined up to a limited number of parameters. On the other hand, corresponds to the random part describing noise added to the signal, which is defined by a known density function. Thus, the goal was the estimation of the unknown parameters in the function f(x,a). For this purpose, Fisher adopted theMaximum Likelihood Method. The classical parametric paradigm could be called as Model Identification as the main philosophical idea is the traditional goal of science to discover or to identify an existing law of nature that generates the statistical properties of data.

The classical parametric paradigm is based on three beliefs. The first belief states that the number of free parameters, that define a set of functions linear to them and contain a good approximation to the desired function, is small. This belief is based on Weiestrass’s theorem, according to which “any continuous function can be approximated by polynomials at any degree of accuracy” [36].

The second belief refers to the Central Limit Theorem which states that “the sum of a large number of variables, under wide conditions, is approximated by normal (Gaussian) law.” Therefore, the underlying belief is that for most real life problems, if randomness is the result of interaction among large numbers of random components, then the statistical law of the stochastic element is the normal law [36].

The third belief states that the maximum likelihood method as an induction engine is a good tool for estimating parameters of models even for small sample sizes. This

belief was supported by many theorems that are related with conditional optimality of the method [36].

However, the parametric paradigm demonstrates shortcomings in all of the beliefs that it was based on. In real life multidimensional problems, it seems naive to consider that we could define a small set of functions that contained the desired functions, because it is known that, if the desired function is not very smooth, the desired level of accuracy causes an exponential increase of the number of free terms with an increasing amount of variables. Additionally, real life problems cannot be described by only classical distribution functions, while it has also been proven that the max-imum likelihood method is not the best one even for simple problems of density estimation [36].

3.3.2 General Nonparametric—Predictive Paradigm

The shortcomings of the parametric paradigm for solving high-dimensional prob-lems have led to the creation of new paradigms as an attempt to analyze the problem of generalization of statistical inference for the problem of pattern recognition. This study started by Glivenko, Cantelli and Kolmogorov [36], who proved that the empir-icaldistribution function always converges to theactualdistribution function. The main philosophical idea of this paradigm arises from the fact that there is no reliable a priori information about the desired function that we would like to approximate.

Due to this fact, we need a method to infer an approximation of the desired function utilizing the given data under this situation. This means that this method should be the best for the given data and a description of conditions should be provided in order to achieve the best approximation to the desired function that we are looking for.

As a result of the previous reasons, the need was created to state the general principle of inductive inference which is known as the principle ofEmpirical Risk Minimization(ERM). The principle of ERM suggests a decision rule (anindicator function) that minimizes the so-called “empirical risk”, i.e.the number of training errors[9,36].

The construction of a general type of learning machine started in 1962, where Rosenblatt [2, 36] proposed the first model of a learning machine known as per-ceptron. The underlying idea of the perceptron was already known in the literature of neurophysiology. However, Rosenblatt was the first to formulate the physiolog-ical concepts of learning with reward and punishment stimulus of perceptron as a program for computers and showed with his experiments that this model can be generalized. He proposed a simple algorithm of constructing a rule for separating data into categories using given examples. The main idea was to choose appropriate coefficients of each neuron during the learning process. The next step was done in 1986 with the construction of theback-propagation algorithm for simultaneously finding the weights for multiple neurons of a neural network.

At the end of the 1960s, Vapnik and Chervonenkis started a new paradigm known as Model Predictions or Predictive Paradigm [36]. The focus here is not on an

3.3 Statistical Learning 37 accurate estimation of the parameters or on the adequacy of a model on past observa-tions, but on the predictive ability, for example the capacity of making good predic-tions for new observapredic-tions. The classical paradigm (Model for Identification) looks for a parsimonious function f(x,a)belonging to a predefined set. On the other hand, in the predictive paradigm the aim is not to approximate the true function (the optimal solution) f(x,a0), but to get a function f(x,a)which gives as accurate predictions as possible. Hence, the goal is not to discover the hidden mechanism, but to perform well. This is known as “black box model” [34], which illustrates the above conception while keeping the same formulation of the classical paradigmy= f(x,a)+ε.

For example in thepattern recognition problem(also known as theclassification problem), understanding the phenomenon (the statistical law that causes the stochas-tic properties of data) would be a very complex task. In the predictive paradigm, models such as artificial neural networks, support vector machines etc. attempt to achieve a good accuracy in predictions of events without the need of the identification-deep understanding of the events which are observed. The fundamental idea behind models that belong to the predictive paradigm is that the problem of estimating a model of events is “hard” or “ill-posed” as it requires a large number of observations to be solved “well” [36]. According to Hadamard, a problem is “well-posed” if its solution

1. exists, 2. is unique, and 3. is stable.

If the solution of the problem violates at least one of the above requirements, then the problem is consideredill-posed.

Consequently, finding a function for the estimation problem has led to the deriva-tion of bounds on the quality of any possible soluderiva-tion. In other words, these bounds express thegeneralizationability of this function. This resulted in the adoption of the inductive principle ofStructural Risk Minimizationwhich controls these bounds.

This theory was constructed by Vapnik and Chervonenkis and its quintessence is the use of what is called a capacity concept for a set of events (a set of indicator functions). Specifically, Vapnik and Chervonenkis introduce the Vapnik Capacity (VC) Dimensionwhich characterizes the variability of a set of indicator functions implemented by a learning machine. Thus, the underlying idea to controlling the generalization ability of the learning machine pertains to “achieving the smallest bound on the test error by controlling (minimizing) the number of the training errors; the machine (the set of functions) with the smallest VC-dimension should be used” [9,36].

Thus, there is a trade-off between the following two requirements:

• to minimize the number of training errors and

• to use a set of functions with small VC-dimension.

On the one hand, the minimization of the number of training errors needs a selection of function from a wide set of functions. On the other hand, the selection of this function has to be from a narrow set of functions with small VC-dimension.

As a result, the selection of the indicator function that we will utilize to minimize the number of errors, has to compromise the accuracy of approximation of training data and the VC-dimension of the set of functions. The control of these two con-tradictory requirements is provided by the Structural Risk Minimization principle [9,36].

3.3.3 Transductive Inference Paradigm

The next step beyond the model prediction paradigm was introduced by Vladimir Vapnik with the publication of theTransductive Inference Paradigm[34]. The key ideas behind the transductive inference paradigm arose from the need to create effi-cient methods of inference from small sample sizes. Specifically, in transductive inference an effort is made to estimate the values of an unknown predictive function at a given restricted subset of its domain in which we are interested and not in the entire domain of its definition. This led Vapnik to formulate theMain Principle[9, 34,36]:

If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution, but may be insufficient to solve a more general intermediate problem.

The main principle constitutes the essential difference between newer approaches and the classical paradigm of statistical inference based on use of the maximum likelihood method to estimate a number of free parameters. While the classical para-digm is useful in simple problems that can be analyzed with few variables, real world problems are much more complex and require large numbers of variables. Thus, the goal when dealing with real life problems that are by nature of high dimensionality is to define a less demanding (i.e. less complex) problem which admits well-posed solutions. This fact involves finding values of the unknown function reasonably well only at given points of interest, while outside of the given points of interest that function may not be well-estimated.

The setting of the learning problem in the predictive paradigm, which is analyzed in the next section, uses a two step procedure. The first step is theinduction stage, in which we estimate the function from a given set of functions using an induction principle. The second step is thededuction stage, in which we evaluate the values of the unknown function only at given points of interest. In other words, the solu-tion given by the inductive principle derives results first from particular to general (inductive step) and then from general to particular (deductive step).

On the contrary, the paradigm of transductive inference forms a solution that derives results directly from particular (training samples) to particular (testing samples).

In many problems, we do not care about finding a specific function with good generalization ability, but rather are interested in classifying a given set of examples

3.3 Statistical Learning 39 (i.e. a test set of data) with minimum possible error. For this reason, the inductive formulation of the learning problem is unnecessarily complex.

Transductive inference embeds the unlabeled (test) data in the decision making process that will be responsible for their final classification. Transductive inference

“works because the test set can give you a non-trivial factorization of the (discrimina-tion) function class” [7]. Additionally, the unlabeled examples provide information on the prior information of the labeled examples and “guide the linear boundary away from the dense region of labeled examples” [39].

For a given set of labeled data points Train − Set = {(x1,y1), (x2,y2), . . . , (xn,yn)}, withyi ∈ {−1,1}and a set of test data pointsTest−Set= {xn+1,xn+2, . . . , xn+k}, wherexi ∈R^d, transduction seeks among the feasible corresponding labels the oney_n^∗₊₁,y^∗_n₊₂, . . . ,y_n^∗₊_kthat has the minimum number of errors.

Also, transduction would be useful among other ways of inference in which there are either a small amount of labeled data points available or the cost for annotating data points is prohibitive. Hence, the use of the ERM principle helps in selections of the “best function from the set of indicator functions defined inR^d, while transductive inference targets only the functions defined on the working set Working−Set = Train−Set

Test−Set,” which is a discrete space.

To conclude, the goal of inductive learning (classical parametric paradigm and predictive paradigm) is to generalize for any future test set, while the goal of trans-ductive inference is to make predictions for a specific working set. In intrans-ductive inference, the error probability is not meaningful when the prediction rule is updated very abruptly and the data point may be not independently and identically distributed, as, for example, in data streaming. On the contrary, Vapnik [36] illustrated that the results from transductive inference are accurate even when the data points of interest and the training data are not independently and identically distributed. Therefore, the predictive power of transductive inference can be estimated at any time instance in a data stream for both future and previously observed data points that are not independently and identically distributed. In particular,empirical findings suggest that transductive inference is more suitable than inductive inference for problems with small training sets and large test sets[39].

Dans le document Machine Learning Paradigms (Page 47-52)