Bayesian Approach - Introduction to Machine Learning

Simulation using MATLAB

4.1 Aerodynamic Coefficients using Neural Networks

4.1.1 Introduction to Machine Learning

4.1.1.2 Bayesian Approach

Imagine two boxes, one red, one blue, with the red box containing 2 apples and 6 oranges, and the blue containing 3 apples and 1 orange. Suppose that one of the boxes is picked randomly, and one fruit is selected from the box. Then, after seeing the sort of fruit that was picked, it is put back in the box. Assume that the red box is picked the 40%of time and the blue the other 60%. By denoting the random

variablesB for box,F for fruit, and the values ofr =red,b =blue,a= apple, and o=orange, the probability of selecting the red box is expressed asp(B =r) = 4/10 andp(B = b) = 6/10. Note that all probabilities must lie in the interval[0,1], and that the probability of an event is defined as the fraction of times that the event occurs out of the total number of trials, in the limit that the total number of trials tends to infinity. Note also thatp(B=r) +p(B=b) = 1.

In the same way, it is known that the probability of selecting an apple from the blue box is 3/4, so theconditionalprobabilities, starting with the latter mentioned are:

• p(F =a|B =b) = 3/4 and using the rules of sum and product of probability, the overall probability of choosing an apple can be evaluated by:

p(F =a) =p(F =a|B =r)p(B =r) +p(F =a|B=b)p(B =b)

Now, lets assume that a fruit has been selected and it is an orange, and it is desired to know from which box it came from, one can solve the problem of reversing the conditional probability by using theBayes theorem. This yields:

p(B =r|F =o) = p(F =o|B =r)p(B =r)

An important interpretation of the Bayes theorem can be provided. If it would have been asked which box had been chosen before being told the selected fruit, then the most complete information available is provided by the probability p(B). This is called the prior probability, because it is the probability available before the obser-vation of the identity of the fruit. Once it is known that the fruit is an orange, the Bayes theorem can be used to compute the probabilityp(B|F), which will be called theposterior probability, because is the probability obtained after the observation of F. Note that the prior probability of selecting the red box was 4/10, so it is more likely to select the blue box. However, once it was observed that the selected fruit is an orange, the posterior probability of the red box is now 2/3, so that it is now more likely that the box selected was in fact the red one. Corroborating that the ob-servation that the fruit was an orange, provides significant evidence favouring the red box. Moreover, this evidence is strong enough to outweigh the prior probability of selecting the blue box.

Now, after pointing out the advantages of aBayesianapproach, where probabilities

4.1. Aerodynamic Coefficients using Neural Networks 61 provide a quantification of uncertainty rather than the typicalfrequentist interpreta-tion of probability based on frequencies of random, repeatable events, let us recon-sider the polynomial curve fitting example.

In the polynomial fitting, assumptions about the parameterswcan be made before observing the data, in the form of a prior probability distributionp(w). Then, the effect of the observed dataD = {t₁, . . . , t_N} is expressed through the conditional probabilityp(D|w), and it will be shown how this can be represented explicitly.

Hence, the Bayes theorem takes the form of

p(w|D) = p(D|w)p(w)

p(D) (4.8)

which allows the evaluation of the uncertainty inwafter the observation ofDin the form of the posterior probabilityp(w|D).

The quantityp(D|w) of (4.8) is evaluated for the observed data set D and can be viewed as a function of the parameter vectorw, in which case it is called thelikelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w. Note that the likelihood is not a probability distribution overw. Hence, its integral with respect towdoes not (necessarily) equal to one. In other words, the Bayes theorem can be expressed as:

posterior ∝likelihood×prior

where all these are viewed as functions of w. The term p(D) in (4.8) is the nor-malization constant, which ensures that the left hand-side of the equation is a valid probability density and integrates to one.

In both Bayesian and frequentist paradigms, the likelihood functionp(D|w) is fun-damental but is used differently depending on the approach. For the frequentist setting,wis considered to be a fixed parameter, which is determined by some "esti-mator", and error bars on this estimate are obtained by considering the distribution of possible data setsD. On the other hand, for the Bayesian approach, there is only a single data setD, which is the one observed, and the uncertainty in the parameters is expressed through a probability distribution overw.

In themaximum likelihoodfrequentist estimator,wis set to the value that maximizes the likelihood function. This corresponds to choosingwsuch that the probability of the observed data set is maximized.

In the machine learning literature, the monotonically decreasing function given by the negative log of the likelihood function is the error function, because maximizing the likelihood is equivalent to minimizing the error.

The advantage of the Bayesian approach can be seen in the inclusion of prior knowl-edge. Suppose for example that a coin is flipped three times, and every time it lands heads. The classical maximum likelihood estimate of the probability of landings heads would be 1, implying that all future tosses will be heads. In contrast, the Bayesian approach with reasonable prior information will lead to a less extreme con-clusion.

Therefore, in the curve fitting problem viewed from a probabilistic perspective, the goal is to make predictions for the target variabletgiven a new value of the scalar input variablex, using a training set comprisingN input observation values X = (x1, . . . , xN)^T and their corresponding target valuesT= (t1, . . . , tN)^T. In this man-ner, the uncertainty over the value of the target variable can be expressed using a probability distribution. For this purpose, it is assumed that, given the value ofx,

the corresponding value ofthas a Gaussian distribution¹ with a mean equal to the valuey(x, w)of the polynomial curve given by (4.1). Thus:

p(t|x, w, β) =N(t|y(x, w), β⁻¹) (4.9) whereβcorresponds to the inverse variance of the Gaussian distribution.

Now, the values of the unknown parameterswandβare determined using the train-ing data{X,T}via maximum likelihood. If the training data is assumed to be taken independently from the distribution (4.9), then the likelihood function is given by

p(T|X, w, β) =

n=1

N(t_n|y(x_n, w), β⁻¹) (4.10) As shown in the Appendix B.2, where the maximum likelihood approach is used to perform a regression of a Gaussian distribution, it is convenient to maximize the logarithm of the likelihood function. In this manner, using (B.1), which describes mathematically the Gaussian distribution, it is obtained

lnp(T|X, w, β) =−β The first maximum likelihood solution for the polynomial coefficients will be de-noted byw_{M L}, obtained by maximizing (4.11) with respect to w. Some interesting results are derived from this. Note that the last 2 terms of the right-hand side can be omitted, as they do not depend onw. Then, scaling the log likelihood by a posi-tive constant coefficientβ, does not alter the location of the maximum with respect tow. Consequently,β/2can be replaced by1/2. Finally, instead of maximizing the log likelihood, is equivalent to minimize the negative log likelihood. Therefore, it is clearly seen that maximizing the likelihood to obtain the parameterswis the same as minimizing thesum-of-squares error function, defined previously in (4.2). Thus the sum-of-squares error function is a consequence of maximizing the likelihood under the assumption of a Gaussian noise distribution.

Also, maximizing (4.11) with respect to the precision parameterβ of the Gaussian conditional distribution yields Now that the parameterswandβ have been determined, predictions for new val-ues ofxcan be made. Since a probabilistic model is obtained, the predictions are expressed by a predictive distribution that gives the probability distribution over t rather than simply a point estimate. Hence (4.9) turns into

p(t|x, w_{M L}, β_{M L}) =N(t|y(x, w_{M L}), β⁻¹_{M L}) (4.13) Therefore, in order to advance towards a Bayesian approach, it is introduced a Gaus-sian prior distribution over the polynomial coefficientsw, given by

p(w|α) =N(w|0, α⁻¹I) =

1For more information on Gaussian distribution, refer to Appendix B.1

4.1. Aerodynamic Coefficients using Neural Networks 63 whereαis the precision of the distribution, andM+1is the total number of elements in the vectorwfor anM^thorder polynomial. Variables such asα, which control the distribution of model parameters, are calledhyperparameters. Then, using the Bayes theorem, the posterior distribution forwis proportional to the product of the prior distribution and the likelihood function, denoted by

p(w|X,T, α, β)∝p(T|X, w, β)p(w|α) (4.15) In this manner,wcan be determined by finding the most probable value ofwgiven the data. This, therefore, involves maximizing the posterior distribution, and using the technique is called maximum posterior. Taking the negative logarithm of (4.15) and using (4.11) and (4.14), the maximum of the posterior is given by the minimum of

At this point, it is clear that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function encountered in (4.4), with a regularization parameter denoted byλ=α/β

Dans le document The DART-Europe E-theses Portal (Page 90-94)