• Aucun résultat trouvé

Bivariate linear regression

Computational data mining

4.3 Linear regression

4.3.1 Bivariate linear regression

p

s=1

(xisx(t)sl)2

where x(t)l =[x(t)1l, . . . , x(t)pl] is the centroid of groupl calculated at the tth iter-ation. This shows that the k-means method searches for the partition of the n observations in g groups (with g fixed in advance) that satisfies a criterion of internal cohesion based on the minimisation of the within-group deviance W, therefore the goodness of the obtained partition can be evaluated by calculating the indexR2 of the pseudo-F statistic. A disadvantage of thek-means method is the possibility of obtaining distorted results when there are outliers in the data.

Then the non-anomalous units will tend to be classified into very few groups, but the outliers will tend to be put in very small groups on their own. This can create so-called ‘elephant clusters’ – clusters too big and containing most of the observations. Chapter 9 looks at an application of thek-means clustering method.

4.3 Linear regression

In Chapter 3, dealing with correlation and association between statistical vari-ables, the variables were treated in a symmetric way. We now consider the common situation where we wish to deal with the variables in a non-symmetric way, to derive a predictive model for one (or more) response variables, on the basis of one (or more) of the others. This section focuses on quantitative response variables and the next section focuses on qualitative response variables. Chapter 1 introduced the distinction between descriptive, predictive and local data mining methods. Linear regression is a predictive data mining method.

We will initially suppose that only two variables are available. Later we will consider the multivariate case.

4.3.1 Bivariate linear regression

In many applications it is interesting to evaluate whether one variable, called the dependent variable or the response, can be caused, explained and therefore predicted as a function of another, called the independent variable, the explana-tory variable, the covariate or the feature. We will useY for the dependent (or response) variable andXfor the independent (or explanatory) variable. The sim-plest statistical model that can describeY as a function ofXis linear regression.

The linear regression model specifies a noisy linear relationship between vari-ablesY andX, and for each paired observation (xi,yi) this can be expressed by the so-called regression function:

yi =a+bxi+ei (i=1,2, . . . , n)

whereais the intercept of the regression function,bis the slope coefficient of the regression function, also called the regression coefficient, and ei is the random error of the regression function, relative to the ith observation.

Note that the regression function has two main parts: the regression line and the error term. The regression line can be built empirically, starting from the matrix of available data. The error term describes how well the regression line approximates the observed response variable. From an exploratory view point, determination of the regression line can be described as a problem of fitting a straight line to the observed dispersion diagram. The regression line is the linear function

ˆ

yi =a+bxi (i =1,2, . . . , n)

whereyˆi indicates the fittedith value of the dependent variable, calculated on the basis of theith value of the explanatory variablexi. Having defined the regression line, it follows that the error termei in the expression of the regression function represents, for each observation yi, the residual, namely the difference between the observed response values yi, and the corresponding values fitted with the regression line,yˆi:

ei=yi− ˆyi

Each residual can be interpreted as the part of the corresponding value that is not explained by the linear relationship with the explanatory variable. What we have just described can be represented graphically as in Figure 4.2. To obtain the analytic expression of the regression line it is sufficient to calculate the parameters a andb on the basis of the available data. The method of least squares is often used for this. It chooses the straight line that minimises the sum of the squares

0 X

yi

xi ei

yi

yi

Y

Figure 4.2 Representation of the regression line.

of the errors of the fit (SSE), defined by

To find the minimum of SSE we need to take the first partial derivatives of the SSE function with respect to a and bthen equate them to zero. Since the sum of the squares of the errors is a quadratic function, if an extremal point exists then it is a minimum. Therefore the parameters of the regression line are found by solving the following system of equations, called normal equations:

From the first equation we obtain

a=yi

nbxi

n =µYX

Substituting it into the second equation and simplifying, we obtain b= variablesY andX, and r(X, Y )is the correlation coefficient betweenX andY. Regression is a simple and powerful predictive tool. To use it in real situations, it is only necessary to calculate the parameters of the regression line, according to the previous formulae, on the basis of the available data. Then a value forY is predicted simply by substituting a value forXinto the equation of the regression line. The predictive ability of the regression line is a function of the goodness of fit of the regression line, which is very seldom perfect.

If the variables were both standardised, with zero mean and unit variance, thena=0 andb=r(X, Y ). Thenyi =r(X, Y ) xi and the regression line ofX, as a function of Y, is simply obtained by inverting the linear relation between Y and X. Even though not generally true, this particular case shows the link between a symmetric analysis of the relationships between variables (described by the linear correlation coefficient) and an asymmetric analysis (described by the regression coefficientb).

Here is a simple regression model for the real data introduced in Section 3.2, on the weekly returns of an investment fund. The considered period goes from 4th October 1994 to 4th October 1999. The objective of the analysis is to study the dependence of the returns on the weekly variations of a stock market index

Figure 4.3 Example of a regression line fit.

typically used as benchmark (predictor) of the returns themselves; the index is named MSCI WORLD.

Figure 4.3 shows the behaviour of a simple regression model for this data, along with the scatterplot matrix. The intercept parameterahas been set to zero before adapting the model. This was done to obtain a fitted model that would be the closest possible to the theoretical financial model known as the capital asset pricing model (CAPM). The slope parameter of the regression line in Figure 4.3 is calculated on the basis of the data, according to the formula presented earlier, from which it turns out that b=0.8331. Therefore the regression line can be analytically described by the following equation:

REND=0.8331 WORLD

where REND is the response variable and WORLD is the explanatory variable.

The main utility of this model is in prediction; for example, on the basis of the fitted model, we can forecast that if the WORLD index increases by 10% in a week, the fund returns will increase by 8.331%.