Bivariate linear regression - Linear regression

Model specification

4.3 Linear regression

4.3.1 Bivariate linear regression

In many applications it is interesting to evaluate whether one variable, called the dependent variable or the response, can be caused, explained and therefore predicted as a function of another, called the independent variable, the explana-tory variable, the covariate or the feature. We will use Y for the dependent (or response) variable andXfor the independent (or explanatory) variable. The sim-plest statistical model that can describeY as a function ofXis linear regression.

The linear regression model specifies a noisy linear relationship between vari-ablesY andX, and for each paired observation (x_i,yi) this can be expressed by the so-called regression function,

yi =a+bxi +ei, i=1,2, . . . , n,

where a is the intercept of the regression function, b is the slope coefficient of the regression function, also called regression coefficient, and ei is the random error of the regression function, relative to the ith observation.

Note that the regression function has two main parts: the regression line and the error term. The regression line can be constructed empirically, starting with the matrix of available data. The error term describes how well the regression line approximates the observed response variable. From an exploratory point of view, determination of the regression line can be described as a problem of fitting a straight line to the observed dispersion diagram. The regression line is described the linear function:

yi =a+bxi, i=1,2, . . . , n,

where ˆy_i denotes the fitted ith value of the dependent variable, calculated on the basis of the ith value of the explanatory variable, x_i. Having defined the regression line, it follows that the error terme_iin the expression for the regression

0 X y_i

e_i

x_i

yˆ_i

y_i Y

Figure 4.2 Representation of the regression line.

function represents, for each observation yi, the residual, that is, the difference between the observed response values, yi, and the corresponding values fitted with the regression line, ˆyi:

ei =yi−yˆi.

Each residual can be interpreted as the part of the corresponding value that is not explained by the linear relationship with the explanatory variable. What we have just described can be represented graphically, as in Figure 4.2. To obtain the analytic expression for the regression line it is sufficient to calculate the parameters a and b on the basis of the available data. The method of least squares is often used for this purpose. It chooses the straight line that minimises the sum of the squares of the errors of the fit (SSE), defined by

SSE=

i=1

e²_i =

i=1

(yi −yˆi)²=

i=1

(yi−a−bxi)².

To find the minimum of SSE we need to take the first partial derivatives of the SSE function with respect to a and bthen equate them to zero. Since the sum of the squares of the errors is a quadratic function, if an extremal point exists then it is a minimum. Therefore the parameters of the regression line are found by solving the following system of equations,called normal equations:

∂

(yi−a−bxi)²

∂a = −2

(yi−a−bxi)=0,

∂

(yi−a−bxi)²

∂b = −2

xi(yi −a−bxi)=0.

From the first equation we obtain a= yi

n −b xi

n =μY −bμX.

Substituting this into the second equation and simplifying gives b= x_iy_i/n− y_i x_i/n²

x_i²/n− xi/n

2 = Cov(X, Y )

Var(X) =r(X, Y )σ_Y σX

, where μY and μX are the means, σY and σX the standard deviations of the variablesY andX, whiler(X,Y )indicates the correlation coefficient betweenX and Y.

Regression is a simple and powerful predictive tool. To use it in real situations, it is only necessary to calculate the parameters of the regression line, according to the previous formulae, on the basis of the available data. Then a value forY is predicted simply by substituting a value forXinto the equation of the regression line. The predictive ability of the regression line is a function of the goodness of fit of the regression line, which is very seldom perfect.

If the variables were both standardised, with zero mean and unit variance, then a=0 andb=r(X, Y ). Thenyi =r(X, Y )xi and the regression line ofX, as a function ofY, is simply obtained by inverting the linear relation betweenY and X. Even though not generally true, this particular case shows the link between a symmetric analysis of the relationships between variables (described by the linear correlation coefficient) and an asymmetric analysis (described by the regression coefficientb).

Here is a simple regression model for the weekly returns of an investment fund. The period considered goes from 4 October 1994 to 4 October 1999. The objective of the analysis is to study the dependence of the returns on the weekly variations of a stock market index typically used as benchmark (predictor) of the returns themselves, which is named MSCI WORLD.

Figure 4.3 shows the behaviour of a simple regression model for this data set, along with the scatterplot matrix. The intercept parameterahas been set to zero before adapting the model. This was done to obtain a fitted model that would the closest possible to the theoretical financial model known as capital asset pricing model. The slope parameter of the regression line in Figure 4.3 is calculated

R E N D

WORLD 5

−5

−5 0

Figure 4.3 Example of a regression line fit.

on the basis of the data, according to the formula previously presented, from which it turns out that b=0.8331. Therefore, the obtained regression line can be analytically described by the equation

REND=0.8331 WORLD,

where REND is the response variable and WORLD the explanatory variable. The main utility of this model is in prediction: on the basis of the fitted model, we can forecast that if the WORLD index increases by 10% in a week, the fund returns will increase by 8.331%.

Dans le document Applied Data Mining for Business and Industry (Page 64-67)