• Aucun résultat trouvé

Nonlinearities

Dans le document A practical introduction to statistics (Page 101-104)

6.2 Ordinary least squares regression

6.2.1 Nonlinearities

We have already studied a regression model with a nonlinear relation between the pre-dictor and the dependent variable. We could add a quadratic term to the model, using lm(),

> english.lm = lm(RTlexdec ˜ WrittenFrequency + I(WrittenFrequencyˆ2) + + AgeSubject + LengthInLetters, data = english)

> summary(english.lm) Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 6.9181819 0.0100832 686.112 < 2e-16 WrittenFrequency -0.0773456 0.0029733 -26.013 < 2e-16 I(WrittenFrequencyˆ2) 0.0038209 0.0002732 13.987 < 2e-16 AgeSubjectyoung -0.2217215 0.0025380 -87.362 < 2e-16 LengthInLetters 0.0050257 0.0015131 3.321 0.000903

and it is clear from the summary that the quadratic term forWrittenFrequencyis jus-tified. The technical term for this way of handling nonlinearities as that we made use of a QUADRATIC POLYNOMIAL.

It is not possible (nor necessary, as we shall see) to add a quadratic term in the same way to the model formula when usingols(). This is becauseols()tries to look up the quadratic term in the data distribution object that we constructed for our data frame. As there is no separate quadratic term available in our data frame,ols()reports an error and quits. Fortunately,ols()provides alternative ways of modeling nonlinearities that are in fact simpler to specify in the model formula. In order to include a quadratic term for WrittenFrequency, we use the functionpol(), an abbreviation forPOLYNOMIAL. It takes two arguments, the name of the predictor, and a number specifying the complexity of the polynomial function. A2specifies a linear and a quadratic component, a3defines the combination of a linear, a quadratic, and a cubic component, etc. Here, we opt for minimal nonlinearity with a quadratic fit:

191

DRAFT

> english.olsB = ols(RTlexdec ˜ pol(WrittenFrequency, 2) + AgeSubject + + LengthInLetters, data = english)

> english.olsB Coefficients:

Value Std. Error t Pr(>|t|) Intercept 6.918182 0.0100832 686.112 0.0000000 WrittenFrequency -0.077346 0.0029733 -26.013 0.0000000 WrittenFrequencyˆ2 0.003821 0.0002732 13.987 0.0000000 AgeSubject=young -0.221721 0.0025380 -87.362 0.0000000 LengthInLetters 0.005026 0.0015131 3.321 0.0009026

The estimates of the coefficients are identical to those estimated bylm(), but we did not have to spell out the quadratic term ourselves.

The use ofols()has some further, more important advantages, however. First, the anova table lists the overall significance ofWrittenFrequency, and separately the sig-nificance of its nonlinear component(s):

> anova(english.olsB)

Analysis of Variance Response: RTlexdec

Factor d.f. Partial SS MS F P

WrittenFrequency 2 21.3312650 10.665632502 1508.39 <.0001 Nonlinear 1 1.4462474 1.446247447 204.54 <.0001 AgeSubject 1 54.4400676 54.440067616 7699.22 <.0001 LengthInLetters 1 0.0821155 0.082115506 11.61 7e-04 REGRESSION 4 76.0907743 19.022693573 2690.30 <.0001 ERROR 4461 31.5430668 0.007070851

Unlike whenanova() is applied to model objects produced bylm(), the anova() method forolsobjects provides aNON-SEQUENTIALanalysis of variance table. This ta-ble lists, for each predictor, theFstatistics and associatedp-values given that all the other predictors are already in the model.

A second advantage ofols()is that it is straightforward to visualize the effects of the predictors. For this example, we begin with creating space for three panels withmfrow(), and then we applyplot()to the model object. When setting up the plot regions, we also specify that we need a smaller font size (0.6of the standard) with thecexparameter, so that the text accompanying each panel is fully readable.

> par(mfrow = c(2, 2), cex = 0.6)

> plot(english.olsB)

> par(mfrow = c(1, 1), cex = 1.0)

Figure 6.3 shows thePARTIAL EFFECTSof each of the predictors, i.e., the effect of a given predictor when the other predictors in the model are held constant. The position of each curve with respect to the vertical axis depends on the actual values for which the other parameters in the model are held constant. These values are spelled out beneath each

192

DRAFT

WrittenFrequency

RTlexdec

0 2 4 6 8 10 12

6.66.76.86.9

Adjusted to: AgeSubject=old LengthInLetters=4

AgeSubject

RTlexdec

old young

6.456.506.556.606.65

Adjusted to: WrittenFrequency=4.832 LengthInLetters=4

LengthInLetters

RTlexdec

2 3 4 5 6 7

6.646.656.666.67

Adjusted to: WrittenFrequency=4.832 AgeSubject=old

Figure 6.3: Partial effects of the predictors in the modelenglish.olsB.

panel. For instance, the curve for frequency represents the old subjects, and words with four letters (the median word length). The line for the effect of length is adjusted so that it describes the effect for the old subjects and for a written frequency of4.8(the median frequency). The dashed lines show the95% confidence bands for the regression lines.

Confidence intervals are indicated by hyphens above and below the points representing factor levels. ForAgeSubject, the intervals are so small that the hyphens coincide with the point symbols.

There are disadvantages to the use of polynomials, however. A quadratic polynomial presupposes the data follow part of a parabola. For more complex curvature, higher-order polynomials can be used (i.e., models including additional cubic or higher terms), but they are costly in the number of parameters they require, they tend to overfit the data, and a-priori impose a very specific functional form on the curve. A more flexible alternative

193

DRAFT

is to useRESTRICTED CUBIC SPLINES. In construction, a spline is a flexible strip of metal or a piece of rubber that is used for drawing the curved parts of objects. In statistics, a spline is a function for modeling nonlinear relations. The spline function combines a series of simpler functions (in fact, cubic polynomials) defined over a corresponding series of intervals. These simpler functions are constrained to have smooth transitions where they meet, at theKNOTSof the spline. The number of knots determines the number of intervals. When you use more intervals, the simpler functions are defined over smaller intervals, so this allows you to model more subtle nonlinearities. In other words, the number of knots controls how smooth your curve will be. The minimal number of knots is3(so two intervals), in which case the curve is maximally smooth. As more knots are added, more wriggly curves can be fitted. Restricted cubic splines are cubic splines that are adjusted to avoid overfitting for the more extreme values of the predictor. For details, the reader is referred to Harrell [2001, 16–24] and references cited there.

Let’s consider two models, one with a restricted cubic spline with three knots, and one with seven knots. In the model formula, we replacepol()byrcs(). The number of knots is the second parameter forrcs(), the first parameter specifies what predictor a spline is requested for.

> english.olsC = ols(RTlexdec ˜ rcs(WrittenFrequency, 3) + AgeSubject + + LengthInLetters, data = english)

> english.olsC

Value Std. Error t Pr(>|t|) Intercept 6.903062 0.009248 746.411 0.000000 WrittenFrequency -0.059213 0.001650 -35.882 0.000000 WrittenFrequency’ 0.030576 0.002055 14.881 0.000000 AgeSubject=young -0.221721 0.002531 -87.598 0.000000 LengthInLetters 0.004875 0.001508 3.232 0.001238

The mathematics of restricted cubic splines work out so that the number of parameters required is one less than the number of knots. This explains why the summary lists two coefficients forWrittenFrequency. For seven knots, we get six coefficients:

> english.olsD = ols(RTlexdec ˜ rcs(WrittenFrequency,7) + AgeSubject + + LengthInLetters, data = english)

> english.olsD

Value Std. Error t Pr(>|t|) Intercept 6.794645 0.013904 488.697 0.000e+00 WrittenFrequency -0.010971 0.005299 -2.070 3.847e-02 WrittenFrequency’ -0.348645 0.052381 -6.656 3.147e-11 WrittenFrequency’’ 2.101416 0.474765 4.426 9.814e-06 WrittenFrequency’’’ -2.987002 1.081374 -2.762 5.764e-03 WrittenFrequency’’’’ 1.880416 1.121685 1.676 9.372e-02 WrittenFrequency’’’’’ -0.951205 0.649998 -1.463 1.434e-01 AgeSubject=young -0.221721 0.002497 -88.784 0.000e+00 LengthInLetters 0.005238 0.001491 3.513 4.468e-04

194

DRAFT

Note that the last two coefficients forWrittenFrequencyhave largep-values. This sug-gests that5knots should be sufficient to capture the nonlinearity without undersmooth-ing or oversmoothundersmooth-ing. Figure 6.4 compares the different spline curves with the curve obtained with a quadratic polynomial. With only three knots (so two intervals), we ba-sically get two straight lines with a smooth bend, that together are very similar to the polynomial curve. With six knots, the curve becomes somewhat wriggly in the center, with several points of inflection. These are removed when the number of intervals is reduced to four.

Figure 6.4 is built panel by panel. Presuming the plot region is defined properly with mfrow(), we obtain the upper left panel by settingWrittenFrequencytoNA.

> plot(english.olsC, WrittenFrequency=NA, ylim=c(6.5, 7.0), conf.int=F) This tells the plot method forolsobjects that it should suppress panels for the other pre-dictors in the model. As we want to avoid cluttering our plot with very similar confidence intervals, we setconf.int = F. In order to add the polynomial curve to the same plot we specifyadd = T.

> plot(english.olsB, WrittenFrequency = NA, add = T, + lty = 2, conf.int = F)

> mtext("3 knots, undersmoothing", 3, 1, cex = 0.8)

The other two panels are obtained in a similar way. Note that we force the same interval on the vertical axis across all panels.

> plot(english.olsD, WrittenFrequency=NA, ylim=c(6.5, 7.0), conf.int=F)

> plot(english.olsB, WrittenFrequency=NA, add=T, lty=2, conf.int=F)

> mtext("7 knots, oversmoothing", 3, 1, cex = 0.8)

> english.olsE = ols(RTlexdec ˜ rcs(WrittenFrequency,5) + AgeSubject + + LengthInLetters, english)

> plot(english.olsE, WrittenFrequency=NA, ylim=c(6.5, 7.0), conf.int=F)

> plot(english.olsB, WrittenFrequency=NA, add=T, lty=2, conf.int=F)

> mtext("5 knots", 3, 1, cex = 0.8)

It turns out that there is an interaction ofWrittenFrequencyby age.

> english.olsE = ols(RTlexdec ˜ rcs(WrittenFrequency, 5) + AgeSubject + + LengthInLetters + rcs(WrittenFrequency,5) : AgeSubject,

+ data = english)

The summary shows that there are four coefficients for the interaction of age by frequency, matching the four coefficients for frequency by itself.

> english.olsE Coefficients:

Value ...

Intercept 6.856846

WrittenFrequency -0.039530

195

DRAFT

WrittenFrequency

RTlexdec

0 2 4 6 8 10 12

6.56.66.76.86.97.0

Adjusted to: AgeSubject=old LengthInLetters=4

3 knots, undersmoothing

WrittenFrequency

RTlexdec

0 2 4 6 8 10 12

6.56.66.76.86.97.0

Adjusted to: AgeSubject=old LengthInLetters=4

7 knots, oversmoothing

WrittenFrequency

RTlexdec

0 2 4 6 8 10 12

6.56.66.76.86.97.0

Adjusted to: AgeSubject=old LengthInLetters=4

5 knots

Figure 6.4: The partial effect of written frequency using a restricted cubic spline with three knots (upper left), seven knots (upper right), and five knots (lower left). The dashed line represents a quadratic polynomial.

196

DRAFT

WrittenFrequency’ -0.136373

WrittenFrequency’’ 0.749955

WrittenFrequency’’’ -0.884461

AgeSubject=young -0.275166

LengthInLetters 0.005218

WrittenFrequency * AgeSubject=young 0.017493 WrittenFrequency’ * AgeSubject=young -0.043592 WrittenFrequency’’ * AgeSubject=young 0.010664 WrittenFrequency’’’ * AgeSubject=young 0.171251 ...

Residual standard error: 0.08448 on 4557 degrees of freedom Adjusted R-Squared: 0.7102

The anova table confirms that all these coefficients are really necessary.

> anova(english.olsE)

Analysis of Variance Response: RTlexdec

Factor df SS MS F P

WrittenFrequency

(Factor+Higher Order Factors) 8 23.5123 2.9390 411.80 <.0001

All Interactions 4 0.1093 0.0273 3.83 0.0041

Nonlinear

(Factor+Higher Order Factors) 6 2.4804 0.4134 57.92 <.0001 AgeSubject

(Factor+Higher Order Factors) 5 56.2505 11.2501 1576.29 <.0001

All Interactions 4 0.1093 0.0273 3.83 0.0041

LengthInLetters 1 0.0874 0.0874 12.24 0.0005

WrittenFrequency * AgeSubject

(Factor+Higher Order Factors) 4 0.1093 0.0273 3.83 0.0041

Nonlinear 3 0.1092 0.0364 5.10 0.0016

TOTAL NONLINEAR 6 2.4804 0.4134 57.92 <.0001 TOTAL NONLINEAR + INTERACTION 7 2.4806 0.3544 49.65 <.0001

REGRESSION 10 79.9318 7.9932 1119.95 <.0001

ERROR 4557 32.5237 0.0071

It is worth taking a closer look at thisanovatable. It first lists the statistics forWritten Frequencyas a whole, including its nonlinear terms and its interactions. The column labeleddflists the number of coefficients in the model for the different predictors and their interactions. ForWrittenFrequency, for instance, we have8coefficients, 4for the main effect and another4for the interaction withAgeSubject. The non-linearity ofWrittenFrequencyis accounted for with6coefficients (the ones listed with one or more apostrophes in the summary table for the coefficients and theirp-values). For AgeSubject, we spend5parameters: one coefficient forAgeSubjectitself, and4for the interaction withWrittenFrequency. The last lines of the summary evaluate the

197

DRAFT

combined nonlinearities as well as the nonlinearities and interactions considered jointly, and conclude with theF-test for the regression model as a whole.

Each coefficient costs us a degree of freedom. In the present model, we have4557 degrees of freedom left. If we were to add another predictor requiring one coefficient, the residual degrees of freedom would become4556. Sincep-values for thetandFtests become smaller for larger degrees of freedom, it becomes more and more difficult to ob-serve significant effects as we add more parameters to the model. This is exactly what is needed, as we want our model to be parsimonious and to avoid overfitting the data.

Figure 6.5 shows the partial effects of the predictors in this model. As before, we add the curve representingWrittenFrequencyfor the young subjects to the plot for the old subjects with the optionadd=T.

> par(mfrow = c(2, 2), cex = 0.7)

> plot(english.olsE, WrittenFrequency = NA, ylim = c(6.2, 7.0))

> plot(english.olsE, WrittenFrequency = NA, AgeSubject = "young", + add = T, col = "darkgrey")

> plot(english.olsE, LengthInLetters = NA, ylim = c(6.2, 7.0))

> plot(english.olsE, AgeSubject = NA, ylim = c(6.2, 7.0))

> par(mfrow = c(1, 1), cex = 1)

With the same range of values on the vertical axis, the huge differences in the sizes of the partial effects of frequency, length, and age group become apparent.

You now know how to run a multiple regression withols(), how to handle poten-tial nonlinearities, and how to plot the parpoten-tial effects of the predictors. For the present data set, the analysis is far from complete, however, as there are many more variables in the model that we have not yet considered. As many of these additional predictors are pairwise correlated, we run into the problem of collinearity.

Dans le document A practical introduction to statistics (Page 101-104)