6.3 Generalized linear models
6.3.2 Ordinal logistic regression
Logistic regression is appropriate for dichotomous response variables. ORDINAL REGRES -SIONis appropriate for dependent variables that are factors with ordered levels. For a fac-tor such as gender in German, the facfac-tor levels ’masculine’, ’feminine’ and ’neuter’ are not intrinsically ordered. In contrast, vowel length in Estonian has the ordered levels ’short’,
’long’ and ’extra long’. Regression models for suchORDERED FACTORSare available. The technique that we introduce here,ORDINAL LOGISTIC REGRESSION, is a generalization of the logistic regression technique.
227
DRAFT
WrittenFrequency
Pr(regular)
2 4 6 8 10 12
0.00.40.8
FamilySize
Pr(regular)
0 1 2 3 4 5 6
0.00.40.8
NcountStem
Pr(regular)
0 5 10 15 20 25
0.00.40.8
InflectionalEntropy
Pr(regular)
0.5 1.5 2.5
0.00.40.8
Auxiliary
Pr(regular)
hebben zijn zijnheb
0.00.4
0.8 −
−
−
−
−
−
Valency
Pr(regular)
0 5 10 15 20
0.00.40.8
NVratio
Pr(regular)
−10 −5 0
0.00.40.8
WrittenSpokenRatio
Pr(regular)
−4 −2 0 1 2 3
0.00.40.8
Figure 6.11: Partial effects of the predictors for the log odds ratio of a Dutch simplex verb from the native (Germanic) stratum being regular.
228
DRAFT
As an example, we consider the data set studied by Tabak et al. [2005]. The model predicting regularity for Dutch verbs developed in the preceding section showed that the likelihood of regularity decreased with increasing valency. An increase in valency (here, the number of different subcategorization frames in which a verb can be used) is closely related to an increase in the verb’s number of meanings.
Irregular verbs are generally described as the older verbs of the language. Hence, it could be that they have more meanings and a greater valency because they have had a longer period of time in which they could spawn new meanings and uses. Irregular verbs also tend to be more frequent than irregular verbs, and it reasonable to assume that this high frequency protects irregular verbs through time against regularization.
In order to test these lines of reasoning, we need some measure of the age of a verb.
A rough indication of this age is the kind of cognates a Dutch verb has in other Indo-European languages. On the basis of an etymological dictionary, Tabak et al. [2005]
established whether a verb appears only in Dutch, in Dutch and German, in Dutch, German and other West-Germanic languages, in any Germanic language, or in Indo-European. This classification according to etymological age is available in the column labeledEtymAgein the data setetymology.
> colnames(etymology)
[1] "Verb" "WrittenFrequency" "NcountStem"
[4] "MeanBigramFrequency" "InflectionalEntropy" "Auxiliary"
[7] "Regularity" "LengthInLetters" "Denominative"
[10] "FamilySize" "EtymAge" "Valency"
[13] "NVratio" "WrittenSpokenRatio"
When a data frame is read intoR, the levels of any factor are assumed to be unordered by default. In order to makeEtymAgeinto anORDERED FACTORwith the levels in the appropriate order, we use the functionordered():
> etymology$EtymAge = ordered(etymology$EtymAge, levels = c("Dutch", + "DutchGerman", "WestGermanic", "Germanic", "IndoEuropean")) When we inspect the factor,
> etymology$EtymAge ...
[276] WestGermanic Germanic IndoEuropean Germanic Germanic [281] Germanic WestGermanic Germanic Germanic DutchGerman Levels: Dutch < DutchGerman < WestGermanic < Germanic < IndoEuropean we see that the ordering relation between its levels is now made explicit. We leave it as an exercise to the reader to verify that etymological age is a predictor for whether a verb is regular or irregular over and above the predictors studied in the preceding section. Here, we study whether etymological age itself can be predicted from frequency, regularity, family size, etc. We create a data distribution object, set the appropriate variable to point to this object,
229
DRAFT
> etymology.dd = datadist(etymology)
> options(datadist = "etymology.dd")
and fit a logistic regression model to the data withlrm().
> etymology.lrm = lrm(EtymAge ˜ WrittenFrequency + NcountStem + + MeanBigramFrequency + InflectionalEntropy + Auxiliary +
+ Regularity + LengthInLetters + Denominative + FamilySize + Valency + + NVratio + WrittenSpokenRatio, data = etymology, x = T, y = T)
> anova(etymology.lrm)
Wald Statistics Response: EtymAge
Factor Chi-Square d.f. P
WrittenFrequency 0.45 1 0.5038
NcountStem 3.89 1 0.0487
MeanBigramFrequency 1.89 1 0.1687 InflectionalEntropy 0.94 1 0.3313
Auxiliary 0.38 2 0.8281
Regularity 14.86 1 0.0001
LengthInLetters 0.30 1 0.5827
Denominative 8.84 1 0.0029
FamilySize 0.42 1 0.5191
Valency 0.26 1 0.6080
NVratio 0.07 1 0.7894
WrittenSpokenRatio 0.18 1 0.6674
TOTAL 35.83 13 0.0006
The anova table suggests three significant predictors, Regularity, as expected, the neighborhood density of the stem (NcountStem), and whether the verb is denominative (Denominative). We simplify the model, and inspect the summary.
> etymology.lrmA = lrm(EtymAge ˜ NcountStem + Regularity + Denominative, + data = etymology, x = T, y = T)
> etymology.lrmA
Frequencies of Responses
Dutch DutchGerman WestGermanic Germanic IndoEuropean
8 28 43 173 33
Obs Max Deriv Model L.R. d.f. P C
285 2e-08 30.92 3 0 0.661
Dxy Gamma Tau-a R2 Brier
0.322 0.329 0.189 0.114 0.026
Coef S.E. Wald Z P y>=DutchGerman 4.96248 0.59257 8.37 0.0000 y>=WestGermanic 3.30193 0.50042 6.60 0.0000 y>=Germanic 2.26171 0.47939 4.72 0.0000
230
DRAFT
y>=IndoEuropean -0.99827 0.45704 -2.18 0.0289 NcountStem 0.07038 0.02014 3.49 0.0005 Regularity=regular -1.03409 0.25123 -4.12 0.0000 Denominative=N -1.48182 0.43657 -3.39 0.0007
The summary lists the frequencies with which the different levels of our ordered factor for etymological age are attested, followed by the usual measures for gauging the predictivity of the model. The values ofC,Dxy, andR2Nare all low, so we have to be careful when drawing conclusions.
The first four lines of the table of coefficients are new, and specific to ordinal logistic regression. These four lines represent four intercepts. The first intercept is for a normal binary logistic model that contrasts data points withDutchas etymological age with all other data points, for which the etymological age (represented byyin the summary) is greater or equal thanDutchGerman. For this standard binary model, the probability of greater age increases with neighborhood density, it is smaller for regular verbs, and also smaller for denominative verbs. The second intercept represents a second binary split, now betweenDutchandDutchGermanon the one hand, andWestGermanic, GermanicandIndoEuropeanon the other. Again, the coefficients for the three predic-tors show how the probability of having a greater etymological age has to be adjusted for neighborhood density, regularity, and whether the verb is denominative. The remain-ing two intercepts work in the same way, each shift the criterion for ’young’ versus ’old’
further towards the greatest age level.
There are two things to note here. First, the four intercepts are steadily decreas-ing. This simply reflects the distribution of successes (old etymological age) and fail-ures (young etymological age) as we shift our cutoff point for old versus young further towardsIndoEuropean. To see this, we first count the data points classified as ’old’
versus ’young’.
> tab = xtabs(˜etymology$EtymAge)
> tab
etymology$EtymAge
Dutch DutchGerman WestGermanic Germanic IndoEuropean
8 28 43 173 33
> sum(tab) [1] 285
For the cutoff point betweenDutchand DutchGerman, we have285−8 = 277old observations (successes) and8young observations (failures), and hence a log odds ratio of3.54. The following code loops through the different cutoff points and lists the counts of old and young observations, and the corresponding log odds ratio.
> for (i in 0:3) {
+ cat(sum(tab[(2 + i) : 5]), sum(tab[1 : (1 + i)]),
+ log(sum(tab[(2 + i) : 5]) / sum(tab[1 : (i + 1)])), "\n") + }
277 8 3.544576
231
DRAFT
249 36 1.933934 206 79 0.9584283 33 252 -2.032922
We see the same downwards progression in the logits as in the table of intercepts. The numbers are not the same, as our logits do not take into account any of the other predic-tors in the model. In other words, the progression of intercepts is by itself not of interest, just as the intercept in least squares regression or standard logistic regression is generally not of interest by itself.
The second thing to note is thatlrm()assumes that the effects of our predictors, NcountStem,RegularityandDenominative, are the same, irrespective of the cutoff point for etymological age. In other words, these predictors are taken to have the same proportional effect across all levels of our ordered factor. Hence, this kind of model is referred to as aPROPORTIONAL ODDS MODEL. The assumption of proportionality should be checked. One way of doing so is to plot, for each cutoff point, the mean of the par-tial binary residuals together with their95% confidence intervals. If the proportionality assumption holds, these means should be close to zero. As can be seen in the first three panels of Figure 6.12, the proportionality assumption is not violated for our data. The means are very close to zero in all cases. The last panel takes a closer look at our continu-ous predictor,NcountStem. For each successive factor level, two points are plotted. The circles connected by the solid line show the means as actually observed, the dashed line shows what these means should be if the proportionality assumption would be satisfied perfectly. There is a slight discrepancy for the first level,Dutch, for which we also have the lowest number of observations. But since the two lines are otherwise quite similar, we conclude that a proportional odds model is justified. The diagnostic plots shown in Figure 6.12 were produced with two functions from theDesignpackage,resid()and plot.xmean.ordinaly. as follows.
> par(mfrow = c(2, 2))
> resid(etymology.lrmA, ’score.binary’, pl = T)
> plot.xmean.ordinaly(EtymAge ˜ NcountStem, data = etymology)
> par(mfrow = c(1, 1))
Boostrap validation calls attention to changes in slope and intercept,
> validate(etymology.lrmA, bw=T, B=200)
1 2 3
2 7 191
index.orig training test optimism index.corrected Dxy 0.3222059 0.3314785 0.31487666 0.01660182 0.30560403 R2 0.1138586 0.1227111 0.10597692 0.01673422 0.09712436 Intercept 0.0000000 0.0000000 0.04821578 -0.04821578 0.04821578 Slope 1.0000000 1.0000000 0.95519326 0.04480674 0.95519326 Emax 0.0000000 0.0000000 0.01871305 0.01871305 0.01871305 D 0.1049774 0.1147009 0.09714786 0.01755301 0.08742437 but the optimism is fairly small, and a pentrace recommends a penalty of zero,
232
DRAFT
EtymAge NcountStem −0.4−0.20.00.20.4
DutchGerman WestGermanic Germanic IndoEuropean
EtymAge
NcountStem
EtymAge Regularity=regular −0.020.000.020.04
DutchGerman WestGermanic Germanic IndoEuropean
EtymAge
Regularity=regular
EtymAge Denominative=N −0.04−0.020.000.020.04
DutchGerman WestGermanic Germanic IndoEuropean
EtymAge
Denominative=N
EtymAge
Mean of NcountStem
Dutch DutchGerman Germanic IndoEuropean
7891011
n=285
Figure 6.12: Diagnostics for the proportionality assumption for the ordinal logististic regression model for etymological age. The lower right panel compares observed (ob-served) and expected (given proportionality, dashed) mean neighborhood density for each level of etymological age, the remaining panels plot for each predictor the distri-bution of residuals for each cutoff point.
233
DRAFT
> pentrace(etym.lrmA, seq(0, 0.8, by=0.05)) Best penalty:
penalty df 0 3
so we accept etymology.lrmAas our final model, and plot the partial effects (Fig-ure 6.13).
> plot(etymology.lrmA, fun = plogis, ylim = c(0.8, 1))
We conclude that the neighborhood density of the stem is a predictor for the age of a verb.
Words with a higher neighborhood density are phonologically more regular, and easier to articulate. Apparently, phonological regularity and ease of articulation contribute to a verb’s continued existence through time, in addition to morphological regularity. It is remarkable that frequency is not predictive at all.