Visualizing two or more variables - A practical introduction to statistics

In Chapter 1, we created a contingency table for the counts of clauses cross-classified by the animacy of the recipient and the realization of the recipient (NPversusPP), using the data analyzed by Bresnan et al. [2007]. We recreate this contingency table,

> verbs.xtabs = xtabs( ˜ AnimacyOfRec + RealizationOfRec, + data = verbs[verbs$AnimacyOfTheme != "animate", ])

> verbs.xtabs

RealizationOfRec AnimacyOfRec NP PP

animate 517 300 inanimate 33 47

and visualize it by means of a bar plot. We use the samebarplot()function as above.

However, as our input is not a vector but a table, we have to decide what kind of bar plot we want. Figure 2.5 illustrates the two options. The left panel shows two bars, each composed of subbars proportional to the two counts in the columns ofverbs.xtabs.

The right panel shows two pairs of bars, the first pair representing the counts for animacy withinNPrealizations, the second pair representing the same counts within the realiza-tions of the recipient as aPP.

DRAFT

NP PP

0100200300400500 _anim

inanim

NP PP

animate inanimate

0100200300400500

Figure 2.5: Bar plots for the counts of clauses cross-classified by the realization of the recipient asNPorPPand the animacy of the recipient.

> par(mfrow = c(1, 2))

> barplot(verbs.xtabs, legend.text=c("anim", "inanim"))

> barplot(verbs.xtabs, beside = T, legend.text = rownames(verbs.xtabs))

> par(mfrow = c(1, 1))

In Chapter 1 we had a first look at the data of Bresnan and colleagues on the dative alternation in English. Let’s consider their data once more, but now we make use of the full data set (dative), and cross-tabulate the realization of the recipient by its animacy and accessibility.

> verbs.xtabs =

+ xtabs( ˜ AnimacyOfRec + AccessOfRec + RealizationOfRecipient, + data = dative)

> verbs.xtabs

DRAFT

, , RealizationOfRecipient = NP AccessOfRec

AnimacyOfRec accessible given new

animate 290 1931 78

inanimate 11 99 5

, , RealizationOfRecipient = PP AccessOfRec

AnimacyOfRec accessible given new

animate 259 239 227

inanimate 55 33 36

Such a contingency table might be visualized with a barplot, but12bars or smaller numbers of stacked bars quickly become rather complex to interpret. An attractive alter-native is to make use of a mosaic plot, as shown in the left panel of Figure 2.6.

> mosaicplot(verbs.xtabs, main = "dative")

The areas of the twelve rectangles in the plot are proportional to the counts for the twelve cells of the contingency table. When there is no structure in the data, as in the mosaic plot in the right panel of Figure 2.6, each rectangle is approximately equally large. The many asymmetries in the left panel show, for instance, that in the actual data set given recipients are more likely to be realized asNPthan new or accessible recipients, both for animate and inanimate recipients, irrespective of the overall preponderance of given recipients.

The relation between two numerical variables with many different values is often brought to light by means of aSCATTERPLOT. Figure 2.7 displays two versions of the same scatterplot for variables in theratingsdata set. The upper panel was produced in two steps. The first step consisted of plotting the data points.

> plot(ratings$Frequency, ratings$FamilySize)

All we have to do is specify the vectors ofXandY values as arguments toplot(). By default, the names of the two input vectors are used as labels for the axes. You can see that words with a very high frequency tend to have a very high family size. In other words, the two variables are positivelyCORRELATED. At the same time, it is also clear that there is a lot of noise, and that the scatter (or variance) in family sizes is greater for lower frequencies. Such an uneven pattern is referred to asHETEROSKEDASTIC, and is endemic in lexical statistics.

The second step consisted of adding the grey line to highlight the main trend.

> lines(lowess(ratings$Frequency, ratings$FamilySize), col="darkgrey") This line shows that you have to proceed almost2log frequency units along the horizontal axis before you begin to see an increase in family size. For larger frequencies, the family size increases, slowly at first, but then faster and almost like a straight line. A curve

DRAFT

dative

AnimacyOfRec

AccessOfRec

animate inanimate

accessiblegivennew

NP PP NPPP

uniform

AnimacyOfRec

AccessOfRec

animate inanimate

accessiblegivennew

NP PP NP PP

Figure 2.6: A mosaic plot for observed counts of clauses cross-classified by the animacy of the recipient, the accessibility of the recipient, and the realization of the recipient (left panel), and for random counts (right).

DRAFT

Figure 2.7: Scatterplots for Family Size as a function of Frequency for81English nouns.

DRAFT

like this is often referred to as aSCATTERPLOT SMOOTHER, as it smoothes away all the turbulence around the main trend in the data. The smoothing function that we used here islowess(), which takes as input theXandY coordinates of the data points and produces as output theXandY coordinates of the smooth line. To plot this line, we fed its coordinates intolines().

The basic idea underlying smoothers is to use the observations in a given span (or bin) of values ofXto calculate the average increase inY. You then move this span from left to right along the horizontal axis, each time calculating the new increase iny. There are many ways in which you can estimate these increases, and many ways in which you can combine all these estimated increases into a line. Recall that Figure 2.2 illustrated that the smoothness of a histogram depends on the width of its bins (bars). In a similar way, the smoothness of the line produced bylowess()is determined by the bin width used. As lowess()makes use of a sensible rule of thumb for calculating a reasonable bin width, we need not do anything ourselves. However, if you think thatlowess()engages in too much smoothing (the line hides variation you suspect to be there) or too little smoothing (the line has too many idiosyncratic bumps) for your data, you can change the bin width manually, as documented in the on-line help. Venables & Ripley [2000:228–232] provide detailed information on various important smoothers that are available inR.

The lower panel of Figure 2.7 shows a different version of the same scatterplot. Data points are now labeled by the words they represent. It is now easy to see thathorseanddog are the words with the highest frequency and family size in the sample. This scatterplot was also made in two steps. The first step consisted of setting up the axes, now with our own labels for the axes, specified withxlabandylab. However, we instructedplot() not to add the data points by setting the plot type to ”none” withtype = "n".

> plot(ratings$Frequency, ratings$FamilySize, type = "n", + xlab = "Frequency", ylab = "Family Size")

The second step consisted in adding the words to the plot withtext(). Likeplot(), it requires input vectors for theX andY coordinates. Its third argument should be a vector with the strings that are to be placed in the plot. In the data frameratings, the column labeledWordis a factor, so we first convert it into a vector of strings with as.character()before handing it over totext(). Finally, we set the font size to0.7 of its default withcex = 0.7.

> text(ratings$Frequency, ratings$FamilySize, + as.character(ratings$Word), cex = 0.7)

Thus far, we have considered scatterplots involving two variables only. Many data sets have more than two variables, however, and although we might consider to inspect all possible pairwise combinations with a series of scatterplots, it is often more convenient and insightful to make a single multipanel figure that shows all pairwise scatterplots si-multaneously. Figure 2.8 shows such aSCATTERPLOT MATRIXfor all two by two com-binations of the five numerical variables inratings. The panels on the main diagonal provide the labels for the axes of the panels. For instance, all the panels on the top row

DRAFT

Frequency

0.0 1.5 3.0 3 5 7 9

2468

0.01.53.0

FamilySize

SynsetCount

1.02.0

3579

Length

2 4 6 8 1.0 2.0 0.0 1.0 2.0

0.01.02.0

DerivEntropy

Figure 2.8: A pairs plot for the five numerical variables in theratingsdata frame.

DRAFT

have Frequency on the vertical axis, and all the panels of the first column have Frequency on the horizontal axis. Each pair of variables is plotted twice, once with a given variable on the horizontal axis, and once with the same variable on the vertical axis. Such pairs of plots have coordinates that are mirrored in the main diagonal. Thus, panel(1,2)is the mirror image of panel(2,1). Similarly, panel(5,1)in the lower left has its opposite in the upper right corner at location(1,5). The reason for having mirrored panels is that sometimes a pattern strikes the eye in one orientation, but not in the other.

Figure 2.8 was made with thepairs()plot function, which requires a data frame with numerical columns as input.

> pairs(ratings[ , -c(1, 6:8, 10:14)])

The condition on the columns has a minus sign, indicating that all columns specified to its right should be excluded instead of included. The columns that we exclude here are all factors. Factors cannot be visualized in scatterplots, hence we take them out before apply-ingpairs(). Figure 2.8 reveals that a fair number of pairs of predictors enter into cor-relations, a phenomenon that is known asMULTICOLLINEARITY. Strong multicollinearity among a set of predictor variables may make it impossible to ascertain which predictor variables best explain the dependent variable. We will return to this issue in more detail when discussing multiple regression.

Dans le document A practical introduction to statistics (Page 22-25)