• Aucun résultat trouvé

Tables with measurements: Principal components analysis

Dans le document A practical introduction to statistics (Page 69-73)

4.6 A note on statistical significance

5.1.1 Tables with measurements: Principal components analysis

Words such asgoodnessandsharpnesscan be analyzed as consisting of a stemgood, sharp and an affix, the suffix-ness. Some affixes are used in many words,-nessis an example.

Other affixes occur only in a limited number of words, for instance, the-thinwarmthand strength. The extent to which affixes are used and available for the creation of new words is referred to as the productivity of the affix. Baayen [1994] addressed the question of the extent to which the productivity of an affix is co-determined by stylistic factors. Do different kinds of texts favor the use of different kinds of affixes?

The data setaffixProductivitylists, for44texts with varying authors and genres, a productivity index for27derivational affixes. The44texts represent four different text

127

DRAFT

types: religious texts (e.g., the book of Mormon, codedB), books written for children (e.g., Alice’s adventures in Wonderland, codedC), literary texts (e.g., novels by Austin, Conrad, James, codedL), and other texts (including officialese from the US government account-ing office), codedO. The classification codes are given in the column labeledRegisters.

> affixProductivity[c("Mormon", "Austen", "Carroll", "Gao"), c(5:10, 29)]

ian ful y ness able ly Registers

Mormon 0 0.1887 0.5660 2.0755 0.0000 2.2642 B Austen 0 1.2891 1.5654 1.6575 1.0129 6.2615 L Carroll 0 0.2717 1.0870 0.2717 0.4076 6.3859 C Gao 0 0.3306 1.9835 0.8264 0.8264 4.4628 O

The question of interest is whether there is any structure in this44by27table of numbers that sheds light on the relation between productivity and style. The tool that we will use here isPRINCIPAL COMPONENTS ANALYSIS.

Figure 5.1: Different distributions of points (highlighted in grey) in a cube.

In order to understand the main idea underlying principal components analysis, con-sider Figure 5.1. The upper left panel shows a cube, and the grey coloring of the cube indicates that data points are spread out everywhere in the cube. In order to describe a

128

DRAFT

point in the cube, we need all three axes. The cube in the upper right describes the situa-tion in which all the points are located on the grey plane. We could describe the locasitua-tion of a point on this plane using the three axes of the cube. But we can also choose new axes in this plane, in which case we can still describe each and every relevant point. This description is more economical, as it dispenses with the superfluous third dimension.

The cube in the lower left panel also involves a plane, but now there is more variation (a greater range of values) in theY andZdirection than in theXdirection. The final cube depicts the case where all the points are located on a line. To describe the location of these points, a single axis (the line through these points) is sufficient. Here, we have only one dimension left.

What principal components analysis does is try to reduce the number of dimensions required for locating the approximate positions of the data points. For the upper left cube, this is impossible. For the upper right cube, this is possible: We can get rid of one dimension. The way in which principal components achieves this is by rotating the axes in such a way that you get two new axes in the diagonal plane of the original, unrotated, axes. If you imagine the points to be fixed in their location, while the cube itself can be moved around, then what happens is that the cube is rotated so that all the data points are lying on the bottom.

In the case of the lower left panel of Figure 5.1, principal components analysis will rotate the cube so that all the points are on its floor. It will then choose the dimension with most variation as its first axis (named principal component 1, henceforth PC1), in this example the axis going up and back. The second axis (PC2) will be, in this example, the originalXaxis. The third axis of the rotated cube (PC3) is one we don’t need anymore, as it does not account for any variability in the data.

Of course, this example simplifies what happens in real data sets. It rarely happens that all data points are exactly on a plane, there is nearly always a little scatter around the plane. And instead of three dimensions, there may be many more dimensions, and the plane around which points cluster may be a hyperplane instead of a standard two-dimensional plane. But the key idea remains the same: we rotate our hypercube, and work with a reduced set of dimensions, ordered by how much variability they account for.

Returning to our data, we can regard the44texts as44points in a27-dimensional space. Do we need all these27dimensions, or can we reduce the number of dimensions to a (much) smaller number? And do these new dimensions tell us something about how affixes are used in different kinds of texts?

Let’s consider how we can address this question with the functionprcomp(), which requires a matrix (or a data frame, but then only the numerical columns in that data frame) as input. As the last two colums of our data frameaffixescontain descriptions of labels for authors and text types, we select only colums1:27as input.

> affixes.pr = prcomp(affixProductivity[, 1:(ncol(affixProductivity)-3)]) We now have created a principal components object that has several components, as shown when we request a list of the names of these components with the functionnames().

129

DRAFT

> names(affixes.pr)

[1] "sdev" "rotation" "center" "scale" "x"

Let’s consider these components step by step. The first component,sdev, is the standard deviation corresponding to each PC.

> round(affixes.pr$sdev, 4)

[1] 1.8598 1.1068 0.7044 0.5395 0.5320 0.4343 0.4095 0.3778 [9] 0.3303 0.2952 0.2574 0.2270 0.2113 0.1893 0.1617 0.1503 [17] 0.1265 0.1126 0.1039 0.0870 0.0742 0.0674 0.0585 0.0429 [25] 0.0260 0.0098 0.0087

These standard deviations are also listed bysummary(), only part of the output is shown.

> summary(affixes.pr) Importance of components:

PC1 PC2 PC3 PC4 PC5 PC6

Standard deviation 1.860 1.107 0.7044 0.5395 0.5320 0.4343 Proportion of Variance 0.512 0.181 0.0734 0.0431 0.0419 0.0279 Cumulative Proportion 0.512 0.693 0.7663 0.8094 0.8512 0.8791 ...

PC23 PC24 PC25 PC26 PC27

Standard deviation 0.05853 0.04292 0.0260 0.00977 0.00872 Proportion of Variance 0.00051 0.00027 0.0001 0.00001 0.00001 Cumulative Proportion 0.99960 0.99987 1.0000 0.99999 1.00000

The proportions of variance are simply the squared standard deviations divided by the sum of the squared standard deviations, compare

> props = round((affixes.pr$sdevˆ2/sum(affixes.pr$sdevˆ2)), 3)

> props[1:6]

[1] 0.512 0.181 0.073 0.043 0.042 0.028

The first principal component explains more than half of the variance, the last component has no explanatory value whatsoever. The question we now have to address is which dimensions are relevant, and which irrelevant. There is a rule of thumb stating that only those principal components are important that account for at least5% of the variance.

Figure 5.2 plots the proportions of variance accounted for by the principal components, the ’significant’ components are shown in black.

> barplot(props, col = as.numeric(props > 0.05), + xlab = "principal components",

+ ylab = "proportion of variance explained")

> abline(h = 0.05)

A very similar plot is obtained with

> plot(affixes.pr)

130

DRAFT

Another rule of thumb is to locate the cutoff point where there is a clear discontinuity as you move from right to left. In the present example, the first minor discontinuity is at the fifth PC, and the first large discontinuity at the third PC. From the summary, we learn that we can reduce27dimensions to3dimensions without losing much of the structure in the data: The first three PCs jointly account for slightly more than three quarters of the variance (76.6%). In other words, with just three dimensions, we can already get very close to the location of our44texts in the original27-dimensional productivity space.

principal components

proportion of variance explained 0.00.10.20.30.40.5

Figure 5.2: Screeplot for the principal components analysis of texts in affix productivity space.

The coordinates of the texts in the new three-dimensional space spanned by the new axes, the first three principal components, are available in the component ofaffixes.pr labeledx. This component lists the coordinates on all27PCs, here we only need the first three.

> affixes.pr$x[c("Mormon", "Austen", "Carroll", "Gao"), 1:3]

PC1 PC2 PC3

Mormon -3.7613247 1.5552693 1.4117837

131

DRAFT

Austen -0.1745206 -1.5247233 0.3285241 Carroll 0.3363524 1.5711792 -0.2937536 Gao -1.8250509 -0.8581186 -1.2897237

Figure 5.3 plots the texts in this3-dimensional space by means of a scatterplot matrix displaying all three pairs of combinations of PCs. You can think of this as looking into a cube from three different sides: once from the top, once from the front, and once from the side. We can observe some clustering, especially in the panel for PC1 and PC2 (first panel of second row). The literary texts are in the center, the religious texts in the upper left, the texts for children are more to the lower right, and the officialese tends towards the bottom of the graph.

Visualization with scatterplot matrices is an important part of exploratory data anal-ysis with principal components analanal-ysis. Figure 5.3 was made with a trellis function, splom()(forscatterplotmatrices). This is a powerful function with many options that are explained in the on-line help. We first load thelatticepackage.

> library(lattice)

The next line of code figures out about how points should be represented in terms of plot symbols and color coding. If you are using theRgraphics window, it will figure out to use color coding. If you are saving the plot as PostScript or jpeg, it will use plotting symbols in black and white instead.

> super.sym = trellis.par.get("superpose.symbol") The plot itself can now be produced with the following lines of code:

> splom(data.frame(affixes.pr[,1:3]), + groups = affixProductivity$Registers, + panel = panel.superpose,

+ key = list(

+ title = "texts in productivity space", + text = list(c("Religious", "Children", + "Literary", "Other")), + points = list(pch = super.sym$pch[1:4],

+ col = super.sym$col[1:4])))

A third important component of a principal components object is the rotation matrix, which looks like this:

> dim(affixes.pr$rotation) [1] 27 27

> affixes.pr$rotation[1:10, 1:3]

PC1 PC2 PC3 PC4

semi 0.0018753121 -0.001359615 0.003074151 -0.0033841237 anti -0.0003107270 -0.002017771 -0.002695399 0.0005929162 ee -0.0019930399 0.001106277 -0.017102260 -0.0033997410

132

DRAFT

Figure 5.3: Scatterplot matrix for the distribution of texts in the space spanned by the three first principal components of affix productivity scores.

133

DRAFT

Figure 5.4: Biplot with principal components1and2for authors in productivity space, and the loadings of the affixes on these principal components.

ism 0.0087251807 -0.046360929 0.046553003 0.0300832267 ian -0.0459376905 -0.008605163 -0.010271978 -0.0937441773 ful 0.0334764289 0.013734791 0.010000845 -0.0966573851 y 0.1113180755 -0.043908360 -0.276324337 -0.5719405630 ness 0.0297280626 -0.112768134 0.700249340 -0.1374734621 able 0.0084568997 -0.124364821 0.012313097 0.1119376764 ly 0.9729027985 -0.111160032 -0.020500850 0.1585457448

This matrix lists theLOADINGSof the affixes on each principal component. These loadings are proportional to the correlation of the original productivity values of an affix with the PC. Therefore, you can get some idea of what a PC might indicate by looking at which affixes have large positive or negative loadings. For instance, the suffix-ly(as inbadly) has a very high positive loading on PC1 compared to the other affixes shown above.

What makes principal components analysis attractive is the insights offered when we plot affixes and texts together in aBIPLOT. As you can see in Figure 5.4, the variation on PC1 is dominated by the suffix-ly, which seems to have been favored especially in the Barrie novel. There is somewhat more diversification on PC2. Comparatives and

134

DRAFT

superlatives are somewhat more characteristic for texts with high values on PC2, such as Kipling, Carroll and Grimm. On the other hand,-ationemerges as characteristic for the Federalist papers and also the texts by James and Austen.

The biplot shown in Figure 5.4 is obtained with thebiplot()function, which in its simplest form simply takes the principal components object as input. Here, we make use of a number of options to fine-tune the plot.

> biplot(affixes.pr, scale = 0, var.axes = F, + col = c("darkgrey", "black"), cex = c(0.9, 1.2))

By default,biplot()rescales the principal components and the loadings. This rescal-ing is disabled withscale = 0. I have also disabled the displaying of arrows pointing to the affixes withvar.axes = F. The parametercolcontrols the colors for the texts (darkgrey) and the affixes (black), and the parametercexcontrols the font sizes. Note that the primary coordinate system (bottom and left axes) represents the principal com-pononts, and that the secondary coordinate system (upper and right axes) represents the corresponding loadings.

When carrying out a principal components analysis, there are two things that should be kept in mind. First, the variables should have reasonably symmetric distributions.

Second, and more importantly, it is almost always advisable to scale the columns. If the columns contain variables with very different ranges, then the columns with the greatest ranges may dominate the results. We have seen for the present data that two affixes dominate the first two principal components,-lyon PC1 and-ationon PC2. This lopsided effect of a few variables is avoided by running theprcomp()function with the option scale = TRUE. Technically, this amounts to running the analysis not on the covariance matrix, but on the correlation matrix. The upper panel of Figure 5.5 shows the biplot for a principal components analysis when the correlation matrix is used.

> affixes.pr = prcomp(affixProductivity[ ,1:27], scale = T, center = T)

> biplot(affixes.pr, var.axes = F, col = c("darkgrey", "black"), + cex = c(0.6, 1), xlim = c(-0.42, 0.38))

The loadings of the affixes now reveal more interesting structure. Native affixes (e.g., -ness,-less,-er) tend to occur more in the upper and right parts of the plot. Nonnative affixes (e.g.,-ation, super-, anti-) tend to occur in the lower left of the biplot. The use of nonnative affixes is more typical for officialese (e.g., congress hearings (Hearing) and formal texts such as the Federalist papers. Native affixes are more typical for, for instance, the stories for children by Carroll and Baum. In other words, nonnative affixes are more productive in more formal and educated registers.

Dans le document A practical introduction to statistics (Page 69-73)