• Aucun résultat trouvé

Operations on data frames

Dans le document A practical introduction to statistics (Page 11-15)

1.4.1 Sorting a data frame by one or more columns

In the previous section, we created the data frameverbs.rs, the rows of which appeared in the arbitrary order specified by our vector of row numbersrs. It is often useful to sort the entries in a data frame by the values in one of the columns, for instance, by the realization of the recipient,

> verbs.rs[order(verbs.rs$RealizationOfRec), ]

RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme

390 NP lend animate animate 0.6931472

638 PP pay animate inanimate 0.6931472

799 PP sell animate inanimate 1.3862944

569 PP sell animate inanimate 1.6094379

567 PP send inanimate inanimate 1.3862944

or by verb and then by the length of the theme,

> verbs.rs[order(verbs.rs$Verb, verbs.rs$LengthOfTheme), ]

RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme

390 NP lend animate animate 0.6931472

638 PP pay animate inanimate 0.6931472

799 PP sell animate inanimate 1.3862944

569 PP sell animate inanimate 1.6094379

567 PP send inanimate inanimate 1.3862944

The crucial work is done byorder(). Its first argument is the primary column of the data frame by which the rows should be sorted (alphabetical or numerical depending on the column values). The second argument is the column that provides the sort key for those rows that have ties according to the first column. Additional columns for sorting can be supplied as third or fourth argument, and so on.

Note that theorder()function occupies the slot in the subscript of the data frame that specifies the conditions on the rows. Whatorder()actually does is supply a vector of row numbers, with the row number of the row that is to be listed first as first element, the row number that is to be listed second as second element, and so on. For instance, when we sort the rows byVerb,order()returns a vector of row numbers

> order(verbs.rs$Verb)

[1] 10 7 8 3 1 9 2 4 6 5

that will move the last row (forcost) to the first row, the seventh row (forgive) to the second row, and so on.

The elements of a vector can be sorted in the same way. When sorting the vector

> v = c("pay", "sell", "lend", "sell", "send", + "sell", "give", "give", "pay", "cost")

11

DRAFT

(note thatRchanges the prompt from>to+when a command is not finished by the end of the line, so don’t type the+symbol when defining this vector) we subscript it with order()applied to itself:

> v[order(v)]

[1] "cost" "give" "give" "lend" "pay"

[6] "pay" "sell" "sell" "sell" "send"

However, a more straightforward function for sorting the elements of a vector issort():

> sort(v)

It is important to keep in mind that in all preceding examples we never assigned the out-put of the reordering operations, sovis still unsorted. In order to obtain sorted versions, simply assign the output to the original data object.

> v = sort(v)

1.4.2 Changing information in a data frame

Information in a data frame can be changed. For instance, we could manipulate the data inverbs.rsand change the realization of the recipient for the verbto pay(originally on line638inverbs) fromPPintoNP. (In what follows, I assume that this command is not actually carried out.)

> verbs.rs["638", ]$RealizationOfRec = "NP"

If many such changes have to be made, for instance in order to correct coding errors, then it may be more convenient to do this in a spreadsheet, save the result as a.csvfile, and load the corrected data intoRwithread.csv().

Changes that are easily carried out inRare changes that affect whole columns or sub-parts of the table. For instance, in order to reconstruct the length of the theme (in words) from the logarithmically transformed values listed inverbs.rs, all we have to do is apply theexp()function to the appropriate column. All values in the column will be changed accordingly.

> verbs.rs$LengthOfTheme

[1] 0.6931472 1.3862944 0.6931472 1.6094379 1.3862944

> exp(verbs.rs$LengthOfTheme) [1] 2 4 2 5 4

We can also add new columns to a data frame. For instance, we might consider adding a column with the length of the verb (in letters). There is a function,nchar(), that con-veniently reports the number of letters in its input, provided that its input is a character string or a vector of character strings. We illustratenchar()for the longest word (with-out intervening spaces or hyphens) of English [Sproat, 1992] and the shortest word of English.

12

DRAFT

> nchar(c("antidisestablishmentarianism", "a")) [1] 28 1

When applyingnchar()to a column in a data frame, we have to keep in mind that non-numerical columns typically are not vectors of strings, but factors. So we must first con-vert the factor into a character vector withas.character()before applyingnchar(). We add the result toverbs.rswith the$operator.

> verbs.rs$Length = nchar(as.character(verbs.rs$Verb))

We display only the first four rows of the result, and only the verb and its orthographic length.

> verbs.rs[1:4, c("Verb", "Length")]

Verb Length

638 pay 3

799 sell 4 390 lend 4 569 sell 4

1.4.3 Extracting contingency tables from data frames

How many observations are characterized by animate recipients realized as anNP? Ques-tions like this are easily addressed with the help ofCONTINGENCY TABLES, tables that cross-tabulate counts for combinations of factor levels. Since the factorsRealizationOf RecandAnimacyOfReceach have two levels, as shown by the functionlevels(),

> levels(verbs$RealizationOfRec) [1] "NP" "PP"

> levels(verbs$AnimacyOfRec) [1] "animate" "inanimate"

a cross-tabulation ofRealizationOfRecandAnimacyOfRecwithxtabs()results in a table with four cells.

> xtabs( ˜ RealizationOfRec + AnimacyOfRec, data = verbs) AnimacyOfRec

RealizationOfRec animate inanimate

NP 521 34

PP 301 47

The first arguments ofxtabs()is aFORMULA. Formulas have the following general structure, with the tilde (∼) denoting ’depends on’ or ’is a function of’.

dependent variable∼predictor 1+predictor 2+. . .

13

DRAFT

ADEPENDENT VARIABLEis a variable the value of which we try to predict. The other vari-ables are often referred to asINDEPENDENT VARIABLES. This terminology is somewhat misleading, however, because sets of predictors are often characterized by all kinds of in-terdependencies. A more appropriate term is simplyPREDICTOR. In the study of Bresnan et al. [2007] that we are considering here, the dependent variable is the realization of the recipient. All other variables are predictor variables.

When we construct a contingency table, however, there is no independent variable. A contingency table allows us to see how counts are distributed over conditions, without making any claim as to whether one variable might be explainable in terms of other vari-ables. Therefore, the formula forxtabs()has nothing to the left of the tilde operator. We only have predictors, which we list to the right of the tilde, separated by plusses.

More than two factors can be cross-tabulated:

> verbs.xtabs =

+ xtabs( ˜ AnimacyOfRec + AnimacyOfTheme + RealizationOfRec, + data = verbs)

> verbs.xtabs

, , RealizationOfRec = NP AnimacyOfTheme AnimacyOfRec animate inanimate

animate 4 517

inanimate 1 33

, , RealizationOfRec = PP AnimacyOfTheme AnimacyOfRec animate inanimate

animate 1 300

inanimate 0 47

As three factors enter into this cross-classification, the result is a three-dimensional con-tingency table, that is displayed in the form of two2by2contingency tables. It is clear from this table that animate themes are extremely rare. It therefore makes sense to re-strict our attention to the clauses with inanimate themes. We implement this rere-striction by conditioning on the rows ofverbs.

> verbs.xtabs = xtabs( ˜ AnimacyOfRec + RealizationOfRec, + data = verbs, subset = AnimacyOfTheme != "animate")

> verbs.xtabs

RealizationOfRec AnimacyOfRec NP PP

animate 517 300 inanimate 33 47

It seems that recipients are somewhat more likely to be realized as anNPwhen animate and as aPPwhen inanimate.

14

DRAFT

This contingency table can be recast as a table of proportions by dividing each cell in the table by the sum of all cells, the total number of observations in the data frame with inanimate themes. We obtain this sum with the help of the functionsum(), which returns the sum of the elements in a vector or table:

> sum(verbs.xtabs) [1] 897

We verify that this is indeed equal to the number of rows in the data frame with inanimate themes only, with the help of thenrow()function.

> sum(verbs.xtabs) == nrow(verbs[verbs$AnimacyOfTheme != "animate",]) [1] TRUE

A table of proportions is obtained straightforwardly by dividing the contingency table by this sum.

> verbs.xtabs/sum(verbs.xtabs) RealizationOfRec

AnimacyOfRec NP PP

animate 0.57636566 0.33444816 inanimate 0.03678930 0.05239688

For percentages instead of proportions, we simply multiply by100.

> 100 * verbs.xtabs/sum(verbs.xtabs) RealizationOfRec

AnimacyOfRec NP PP

animate 57.636566 33.444816 inanimate 3.678930 5.239688

It is often useful to recast counts as proportions (relative frequencies) with respect to row or column totals. Such proportions can be calculated withprop.table(). When its second argument is1,prop.table()calculates relative frequencies with respect to the row totals,

> prop.table(verbs.xtabs, 1) # rows sum to 1 RealizationOfRec

AnimacyOfRec NP PP

animate 0.6328029 0.3671971 inanimate 0.4125000 0.5875000

when its second argument is2, it produces proportions relative to column totals.

> prop.table(verbs.xtabs,2) # columns sum to 1 RealizationOfRec

AnimacyOfRec NP PP

animate 0.9400000 0.8645533 inanimate 0.0600000 0.1354467

15

DRAFT

These tables show that the row proportions are somewhat different foranimateversus inanimaterecipients, and that column proportions are slightly different forNPversus PPrealizations of the recipient. Later we shall see that there is indeed reason for surprise:

The observed asymmetry between rows and columns is unlikely to arise under chance conditions. For animate recipients, theNPrealization is more likely than thePP realiza-tion. Inanimate recipients have a non-trivial preference for thePPrealization.

1.4.4 Calculations on data frames

Another question that arises with respect to the data inverbsis to what extent the length of the theme, i.e., the complexity of the theme measured in terms of the number of words used to express it, covaries with the animacy of the recipient. Could it be that animate recipients show a preference for more complex themes, compared to inanimate recipi-ents? To assess this possibility, we calculate the mean length of the theme for animate and inanimate recipients. We obtain these means with the help of the functionmean(), which takes a numerical vector as input, and returns the arithmetic mean.

> mean(1:5) [1] 3

We could use this function to calculate the means for the animate and inanimate recipients separately,

> mean(verbs[verbs$AnimacyOfRec == "animate", ]$LengthOfTheme) [1] 1.540278

> mean(verbs[verbs$AnimacyOfRec != "animate", ]$LengthOfTheme) [1] 1.071130

but a much more convenient way for obtaining these means simultaneously is to make use of thetapply()function. This function takes three arguments. The first argument specifies a numeric vector for which we want to calculate means. The second argument specifies how this numeric vector should be split into groups, namely, on the basis of its factor levels. The third argument specifies the function that is to be applied to these groups. The function that we want to apply to our data frame is mean(), but other functions (e.g.,sum(),sqrt()) could also be specified.

> tapply(verbs$LengthOfTheme, verbs$AnimacyOfRec, mean) animate inanimate

1.540278 1.071130

The output oftapply()is a table, here a table with two means labeled by the levels of the factor for which they were calculated. Later we shall see that the difference between these two group means is unlikely to be due to chance.

It is also possible to calculate means for subsets of data defined by the levels of more than one factor, in which case the second argument fortapply()should be aLISTof the relevant factors. Like vectors, lists are ordered sequences of elements, but unlike for

16

DRAFT

vectors, the elements of a list can themselves have more than one element. Thus we can have lists of vectors, lists of data frames, or lists containing a mixture of numbers, strings, vectors, data frames, and other lists. Lists are created with thelist()function. For tapply(), all we have to do is specify the factors as arguments to the functionlist(). Here is an example for the means of the length of the theme cross-classified for the lev-els ofAnimacyOfRecandAnimacyOfTheme, illustrating an alternative, slightly shorter way of usingtapply()with the help ofwith():

> with(verbs, tapply(LengthOfTheme,

+ list(AnimacyOfRec, AnimacyOfTheme), mean)) animate inanimate

animate 1.647496 1.539622 inanimate 2.639057 1.051531

A final operation on data frames is best illustrated by means of a data set (heid) concerning reaction times in visual lexical decision elicited from Dutch subjects for neol-ogisms ending in the suffix-heid(’-ness’).

> heid[1:5, ]

Subject Word RT BaseFrequency

1 pp1 basaalheid 6.69 3.56

2 pp1 markantheid 6.81 5.16

3 pp1 ontroerdheid 6.51 5.55

4 pp1 contentheid 6.58 4.50

5 pp1 riantheid 6.86 4.53

This data frame comprises log reaction times for26subjects to40words. For each com-bination of subject and word, a reaction time (RT) was recorded. For each word, the frequency of its base word was extracted from theCELEXlexical database [Baayen et al., 1995]. Given what we know about frequency effects in lexical processing in general, we expect that neologisms with a higher base frequency elicit shorter reaction times.

Psycholinguistic studies often report two analyses, one for reaction times averaged over subjects, and one for reaction times averaged over words. Theaggregate() func-tion carries out these averaging procedures. Its syntax is similar to that oftapply(). Its first argument is the numerical vector for which we want averages according to the sub-sets defined by the list supplied by the second argument. Here is how we average over words:

> heid2 = aggregate(heid$RT, list(heid$Word), mean)

> heid2[1:5, ]

Group.1 x

1 aftandsheid 6.705000 2 antiekheid 6.542353 3 banaalheid 6.587727 4 basaalheid 6.585714 5 bebrildheid 6.673333

17

DRAFT

Asaggregate()does not retain the original names of our data frame, we change the column names so that the columns ofheid2remain well interpretable.

> colnames(heid2) = c("Word", "MeanRT")

In the averaging process, we lost the information about the base frequencies of the words.

We add this information in two steps. We begin with creating a data frame with just the information pertaining to the words and their frequencies.

> items = heid[, c("Word", "BaseFrequency")]

Because each subject responded to each item, this data frame has multiple identical rows for each word. We remove these redundant rows withunique().

> nrow(items) [1] 832

> items = unique(items)

> nrow(items) [1] 40

> items[1:4, ]

Word BaseFrequency

1 basaalheid 3.56

2 markantheid 5.16

3 ontroerdheid 5.55

4 contentheid 4.50

The final step is to add the information initemsto the information already available inheid2. We do this withmerge(). As arguments tomerge(), we first specify the receiving data frame (heid2), and then the donating data frame (items). We also specify the columns in the two data frames that provide the keys for the merging:by.xshould point to the key in the receiving data frame andby.yshould point to the key in the donating data frame. In the present example, the keys for both data frames have the same value,Word.

> heid2 = merge(heid2, items, by.x = "Word", by.y = "Word")

> head(heid2, n = 4)

Word MeanRT BaseFrequency 1 aftandsheid 6.705000 4.20 2 antiekheid 6.542353 6.75 3 banaalheid 6.587727 5.74 4 basaalheid 6.585714 3.56

Make sure you understand why the next sequence of steps lead to the same results.

> heid3 = aggregate(heid$RT, list(heid$Word, heid$BaseFrequency), mean)

> colnames(heid3) = c("Word", "BaseFrequency", "MeanRT")

> head(heid3[order(heid3$Word),], 4)

We shall see shortly that theMeanRTindeed tends to be shorter asBaseFrequency increases.

18

DRAFT

Dans le document A practical introduction to statistics (Page 11-15)