Some Recent Results on Cross-Linguistic, Corpus-Based Quantitative Modelling of Word Order and Aspect

(1)

Proceedings Chapter

Reference

Some Recent Results on Cross-Linguistic, Corpus-Based Quantitative Modelling of Word Order and Aspect

MERLO, Paola

Abstract

One of the most striking features of human languages is their extreme variety. Even more striking is the existence, behind the apparent variety, of strong representational and cognitive regularities that govern their form and their function: language universals. We discuss here some recent work from our group, where large-scale, data-intensive computational modelling techniques are used to address fundamental linguistic questions on language regularities. In the area of word order, we report here on work that leverages large amounts of monolingual and parallel corpus data to develop computational models of the internal structure of the noun phrase (Universal 20) and of general structural minimisation principles. In the area of event duration, we report on work that leverages both deep similarities and surface differences to develop truly cross-linguistic natural language processing tools.

MERLO, Paola. Some Recent Results on Cross-Linguistic, Corpus-Based Quantitative Modelling of Word Order and Aspect. In: Blochowiak, J. Formal Models in the Study of Language . Cham : Springer, 2017. p. 451-464

DOI : 10.1007/978-3-319-48832-5_24

Available at:

http://archive-ouverte.unige.ch/unige:123646

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Corpus-Based Quantitative Modelling of Word Order and Aspect

Paola Merlo

Abstract One of the most striking features of human languages is their extreme variety. Even more striking is the existence, behind the apparent variety, of strong representational and cognitive regularities that govern their form and their function:

language universals. We discuss here some recent work from our group, where large- scale, data-intensive computational modelling techniques are used to address fundamental linguistic questions on language regularities. In the area of word order, we report here on work that leverages large amounts of monolingual and parallel corpus data to develop computational models of the internal structure of the noun phrase (Universal 20) and of general structural minimisation principles. In the area of event duration, we report on work that leverages both deep similarities and surface diﬀer- ences to develop truly cross-linguistic natural language processing tools.

Keywords Language universals

⋅

Universal 20

⋅

^{Word order}

⋅

Dependency length minimization

⋅

Event duration

⋅

Corpus modeling

⋅

Computational modeling

1 Introduction

One of the striking properties of human languages is their extreme variety in form and expression. Another, apparently contradictory, property is that they all are, at some level, similar, as shown by the fact that any two languages can be translations of each other and that their structural forms exhibit very strong regularities.

Several decades of formal grammar have developed complex representations and sophisticated theories of these regularities. Current availability of very large corpora for many languages provides observational data about variation. The main scientiﬁc

The space of intersection of generative grammar, corpus work and computational modelling is a lonely place. Thanks to Jacques, for sharing some of these interests.

P. Merlo (✉⁾

Computational Learning and Computational Linguistics Research Group (clcl.unige.ch), Department of Linguistics, University of Geneva, Geneva, Switzerland

e-mail: [email protected]

J. Blochowiak et al. (eds.),Formal Models in the Study of Language, DOI 10.1007/978-3-319-48832-5_24

451

(3)

challenge for computational linguistics is the creation of theories and methods that fruitfully combine large-scale, corpus-based approaches with the linguistic depth of more theoretical symbolic methods.

We discuss here some recent work from our group, where large-scale, data- intensive computational modelling techniques are used to address fundamental linguistic questions. We investigate both similarities and diﬀerences across languages.

On the one hand, we report on work that investigates, in the area of word order, whether frequencies—both typological and corpus-based—are systematically correlated to abstract syntactic principles, and to higher level functional principles of eﬃciency and optimisation. On the other hand, much like the comparative method in linguistics, cross-lingual corpus investigations take advantage of corresponding annotation or linguistic knowledge across languages. We report on work that exploits diﬀerences across languages in the surface expression of meaning to show that complementary information about one language can be extracted from their translations in a second language, considerably improving the performance on current NLP tasks.

These two approaches, a type-based approach and a token-based approach, will be illustrated in this paper by our current work on language universals—speciﬁcally on universal 20 and the dependency minimisation eﬀect—and work on cross-lingual transfer between Serbian and English to learn the duration of events.

Languages vary greatly in one of their most fundamental and apparent properties, the order of words. Languages can position the verb at the beginning or end of the sentence, adjectives can precede or follow the noun, for example. Word orders vary greatly cross-linguistically, but each language has very strong preferences for a few or even only one order, and, across languages, not all orders are equally preferred (Greenberg1966; Dryer1992; Cinque2013; Baker2002). Very many theories and descriptions have attempted to explain word order diﬀerences and universals and their typological frequencies.

Some authors develop generative mechanisms to account for typological observations by a system of costs and constraints that generate statistical universals (Cinque 2005; Abels and Neeleman2009; Cinque2013; Steedman2011; Steddy and Samek- Lodovici2011). For example, in a paper that has received much commentary (Cinque 2005), Greenberg’s Universal 20—the universal governing the order of nominal modiﬁers—is derived from independently motivated principles of syntax organised in a derivational explanation. Factorial, but not derivational, explanations have also been proposed. They identify the predictive properties of the frequency distributions of word order and their relative importance, either based on proximity preference or on general principles of symmetry and harmony (Cysouw2010; Dryer2006,2009).

Their predictive power is compared to the generative one in Merlo (2015). We review this work below.

Another area of study of word order investigates the preferred order in those cases where several are possible. Several factors can inﬂuence the choice of word order, both related to grammatical principles and to processing principles. Process- ing theories propose that alternations of word order are attempts to minimise eﬀort.

Hawkins (1994, 2004) shows, for example, that syntactic choices generally respect the preference for placing short elements closer to the head than long elements.

(4)

Hawkins’s work is representative of much work on language processing which attributes parsing performance to the distance or locality of linguistic constituents and their dependents (Gibson1998, 2000; Lohse et al.2004; Demberg and Keller 2008). Recently, evidence has been provided for an optimisation process called dependency length minimisation (DLM). Temperley (2007) ﬁnds evidence for DLM in a variety of syntactic choice phenomena in written English. Global measures of dependency length on a larger scale have been proposed, and cross-linguistic work has used these measures and demonstrated their minimisation (Gildea and Temper- ley2010; Futrell et al.2015; Gulordava and Merlo2015a,b; Gulordava et al.2015).

We review our group’s work in this area, which shows that DLM also applies to very short spans and presents interesting diachronic trends.

Statistical approaches in natural language processing have experienced unham- pered success for almost two decades on very many tasks. These methods, however, rely on the preparation and use of costly hand-annotated resources, which, in prac- tice, exist for only a few languages. Much like the comparative method in linguistics, most cross-lingual learning takes advantage of any corresponding annotation or linguistic knowledge that is available in one or more languages to transfer it to other languages (Tsang and Stevenson2001; van der Plas et al.2010). For example, exploiting multi-lingual resources to solve mono-lingual problems, Merlo et al. (2002) have proposed a verb classiﬁcation methodology using multi-lingual dictionaries. They exploit the diﬀerences across languages in the surface expression of meaning, to show that complementary information about English verbs can be extracted from their translations in a second language (Chinese, German), considerably improving the performance on English (Tsang et al.2002). We report on recent development of this line of work below (Samardžić and Merlo2016), where we learn the duration of events in English by transferring information from Serbian.

2 Universal 20

One interesting goal of large-scale corpus and computational linguistics is to investigate the link between syntactic frequencies of word order and underlying grammatical principles (for instance, movement). This line of work develops the hypothesis that the differential costs of basic syntactic operations and structures yield differential frequencies. One of such areas is the study of the costs of operations that give rise to different word orders across languages in the internal structure of NPs and the observational universal called Universal 20 (Greenberg1966)—the universal governing the linear order of a noun and its modifiers. This line of work started in (Merlo2015), summarised here, and currently is being developed in Merlo and Ouwayda (forthcoming) to finer-grained theoretical detail.

Greenberg’s Universal 20

When any or all the items (demonstrative, numeral, and descriptive adjective) precede the noun, they are always found in this order. If they follow, the order is exactly the same or its exact opposite.

(5)

Table 1 Attested word orders of Universal 20 and their estimated frequencies Dryer’s counts by

languages

Dryer’s counts by genera

Cinque’s counts discretised

Dem Num Adj N 74 44 V. many

Dem Adj Num N 3 2 0

Num Dem Adj N 0 0 0

Num Adj Dem N 0 0 0

Adj Dem Num N 0 0 0

Adj Num Dem N 0 0 0

Dem Num N Adj 22 17 Many

Dem Adj N Num 11 6 V. few

Num Dem N Adj 0 0 0

Num Adj N Dem 4 3 V. few

Adj Dem N Num 0 0 0

Adj Num N Dem 0 0 0

Dem N Adj Num 28 22 Many

Dem N Num Adj 3 3 V. few

Num N Dem Adj 5 3 0

Num N Adj Dem 38 21 Few

Adj N Dem Num 4 2 V. few

Adj N Num Dem 2 1 V. few

N Dem Num Adj 4 3 Few

N Dem Adj Num 6 4 V. few

N Num Dem Adj 1 1 0

N Num Adj Dem 9 7 Few

N Adj Dem Num 19 11 Few

N Adj Num Dem 108 57 V. many

Some aspects of Greenberg’s formulation have been confirmed by the collection of larger samples, but some others have been found to be too strong. Table1reports the 24 combinatorially possible orders of the four elements: N, Dem, Num, Adj and the actual counts that have been proposed in several publications: the first two columns are Dryer’s (2006) counts by language and by genera; and the following col- umn are Cinque’s ranks, as can be deduced from the 2005 paper. As can be observed, despite differences across the different counting methods and across authors, which have been discussed in detail in the related publications, the rank of languages or genera based on frequencies shows only one disagreement (the order Num N Adj Dem). The other counts agree on the two most frequent orders, the rare or unattested orders, as well as the rankings in-between. Given the observations of striking differences in the frequency of use of word orders across languages, the question is: why are there such differences? Many different explanations have been proposed.

(6)

Merlo (2015) illustrates a method to compare them and analyse how well these explanations predict the data in a precise experimental setting. Three pieces of work are compared. Cinque (2005) proposes that typological frequency rankings are explained by a system of costs of derivational and movement operations: the more costly the derivation the less frequent the order. In a diﬀerent proposal, a factorial, but not derivational, explanation is proposed (Cysouw2010). An explanation of typological frequencies is produced by the cumulative combination of three charac- teristics: hierarchical structure, noun-adjective order, and whether the noun is at the phrase boundary. Finally, Dryer proposes an unweighted factorial explanation based on general principles of symmetry and harmony (Dryer2006).

Merlo (2015) deﬁnes a formal encoding based on a map of the explanatory properties of each proposal to a feature-vector representation. In this method, each word order is representd by a vector of features, which encodes the theory. These vectors are then used by a supervised classiﬁer to predict the word order frequency of unseen word orders.

The most important aspect for our discussion is that once the data are encoded in an appropriate way, Cinque’s way of assigning markedness values (fitting the weights), which is done by hand, or Cysouw’s way of fitting the model to the data is reproduced in this classification scenario by training the model. There are numerous algorithms for learning the weights of a model in a supervised setting, and many regimes for training and testing such algorithms. In Merlo’s (2015) experiments, two probabilistic learning algorithms are used—Naive Bayes and the Weighted Average One-dependence Estimator (WAODE)—, andn-fold cross-validation, in the proto- col used for training and testing (Russel and Norvig1995; Webb et al.2005).

This method of model comparison yields many novel results. First of all, it provides an unbiased estimate of how good at generalising the models really are, since they are tested on unseen data. The results show that all the models are at least ten percent below ceiling, some below baseline. This result shows that manual data fitting yields overly optimistic results compared to those obtained by comparing on a properly held-out test set. Secondly, one can identify kinds of errors specific to each model by looking at confusion matrices. Cinque’s model tends to classify word orders as more frequent than they really are. Cysouw’s model does not appear to be able to predict the unattested class. Dryer’s result show that features based on harmony and symmetry can only precisely predict the very frequent orders. Finally, this method allows us to test whether the proposed elementary operations, encoded as the attributes of the models, are indeed primitive and independent. This is possible, because the Naive Bayes learner is predicated on a strong independence assump- tion of the attributes. Results show that Cinque’s and Dryer’s models yield better results when a classifier that uses weaker independence assumptions—the WAODE classifier—is used, while Cysouw’s has the same performance. The fact that a classifier that makes weaker independence assumptions about its attributes yields better performance than Naive Bayes indicates that the attributes are not independent. Find- ing a statistical dependence among factors suggests that part of the explanation of the data is given by the interaction of the factors. This interaction is specific to the

(7)

problem and the given data, and shows that part of the explanation is not independently motivated.

This work on word order is an illustration of how to use simple computational methods to represent theories of language variation coming from typology, and compare them to each other, but it relies on extremely summarised data. It does not entirely exploit the large amounts of naturalistic data that are available these days and that enable us to study variation in much more detail. Corpora and other large sources of data need to be used to attain this goal, as illustrated in the next section.

3 Dependency Length Minimisation

Another universal that has recently been claimed to govern both word order and syntactic structure is Dependency Length Minimisation (DLM) (Futrell et al.2015).

The DLM principle can be stated as follows: if there exist possible alternative orderings of a phrase, the one with the shortest overall dependency length (DL) is preferred.¹

This principle replaces previous formulations where it had been observed that heavy, complex phrases have a tendency to move to the end of the sentence. While the eﬀect of size or heaviness is well-documented, a preference of end weight does not describe the preference of head-ﬁnal languages: Object-Verb languages, like Japanese or Korean, put long constituents before short (Hawkins 1994; Wasow 2002). DLM captures the fact that short elements prefer being close to their head and is therefore more valid typologically.

Corpus and treebank data allows us to verify how these claims fare cross- linguistically. DLM has recently been confirmed at the sentential level for many languages (Futrell et al.2015). Our group has recently investigated short spans, studying DLM effects in the noun phrase. While DLM has been demonstrated on a large scale and explanations have been proposed based on human sentence processing facts in the verbal domain, it is not clear what the effects of DLM are in the more limited nominal domain. If the explanations are really rooted in memory and efficiency, will they still hold in phrases that might span only a few words? Gulordava et al. (2015) and Gulordava and Merlo (2015b) look at the structural factors that play a role in adjective-noun word order alternations in Romance languages. They demonstrate that, unlike results for the verbal domain, it is not only the length of the dependency that is at stake, but also the interaction with the surrounding dependencies. More precisely, it is demonstrated that the presence of a right dependent of the noun affects the position of the adjective which modifies this noun: the prenominal position is more often preferred in such cases. This effect is highly consistent for five Romance languages and for all noun and adjective types.

1The length of a dependency is measured as the number of words between the head and its dependent.

(8)

Fig. 1 Right external dependent, prenominal adjective

Fig. 2 Right external dependent, postnominal adjective

Specifically, Gulordava and colleagues analyse the adjective placement using the dependency annotated treebanks of Romance languages, from which they collect large amounts of noun phrases and their structures. They consider a prototypical simple noun phrase with one modifying adjective phrase, which in turn can contain preadjectival and postadjectival material. The adjective modifier can be a complex phrase with both left and right dependents (𝛼and𝛽, respectively). The noun phrase can have parents and right modifiers (X and Y, respectively). These structures cor- respond to examples like those shown in (1), in Italian (X =‘è’, Adj =‘principale’, N =‘isola’, Y = ‘del Lago Maggiore’). Example structures for two of the possible cases are shown in Figs.1and2.

(1) a. Questa è la principale isola del Lago Maggiore.

(‘This is the main island of Lake Maggiore.’) b. Questa è l’isola principale del Lago Maggiore.

(‘This is the island main of Lake Maggiore.’) c. La principale isola del Lago Maggiore è questa.

(‘The main island of Lake Maggiore is this one.’) d. L’isola principale del Lago Maggiore è questa.

(‘The island main of Lake Maggiore is this one.’)

DLM makes predictions on adjective placement with respect to the noun—

prenominal or postnominal—given the dependents of the adjectives,𝛼and𝛽, and given the dependent of the noun Y.

Let’s assume that the calculation of DL diﬀerences is always calculated as DL₁−DL₂, that is prenominal order—postnominal order. Consider the dependency length diﬀerence for the two cases shown in the picture: DL₁=d₁+d₂+d₃ =

|𝛽|+|Y| and DL₂=d₁+d₂+d₃=|𝛼|+|𝛽|+|Y|+ 1 +|𝛼|+|𝛼|+|𝛽|+ 1=

3|𝛼|+ 2|𝛽|+|Y|+ 2. The diﬀerentialDL₁−DL₂is−3|𝛼|−|𝛽|− 2. The negative value shows thatDL₁<DL₂, hence there will be a preference for a prenominal placement. Similar calculations can be done for the other cases.

The predictions of the DLM theory are tested in a mixed-effects model estimated on the data provided by the dependency annotated corpora of five main Romance languages (Catalan, French, Italian, Portuguese, Spanish). The different elements in the DLM configuration are encoded as four factors: corresponding to the factors illustrated in Figs.1to2and example (1):LeftAP—the cumulative length (in words)

(9)

of all left dependents of the adjective, indicated as𝛼in Figs.1to2;RightAP—the cumulative length (in words) of all right dependents of the adjective, indicated as 𝛽 in Figs.1 to2;ExtDep—the direction of the arc from the noun to its parent X, an indicator variable;RightNP—the indicator variable representing the presence or absence of the right dependent of the noun, indicated as Y.

The findings partly confirm the predictions about adjective placement with respect to the noun given the adjective dependents (presence of preadjectival material induces prenominal placement and postadjectival material triggers postnominal placement). The prediction related to the presence of a right dependent of the noun on the placement of the adjective are also confirmed.

These results conﬁrm that a principle of minimisation of dependencies is also active in Noun Phrases and even in very short spans—recall that Noun Phrases with only one adjective were considered. They therefore suggest that DLM applies also in situations where a functional motivation of this principle is only minimally sup- ported, as the need to reduce cognitive or memory load is not compelling.

Observational language universals, such as universal 20, suggest that languages prefer harmonic orders, while work on DL preferences shows that languages prefer shorter dependencies. However, these two pressures are in contradiction, as a reduc- tion in dependency length can be obtained by placing modiﬁers at the two sides of the head, increasing variation in head directionality and consequently inducing less directional harmony in the structure. How exactly languages balance these two pressures and evolve, then, is worthy of investigation.

Gulordava and Merlo (2015a) study texts of Latin and Ancient Greek, two well- documented free word order languages, spanning diﬀerent time periods. The texts in Latin range from the Classical Latin period (Caesar and Cicero) to the Late Latin of 4th century (Vulgate and Peregrinatio). Jerome’s Vulgate is a translation from the Greek New Testament. The two Greek texts are Herodotus (4th century BC) and New Testament (4th century AD). The sizes of the texts vary but they are at least 900-sentence long.

Following Gildea and Temperley (2010) and Futrell et al. (2015), upper and lower bounds are established by computing the optimal and random dependency length of a sentence. More precisely, to compute the random dependency length, the positions of the words in the sentence are permuted, preserving its unordered dependency tree available from the gold annotation, and the new random dependency length is calculated.

Interesting results are obtained by comparing the random, optimal and actual dependency lengths averaged for sentences of the same length.²Figure3shows the curves for the aggregated data from the classical Latin texts and from the late Latin texts. The ﬁrst observation is that both periods show dependency lengths that are optimised, to a certain extent, as they are all better than the (red) random curve.

2Since the optimal and random dependency length values depend (non-linearly) on the sentence lengthn, it is customary not to average DL globally over all sentences in a treebank (Cancho and Liu,2014).

(10)

Fig. 3 Average random, average optimal and actual dependency lengths of sentences by sentence length for texts in classical Latin (left panel) and in late Latin (right panel)

Also, all languages have a very similar minimised dependency length, and the actual curves lie between random and optimal.

If we calculate the ratio of actual dependency length compared to optimal dependency length, over all the dependencies in the texts, we can calculate how much a given time period optimises DL. According to the ratio measure used in Gulordava and Merlo (2016),³ the diﬀerent texts show the following rates of minimisation: Cicero (Classical Latin) = 1.26; Vulgate (late Latin) = 1.17; Herodotus (early Ancient Greek) = 1.33; New testament (late Ancient Greek) = 1.19. These numbers clearly show a trend towards greater minimisation, in both languages. For Latin, this tendency is conﬁrmed by the minimisation rate of current Romance languages: Italian = 1.13, French and Spanish = 1.15.

The work described so far investigates quantitative factors underlying typological and corpus variation of word order, to discover the operations and principles that govern this variation. Two main conclusions emerge from these investigations.

The great variety of the world’s languages do exhibit common underlying principles and properties, but these properties are very abstract and are predicated on underlying unobservable structures. Moreover, the properties can be formal (like movement operations) or quantitative, like minimising dependencies. This dual property of languages—common abstract unifying principles at an unobserved level expressed by cross-linguistic variation at the surface level—can also be leveraged in a natural language processing perspective, as discussed in the next section.

3The measure is deﬁned as follows:DLMRatio= Σs DL_s

|s|²∕Σs OptDL_s

|s|² .

(11)

4 Learning Event Duration Using Parallel Corpora

In natural language processing, one of the problems that applications such as question answering, multi-document summarisation or generation have to solve is the correct ordering of the events in the text (Lapata and Lascarides2006; UzZaman et al.2013).

The ordering of events depends, among other factors, on the duration of each event separately. For example, the sentences in (2) represent two diﬀerent orderings.

In (2a), the two events are interpreted as a, possibly causal, sequence. This is possible because they are interpreted as short events. In contrast, in (2b), the second event is interpreted as durative, preexisting the ﬁrst event and spanning over its duration.

A sequential reading is not possible.

(2) a. John walked into the baby’s room. The baby woke up.

b. John walked into the baby’s room. The baby slept soundly.

modiﬁed from (Dowty1986) People have intuitions over how long an event can last, as it is shown by the fact that annotations of duration of events have been possible with reasonable inter- annotator agreement (Pan et al. 2011). But in many languages this information is implicit and not marked on the form of the verb expressing the event.

For example, in English, event duration is not a grammatical category like tense or person. Proposals have recently been made that event duration can be inferred by the context—such as time adverbials or surrounding words— and by morphological properties of words that are implicitly correlated with the property of duration (Costa and Branco2013; Pan et al.2011; Gusev et al.2011; Williams and Katz2012).

Samardžić (2013) and Samardžić and Merlo (2016) present a very diﬀerent idea that is based on two basic insights. First, the duration of an event is a lexical property correlated to the general cross-linguistic property of verb aspect. Two of the main properties of events expressed by aspect are whether it has an end point or not (boundedness) and how long it lasts (duration). These properties are correlated.

Second, while in English verb aspect is often implicit, there are languages that mark aspect much more explicitly, like the Slavic languages. One can leverage the surface properties of these languages, given the appropriate parallel resources, transfer aspect information to languages that do not express it explicitly, like English, and then use this information to infer event duration. Thus, they develop a quantitative representation of verb aspect, which is based on the distribution of morpho-syntactic realisations of Serbian verbs, they apply it to a set of parallel English-Serbian verb instances and use it to predict event duration in English. This work is discussed in more detail below.

Slavic languages differ from most of the other European languages, because they encode verb aspect through lexical derivations, by a complex system of prefixes and suffixes that encode perfectivity. These perfective and imperfective derivations can potentially encode numerous boundedness and duration classes, distinguishing long events from short events.

(12)

There are potentially two kinds of long events: basic imperfective and secondary imperfectives. A basic imperfective presents a morphological form such as<stem>

+<inﬂection>(e.g.kuva-ti, meaningboil), while a secondary imperfective has a

form<prefix>+<stem>+<suffix>+<inflection>(e.g.pro-kuva-va-ti, meaning:

‘boilcontinuously or repeatedly’). These two forms do not have the same properties. In particular, the secondary imperfective has a resultative meaning introduced by the prefix (Arsenijević2007), while the basic imperfective does not. Since the meaning of secondary imperfectives is more specific, they describe shorter events than basic imperfective verbs. The system of prefixes and suffixes also encodes two kinds of short events. The prefixed forms of the verb (e.g. pro-kuva-ti, meaning

‘complete and specifyboil’) express longer events than the suffixed forms (e.g.kuv- nu-ti, which means ‘boilonce, instantaneously’), since the perfective suffix is spe- cialised for very short or instantaneous events, clearly bounded. The short events are shorter than the long events, thus the four event types define a total order: prefixed perfective<suffixed perfective<secondary imperfective<basic imperfective.

This analysis of the Serbian aspect system illustrates that one can infer the duration of events based only on some simple, unlexicalised lexical properties: whether a verb as a prefix, and whether it has a suffix. Samardžić and colleagues define a simple Bayesian model that formalises the relationships between the observed properties—

suffix and prefix—and the duration of the event. The Bayesian logic indicates that an unobserved property, the duration, surfaces as different verbal forms. Their model assumes that suffixes depend both on the latent variable of interest—the duration of the event—and on the prefix, since a suffix can be added to a prefixed verb form as a means of deriving secondary imperfectives (potentially expressing long events) or it can be added to a bare form, and it can result in a perfective (expressing a short event). Prefixes depend only on the duration of the event. This model might appear overly simple but in fact it has been shown to have as good performance as both more complex models of the same kind and as more complex models trained on Web-scale resources.

Table2 shows comparative percent accuracy results of the unsupervised and supervised version of the aspect-based classiﬁer, in comparison with two unsuper-

Table 2 Percent accuracy of the unsupervised and supervised version of the aspect-based classi- ﬁer, in comparison with other approaches

Classiﬁcation accuracy (%)

Classiﬁer In-domain set WJS set

AspectNet Unsupervised 67.6 73.4

Gusev-etal Unsupervised 1 70.7 73.5

Gusev-etal Unsupervised 2 72.4 74.8

AspectNet Supervised 70.8 74.8

Gusev-etal Supervised 73.0 74.8

Pan-etal Supervised 73.3 73.5

(13)

vised approaches proposed by Gusev et al. (2011) and supervised approaches by Gusev et al. (2011) and by Pan et al. (2011).

Pan et al. (2011) learn a binary classiﬁcation of events into short and long using a set of lexical features and a set of features extracted from the context of the event.

The best classiﬁcation is learned using Support Vector Machines. Gusev et al. (2011) learn event duration using predeﬁned word patterns, indicating either a long or a short event. One of the patterns used to signal a short event, for example, is Past Tense +yesterday. The occurrence data are extracted from the Web.

These are very strong competitors, based in one case on very large, Web-scale data collections. The performance of the model described here on the in-domain test set approaches the other methods. On the out-of-domain WSJ test set, the model’s performance is similar to Gusev et al. (2011) in the unsupervised setting. The aspect- based model, despite its simplicity, reaches identical scores as Gusev et al. (2011) in the supervised setting.

These results clearly show the power of linguistically-informed approaches.

Samardžić and Merlo’s model is much simpler than the others, with only two, word- level features and consequently very little data is needed to estimate its parameters.

The model is trained on only a few hundred examples, without need for external resources, such as syntactic parsers or ontologies, which are necessary in other approaches. Other methods requires many learning patterns and Web-scale amounts of data. Finally, this method exemplifies how to transfer surface information from a language where this information is explicitly marked on the surface to a language where it is only implicit. In this way, languages become annotations of each other and, as in this case, the flow of information can flow from an otherwise resource- poor to a resource-rich language.

5 Conclusions

The pieces of work described in this chapter are strongly inspired by linguistic theory and linguistic constructs. The use of articulated, abstract theories leads to models that are small in size and require little data to estimate a few parameters, but maintain good predictive power. These models provide a link between symbolic, traditional forms of linguistic explanation and quantitative properties of the linguistic phenomena under study, properties that can be typological or corpus-based. Exploring this link between linguistic and statistical properties enables us to develop powerful, yet simple NLP tools and, in so doing, (re) build bridges between linguistics, computational linguistics and natural language processing.

Acknowledgements I gratefully thank the Swiss NSF which has funded the collaborators who have worked on this research program, Kristina Gulordava and Tanja Samardžić, under grants 122643 and 144362. Thanks to the members of the Computational Learning and Computational Linguistics group, who have given comments at diﬀerent stages of this work. All remaining errors are my own.

(14)

References

Abels K, Neeleman A (2009) Universal 20 without the LCA. In: Brucart JM, Gavarro G, Sola J (eds) Merging features: computation, interpretation, and acquisition. Oxford University Press, Oxford/New York, pp 60–79

Arsenijević, B. (2007). Slavic verb preﬁxes are resultative.Cah Chronos,17, 197–213.

Baker MC (2002) The atoms of language: the mind’s hidden rules of grammar. Basic books Cancho, R F-i, & Liu, H. (2014). The risks of mixing dependency lengths from sequences of dif-

ferent length.Glottotheory,5(2), 143–155.

Cinque, G. (2005). Deriving Greenberg’s universal 20 and its exceptions.Linguist Inq,36(3), 315–

332.

Cinque G (2013) Typological studies: word order and relative clauses. Routledge, New York/

London

Costa F, Branco A (2013) Temporal relation classiﬁcation based on temporal reasoning. In:

Proceedings of the 10th international conference on computational semantics (IWCS 2013).

Potsdam, Germany, pp 59–70

Cysouw, M. (2010). Dealing with diversity: towards an explanation of NP word order frequencies.

Linguist Typol,14(2), 253–287.

Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syntactic processing complexity.Cognition,109(2), 193–210.

Dowty, D. R. (1986). The eﬀects of aspectual class on the temporal structure of discourse: semantics or pragmatics.Linguist Philos,9, 37–61.

Dryer M (2006) The order demonstrative, numeral, adjective and noun: an alternative to Cinque.

http://exadmin.matita.net/uploads/pagine/1898313034_cinqueH09.pdf

Dryer, M. S. (1992). The Greenbergian word order correlations.Language,68, 81–138. doi:10.

2307/416370.

Dryer MS (2009) The branching direction theory of word order correlations revisited. In: Universals of language today. Springer, pp 185–207

Futrell R, Mahowald K, Gibson E (2015) Large-Scale evidence of dependency length minimization in 37 languages. In: Proceedings of the national academy of sciences of the United States of America

Gibson, E. (1998). Linguistic complexity: locality of syntactic dependencies.Cognition,68(1), 1–

76.

Gibson E (2000) The dependency locality theory: a distance-based theory of linguistic complexity.

Image, language, brain pp 95–126

Gildea D, Temperley D (2010) Do grammars minimize dependency length? Cogn. Sci. 34(2):286–

310.http://dx.doi.org/10.1111/j.1551-6709.2009.01073.x

Greenberg, J. H. (1966).Language universals. The Hague, Paris: Mouton.

Gulordava K, Merlo P (2015a) Diachronic trends in word order freedom and dependency length in dependency-annotated corpora of Latin and Ancient Greek. In: Proceedings of the international conference on dependency linguistics (DepLing’15). Uppsala, Sweden

Gulordava K, Merlo P (2015b) Structural and lexical factors in adjective placement in complex noun phrases across romance languages. In: Proceedings of the 19th conference on computational natural language learning. Beijing, China, pp 247–257

Gulordava K, Merlo P (2016) Multi-lingual dependency parsing evaluation: a large-scale analysis of word order properties using artiﬁcial data. Trans Assoc Comput Linguist

Gulordava K, Merlo P, Crabbé B (2015) Dependency length minimisation eﬀects in short spans:

a large-scale analysis of adjective placement in complex noun phrases. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 2: Short Papers). Beijing, China, pp 477–482

(15)

Gusev A, Chambers N, Khaitan P, Khilnani D, Bethard S, Jurafsky D (2011) Using query patterns to learn the durations of events. In: IEEE IWCS-2011, 9th international conference on web service, institute of electrical and electronics engineers (IEEE). Oxford, UK, pp 145–155

Hawkins JA (1994) A performance theory of order and constituency. Cambridge University Press, Cambridge

Hawkins, J. A. (2004).Eﬃciency and complexity in grammars. Oxford: Oxford University Press.

Lapata, M., & Lascarides, A. (2006). Learning sentence-internal temporal relations.J Artif Intell Res,27, 85–117.

Lohse B, Hawkins JA, Wasow T (2004) Domain minimization in English verb-particle construc- tions. Language, pp 238–261

Merlo, P. (2015). Predicting word order universals.J Lang Model,3(2), 317–344.

Merlo P, Ouwayda S (forthcoming) Movement and structure eﬀects on universal 20 word order frequencies: a quantitative study, manuscript, University of Geneva

Merlo P, Stevenson S, Tsang V, Allaria G (2002) A multi-lingual paradigm for automatic verb classiﬁcation. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL’02). Philadelphia, PA, pp 207–214

Pan, F., Mulkar-Mehta, R., & Hobbs, J. R. (2011). Annotating and learning event durations in text.

Comput Linguist,37(4), 727–753.

van der Plas L, Samardžić T, Merlo P (2010) Cross-lingual validity of PropBank in the manual annotation of French. In: Proceedings of the 4th linguistic annotation workshop, Association for Computational Linguistics. Uppsala, Sweden, pp 113–117.http://www.aclweb.org/anthology/

W10-1814

Russel, S., & Norvig, P. (1995).Artiﬁcial intelligence: a modern approach. Upper Saddle River, NJ: Prentice Hall Series in Artiﬁcial Intelligence, Prentice Hall.

Samardžić T (2013) Dynamics, causation, duration in the predicate-argument structure of verbs:

a computational approach based on parallel corpora. Ph.D. thesis, University of Geneva Samardžić T, Merlo P (to appear) Unlexicalised learning of event duration using parallel corpora.

In: Villavincencio A, Diab M (eds) Volume in honour of Adam Kilgarriﬀ. Springer

Steddy S, Samek-Lodovici V (2011) On the ungrammaticality of remnant movement in the derivation of Greenberg’s universal 20. Linguist Inq 42(3):445–469

Steedman M (2011) Greenberg’s 20th: the view from the long tail, unpublished manuscript. Uni- versity of Edinburgh

Temperley, D. (2007). Minimization of dependency length in written English.Cognition,105(2), 300–333.

Tsang V, Stevenson S (2001) Automatic verb classiﬁcation using multilingual resources. In: Pro- ceedings of the conference on computational language learning (CONLL 2001)

Tsang V, Stevenson S, Merlo P (2002) Cross-linguistic transfer in automatic verb classiﬁcation. In:

Proceedings of the 19th international conference on computational linguistics (COLING 2002).

Taipei, Taiwan, pp 1023–1029

UzZaman N, Llorens H, Derczynski L, Allen J, Verhagen M, Pustejovsky J (2013) SemEval-2013 task 1: TempEval-3: evaluating time expressions, events, and temporal relations. Second joint conference on lexical and computational semantics (*sem), volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013). Atlanta, Georgia, USA, pp 1–9 Wasow T (2002) Postverbal behavior. CSLI Publications

Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so Naive Bayes: aggregating one-dependence estimators.Mach Learn,58(1), 5–24.

Williams J, Katz G (2012) Extracting and modeling durations for habits and events from Twitter. In:

Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, pp 223–227