Proceedings Chapter
Reference
Some Recent Results on Cross-Linguistic, Corpus-Based Quantitative Modelling of Word Order and Aspect
MERLO, Paola
Abstract
One of the most striking features of human languages is their extreme variety. Even more striking is the existence, behind the apparent variety, of strong representational and cognitive regularities that govern their form and their function: language universals. We discuss here some recent work from our group, where large-scale, data-intensive computational modelling techniques are used to address fundamental linguistic questions on language regularities. In the area of word order, we report here on work that leverages large amounts of monolingual and parallel corpus data to develop computational models of the internal structure of the noun phrase (Universal 20) and of general structural minimisation principles. In the area of event duration, we report on work that leverages both deep similarities and surface differences to develop truly cross-linguistic natural language processing tools.
MERLO, Paola. Some Recent Results on Cross-Linguistic, Corpus-Based Quantitative Modelling of Word Order and Aspect. In: Blochowiak, J. Formal Models in the Study of Language . Cham : Springer, 2017. p. 451-464
DOI : 10.1007/978-3-319-48832-5_24
Available at:
http://archive-ouverte.unige.ch/unige:123646
Disclaimer: layout of this document may differ from the published version.
1 / 1
Corpus-Based Quantitative Modelling of Word Order and Aspect
Paola Merlo
Abstract One of the most striking features of human languages is their extreme variety. Even more striking is the existence, behind the apparent variety, of strong representational and cognitive regularities that govern their form and their function:
language universals. We discuss here some recent work from our group, where large- scale, data-intensive computational modelling techniques are used to address funda- mental linguistic questions on language regularities. In the area of word order, we report here on work that leverages large amounts of monolingual and parallel corpus data to develop computational models of the internal structure of the noun phrase (Universal 20) and of general structural minimisation principles. In the area of event duration, we report on work that leverages both deep similarities and surface differ- ences to develop truly cross-linguistic natural language processing tools.
Keywords Language universals
⋅
Universal 20⋅
Word order⋅
Dependency length minimization⋅
Event duration⋅
Corpus modeling⋅
Computational modeling1 Introduction
One of the striking properties of human languages is their extreme variety in form and expression. Another, apparently contradictory, property is that they all are, at some level, similar, as shown by the fact that any two languages can be transla- tions of each other and that their structural forms exhibit very strong regularities.
Several decades of formal grammar have developed complex representations and sophisticated theories of these regularities. Current availability of very large corpora for many languages provides observational data about variation. The main scientific
The space of intersection of generative grammar, corpus work and computational modelling is a lonely place. Thanks to Jacques, for sharing some of these interests.
P. Merlo (✉)
Computational Learning and Computational Linguistics Research Group (clcl.unige.ch), Department of Linguistics, University of Geneva, Geneva, Switzerland
e-mail: [email protected]
© Springer International Publishing AG 2017
J. Blochowiak et al. (eds.),Formal Models in the Study of Language, DOI 10.1007/978-3-319-48832-5_24
451
challenge for computational linguistics is the creation of theories and methods that fruitfully combine large-scale, corpus-based approaches with the linguistic depth of more theoretical symbolic methods.
We discuss here some recent work from our group, where large-scale, data- intensive computational modelling techniques are used to address fundamental lin- guistic questions. We investigate both similarities and differences across languages.
On the one hand, we report on work that investigates, in the area of word order, whether frequencies—both typological and corpus-based—are systematically cor- related to abstract syntactic principles, and to higher level functional principles of efficiency and optimisation. On the other hand, much like the comparative method in linguistics, cross-lingual corpus investigations take advantage of corresponding annotation or linguistic knowledge across languages. We report on work that exploits differences across languages in the surface expression of meaning to show that com- plementary information about one language can be extracted from their translations in a second language, considerably improving the performance on current NLP tasks.
These two approaches, a type-based approach and a token-based approach, will be illustrated in this paper by our current work on language universals—specifically on universal 20 and the dependency minimisation effect—and work on cross-lingual transfer between Serbian and English to learn the duration of events.
Languages vary greatly in one of their most fundamental and apparent properties, the order of words. Languages can position the verb at the beginning or end of the sentence, adjectives can precede or follow the noun, for example. Word orders vary greatly cross-linguistically, but each language has very strong preferences for a few or even only one order, and, across languages, not all orders are equally preferred (Greenberg1966; Dryer1992; Cinque2013; Baker2002). Very many theories and descriptions have attempted to explain word order differences and universals and their typological frequencies.
Some authors develop generative mechanisms to account for typological observa- tions by a system of costs and constraints that generate statistical universals (Cinque 2005; Abels and Neeleman2009; Cinque2013; Steedman2011; Steddy and Samek- Lodovici2011). For example, in a paper that has received much commentary (Cinque 2005), Greenberg’s Universal 20—the universal governing the order of nominal modifiers—is derived from independently motivated principles of syntax organised in a derivational explanation. Factorial, but not derivational, explanations have also been proposed. They identify the predictive properties of the frequency distributions of word order and their relative importance, either based on proximity preference or on general principles of symmetry and harmony (Cysouw2010; Dryer2006,2009).
Their predictive power is compared to the generative one in Merlo (2015). We review this work below.
Another area of study of word order investigates the preferred order in those cases where several are possible. Several factors can influence the choice of word order, both related to grammatical principles and to processing principles. Process- ing theories propose that alternations of word order are attempts to minimise effort.
Hawkins (1994, 2004) shows, for example, that syntactic choices generally respect the preference for placing short elements closer to the head than long elements.
Hawkins’s work is representative of much work on language processing which attributes parsing performance to the distance or locality of linguistic constituents and their dependents (Gibson1998, 2000; Lohse et al.2004; Demberg and Keller 2008). Recently, evidence has been provided for an optimisation process called dependency length minimisation (DLM). Temperley (2007) finds evidence for DLM in a variety of syntactic choice phenomena in written English. Global measures of dependency length on a larger scale have been proposed, and cross-linguistic work has used these measures and demonstrated their minimisation (Gildea and Temper- ley2010; Futrell et al.2015; Gulordava and Merlo2015a,b; Gulordava et al.2015).
We review our group’s work in this area, which shows that DLM also applies to very short spans and presents interesting diachronic trends.
Statistical approaches in natural language processing have experienced unham- pered success for almost two decades on very many tasks. These methods, however, rely on the preparation and use of costly hand-annotated resources, which, in prac- tice, exist for only a few languages. Much like the comparative method in linguistics, most cross-lingual learning takes advantage of any corresponding annotation or lin- guistic knowledge that is available in one or more languages to transfer it to other lan- guages (Tsang and Stevenson2001; van der Plas et al.2010). For example, exploiting multi-lingual resources to solve mono-lingual problems, Merlo et al. (2002) have proposed a verb classification methodology using multi-lingual dictionaries. They exploit the differences across languages in the surface expression of meaning, to show that complementary information about English verbs can be extracted from their translations in a second language (Chinese, German), considerably improving the performance on English (Tsang et al.2002). We report on recent development of this line of work below (Samardžić and Merlo2016), where we learn the duration of events in English by transferring information from Serbian.
2 Universal 20
One interesting goal of large-scale corpus and computational linguistics is to inves- tigate the link between syntactic frequencies of word order and underlying grammat- ical principles (for instance, movement). This line of work develops the hypothesis that the differential costs of basic syntactic operations and structures yield differ- ential frequencies. One of such areas is the study of the costs of operations that give rise to different word orders across languages in the internal structure of NPs and the observational universal called Universal 20 (Greenberg1966)—the univer- sal governing the linear order of a noun and its modifiers. This line of work started in (Merlo2015), summarised here, and currently is being developed in Merlo and Ouwayda (forthcoming) to finer-grained theoretical detail.
Greenberg’s Universal 20
When any or all the items (demonstrative, numeral, and descriptive adjective) precede the noun, they are always found in this order. If they follow, the order is exactly the same or its exact opposite.
Table 1 Attested word orders of Universal 20 and their estimated frequencies Dryer’s counts by
languages
Dryer’s counts by genera
Cinque’s counts discretised
Dem Num Adj N 74 44 V. many
Dem Adj Num N 3 2 0
Num Dem Adj N 0 0 0
Num Adj Dem N 0 0 0
Adj Dem Num N 0 0 0
Adj Num Dem N 0 0 0
Dem Num N Adj 22 17 Many
Dem Adj N Num 11 6 V. few
Num Dem N Adj 0 0 0
Num Adj N Dem 4 3 V. few
Adj Dem N Num 0 0 0
Adj Num N Dem 0 0 0
Dem N Adj Num 28 22 Many
Dem N Num Adj 3 3 V. few
Num N Dem Adj 5 3 0
Num N Adj Dem 38 21 Few
Adj N Dem Num 4 2 V. few
Adj N Num Dem 2 1 V. few
N Dem Num Adj 4 3 Few
N Dem Adj Num 6 4 V. few
N Num Dem Adj 1 1 0
N Num Adj Dem 9 7 Few
N Adj Dem Num 19 11 Few
N Adj Num Dem 108 57 V. many
Some aspects of Greenberg’s formulation have been confirmed by the collection of larger samples, but some others have been found to be too strong. Table1reports the 24 combinatorially possible orders of the four elements: N, Dem, Num, Adj and the actual counts that have been proposed in several publications: the first two columns are Dryer’s (2006) counts by language and by genera; and the following col- umn are Cinque’s ranks, as can be deduced from the 2005 paper. As can be observed, despite differences across the different counting methods and across authors, which have been discussed in detail in the related publications, the rank of languages or genera based on frequencies shows only one disagreement (the order Num N Adj Dem). The other counts agree on the two most frequent orders, the rare or unattested orders, as well as the rankings in-between. Given the observations of striking differ- ences in the frequency of use of word orders across languages, the question is: why are there such differences? Many different explanations have been proposed.
Merlo (2015) illustrates a method to compare them and analyse how well these explanations predict the data in a precise experimental setting. Three pieces of work are compared. Cinque (2005) proposes that typological frequency rankings are explained by a system of costs of derivational and movement operations: the more costly the derivation the less frequent the order. In a different proposal, a factor- ial, but not derivational, explanation is proposed (Cysouw2010). An explanation of typological frequencies is produced by the cumulative combination of three charac- teristics: hierarchical structure, noun-adjective order, and whether the noun is at the phrase boundary. Finally, Dryer proposes an unweighted factorial explanation based on general principles of symmetry and harmony (Dryer2006).
Merlo (2015) defines a formal encoding based on a map of the explanatory prop- erties of each proposal to a feature-vector representation. In this method, each word order is representd by a vector of features, which encodes the theory. These vectors are then used by a supervised classifier to predict the word order frequency of unseen word orders.
The most important aspect for our discussion is that once the data are encoded in an appropriate way, Cinque’s way of assigning markedness values (fitting the weights), which is done by hand, or Cysouw’s way of fitting the model to the data is reproduced in this classification scenario by training the model. There are numer- ous algorithms for learning the weights of a model in a supervised setting, and many regimes for training and testing such algorithms. In Merlo’s (2015) experiments, two probabilistic learning algorithms are used—Naive Bayes and the Weighted Average One-dependence Estimator (WAODE)—, andn-fold cross-validation, in the proto- col used for training and testing (Russel and Norvig1995; Webb et al.2005).
This method of model comparison yields many novel results. First of all, it pro- vides an unbiased estimate of how good at generalising the models really are, since they are tested on unseen data. The results show that all the models are at least ten percent below ceiling, some below baseline. This result shows that manual data fit- ting yields overly optimistic results compared to those obtained by comparing on a properly held-out test set. Secondly, one can identify kinds of errors specific to each model by looking at confusion matrices. Cinque’s model tends to classify word orders as more frequent than they really are. Cysouw’s model does not appear to be able to predict the unattested class. Dryer’s result show that features based on har- mony and symmetry can only precisely predict the very frequent orders. Finally, this method allows us to test whether the proposed elementary operations, encoded as the attributes of the models, are indeed primitive and independent. This is possible, because the Naive Bayes learner is predicated on a strong independence assump- tion of the attributes. Results show that Cinque’s and Dryer’s models yield better results when a classifier that uses weaker independence assumptions—the WAODE classifier—is used, while Cysouw’s has the same performance. The fact that a clas- sifier that makes weaker independence assumptions about its attributes yields better performance than Naive Bayes indicates that the attributes are not independent. Find- ing a statistical dependence among factors suggests that part of the explanation of the data is given by the interaction of the factors. This interaction is specific to the
problem and the given data, and shows that part of the explanation is not indepen- dently motivated.
This work on word order is an illustration of how to use simple computational methods to represent theories of language variation coming from typology, and com- pare them to each other, but it relies on extremely summarised data. It does not entirely exploit the large amounts of naturalistic data that are available these days and that enable us to study variation in much more detail. Corpora and other large sources of data need to be used to attain this goal, as illustrated in the next section.
3 Dependency Length Minimisation
Another universal that has recently been claimed to govern both word order and syntactic structure is Dependency Length Minimisation (DLM) (Futrell et al.2015).
The DLM principle can be stated as follows: if there exist possible alternative orderings of a phrase, the one with the shortest overall dependency length (DL) is preferred.1
This principle replaces previous formulations where it had been observed that heavy, complex phrases have a tendency to move to the end of the sentence. While the effect of size or heaviness is well-documented, a preference of end weight does not describe the preference of head-final languages: Object-Verb languages, like Japanese or Korean, put long constituents before short (Hawkins 1994; Wasow 2002). DLM captures the fact that short elements prefer being close to their head and is therefore more valid typologically.
Corpus and treebank data allows us to verify how these claims fare cross- linguistically. DLM has recently been confirmed at the sentential level for many lan- guages (Futrell et al.2015). Our group has recently investigated short spans, studying DLM effects in the noun phrase. While DLM has been demonstrated on a large scale and explanations have been proposed based on human sentence processing facts in the verbal domain, it is not clear what the effects of DLM are in the more limited nominal domain. If the explanations are really rooted in memory and efficiency, will they still hold in phrases that might span only a few words? Gulordava et al. (2015) and Gulordava and Merlo (2015b) look at the structural factors that play a role in adjective-noun word order alternations in Romance languages. They demonstrate that, unlike results for the verbal domain, it is not only the length of the dependency that is at stake, but also the interaction with the surrounding dependencies. More pre- cisely, it is demonstrated that the presence of a right dependent of the noun affects the position of the adjective which modifies this noun: the prenominal position is more often preferred in such cases. This effect is highly consistent for five Romance languages and for all noun and adjective types.
1The length of a dependency is measured as the number of words between the head and its depen- dent.
Fig. 1 Right external dependent, prenominal adjective
Fig. 2 Right external dependent, postnominal adjective
Specifically, Gulordava and colleagues analyse the adjective placement using the dependency annotated treebanks of Romance languages, from which they collect large amounts of noun phrases and their structures. They consider a prototypical simple noun phrase with one modifying adjective phrase, which in turn can contain preadjectival and postadjectival material. The adjective modifier can be a complex phrase with both left and right dependents (𝛼and𝛽, respectively). The noun phrase can have parents and right modifiers (X and Y, respectively). These structures cor- respond to examples like those shown in (1), in Italian (X =‘è’, Adj =‘principale’, N =‘isola’, Y = ‘del Lago Maggiore’). Example structures for two of the possible cases are shown in Figs.1and2.
(1) a. Questa è la principale isola del Lago Maggiore.
(‘This is the main island of Lake Maggiore.’) b. Questa è l’isola principale del Lago Maggiore.
(‘This is the island main of Lake Maggiore.’) c. La principale isola del Lago Maggiore è questa.
(‘The main island of Lake Maggiore is this one.’) d. L’isola principale del Lago Maggiore è questa.
(‘The island main of Lake Maggiore is this one.’)
DLM makes predictions on adjective placement with respect to the noun—
prenominal or postnominal—given the dependents of the adjectives,𝛼and𝛽, and given the dependent of the noun Y.
Let’s assume that the calculation of DL differences is always calculated as DL1−DL2, that is prenominal order—postnominal order. Consider the dependency length difference for the two cases shown in the picture: DL1=d1+d2+d3 =
|𝛽|+|Y| and DL2=d1+d2+d3=|𝛼|+|𝛽|+|Y|+ 1 +|𝛼|+|𝛼|+|𝛽|+ 1=
3|𝛼|+ 2|𝛽|+|Y|+ 2. The differentialDL1−DL2is−3|𝛼|−|𝛽|− 2. The negative value shows thatDL1<DL2, hence there will be a preference for a prenominal place- ment. Similar calculations can be done for the other cases.
The predictions of the DLM theory are tested in a mixed-effects model estimated on the data provided by the dependency annotated corpora of five main Romance languages (Catalan, French, Italian, Portuguese, Spanish). The different elements in the DLM configuration are encoded as four factors: corresponding to the factors illustrated in Figs.1to2and example (1):LeftAP—the cumulative length (in words)
of all left dependents of the adjective, indicated as𝛼in Figs.1to2;RightAP—the cumulative length (in words) of all right dependents of the adjective, indicated as 𝛽 in Figs.1 to2;ExtDep—the direction of the arc from the noun to its parent X, an indicator variable;RightNP—the indicator variable representing the presence or absence of the right dependent of the noun, indicated as Y.
The findings partly confirm the predictions about adjective placement with respect to the noun given the adjective dependents (presence of preadjectival material induces prenominal placement and postadjectival material triggers postnominal placement). The prediction related to the presence of a right dependent of the noun on the placement of the adjective are also confirmed.
These results confirm that a principle of minimisation of dependencies is also active in Noun Phrases and even in very short spans—recall that Noun Phrases with only one adjective were considered. They therefore suggest that DLM applies also in situations where a functional motivation of this principle is only minimally sup- ported, as the need to reduce cognitive or memory load is not compelling.
Observational language universals, such as universal 20, suggest that languages prefer harmonic orders, while work on DL preferences shows that languages prefer shorter dependencies. However, these two pressures are in contradiction, as a reduc- tion in dependency length can be obtained by placing modifiers at the two sides of the head, increasing variation in head directionality and consequently inducing less directional harmony in the structure. How exactly languages balance these two pres- sures and evolve, then, is worthy of investigation.
Gulordava and Merlo (2015a) study texts of Latin and Ancient Greek, two well- documented free word order languages, spanning different time periods. The texts in Latin range from the Classical Latin period (Caesar and Cicero) to the Late Latin of 4th century (Vulgate and Peregrinatio). Jerome’s Vulgate is a translation from the Greek New Testament. The two Greek texts are Herodotus (4th century BC) and New Testament (4th century AD). The sizes of the texts vary but they are at least 900-sentence long.
Following Gildea and Temperley (2010) and Futrell et al. (2015), upper and lower bounds are established by computing the optimal and random dependency length of a sentence. More precisely, to compute the random dependency length, the positions of the words in the sentence are permuted, preserving its unordered dependency tree available from the gold annotation, and the new random dependency length is calculated.
Interesting results are obtained by comparing the random, optimal and actual dependency lengths averaged for sentences of the same length.2Figure3shows the curves for the aggregated data from the classical Latin texts and from the late Latin texts. The first observation is that both periods show dependency lengths that are optimised, to a certain extent, as they are all better than the (red) random curve.
2Since the optimal and random dependency length values depend (non-linearly) on the sentence lengthn, it is customary not to average DL globally over all sentences in a treebank (Cancho and Liu,2014).
Fig. 3 Average random, average optimal and actual dependency lengths of sentences by sentence length for texts in classical Latin (left panel) and in late Latin (right panel)
Also, all languages have a very similar minimised dependency length, and the actual curves lie between random and optimal.
If we calculate the ratio of actual dependency length compared to optimal depen- dency length, over all the dependencies in the texts, we can calculate how much a given time period optimises DL. According to the ratio measure used in Gulordava and Merlo (2016),3 the different texts show the following rates of min- imisation: Cicero (Classical Latin) = 1.26; Vulgate (late Latin) = 1.17; Herodotus (early Ancient Greek) = 1.33; New testament (late Ancient Greek) = 1.19. These numbers clearly show a trend towards greater minimisation, in both languages. For Latin, this tendency is confirmed by the minimisation rate of current Romance lan- guages: Italian = 1.13, French and Spanish = 1.15.
The work described so far investigates quantitative factors underlying typolog- ical and corpus variation of word order, to discover the operations and principles that govern this variation. Two main conclusions emerge from these investigations.
The great variety of the world’s languages do exhibit common underlying principles and properties, but these properties are very abstract and are predicated on underly- ing unobservable structures. Moreover, the properties can be formal (like movement operations) or quantitative, like minimising dependencies. This dual property of languages—common abstract unifying principles at an unobserved level expressed by cross-linguistic variation at the surface level—can also be leveraged in a natural language processing perspective, as discussed in the next section.
3The measure is defined as follows:DLMRatio= Σs DLs
|s|2∕Σs OptDLs
|s|2 .
4 Learning Event Duration Using Parallel Corpora
In natural language processing, one of the problems that applications such as ques- tion answering, multi-document summarisation or generation have to solve is the correct ordering of the events in the text (Lapata and Lascarides2006; UzZaman et al.2013).
The ordering of events depends, among other factors, on the duration of each event separately. For example, the sentences in (2) represent two different orderings.
In (2a), the two events are interpreted as a, possibly causal, sequence. This is possible because they are interpreted as short events. In contrast, in (2b), the second event is interpreted as durative, preexisting the first event and spanning over its duration.
A sequential reading is not possible.
(2) a. John walked into the baby’s room. The baby woke up.
b. John walked into the baby’s room. The baby slept soundly.
modified from (Dowty1986) People have intuitions over how long an event can last, as it is shown by the fact that annotations of duration of events have been possible with reasonable inter- annotator agreement (Pan et al. 2011). But in many languages this information is implicit and not marked on the form of the verb expressing the event.
For example, in English, event duration is not a grammatical category like tense or person. Proposals have recently been made that event duration can be inferred by the context—such as time adverbials or surrounding words— and by morphological properties of words that are implicitly correlated with the property of duration (Costa and Branco2013; Pan et al.2011; Gusev et al.2011; Williams and Katz2012).
Samardžić (2013) and Samardžić and Merlo (2016) present a very different idea that is based on two basic insights. First, the duration of an event is a lexical prop- erty correlated to the general cross-linguistic property of verb aspect. Two of the main properties of events expressed by aspect are whether it has an end point or not (boundedness) and how long it lasts (duration). These properties are correlated.
Second, while in English verb aspect is often implicit, there are languages that mark aspect much more explicitly, like the Slavic languages. One can leverage the sur- face properties of these languages, given the appropriate parallel resources, transfer aspect information to languages that do not express it explicitly, like English, and then use this information to infer event duration. Thus, they develop a quantitative representation of verb aspect, which is based on the distribution of morpho-syntactic realisations of Serbian verbs, they apply it to a set of parallel English-Serbian verb instances and use it to predict event duration in English. This work is discussed in more detail below.
Slavic languages differ from most of the other European languages, because they encode verb aspect through lexical derivations, by a complex system of prefixes and suffixes that encode perfectivity. These perfective and imperfective derivations can potentially encode numerous boundedness and duration classes, distinguishing long events from short events.
There are potentially two kinds of long events: basic imperfective and secondary imperfectives. A basic imperfective presents a morphological form such as<stem>
+<inflection>(e.g.kuva-ti, meaningboil), while a secondary imperfective has a
form<prefix>+<stem>+<suffix>+<inflection>(e.g.pro-kuva-va-ti, meaning:
‘boilcontinuously or repeatedly’). These two forms do not have the same proper- ties. In particular, the secondary imperfective has a resultative meaning introduced by the prefix (Arsenijević2007), while the basic imperfective does not. Since the meaning of secondary imperfectives is more specific, they describe shorter events than basic imperfective verbs. The system of prefixes and suffixes also encodes two kinds of short events. The prefixed forms of the verb (e.g. pro-kuva-ti, meaning
‘complete and specifyboil’) express longer events than the suffixed forms (e.g.kuv- nu-ti, which means ‘boilonce, instantaneously’), since the perfective suffix is spe- cialised for very short or instantaneous events, clearly bounded. The short events are shorter than the long events, thus the four event types define a total order: prefixed perfective<suffixed perfective<secondary imperfective<basic imperfective.
This analysis of the Serbian aspect system illustrates that one can infer the dura- tion of events based only on some simple, unlexicalised lexical properties: whether a verb as a prefix, and whether it has a suffix. Samardžić and colleagues define a simple Bayesian model that formalises the relationships between the observed properties—
suffix and prefix—and the duration of the event. The Bayesian logic indicates that an unobserved property, the duration, surfaces as different verbal forms. Their model assumes that suffixes depend both on the latent variable of interest—the duration of the event—and on the prefix, since a suffix can be added to a prefixed verb form as a means of deriving secondary imperfectives (potentially expressing long events) or it can be added to a bare form, and it can result in a perfective (expressing a short event). Prefixes depend only on the duration of the event. This model might appear overly simple but in fact it has been shown to have as good performance as both more complex models of the same kind and as more complex models trained on Web-scale resources.
Table2 shows comparative percent accuracy results of the unsupervised and supervised version of the aspect-based classifier, in comparison with two unsuper-
Table 2 Percent accuracy of the unsupervised and supervised version of the aspect-based classi- fier, in comparison with other approaches
Classification accuracy (%)
Classifier In-domain set WJS set
AspectNet Unsupervised 67.6 73.4
Gusev-etal Unsupervised 1 70.7 73.5
Gusev-etal Unsupervised 2 72.4 74.8
AspectNet Supervised 70.8 74.8
Gusev-etal Supervised 73.0 74.8
Pan-etal Supervised 73.3 73.5
vised approaches proposed by Gusev et al. (2011) and supervised approaches by Gusev et al. (2011) and by Pan et al. (2011).
Pan et al. (2011) learn a binary classification of events into short and long using a set of lexical features and a set of features extracted from the context of the event.
The best classification is learned using Support Vector Machines. Gusev et al. (2011) learn event duration using predefined word patterns, indicating either a long or a short event. One of the patterns used to signal a short event, for example, is Past Tense +yesterday. The occurrence data are extracted from the Web.
These are very strong competitors, based in one case on very large, Web-scale data collections. The performance of the model described here on the in-domain test set approaches the other methods. On the out-of-domain WSJ test set, the model’s performance is similar to Gusev et al. (2011) in the unsupervised setting. The aspect- based model, despite its simplicity, reaches identical scores as Gusev et al. (2011) in the supervised setting.
These results clearly show the power of linguistically-informed approaches.
Samardžić and Merlo’s model is much simpler than the others, with only two, word- level features and consequently very little data is needed to estimate its parameters.
The model is trained on only a few hundred examples, without need for external resources, such as syntactic parsers or ontologies, which are necessary in other approaches. Other methods requires many learning patterns and Web-scale amounts of data. Finally, this method exemplifies how to transfer surface information from a language where this information is explicitly marked on the surface to a language where it is only implicit. In this way, languages become annotations of each other and, as in this case, the flow of information can flow from an otherwise resource- poor to a resource-rich language.
5 Conclusions
The pieces of work described in this chapter are strongly inspired by linguistic the- ory and linguistic constructs. The use of articulated, abstract theories leads to models that are small in size and require little data to estimate a few parameters, but maintain good predictive power. These models provide a link between symbolic, traditional forms of linguistic explanation and quantitative properties of the linguistic phenom- ena under study, properties that can be typological or corpus-based. Exploring this link between linguistic and statistical properties enables us to develop powerful, yet simple NLP tools and, in so doing, (re) build bridges between linguistics, computa- tional linguistics and natural language processing.
Acknowledgements I gratefully thank the Swiss NSF which has funded the collaborators who have worked on this research program, Kristina Gulordava and Tanja Samardžić, under grants 122643 and 144362. Thanks to the members of the Computational Learning and Computational Linguistics group, who have given comments at different stages of this work. All remaining errors are my own.
References
Abels K, Neeleman A (2009) Universal 20 without the LCA. In: Brucart JM, Gavarro G, Sola J (eds) Merging features: computation, interpretation, and acquisition. Oxford University Press, Oxford/New York, pp 60–79
Arsenijević, B. (2007). Slavic verb prefixes are resultative.Cah Chronos,17, 197–213.
Baker MC (2002) The atoms of language: the mind’s hidden rules of grammar. Basic books Cancho, R F-i, & Liu, H. (2014). The risks of mixing dependency lengths from sequences of dif-
ferent length.Glottotheory,5(2), 143–155.
Cinque, G. (2005). Deriving Greenberg’s universal 20 and its exceptions.Linguist Inq,36(3), 315–
332.
Cinque G (2013) Typological studies: word order and relative clauses. Routledge, New York/
London
Costa F, Branco A (2013) Temporal relation classification based on temporal reasoning. In:
Proceedings of the 10th international conference on computational semantics (IWCS 2013).
Potsdam, Germany, pp 59–70
Cysouw, M. (2010). Dealing with diversity: towards an explanation of NP word order frequencies.
Linguist Typol,14(2), 253–287.
Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories of syn- tactic processing complexity.Cognition,109(2), 193–210.
Dowty, D. R. (1986). The effects of aspectual class on the temporal structure of discourse: semantics or pragmatics.Linguist Philos,9, 37–61.
Dryer M (2006) The order demonstrative, numeral, adjective and noun: an alternative to Cinque.
http://exadmin.matita.net/uploads/pagine/1898313034_cinqueH09.pdf
Dryer, M. S. (1992). The Greenbergian word order correlations.Language,68, 81–138. doi:10.
2307/416370.
Dryer MS (2009) The branching direction theory of word order correlations revisited. In: Universals of language today. Springer, pp 185–207
Futrell R, Mahowald K, Gibson E (2015) Large-Scale evidence of dependency length minimization in 37 languages. In: Proceedings of the national academy of sciences of the United States of America
Gibson, E. (1998). Linguistic complexity: locality of syntactic dependencies.Cognition,68(1), 1–
76.
Gibson E (2000) The dependency locality theory: a distance-based theory of linguistic complexity.
Image, language, brain pp 95–126
Gildea D, Temperley D (2010) Do grammars minimize dependency length? Cogn. Sci. 34(2):286–
310.http://dx.doi.org/10.1111/j.1551-6709.2009.01073.x
Greenberg, J. H. (1966).Language universals. The Hague, Paris: Mouton.
Gulordava K, Merlo P (2015a) Diachronic trends in word order freedom and dependency length in dependency-annotated corpora of Latin and Ancient Greek. In: Proceedings of the international conference on dependency linguistics (DepLing’15). Uppsala, Sweden
Gulordava K, Merlo P (2015b) Structural and lexical factors in adjective placement in complex noun phrases across romance languages. In: Proceedings of the 19th conference on computational natural language learning. Beijing, China, pp 247–257
Gulordava K, Merlo P (2016) Multi-lingual dependency parsing evaluation: a large-scale analysis of word order properties using artificial data. Trans Assoc Comput Linguist
Gulordava K, Merlo P, Crabbé B (2015) Dependency length minimisation effects in short spans:
a large-scale analysis of adjective placement in complex noun phrases. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 2: Short Papers). Beijing, China, pp 477–482
Gusev A, Chambers N, Khaitan P, Khilnani D, Bethard S, Jurafsky D (2011) Using query patterns to learn the durations of events. In: IEEE IWCS-2011, 9th international conference on web service, institute of electrical and electronics engineers (IEEE). Oxford, UK, pp 145–155
Hawkins JA (1994) A performance theory of order and constituency. Cambridge University Press, Cambridge
Hawkins, J. A. (2004).Efficiency and complexity in grammars. Oxford: Oxford University Press.
Lapata, M., & Lascarides, A. (2006). Learning sentence-internal temporal relations.J Artif Intell Res,27, 85–117.
Lohse B, Hawkins JA, Wasow T (2004) Domain minimization in English verb-particle construc- tions. Language, pp 238–261
Merlo, P. (2015). Predicting word order universals.J Lang Model,3(2), 317–344.
Merlo P, Ouwayda S (forthcoming) Movement and structure effects on universal 20 word order frequencies: a quantitative study, manuscript, University of Geneva
Merlo P, Stevenson S, Tsang V, Allaria G (2002) A multi-lingual paradigm for automatic verb classification. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL’02). Philadelphia, PA, pp 207–214
Pan, F., Mulkar-Mehta, R., & Hobbs, J. R. (2011). Annotating and learning event durations in text.
Comput Linguist,37(4), 727–753.
van der Plas L, Samardžić T, Merlo P (2010) Cross-lingual validity of PropBank in the manual annotation of French. In: Proceedings of the 4th linguistic annotation workshop, Association for Computational Linguistics. Uppsala, Sweden, pp 113–117.http://www.aclweb.org/anthology/
W10-1814
Russel, S., & Norvig, P. (1995).Artificial intelligence: a modern approach. Upper Saddle River, NJ: Prentice Hall Series in Artificial Intelligence, Prentice Hall.
Samardžić T (2013) Dynamics, causation, duration in the predicate-argument structure of verbs:
a computational approach based on parallel corpora. Ph.D. thesis, University of Geneva Samardžić T, Merlo P (to appear) Unlexicalised learning of event duration using parallel corpora.
In: Villavincencio A, Diab M (eds) Volume in honour of Adam Kilgarriff. Springer
Steddy S, Samek-Lodovici V (2011) On the ungrammaticality of remnant movement in the deriva- tion of Greenberg’s universal 20. Linguist Inq 42(3):445–469
Steedman M (2011) Greenberg’s 20th: the view from the long tail, unpublished manuscript. Uni- versity of Edinburgh
Temperley, D. (2007). Minimization of dependency length in written English.Cognition,105(2), 300–333.
Tsang V, Stevenson S (2001) Automatic verb classification using multilingual resources. In: Pro- ceedings of the conference on computational language learning (CONLL 2001)
Tsang V, Stevenson S, Merlo P (2002) Cross-linguistic transfer in automatic verb classification. In:
Proceedings of the 19th international conference on computational linguistics (COLING 2002).
Taipei, Taiwan, pp 1023–1029
UzZaman N, Llorens H, Derczynski L, Allen J, Verhagen M, Pustejovsky J (2013) SemEval-2013 task 1: TempEval-3: evaluating time expressions, events, and temporal relations. Second joint conference on lexical and computational semantics (*sem), volume 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013). Atlanta, Georgia, USA, pp 1–9 Wasow T (2002) Postverbal behavior. CSLI Publications
Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so Naive Bayes: aggregating one-dependence estimators.Mach Learn,58(1), 5–24.
Williams J, Katz G (2012) Extracting and modeling durations for habits and events from Twitter. In:
Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, pp 223–227