Biases in segmenting non-concatenative morphology

(1)

Biases in Segmenting Non-concatenative Morphology

by

Michelle Alison Fullwood

B.A., Cornell University (2004)

Submitted to the Department of Linguistics and Philosophy

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Linguistics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2018

Massachusetts Institute of Technology 2018. All rights reserved.

Signature redacted

A u thor...

Department of Linguistics and Philosophy

September 1, 2018

Signature redacted

C ertified by ...

Adani'Albright

Professor of Linguistics

Thesis Supervisor

Accepted by

...

ISignature

redacted

MASSACHUSETTS INSTITUTE ' dam Albright

OF TECHNOLOGY

Professor of Linguistics

SEP 2 7 2018

Linguistics Section Head

LIBRARIES

Biases in Segmenting Non-concatenative Morphology

by

Michelle Alison Fullwood

Submitted to the Department of Linguistics and Philosophy on September 1, 2018, in partial fulfillment of the

requirements for the degree of Doctor of Philosophy in Linguistics

Abstract

Segmentation of words containing non-concatenative morphology into their com-ponent morphemes, such as Arabic /kita:b/ 'book' into root /ktb and vocalism /i-a:/ (McCarthy, 1979, 1981), is a difficult task due to the size of its search space of possibilities, which grows exponentially as word length increases, versus the linear growth that accompanies concatenative morphology.

In this dissertation, I investigate via computational and typological simula-tions, as well as an artificial grammar experiment, the task of morphological seg-mentation in root-and-pattern languages, as well as the consequences for majority-concatenative languages such as English when we do not presuppose concatena-tive segmentation and its smaller hypothesis space.

In particular, I examine the necessity and sufficiency conditions of three biases that may be hypothesised to govern the learning of such a segmentation: a bias to-wards a parsimonious morpheme lexicon with a power-law (Zipfian) distribution over tokens drawn from this lexicon, as has successfully been used in Bayesian models of word segmentation and morphological segmentation of concatenative languages (Goldwater et al., 2009; Poon et al., 2009, et seq.); a bias towards catenativity; and a bias against interleaving morphemes that are mixtures of con-sonants and vowels.

I demonstrate that while computationally, the parsimony bias is sufficient to

segment Arabic verbal stems into roots and residues, typological considerations argue for the existence of biases towards concatenativity and towards separat-ing consonants and vowels in root-and-pattern-style morphology. Further evi-dence for these as synchronic biases comes from the artificial grammar experiment, which demonstrates that languages respecting these biases have a small but signif-icant learnability advantage.

Thesis Supervisor: Adam Albright Title: Professor of Linguistics

(4)

(5)

Acknowledgments

I write these acknowledgments with minutes to go to the official deadline, as one

does, but I've been composing these in my head long before I ever wrote a word of this dissertation, because it took the help and kindness of many different people to get me to the point where I could even write a word. So here goes:

Thanks go first and foremost to Adam Albright, who has been my advisor from my first day at MIT. I'm not sure how someone's worst quality can be that they're too nice, but Adam has managed it. After hearing innumerable horror stories about other people's advisorsi, I've come to realise how lucky I am to have ex-perienced his kindness, guidance, and wide-ranging knowledge of the minutiae of so many languages. I can only hope that as I go on to supervise other people in my working life, I can be half as patient and understanding as he has been with me through the years.

Thanks also to the other members of my committee, Donca Steriade and Ed-ward Flemming, who were always there when I needed them, and provided much advice on linguistics, statistics, teaching, and life along the way. Michael Kenstow-icz taught me so much about phonology through the classes I took from him and our one-on-one meetings.

Tim O'Donnell was the one who started me on the project that grew into this dissertation, and I'm very grateful to him for teaching me almost everything I know about Bayesian statistics and their applications to language.

I'll not hide that I burned out, pretty badly, for a couple of years while work-ing on my dissertation. I'd like to thank MIT Mental Health and Counsellwork-ing for giving me the tools to work through my anxiety and depression, and to all of my professors for being understanding and patient during the dark times. I'd also like to thank in this regard everyone at department HQ, particularly Jen, Matt, Mary and Chrissy, for bailing me out when I missed deadlines because things felt too impossible, and for always being there with smiles and chocolate. To any very lost

(6)

grad students who might stumble upon this dissertation, if sitting down to work feels incapacitating, please talk to someone. No one I know has enjoyed writing their dissertation, but it shouldn't be an object of terror.

Back to the good times: I loved sharing an office with Suyeon Yun, Yusuke Imanishi, Hrayr Khanjian, and Amanda Swenson, and the inimitable members of ling-10: Ayaka Sugawara, Coppe van Urk, Gretchen Kern, Isaac Gould, Ryo Ma-suda, Sam Steddy, Ted Levin, and Wataru Uegaki. Thanks also go to my extended family of officemates in Masako Imanishi, Yuika-chan, Joetaro-kun, and Sally Yun. Another bright spot was TAing for Donca Steriade and David Pesetsky. Those courses, and the bright undergrads who participated in them, rekindled my love for linguistics when it was flagging.

It may be a trope that the real treasure is the friends you made along the way, but I am very glad that my rocky journey through MIT led, directly or indirectly, to me knowing Abdul-Raza Sulemana, Alya Abbott, Anna Jurgensen, Benjamin Storme, Brian Hsu, Bronwyn Bjorkman, Chew Lin Kay, Chrissy Wheeler, Christine Riordan, Claire Halpert, Danfeng Wu, Despina Oikonomou, Diviya Sinha, Erin Olson, Eva Csipak, Fen Tung, Gaja Jarosz, Giorgio Magri, Grace Chua, Hadas Kotek, Ishani Guha, Janice Lee, Jiahao Chen, Jonah Katz, Laine Stranahan, Lau-ren Eby Clemens, Lilla Magyar, Louis Tee, Luka Crnic, Maria Giavazzi, Michelle Yuan, Milena Sisovics, mitcho Erlewine, Nina Topintzi, Pamela Pan, Patrick Grosz, Patrick Jones, Pooja Paul, Presley Pizzo, Pritty Patel-Grosz, Ruth Brillman, Sam Al Khatib, Sam Zukoff, Sarah Ouwayda, Sixun Chen, Snejana Iovtcheva, Sunghye Cho, Yang Lin, Yasutada Sudo, Yimei Xiang, Yoong Keok, Youngah Do, Yujing Huang, Yukiko Ueda, among many others.

Thanks also to the many who, wittingly or unwittingly, helped me procrasti-nate on my dissertation and kept me sane: my puzzle hunt crew, Chelsea, Tim, Jit Hin, Ahmed, Fabian, Nicholas, and Eric; fellow organisers and members of Boston Python, PyLadies Boston, and the Python community at large, including Adam Palay, Lynn Cherny, Liza Daly, Alex Dorsk, Eric Ma, Frederick Wagner, Janet Ri-ley, Jennifer Konikowski, Jenny Cheng, Laura Stone, Leena, Lina Kim, Marianne

(7)

Foos, and Thao Nguyen; my fellow mapmakers at Maptime Boston, particularly Jake Wasserman and Ray Cha; and fellow volunteers at the Prison Book Program. Muchas gracias also to colleagues at Vista Higher Learning, Cobalt Speech, and Spirit Al, who not only helped me procrastinate but paid me to do it.

Ever since I came to Boston, I've gone back home to Singapore once a year, where my friends have helped me decompress. I'd like to thank in particular the Cornell contingent (Yann Fang, Clara, Peishan, Pris, Ray, Wenshan), my CSIT "bros" (CKL, GKC, TYL, KKH, BS, OYS), and Janet.

In Singapore also are my parents, to whom eternal thanks go for simply ev-erything. They had to leave school at 16 and 18 to start working and supporting their families, but that didn't stop them from having a start-up before start-ups were ever a thing, or from producing two daughters with PhDs either. Thank you for always ensuring we had opportunities to read and educate ourselves, while at the same time showing through your own example that you don't need higher education to be wise, and that you definitely don't need a degree to be loved.

To my sister Melissa, I send absolutely no thanks at all for finishing her PhD in three years, though I guess at least that means that between us, we have a re-spectable average time to completion.

Much love also to "the mob", i.e. my extended family in Singapore: Der Yao, Mema, UTH, Eleanor, Veronica, Kelvin, AWW, AMM, Uncle James, Aunty Siew Lan, Gabriel and Christine; as well as my in-laws, Candy, Dave, Amy, and all the Padowski aunts, uncles, and cousins.

Lastly, we come to my boys: Greg and our purr machines Salemtand Mor-ris, who have been here for me through my best days and my worst days, and made all them brighter just by being there. My love for you is monotonically non-decreasing.

This material is based upon work supported by the National Science Founda-tion Graduate Research Fellowship Program under Grant No. 1122374.

(8)

(9)

3.2.3 Controlling for alternative testing strategies 3.2.4 Participants . . . . 3.3 R esults . . . . 3.3.1 Descriptive analysis . . . . 3.3.2 Statistical analysis . . . . 3.4 Discussion . . . . 3.5 Conclusion . . . . . . . 75 . . . 75 . . . 79 . . . 85 . . . 90 . . . 91 . . . 91 . . . 95 . . . 99 . . . 101

4 Towards a new model 103 4.1 The concatenative bias . . . 103

4.1.1 Implementing the concatenative bias in the model . . . 104

4.1.2 Implementing the concatenative bias in Optimality Theory. . 106 4.1.3 Implementing the concatenative bias in MaxEnt models . . 106

4.2 C V bias . . . 107

4.2.1 Implementing the CV bias in the model . . . 109

4.2.2 Implementing the CV bias in Optimality Theory . . . 110

4.2.3 Biases: Implications and Future Work . . . 111

4.3 Spreading and truncation . . . 113

4.3.1 Implementing spreading and truncation in the model . . . 116

4.3.2 Inference over this model . . . 117

53 57 58 60 65 70 72

(11)

4.4 M em ory limitations ... 120 4.5 Conclusion ... 121

A Formal specification of the generative model 123

(12)

(13)

Chapter 1 Introduction

One of the many tasks faced by language learners is morphological segmentation, the task of breaking down complex words into their component morphemes.

In the vast majority of languages, morphology is mostly concatenative, meaning that morphemes are joined together linearly as in the example of learners -+ learn

+ er + s. One class of languages, the Semitic family being the most prominent, displays a different kind of morphology, in which morphemes can be discontinu-ous and are interleaved together to form a stem. The following list of words from Arabic is a classic example. There is no contiguous sequence of segments that all the words share, yet they are all related. Instead, they share a discontinous subse-quence ktb.

(1) Partial list of Arabic words containing the discontinuous subsequence ktb

'he wrote' 'he is writing' 'he caused to write' 'he corresponded (with)'

'book' 'writer' 'office' kutiba yuktabu kuttiba kuutiba kutub kaatibuun

'it was written' 'it is being written' 'it was caused to be written' 'it was corresponded (with)'

'books' 'writers' kataba yaktaba kattaba kaataba kitaab kaatib maktab

(14)

Traditionally, this triconsonantal subsequence has been analysed as a morpheme in its own right, called the "root" (though some theories disagree; this will be dis-cussed in further detail in section 1.3.2).

If it is a morpheme, then the learner needs to be able to segment it out.

How-ever, morphological segmentation with non-concatenative morphology is an expo-nentially more difficult task than with concatenative morphology, as the following section will demonstrate.

1.1 The problem of non-concatenative morphological

segmentation

Since the seminal work of Saffran et al. (1996a), which showed that infants could use transition probabilities between phones to segment words, one prominent al-gorithm for how sentences can be segmented into words and words into mor-phemes has been to count these transition probabilities (TPs) and postulate bound-aries where the TP is low.

Counting TPs between adjacent segments cannot account for non-concatenative morphology, however. Follow-up experiments by Newport and Aslin (2004) sug-gests that infants and adults can use vowel-to-vowel and consonant-to-consonant transition probabilities to perform word segmentation. Given that in the Semitic languages, the root is consonantal and interleaved with a vocalism, this is a pos-sible escape hatch for morphological segmentation. Such counting presupposes, however, knowledge about consonants and vowels. But to what extent is this knowledge necessary to perform successful morphological segmentation in non-concatenative languages?

An alternative approach for word and morphological segmentation is not to concentrate on identifying boundaries, but to optimise the lexicon of words and morphemes that results. This is the general approach used in Bayesian and Min-imum Description Length approaches to segmentation (Goldwater et al., 2009;

(15)

Goldsmith, 2000). Such approaches look to place biases on the types of lexicon that results, requiring them to be parsimonious and, in the case of Bayesian seg-mentation, to have a power-law distribution over tokens. Experiments comparing the predictions of the Bayesian approach to the TP-based approach have shown that the predictions of Bayesian word segmentation better accord with the perfor-mance of human learners (Frank et al., 2010a,b).

Most experiments in Bayesian morphological segmentation have concentrated on concatenative languages such as English, Mandarin Chinese, and Japanese (e.g. Xu et al. (2008); Mochihashi et al. (2009)). When Semitic languages have been ex-amined (e.g. Poon et al. (2009)), morphological segmentation was applied to un-diacriticised written forms of the language, which do not render short vowels, indicators of consonant doubling, or case endings. The words above thus will be presented as:

(2) Partial list of Arabic words in (1), undiacriticised representation

ktb 'he wrote' ktb 'it was written'

yktb 'he is writing' yktb 'it is being written' ktb 'he caused to write' ktb 'it was caused to be written' ktb 'he corresponded (with)' ktb 'it was corresponded (with)'

ktab 'book' ktb 'books'

katb 'writer' katban 'writers'

mktb 'office'

Undiacriticised texts make Semitic languages look mostly concatenative. Where long vowels do intervene in the root as in ktab, this is not an issue for engineering-oriented approaches to morphological segmentation as these words have different base meanings.

Now, suppose we want to tackle the full problem of morphological segmenta-tion on the Semitic languages, and we choose to take the Bayesian approach of us-ing biases on the lexicon to optimise our segmentation. Then we come up against a fundamental problem of non-concatenative morphology, which is that it presents

(16)

an extraordinarily large search space of solutions to segmentation.

Suppose we know that a word with n segments consists of two morphemes, and that it has been formed purely concatenatively. Then the problem of mor-pheme segmentation reduces to figuring out where to place the boundary between the two morphemes. There are only n - 1 possibilities for its placement.

(3) cTattts

Let us be a little more concrete by casting the search space in terms of templates interspersing segments of morpheme 1 (represented by r) and morpheme 2 (repre-sented by s).

(4) Search space for n = 4, assuming concatenative morphology: there are 4

-1 = 3 possibilities a. rrrs

b. rrss

c. rsss

When we can choose to interleave two morphemes in any order, however, the num-ber of possibilities for how to segment the word goes up from n - 1 to 2"1 - 1.

(5) Templatic search space for n = 4: there are 24-1 - 1 = 7 possibilities.

a. rrrs b. rrss C. d. e. f. rsss rrsr rsrr rsrs g. rssr

Not only are the non-concatenative templates a superset of the concatenative tem-1_{Due to symmetry, sssr = rrrs.}

(17)

plates, but as the length of the word increases, the number of possible segmen-tations of the word into morphemes increases exponentially if we assume non-concatenative morphology, versus linearly if the morphemes are concatenated.

(6) Size of the search space when the mechanism of word formation is concate-nation, versus interleaving.

Length Concatenation Interleaving

4 3 7

5 4 15

6 5 31

7 6 63

As in the transition probability-based approach, we can reduce this non-concatenative search space using our knowledge that roots are mostly consonantal. But how nec-essary is this assumption to successful morphological segmentation?

A separate question that arises is what searching the larger non-concatenative

space of solutions would do to languages with concatenative morphology, which can successfully be segmented to a high degree of accuracy when we restrict their search space to concatenation only. Will the Bayesian approach then fail, and if so, how can we implement a morphological learner that is robust enough to acquire both concatenative and non-concatenative root-and-template morphology?

We will examine all these issues in this dissertation.

1.1.1 What about meaning?

Our models and experiments will involve no reference to meaning. All words are presented as individual units, with no information about possible relations between them such as shared meaning, shared features, or being members of the same paradigm. The only information available is the segments in each word.

(18)

mor-phological segmentation. So far, this work has largely concentrated on concatena-tive languages, and has, as mentioned above, been quite successful. In particular, Bayesian word and morphological segmentation represent the state-of-the-art in the field. If the bias towards parsimonious lexicons yields largely correct segmen-tations in concatenative languages, then we are interested in seeing if the same holds true in non-concatenative languages. If not, what additional biases do we need to incorporate for segmentation to be successful?

Secondly, evidence from artificial grammar experiments in infants and adults shows that word and morphological segmentation occur early and are possible without access to meaning. In addition to the word segmentation results reviewed above, Mintz (2004) found that 15-month-old infants acquiring English are able to separate the verbal suffix -ing from pseudo-roots, while Marquis and Shi (2012) found that 11-month-olds acquiring French could segment an artificial suffix -u from a set of nonce roots, among others.

Lastly, I note that there is a growing trend towards simultaneous modelling of multiple levels of language acquisition, such as joint phonetic category induction and lexical learning in Feldman et al. (2013), with the addition of modelling pho-netic variability in Elsner et al. (2013). Such modelling almost always shows that joint learning boosts the performance of learning at each level, and thus a future direction for non-concatenative morphological learning would be to add meaning into the mix. However, for now, my focus is on what the necessary and sufficient biases are for segmenting roots and patterns in non-conctenative morphology.

1.2 Outline

This dissertation examines the problem of non-concatenative morphological seg-mentation, with a focus on root-and-pattern morphology of the sort most fa-mously seen in the Semitic languages. Particular questions I aim to answer are:

(19)

Ara-bic morphology?

b. In particular, is a Bayesian learner that implements a parsimony bias sufficient to learn to segment Arabic morphology into roots and non-root material?

(i) If not, what elements does it learn?

(ii) Does it require a bias towards certain morpheme shapes, such as all-consonant roots and all-vocalic residues?

c. What happens to English when we allow a morphological learner to explore non-concatenative segmentations?

(i) Does an English learner require a bias towards concatenativity? (ii) Do English speakers in an artificial grammar experiment display

biases towards concatenativity and against interleaving morphemes that consist of mixtures of consonants and vowels?

In Chapter 2, which is extended and updated from previously published work with Timothy O'Donnell (Fullwood and O'Donnell, 2013), I present results from a computational simulation of unsupervised segmentation of Arabic verbal stems. We adapt the Bayesian segmentation approach of Goldwater et al. (2009) to the non-concatenative case by switching out its generative model, a procedure by which words are generated, for one that can handle non-concatenative

morphol-ogy.

This model is informed by existing accounts of Arabic morphology, which is the subject of the remainder of Chapter 1. After exploring both root- and non-root-based accounts of how Arabic words are structured, we introduce a simplified, pared-down root-based model, which is then adapted into a hierarchical Bayesian model in Chapter 2. The reason for this simplification is two-fold: one, to render the model computationally tractable; and two, to investigate what assumptions are necessary to learn templatic morphology. This is where we answer the question of the sufficiency of the parsimony bias and the question of a bias towards particular morpheme shapes. We will find that there is sufficient statistical information in

(20)

our corpus of Arabic verbs to learn to segment out the root, without requiring any additional biases in the form of additional linguistic knowledge such as the consonant-vowel distinction.

In Chapter 2, we not only look at how well the model does at learning Arabic morphology, but also at how well it does on an equivalent set of English verbal stems, when the hypothesis space is expanded to include non-concatenative seg-mentations. We find that the model performs poorly on English, despite theoret-ically being capable of handling both concatenative and non-concatenative mor-phology. This suggests that for English, the assumption that the primary mode of word formation is concatenation is crucial for proper segmentation, and that learners may have a bias towards concatenativity.

In Chapter 3, I look further at the problem of concatenativity, as well as ques-tions raised by overgeneration of unattested templates by the model. I show that these pathological predictions are not merely the result of a linguistically deficient model, but surface in Prosodic Morphology (McCarthy and Prince, 1993b) as well.

I report an Artificial Grammar experiment designed to test for synchronic biases

for concatenativity and preference for consonantal roots and vocalic residues in root-and-pattern morphology via three "Martian languages".

(8) a. English-like: CVC-CV

b. Arabic-like: CVCCV where root=CCC, residue=VV

c. Unattested: CVCCV where root=CCV (1st, 3rd, 5th segment) and residue=VC (2nd, 4th segment)

root root root

(8a) C1V1C2C3V2 (8b) C1V1C2C3V2 (8c) C1V1C2C3V2

V

1

residue residue residue

We find evidence from this experiment for synchronic biases towards concate-navity and division of the stem into consonantal roots and vocalic residues, which would reduce the degree of pathological overgeneration of unattested patterns.

(21)

The final chapter, Chapter 4, discusses linguistic issues that arise from the re-sults of Chapters 2 and 3, and looks at how to integrate the findings of Chapter

3 into the model, specifically, how the model may be modified to incorporate the

proposed biases.

1.3 Models of Arabic morphology

Descriptively, Arabic verbs are formed from a set of consonants, called the "root", which conveys a semantic field such as "writing", and a vocalism that conveys voice (active/passive) and aspect (perfect/imperfect). The verbal stem is formed

by interleaving the root and vocalism, and sometimes adding further derivational

prefixes or infixes. To this stem inflectional affixes indicating the subject's person, number and gender are concatenated.

There are nine common forms of the Arabic verbal stem, also known by the Arabic grammatical term wazn and the Hebrew grammatical term binyan. In (9),

/IMf? represents the triconsonantal root. Only the perfect forms are given.

(9) List of common Arabic verbal binyanim

Form IV II III

IV

V VI VII VIII X Active fa~al faTTal faaTal ?afial tafaTial tafaaTal ?infaTal ?iftaTal ?istafhal Passive fuTil fuTil fuuTil ?ufTil tufuil tufuuTil ?iftiTil ?istufTil

(22)

For example, Form II verbs are generally causatives of Form I verbs, as is kattab "to cause to write" (c.f. katab "to write"). However, as is often the case with deriva-tional morphology, these semantic associations are not completely regular: many forms have been lexicalized with alternative or more specific meanings.

1.3.1 Root-based accounts

The traditional Arab grammarians' account of the Arabic verb was as follows: each form was associated with a template with slots labelled C1, C2 and C3, traditionally

represented with the consonants V fil, as described above. The actual root conso-nants were slotted into these gaps. Thus the template of the Form V active perfect verb stem was taCiaC2C2aC3. This, combined with the triconsonantal root, made

up the verbal stem.

(10) Traditional analysis of the Arabic Form V verb

Root f 1 I

Template t a C₁ a C2C2 a C3

The first generative linguistic treatment of Arabic verbal morphology (McCarthy,

1979, 1981) adopted the notion of the root and template, but split off the

deriva-tional prefixes and infixes and vocalism from the template. Borrowing from the technology of autosegmental phonology (Goldsmith, 1976), the template was now comprised of C(onsonant) and V(owel) slots. Rules governing the spreading of segments ensured that consonants and vowels appeared in the correct positions within a template.

(23)

Under McCarthy's model, the analysis for [tafaTal] would be as follows:

(11) McCarthy analysis of Arabic Form V verb

Prefix t Root f T 1

I

A

CV Template C V C V C C V C a Vocalism

While increasing the number of morphemes associated with each verb, the Mc-Carthy approach economized on the variety of such units in the lexicon. The in-ventory of CV templates was limited to those generated by the following rules:

(12) Rules generating (and limiting) verbal templates (McCarthy, 1981) a. (C(V))CV([+seg])CVC

b. V -+ 0 [CVC _ CVC]

Further, there were only three vocalisms, corresponding to active and passive voice intersecting with perfect and imperfect aspect; and only four derivational prefixes (/?/,/n/,/t/,/st/), one of which becomes an infix via morphophonological rule in Form VIII.

1.3.2 Non-root-based accounts

Hockett (1954) divides morphological theories into two categories, item-and-arrangement and process. The theories above fall more or less into the item-and-arrangement side of the divide. However, there are also theories of word formation in Semitic that fall more into the item-and-process category, wherein some word in a paradigm is the base, which is then morphed into other members of the paradigm

by some grammatical process. Because the relation here is word-to-word, there

(24)

I will discuss two such non-root accounts of Semitic morphology in particular, Bat-El (1994), in which the grammatical process is stem modification and vowel overwriting, and McCarthy and Prince (1990), in which iambic prosodic templates are overlaid onto singular Arabic nouns to produce broken plurals.

These accounts are both motivated by transfer phenomena, in which a property of the base transfers over to the derived form. Bat-El (1994) looks at cluster transfer in Hebrew denominal verbs. In such verbs, if they have 5 or more consonants, there are several possible CV templates they can map to. However, which one is chosen cannot be determined from the root alone. Instead, the template that maintains clusters present in the base noun is selected.

(13) Clusters are maintained between noun bases and derived denominal verbs

in Hebrew (Bat-El, 1994)

a. /praklit/ 'lawyer' -s /priklet/ 'to practice law'

b. /traklin/ 'salon, parlour' -+ /triklen/ 'to make something neat'

c. /xantarif / 'nonsense' -+ /xintref/ 'to talk nonsense' d. /?abstrikti/ 'abstract' -+ /?ibstrekt/ 'to abstract'

e. /stenograf/ 'stenographer' -4 /stingref/ 'to take shorthand'

Bat-El (1994) proposes the following process for converting nouns into denominal verbs: first, a bisyllabic template is imposed on the noun. As much of the segmen-tal material of the noun as can be mapped onto those syllables, working from the outside edge in, is syllabified. Afterwards, the vocalic pattern /i,e/ overwrites the original vowels. Lastly, any unsyllabified material is erased.

(25)

(14) Bat-El (1994)'s three-step process for deriving Hebrew denominal verbs c- 0-a. Maximal syllabification workingedge-in _x _{a n} _t a r i f b. Vocalic overwriting i e

c. Delete unsyllabified segments _

x i n t r e f

Clusters are maintained under this framework because there is no mechanism for vowels to be epenthesised between them at any step of the process. No refer-ence to a root is necessary at any point in the derivation.

Similarly, McCarthy and Prince (1990) are concerned with the transfer of vowel length in the final syllable of Arabic nouns in the singular and plural form.

(15) a. /xaatam/ -+ /xawaatim/ 'signet ring(s)' b. /jaamuus/ -+ /jawaamiis/ 'buffalo'

McCarthy and Prince (1990) propose that in the majority of Arabic broken plurals, an iambic template is imposed on the initial syllable of the singular noun, and a vowel overwriting process imposes the vocalic melody /i/ on the final syllable, maintaining the existing length. Again, no reference to the root is necessary at any point in this derivation.

In and of themselves, these are not necessarily arguments against the root. In Optimality Theory, there are input-output faithfulness constraints and output-output faithfulness constraints (Benua, 1997). The transfer phenomena described here could be the result of phonological output-output faithfulness constraints, applied to words formed morphologically from a root-and-template-based inter-leaving. Furthermore, Aronoff (1994) studies several base-derivative relations in Hebrew and concludes that it is impossible to stipulate any given form as the base, disputing accounts of transfer phenomena that rely on derivations from a base form.

(26)

1.3.3 Evidence for the Semitic root

Despite being contested in Ratcliffe (1997); Bat-El (1994, 2003); Ussishkin (2003) and others, the root retains a healthy status in modern linguistic theory, because of a considerable body of evidence, largely external, that has built up in favour of its existence. Prunet (2006) gives a comprehensive overview of this corpus of evidence; here I will briefly sketch a few key pieces.

Much evidence has come from priming experiments, showing that a word with a given root primes other words sharing the same root, to a greater degree than would be expected due to phonological or semantic similarity (Frost et al., 1997; Boudelaa and Marslen-Wilson, 2004, et seq.).

Furthermore, evidence from metatheses in language games (Bagemihl, 1989; McCarthy, 1981, et seq.), slips-of-the-tongue (Berg and Abd-El-Jawad, 1996), and aphasic errors (Prunet et al., 2000, et seq.) shows that when the consonants of a Semitic stem are metathesised, it is invariably the root consonants that get metathe-sised and not consonants belonging to the template or affix.

All this suggests that roots are psychologically real units of the mental lexicon,

on a separate level from the template and affixes. On the basis of this evidence, the linguistic model we will adopt is derived from the root-based approaches outlined in 1.3.1.

1.3.4 Evidence for other morphemes in Semitic

Much of the processing literature addressing Semitic morphology has focused on investigating the psychological reality of the root, as opposed to the other mor-phemes proposed by (McCarthy, 1979, 1981), namely the vocalism and the CV-template.

One exception to this is Boudelaa and Marslen-Wilson (2004), a psycholinguis-tic study using the priming paradigm that tests the place of these two morphemes in the organisation of the Arabic mental lexicon. Boudelaa and Marslen-Wilson find that words with a certain CV-skeleton primed other words sharing the same

(27)

CV-skeleton, even when they have few to no segments in common. Conversely, sharing a vocalism does not cause priming.

(16) a. /nuuqifJ/ "be discussed" (CV-skeleton: CVVCVC) primes

/saa~ad/ "help" (CV-skeleton: CVVCVC) b. /3aazaf/ "act blindly" (Vocalism: /a/)

does not prime

/?inda~ar/ "be wiped out" (Vocalism: /a/) (Boudelaa and Marslen-Wilson, 2004)

Faust (2012), working in the framework of Distributed Morphology, suggests that the vocalism, at least in Modern Hebrew, should be split into two positions, V1 be-tween the first two consonants of the root and V2 bebe-tween the last two consonants of the root, with each position having different DM realisation rules.

Our model will remain agnostic with respect to the status of the vocalism, and indeed will not attempt to model affixation either. Instead, we will combine all non-root elements of the stem into what we term the "residue", as the following section describes. We retain the notion of a template that dictates how the root and residue are interleaved, though it will not be a CV-skeleton; nevertheless, it serves the same purpose and could induce the same priming effect found in Boudelaa and Marslen-Wilson (2004).

1.3.5 Our model

Our model of root-and-template morphology adopts McCarthy's notion of an ab-stract template, but reduces the number of segment-bearing tiers to just two: the root and the residue. Their interleaving is dictated by a template with slots for root and residue segments. We also consider this a morpheme.

(28)

(17) Breakdown of Form VIII verb ?iftaTal under our approach:

Root f T 1

Template s s r s s r s r

Residue ? i t a a

Clearly, there are deficiencies to this rather simplistic model of Arabic morphol-ogy. Not only are prefixes and suffixes not modelled, but other crucial elements of the McCarthy analysis such as spreading, when there are more consonantal slots than consonants and more vocalic slots than vowels, are not modelled.

Thus, while under the McCarthy analysis the vocalism of katab and kaatab is /a/, with spreading of the single segment to all two or three vocalic slots of the

CV template, in this model they are /aa/ and /aaa/ respectively, with no sharing

taking place. Similarly, Arabic doubled roots such as smm, which are biconsonantal under McCarthy's analysis, have to be modelled as underlyingly triconsonantal.

Nor do we model segmental alternations, which occur when one or more of the consonants of the root is a glottal stop or semi-vowel. Thus, in this first pass at Arabic morphology, we deal only with the "sound" roots of Arabic - those with three consonants that do not induce segmental alternations, and we leave the modelling of spreading to future models.

Furthermore, though this model was inspired by root-based approaches, there is nothing in particular forcing the root to be one of the morphemes that is found. Since there is no concept of consonants or vowels built into the model, in learning structure from words, a learner may posit a "root" that consists of a mixture of con-sonants and vowels. What is found in our computational experiment in Chapter 2 will be driven entirely by the statistical strength of various possible morphological segmentations.

While this model is clearly inadequate to capture the nuances of Arabic or any other language's morphology - lacking the additional prefixes and suffixes that transform a stem into a word, for example, or any concept of spreading, which is

(29)

crucial to the analysis in McCarthy (1981), this is a deliberate choice: our goal in performing this simulation is to evaluate how much machinery a model needs to successfully learn the commonly-accepted units of Arabic words.

Though the model above is inspired by templatic morphology, concatenative morphology can theoretically also be captured in this framework by grouping all the root segments on one side and all the residue segments on the other.

(18) Breakdown of English verb cooked [kukt] under our approach:

Root ku k

Template r r r s

Residue t

We will use this in Chapter 2 to examine what happens to English morpholog-ical segmentation when we allow for non-concatenative segmentations.

(30)

(31)

Chapter 2 Computational Modelling

In this chapter, we perform a computational modelling experiment on unsuper-vised morphological segmentation on Arabic and English verbs, based on the the-oretical model of Arabic morphology presented in 1.3.5.

The overall goal of this computational experiment is to figure out whether this somewhat impoverished model, which lacks any prior knowledge of possible structures of existing templates, or even about the distinction between consonants and vowels, can successfully segment Arabic, or whether some amount of prior knowledge is necessary. In addition, we test whether English can be successfully segmented when we expand its hypothesis space to include non-concatenation.

This experiment is situated within the general approach of Bayesian segmen-tation (Goldwater et al., 2009), which recasts segmensegmen-tation as the task of finding an optimal lexicon, where optimality is defined by a probabilistic model that we build up in accordance with our intuitions about what constitutes a "good" seg-mentation.

2.1 Bayesian morphological segmentation

Suppose we are given the following dataset of words and a choice of three mor-phological segmentations:

(32)

(1) Three possible segmentation hypotheses for dataset D

Hypothesis 1 Hypothesis 2 Hypothesis 3 t-r-a-i-n-e-d trained train-ed

t-e-s-t-e-d tested test-ed

t-r-a-i-n-i-n-g training train-ing t-e-s-t-i-n-g testing test-ing u-n-t-r-a-i-n-e-d untrained un-train-ed

The core insight of Bayesian segmentation is that a segmentation automatically induces a morpheme lexicon, some of which are more likely than others.

(2) Three morpheme lexicons L induced by the segmentations in (1)

Lexicon 1 Lexicon 2 Lexicon 3

t trained train

r tested test

a training ed

i testing ing

n untrained un

We can define a hierarchical generative model of how morphemes and words are constructed. This is a probabilistic model, with a probability distribution placed on every choice we make in constructing each morpheme and word. Each hypothesis will then have a concrete probability that is product of the probability of each of these choices. Hypothesised lexicons associated with over- and under-generation as in Hypothesis 1 and Hypothesis 2 will receive a lower probability under the model, while lexicons that more closely match our intuitions about segmentation as in Segmentation/Lexicon 3 will receive a higher probability.

(33)

2.1.1 The Goldwater-Griffiths-Johnson model

Let us begin with the generative model of the Goldwater-Griffiths-Johnson (hence-forth GGJ) approach to word segmentation (Goldwater et al., 2009), adapted to concatenative morphological segmentation by Lee (2012).1

We build up the morpheme lexicon and word dataset as follows. For each word, we decide how many morphemes n it will contain by drawing a value from a Poisson distribution with a small A parameter. This has the following shape:

(3) Poisson distribution P(x) = for A 1.

0.4

0.3

0.2

0.1

0 2 4 6

Number of morphemes in word

Notice how quickly the probability drops as the number of morphemes gets large. This reflects our assumption that words should not consist of a large num-ber of morphemes, putting lower probability on grammars that over-segment the dataset.

Suppose we sample n as 2. Our word will then be composed of two mor-phemes, and we now need to decide what forms these will take. For each of the n morphemes that will make up the word, we draw a lexical entry from a morpheme lexicon.

Every time we draw a token from the lexicon, we will have a choice of either (a) drawing an existing morpheme from the current lexicon, or (b) creating an entirely

IAside from replacing utterances with words and words with morphemes, the primary change is the addition of a Poisson distribution on the number of morphemes, based on work by Xu et al.

(34)

new morpheme and adding to to the existing lexicon. Each of these is associated with a probability:

(4) a. Draw an existing morpheme mn with probability " where:

(i) yi is the number of times the morpheme has previously been drawn,

i.e. its token frequency

(ii) N is the total number of morphemes already drawn

(iii) a, b are hyperparameters on the distribution; 0 < a < 1 and b > -a.

b. Draw a new morpheme with probability aKfb, where:

(i) K is the existing size of the morpheme lexicon (ii) N, a, b are as defined above.

What I have just described is a probability distribution commonly used in Bayesian non-parametric statistics called the Pitman-Yor Process (Pitman and Yor, 1995). Let us briefly verify that this is indeed a probability distribution in that the sum of these probabilities adds up to 1:

(5) Pitman-Yor probabilities sum up to 1:

K i - a aK+ b K I K a aK+ b N+b N+b =N+b N+b N+b N Ka aK+b N+b N+b N+b N-aK+aK+b N+b -1

As a concrete example of drawing a morpheme from the Pitman-Yor process, consider the fourth word in our dataset D, testing. Suppose we have already cho-sen our first three words as in Segmentation 3, so that the existing state of the morpheme lexicon is as follows:

(35)

(6) State of the morpheme lexicon having already selected the words train-ed, test-ed, train-ing (N = 6, K = 4).

Morpheme Existing tokens Current probability of drawing this morpheme

train trained, training 2-a

ed trained, tested 2-a

test tested 1-a

6-a

ing training _1-a

(new type) 4a+b_{6 b}

1-a

With probability 6+-, we choose the existing morpheme test. This updates the morpheme lexicon to the following:

(7) State of the morpheme lexicon having already selected the words train-ed, test-ed, train-ing, and the first morpheme of test-ing (N = 6, K = 4).

Morpheme Existing tokens Current probability of drawing this morpheme

train trained, training 2-a

ed trained, tested 2-a

test tested, test-?? 2-a

7-a

ing training 1-a_7+b

(new type) 4agb

With probability , we choose the existing morpheme ing, and, having reached the number of desired morphemes, n = 2, we concatenate them to form the word testing.

Now let us take a closer look at the probabilities associated with drawing mor-phemes from the lexicon and figure out what assumptions they reflect.

The probability of choosing an existing morpheme, Y- _N+b'has the existing token_{a h} _{xitn oe}

frequency of that token in the numerator. Thus, the larger the token frequency,

(36)

power laws. This biases the model towards choosing lexicons that have a roughly power-law (Zipfian) distribution (Goldwater et al., 2009), as we observe in natural

language lexicons.

Furthermore, as N grows larger, the probability K grows smaller, reflecting our confidence that as we observe more and more of the dataset, we have seen a large subset of the morphemes, and require more extraordinary evidence to pro-pose a new lexical entry in our morpheme lexicon. Overall, this means that more compact morpheme lexicons will have a higher probability than very expansive ones.

Next, we generate the word untrained. Having picked n = 3, as this will be gen-erated as un-train-ed, we now go down the choice point of creating an entirely new morpheme. We then have to generate the form of the new morpheme, which we do by drawing each segment from a uniform distribution over the segments. If the size of the segment inventory is IS1, the probability of drawing any one segment is

. This is repeated as many times as we have segments I in the morpheme, and

thus:

(8) Probability of drawing morpheme of length I=

The longer the morpheme, the smaller its probability; this disfavours under-segmentation, which results in very long morphemes such as untrained.

Once we complete the process of drawing every morpheme and composing them into words, we will have a morpheme lexicon L and a dataset of words D. The joint probability of this generative process, that is to say, the product of each and every one of the choices we made in generating our words: POISSON(n) for each word that was split into n morphemes (tokens), the Pitman-Yor probabilities for each choice of token, and the probability of generating the actual segmental sequence of each lexical entry in the morpheme lexicon. This probability is thus the probability of the lexicon, P(L), and the probability of drawing our dataset D from the lexicon, P(DIL). Since the lexicon is equivalent to our segmentation

(37)

hypothesis H, we'll re-label these P(H) and P(D H).

Looking back at the three segmentation hypotheses at the start of this chapter, we see that the over- and under-segmentation hypotheses are each disfavoured by two elements of the model:

Over-segmentation such as decomposing training into t-r-a-i-n-i-n-g, is penalised

by the Poisson process used to pick the number of morphemes in each word, as it

would require selecting a high n for each word every time. In addition, every time we pick a morpheme to append to our word, we incur the cost of drawing yet an-other token. Since we have to draw from our lexicon for many morphemes in each word, the overall probability of our P(D H) is small.

Meanwhile, the under-segmentation hypothesis is disfavoured by the form of the Pitman-Yor probability of choosing a new morpheme, aKIb. With such a small number of words, this might seem to favour the over-segmentation hypothesis, but consider what happens when we extend the dataset to a larger number of words. For instance, the CELEX database (Baayen et al., 1995) contains only 54 phones but 160,595 words. Under-segmented hypotheses will result in overly large lexicons, which are disfavoured by this generative model. Furthermore, under-segmented hypotheses will result in long morphemes, which are disfavoured by how we modelled the selection of segments that comprise a new lexical entry.

Ultimately, because we have defined our generative model such that it is bi-ased in favour of more compact lexicons, each item of which is reused many times, but at the same time so that each word does not contain inordinately many mor-phemes, a lexicon like the one in Hypothesis 3 will prove to be best, in the tra-dition of Goldilocks: not too many or too few morphemes in the lexicon (there are approximately 17,500 unique morphemes in CELEX); words whose individ-ual morphemes are neither too long nor too short; and are not composed of too many morphemes (on average, just 2 per word, again according to CELEX). Such a lexicon will have the highest P(H ID), threading the line between over- and under-segmentation, optimally balancing novelty and reuse.

(38)

2.1.2 Inference

Now, our goal is really to obtain hypotheses H with high posterior probability P(HI D): the probability of segmentation hypothesis, given our dataset of words. But by Bayes' rule:

(9) P(HID) = P(DII)P(H) P(D)

We can drop the denominator as our dataset is supplied to us and so its probability is irrelevant. Thus:

(10) P(H|D) c< P(D|H)P(H)

The most optimal segmentation is the H that maximises P (H I D),

(11) argmax P(HID) = argmax P(DIH)P(H)

the right hand side of which is simply the joint probability defined by our genera-tive model.

But how do we calculate this probability over all the possible segmentation hy-potheses H? The short answer is that we cannot: for any dataset of a reasonable size, there are so many possible segmentations that this is prohibitively computa-tionally expensive. Instead, we will conduct inference by drawing samples from the distribution P(H I D).

For this we will employ a class of sampling algorithms called Markov Chain Monte Carlo (MCMC). Since this is an implementational detail (though by far the most challenging part of the simulation, engineering-wise), I will only give a brief, informal overview of the two most important algorithms in this class as applied to the problem of concatenative morphological segmentation.

The first algorithm is Gibbs sampling (Geman and Geman, 1984), which was applied by Goldwater et al. (2009) to perform inference on the GGJ model.

(39)

a morpheme lexicon with tokens assigned to each lexical entry.

Next, for each word, we consider each boundary between two segments and decide whether or not to place a morpheme boundary at that spot. Let us call this variable bij when we are considering the ith word of the dataset and the jth phone boundary of word i. bij =1 when a morpheme boundary is placed there; bij = 0 otherwise. For example, for the word untrained, we have the following possible boundaries:

(12) Variables to be considered for word 5, untrained, currently segmented as un-t-rain-ed.

u n t r a i n e d

'1

I

b5 1

0

b52 =1 b53 = 1 b54 0 b55 0 b56 = 0 b57 =1 b5 8 = 0 Let's say that we are at a stage where the other words have all already been segmented correctly. Thus our current morpheme lexicon looks like this:

(13) Current state of the morpheme lexicon (N = 12, K = 7)

Morpheme Existing tokens train _{trained, training}

ed trained, tested, untrained

test tested, testing

ing training, testing

un untrained

t untrained

rain untrained

We can calculate the probabilities of all the choices that led up to this state H and obtain the probability P(H|D).

Now consider b₅₃. It is currently 1, meaning there is a hypothesised morpheme boundary between t and r. We contrast the current hypothesis H with a hypothesis

(40)

(14) State of the morpheme lexicon H' (N = 11, K = 5)

Morpheme Existing tokens train trained, training, untrained

ed trained, tested, untrained

test tested, testing

ing training, testing

un untrained

We then calculate whether we should switch from hypothesis H to H' by flip-ping a coin with probability:

(15) P (H H') P(DIH')P(H')

P(D|H )P(H )+P(DjH')P( H')

We flip a coin with this weight to determine whether to move to this new hypoth-esis or not. Thus if the posterior probability of H' is 3 times that of H, then there is a 2 chance we move from the morpheme lexicon in H to the morpheme lexicon in

H', along with its hypothesised segmentation.

We do this cyclically, passing through the dataset several times, flipping a coin for every bij and adding the new hypothesised lexicon and segmentation to our Markov chain of posterior samples ("Markov" because each sample depends only on the previous one). Though at the beginning the samples are completely ran-dom, eventually, as we switch from model to model, we begin to spend more time in the high probability regions of the posterior distribution. In fact, the Gibbs sam-pling algorithm is guaranteed to provably converge to the correct posterior distri-bution from any random starting point.

Finally, we take our Markov chain, discard the first m "burn-in" samples since it is unlikely that we randomly start in a good area of probability space and, since samples close to each other are highly correlated, retain every nth sample and discard the rest. We now have a good representation of the posterior distribution P(H ID), and can find the optimal segmentation H* = arg maxH P(H ID).

(41)

jumps we make between hypotheses are not limited to changing a single variable. New hypotheses h' are generated from the existing hypothesis h by a proposal dis-tribution g(h -+ h') of one's own choosing. For example, the proposal disdis-tribution could be: pick a number of bij's according to some distribution and toggle each of the bij's picked.

At each step, we sample a new hypothesis from the proposal distribution, and decide whether to accept the new proposal by calculating the Metropolis-Hastings criterion:

(16) A(h -+ h') = min 1P(h'

'P(h) g(h--h')

We flip a coin with this weight to decide whether to move to the new hypothesis. The resultant chain is once again a Markov chain of samples that provably reflects the posterior distribution once enough samples have been accepted, and as in the Gibbs case, we can use these samples to find the optimal segmentation.

The more general Metropolis-Hastings inference algorithm is useful under cer-tain conditions, for example when the equivalent of bij, unlike the binary choice above, has too many possibilities to efficiently enumerate, or when we wish to jump around hypotheses that differ by more than a single variable. This will be the case when we move towards sampling from our generative model for non-concatenative morphology, where the word formation mechanism is intercalation between root and residue.

A good choice of proposal distribution is the key to the efficient use of this

tech-nique. While the algorithm is guaranteed to return a Markov chain that faithfully reflects the posterior distribution no matter what proposal distribution is used, even the uniform distribution, choosing a proposal distribution as close to the true posterior will result in more of the proposals being accepted, yielding more sam-ples in a shorter amount of time than a less informed proposal distribution.

(42)

2.2 Non-concatenative generative model

In this section, we re-cast the model introduced in 1.3.5 as a hierarchical genera-tive model. This model has the following steps:

(17) Generative model of non-concatenative morphology

a. For each word we wish to generate, pick a template length in terms of segments n for that word from a Poisson distribution.2

b. Draw a template of the appropriate length n from a template lexicon governed by a Pitman-Yor process.

(i) If we decide to draw a new template type, then for each of the n

slots in the template, flip a coin to determine whether it will be a root r or residue s slot.

c. Count the number of r and s slots to determine how long the root and residue for this word will be.

d. Draw a root of the appropriate length from a root lexicon governed by a Pitman-Yor process.

(i) If we decide to draw a new root, then draw a segment from a

uniform distribution over all segments until we reach the desired length.

e. Repeat step (17-d) with residues, drawing a residue of appropriate length from a residue lexicon.

f. Intercalate the root and residue according to the template.

As a concrete example, we might first pick a length of 5 for the template. We then go to the template lexicon, where we have a choice of one existing length-5 template rsrsr, or a completely new length-5 template. Drawing according to the Pitman-Yor probabilities, we happen to pick rsrsr. This means that our root will

2

For the purposes of inference, this is unnecessary as each word in our dataset has a given length. However, in Chapter 3 we will run this generative model "forward" to obtain a typological distribution, which requires generation of word lengths; hence I make this step explicit here.

(43)

be of length 3 and the residue of length 2.3 We then move to the root lexicon and choose among the existing length-3 roots, or have the option to draw a new root of that length; we pick a root ktb. Similarly, we pick a length-2 residue ui. We then interpolate according to the template and obtain the word kutib. This is illustrated in the following diagram.

(18) Pictorial representation of our generative model for Arabic verb stems

Templates

k t b

r s r s r -+ kutib rsrsr rsrssr new

Root length=3 Residue length=2

ktb drs new aa ui new

Roots Residues

2.2.1 Inference

Inference over this model was performed with Metropolis-Hastings sampling. We started with random templates being assigned to each word in the dataset.

For each word in the corpus, the algorithm sampled a new template, root, and residue for the word by removing its current template, root, and residue from the three respective lexicons. Then we considered all the possible current templates plus one new, randomly sampled template.

3

This is a design decision; we could also pick any root or residue of any length, and discard extra segments if too long or iterate to fill the template if too short, as is proposed in McCarthy (1979,

(44)

For example, if the word in question was katab, we would consider all the length-5 templates in the template lexicon - for example, there might be three existing templates of length 5 rsrsr, rssss and rsrrr. Besides this, we consider the possibility of adding a fourth template of length 5, and sample an actual template for it, say rssrs.

We now need a distribution over this set of four hypothesised templates -our proposal distribution. In order to bias -our choice towards templates that are more likely, we consider the probability of adding the template, the resultant root, and resultant residue to the three respective lexicons, according to the Pitman-Yor distributions, including the probability of creating the template, root, residue, if entirely new.

The sum of these probabilities will not be 1, so we renormalise to get a prob-ability distribution, then sample a template from this distribution. This will be our new hypothesis, and the renormalised probability with which it was picked is g(h -+ h'). We calculate g(h' -> h) in the reverse way, calculating the likelihood of

proceeding from the new hypothesis to the old hypothesis with which we began.

We then calculate the Metropolis-Hastings criterion and flip a coin with that probability as to whether to accept the new hypothesis. If we do, we add the new template, resultant root and residue to their respective lexicons. If not, we revert to the old template, root and residue, and proceed to the next word. At the end of each "sweep" through the data, we capture a snapshot of the current segmentation and add that to our posterior distribution.

As a result of incorporating downstream information into the proposal distri-bution, we boost the probability of trying out hypotheses that have a higher P(h') and thus are more likely correct. For instance, if ktb has a high token frequency in the current lexicon, the probability of picking the template rsrsr from the proposal distribution will be higher for the word katab. This reduces the average number of proposals we discard before accepting a sample, reducing processing time for the algorithm.

Biases in segmenting non-concatenative morphology

Biases in Segmenting Non-concatenative Morphology

by

Michelle Alison Fullwood

B.A., Cornell University (2004)

Submitted to the Department of Linguistics and Philosophy

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Linguistics

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2018

Massachusetts Institute of Technology 2018. All rights reserved.

Signature redacted

A u thor...

Department of Linguistics and Philosophy

September 1, 2018

Signature redacted

C ertified by ...

Adani'Albright

Professor of Linguistics

Thesis Supervisor

Accepted by

...

ISignature

redacted

Professor of Linguistics

SEP 2 7 2018

Linguistics Section Head

LIBRARIES

ARCHIVES

Biases in Segmenting Non-concatenative Morphology

by

Michelle Alison Fullwood

Abstract

Acknowledgments

Contents

Chapter 1

Introduction

1.1 The problem of non-concatenative morphological

segmentation

1.1.1

What about meaning?

1.2

Outline

V

1

1.3 Models of Arabic morphology

IV

1.3.1

Root-based accounts

I

A

1.3.2 Non-root-based accounts

1.3.4

Evidence for other morphemes in Semitic

1.3.5

Our model

Chapter 2

Computational Modelling

2.1

Bayesian morphological segmentation

2.1.1

The Goldwater-Griffiths-Johnson model

'1

I

I

I

I

I

I

I

0

2.2 Non-concatenative generative model