• Aucun résultat trouvé

Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German

N/A
N/A
Protected

Academic year: 2022

Partager "Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German"

Copied!
105
0
0

Texte intégral

(1)

Master

Reference

Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and

Standard German

SCHERRER, Yves

Abstract

Le sujet de ce mémoire est en rapport avec deux directions de recherche majeures en linguistique informatique. D'une part, le domaine de la traduction automatique nécessite des données plurilingues. D'autre part, les outils informatiques sont de plus en plus sollicités pour les recherches descriptives en dialectologie. Ce travail propose l'utilisation de méthodes développées en dialectologie pour construire un lexique bilingue, partie essentielle de tout système de traduction automatique. Ce choix novateur est motivé par la paire de langues au coeur de nos recherches, à savoir un dialecte alémanique et la variété standard de l'allemand.

A la base de tout système de traduction automatique se trouve un dictionnaire bilingue associant des mots des deux langues entre lesquelles la traduction s'opère. Ces lexiques bilingues peuvent être construits à la main, ou être dérivés de matériaux linguistiques de manière automatique. L'approche standard de construction automatique nécessite des corpus parallèles – des textes traduits par des traducteurs humains – pour y chercher des similarités structurelles [...]

SCHERRER, Yves. Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German. Master d'études avancées : Univ. Genève, 2007

Available at:

http://archive-ouverte.unige.ch/unige:33930

Disclaimer: layout of this document may differ from the published version.

(2)

Phonetic Distance Measures for the Induction of a

Translation Lexicon for Dialects

A Study on Bernese Swiss German and Standard German

Mémoire de DEA Supervisor: Dr. Paola Merlo

Université de Genève Departement de Linguistique

July 13, 2007

Yves Scherrer

yves.scherrer@lettres.unige.ch

(3)

Contents

1 Introduction 5

1.1 Motivation . . . 5

1.2 The sociolinguistic and dialectal situation of German-speaking Switzerland 6 1.2.1 Diglossia . . . 6

1.2.2 Dialects . . . 8

1.3 Bilingual lexicons . . . 9

1.4 A two-stage framework for translation lexicon induction . . . 11

1.5 Terminology . . . 13

1.6 Overview of the thesis . . . 14

2 Related Work 15 2.1 Overview . . . 15

2.2 Machine translation and lexicon induction . . . 16

2.3 String distance measures . . . 17

2.4 Cognate languages, string distance measures and lexicon induction . . . . 18

2.5 String distance measures and dialectology . . . 21

2.6 Concluding remarks . . . 22

(4)

3 Data 23

3.1 Available Data . . . 23

3.2 Data acquisition . . . 26

3.2.1 Building a target lexicon . . . 27

3.2.2 Building a gold standard corpus . . . 27

3.3 Phonemes and Graphemes . . . 30

4 Two-stage models of lexical induction 33 4.1 The lexicon induction framework . . . 33

4.1.1 The noisy channel metaphor . . . 33

4.1.2 Transducers . . . 36

4.1.3 Finite-state automata and tries . . . 39

4.1.4 Composing finite-state machines . . . 40

4.1.5 Translation lexicon induction . . . 41

4.1.6 Implementation . . . 44

4.1.7 String distance measures . . . 45

4.2 Static distance measures . . . 48

4.2.1 The basic model: Levenshtein distance . . . 48

4.2.2 Some variants based on general linguistic intuitions . . . 49

4.2.3 A model based on phonetic features . . . 50

4.3 Adaptive distance measures obtained by learning algorithms . . . 51

4.3.1 The basic model . . . 53

4.3.2 The class model . . . 55

4.3.3 A model trained without a bilingual corpus . . . 56

4.3.4 N-gram models . . . 57

4.4 Adaptive distance measures based on transformation rules . . . 59

4.4.1 The rules . . . 59

4.4.2 The model . . . 65

4.5 Summary . . . 67

(5)

5 Experiments and Results 71

5.1 Goals of the evaluation . . . 71

5.2 Transducer tests . . . 72

5.2.1 Experimental setup . . . 72

5.2.2 Static models and rule-based system . . . 73

5.2.3 Learning models . . . 76

5.3 Framework tests . . . 80

5.3.1 Experimental setup . . . 80

5.3.2 Evaluation methodology . . . 80

5.3.3 Test corpus statistics . . . 84

5.3.4 Results . . . 85

5.4 Discussion . . . 88

5.5 Phonetic and syntactic similarity measures . . . 89

5.6 Concluding remarks . . . 92

6 Conclusion 93 6.1 Main contributions . . . 93

6.2 Outlook . . . 94

6.2.1 Model variations . . . 94

6.2.2 Data variations and dialect comparisons . . . 95

6.2.3 Towards a three-stage framework . . . 96

6.2.4 Towards machine translation . . . 98

6.3 Final remarks . . . 98

Bibliography 100

(6)

Chapter 1 Introduction

1.1 Motivation

Building lexical resources is a very important step in the development of any natural language processing system. A great variety of systems, ranging from shallow parsers to complex machine translation models, based on symbolic or statistical approaches, rely on some kind of lexical knowledge. However, building a lexicon is a time-consuming and repetitive task, which makes research on automatic induction of lexicons particu- larly appealing.

Lexical information can be represented at varying degrees of detail, from simple word lists to complex entries containing phonetical, morphological, syntactical and seman- tic information. Furthermore, lexical entries of different languages can be linked to- gether in order to create bilingual dictionaries for machine translation. In partic- ular, much work has been done in the last few years on extracting bilingual dic- tionaries from parallel corpora. These approaches are restricted to language pairs for which large parallel corpora are available. This constraint on availability of data, combined with economical and political motivations, leads to a large over- representation of some language pairs in machine translation research: English - French (due to the early availability of the Canadian Hansard parallel corpus), English - Spanish/German/Japanese/Chinese/Arabic. These particular language configura- tions share some interesting facts. First, most research deals with machine translation from or into English. Second, the other language of the pair is, from a linguistic point of view, unrelated to English, or a rather distant cognate language of English. Third, all

(7)

of these languages are used by millions of people, assuring thereby the availability of large data sets. Finally, research focuses on the standardized, written versions of these languages.

Our work will not rely on any of these properties. In fact, automatic lexicon induc- tion between Swiss German dialects and Standard German contrasts in several ways with the properties of mainstream machine translation research. The Swiss German dialects are closely related to Standard German – much more closely than are for ex- ample English and French. It is likely that this property will simplify the induction of a lexicon. On the other hand, there is little data available for dialects, and even less data is in parallel form. This is particularly problematic for corpus-based approaches. An- other drawback is that a standardized version of Swiss German does not exist. In fact, there is important variation between the different dialects.

We will argue that these fundamental differences require different techniques. But be- fore introducing the precise goals of this work, we would like to explain the three afore- mentioned points – the close relation between the chosen languages, the lack of dialect data, and the lack of a standardized dialect form – on the basis of the sociolinguistic (1.2.1) and dialectologic (1.2.2) situation of German-speaking Switzerland.

1.2 The sociolinguistic and dialectal situation of German-speaking Switzerland

1.2.1 Diglossia

It is commonly admitted that the sociolinguistic situation of German-speaking Switzerland is a model case of diglossia. The termdiglossiahas been introduced by Ferguson (1959) to describe an environment where two languages (or most frequently, two varieties of one language) are used in a complementary way in functionally differ- ent contexts. This implies that every speaker who is regularly confronted to the two types of context is competent in both varieties. Ferguson qualifies these varieties as High and Low. The High variety is typically used in formal and written contexts, while the Low variety is used in informal and oral contexts. However, the Highvs. Low dis- tinction does not depict the Swiss German reality adequately. In Switzerland, the de- gree of formality does not influence the choice of the variety: dialects are used in all

(8)

situations of oral communication, in television, church and administration. Standard German is used for writing, regardless of the formality of the situation. This type of diglossia, depending solely on the communication medium, is calledmedial diglos- sia.

Nevertheless, other factors influence the choice of a language variety. Standard Ger- man is spoken if addressed to non-dialect speakers, for example in the Swiss Federal Parliament where French and Italian speaking members take part in the deliberations, or in discussions involving Germans who do not understand Swiss dialects. On the other hand, dialects can be written under certain circumstances, for example in di- alect literature, or more recently in informal uses of electronic communication means.

Recent studies show that 58% of E-mails and 75% of SMS written by Swiss Germans are in dialect (Scharloth 2004). This shift is not so much to be considered as an intro- duction of the formalvs. informal distinction, but rather as one consequence of the increasing dialect use in general.

The definition of diglossia implies that the two varieties are clearly distinct. While this is the case in German-speaking Switzerland, this constrasts for example with most re- gions of Austria and Southern Germany, which exhibit a continuum between Standard German and the local dialects. The context (speakers, situation) determines the posi- tioning of the utterances on that continuum. In Switzerland, the distinction between dialect and standard language is clear-cut. However, there are various mutual influ- ences. For example, technical words are often borrowedad hocfrom Standard German and phonologically adapted. In the other direction, dialect speakers often carry over some dialectal features (accent, lexicon, syntax, morphology) to the standard language.

This particular sociolinguistic situation has an impact on the setup of our work. In particular, translation between (written) dialects and (written) Standard German does virtually not occur: in most contexts, only one variety is appropriate. As translation does not occur, parallel corpora between the two varieties do not exist – after all, most speakers are bilingual. One might argue that, in this situation, automatic lexicon in- duction is useless: if all speakers are competent in both language varieties, and written dialect data are rare anyway, there is no need to induce lexical correspondences with the help of computers. However, some arguments of practical and theoretical nature are worth noting.

First, there are a lot of German-speaking people (as second language or as mother tongue) who have no sufficient dialect competence: Germans from the northern part

(9)

of Germany usually have difficulties in understanding Swiss German dialects; for in- stance, television programs involving dialect speakers are broadcast with subtitles for German audience. The French- and Italian-speaking Swiss only learn Standard Ger- man at school; this virtually excludes them from contexts in which dialect is used. It is specially for these audiences that natural language processing of Swiss German di- alects could be useful.

Second, the task of automatic lexicon induction should be viewed as a first step to a machine translation system between spoken Swiss German and written Standard Ger- man. Such a system would entirely respect the diglossia described above. As speech recognition of Swiss German dialects is a rather challenging projectper se, we begin by using written dialect data. In further work, the presented methods can be extended in order to use oral data in phonetic transcription.

Beside these rather concrete considerations, there is also a psycholinguistic interest of our work, given that Swiss Germans use word-matching techniques every day. While this is not the place to present psycholinguistic theories about the diglossic mind, we can nevertheless admit that lexical entries of dialect words are linked somehow to Stan- dard German entries, and that this strong linkage is crucial for maintaining a diglos- sic society: their members must have an extended lexical knowledge in both varieties.

Furthermore, they are able to adapt a word which they have only seen in one variety in order to use it in the other. Such transfers are only possible if they have acquired an implicit model of the differences between the two varieties. Lexicon induction meth- ods are also used in language acquisition. Most Swiss Germans are first confronted to dialects, and learn Standard German later on (at school, or by influence from televi- sion and other media). They are likely to generalize this linking process by learning a phonological model. Thus, our models can also be viewed as a modest contribution to the modelling of the psycholinguistic activity of a diglossic speaker.

1.2.2 Dialects

All dialects of German-speaking Switzerland are part of the same subgroup of German dialects, theAlemannicsubgroup1. This subgroup also contains the dialects spoken in Alsace, South-West Germany, West Austria and Liechtenstein. Even if there is some dialectological unity, dialectal variation is considerable between different regions. As

1 There is one village at the Austrian border that uses a Bavarian dialect. We will not consider this dialect here.

(10)

there are no clear linguistic boundaries between dialects, it is more adequate to speak of a dialect continuum. Even if political boundaries (cantons) are still used to distin- guish dialectological features, their importance tends to decline with the increasing mobility of the speakers.

In the last few decades, some trends of language change have been identified (Chris- ten 1998). The lexicon tends towards a uniformisation among the dialects on the one hand, and between the dialects and Standard German on the other hand. This means that dialect-specific words become increasingly rare. This goes in parallel with the declining importance of agriculture and its highly specialized, dialect-specific vocab- ulary. Neologisms are frequently taken over from Standard German, in most cases with phonological and morphological adaptation. In contrast, grammar (phonology, morphology, syntax) as well as functional and high-frequency words largely resist to uniformisation tendencies, as they do generally not hinder successful communica- tion, but rather provides the speakers an element of identification with regional tra- ditions. These two contrasting trends – uniformisation to simplify intercomprehen- sionvs. keeping particularities to allow regional identification – have already been proposed by de Saussure (1916 : 281) as the two leading forces of language change.

There is some debate, largely motivated by political considerations about (second) lan- guage learning, on the relationship between Swiss German and Standard German. The traditional view considers Swiss German merely as a set of dialects that differ to a vari- able extent from Standard German. Some researchers consider Swiss German as a full- fledged language, different from Standard German. We will not present the details of this debate here (for an overview, see Hägi and Scharloth 2005); in any case, the two entities, whether we call them languages or language varieties, are very closely related.

1.3 Bilingual lexicons

In general, a bilingual lexicon, ortranslation lexicon, for languagesl1andl2can be de- fined as a set of expressions froml1that are related to a set of expressions froml2. The expressions of both languages can be annotated with other linguistic information: part of speech tags, morphological information like inflection classes or gender, and sub- categorization frames for verbs. We propose to analyze the relation between the two sets of expressions as a two-fold many-to-many correspondence. Theparadigmatic many-to-many correspondenceaccounts for the fact that one source word may have

(11)

several possible translations (one of which is chosen depending on the context), and that several source words may have the same translation. Thesyntagmatic many-to- many correspondenceexplains the property that one source word may be most ade- quately translated by a multi-word expression, and that a multi-word expression can be translated by a single word.

These two relation types are the most general possible. Any language pair satisfies them. However, we suppose that the closer the two languages of the pair, the simpler the correspondences. We should therefore expect more one-to-one correspondences (of both types) when dealing with closely related languages. Restricting the many-to- many correspondences to simpler ones has not only linguistic motivations, but it is also appealing for computational reasons. This is particularly apparent for the syn- tagmatic correspondences: for one-to-one correspondences, the search space for a given source word includes all target words. For one-to-many correspondences, the search space includes all target words, all 2-word-combinations of target words, and even longer combinations.

For these reasons, we will restrict our work to syntagmatic one-to-one correspon- dences. The language pair studied does nevertheless include some cases of one-to- many correspondences. For example, some pronouns can be combined with articles in the dialects as well as in Standard German. However, the distributions are not iden- tical (examples from Bern dialect and Standard German): im - im ‘in the-MASC’, but ir - in der ‘in the-FEM’. In Swiss German, some subject pronouns following the verb can be dropped2, which is not the case in Standard German: gsesch - siehst du ‘you see’. Then, there are dialect words which do not have simple equivalents in Standard German and which must be translated by a more complex description: allpott - im- mer wieder ‘again and again’ orChummerzhülf - Retter in der Not ‘knight in shining armour’.

We will also restrict our work on paradigmatic one-to-one correspondences. There is no inherent limitation to do this, but the evaluation of the results is easier if there is one single solution to find. As a consequence of this restriction, we need to induce only one direction of the lexicon: connecting a dialect word to a Standard German word is equivalent to connecting a Standard German word to a dialect word.

Our work is based on another simplifying assumption. The goal of this approach is essentially to find correspondences between dialect and Standard German words. But

2 Alternatively, they can be analyzed as 0-clitics (Nübling 1992 : 261).

(12)

for many applications, a bilingual lexicon needs to contain more information. Given the linguistic similarity of the two languages involved, we assume that this information can be carried over in most cases from the Standard German word to the equivalent dialect word. It remains to be shown to what extent this hypothesis holds.

1.4 A two-stage framework for translation lexicon induc- tion

In the previous section, we have assumed that lexical mappings follow paradigmatic and syntagmatic one-to-one correspondences. With these assumptions in mind, we can reduce the problem of inducing a translation lexicon to the following question: For a given dialect word, what is the most probable equivalent Standard German word?

Finding such lexical mappings amounts to finding word pairs that are maximally sim- ilar, with respect to a particular definition of similarity. Similarity measures can be based on any level of linguistic analysis.Syntactic similarityis based on the alignment of parallel corpora (Brown, Pietra, Pietra, and Mercer 1993). Words that occur in similar syntactic contexts and in the same sentence pairs are considered to be translations of each other.Semantic similaritymeasures (Rapp 1999) rely on co-occurrence vectors:

words that occur together are likely to belong to the same semantic field. Once the se- mantic fields are defined for the two languages, the word mappings are easy to induce.

Phonetic similaritymeasures try to match word pairs according to their phonetic (or rather graphemic) similarity. Different similarity definitions can also be combined to improve the lexicon induction.

Our work is based on the assumption that phonetic similarity measures are the most appropriate in the given language context, because they require less sophisticated training data than semantic or syntactic similarity models. For instance, they do not require parallel corpora. However, phonetic similarity measures can only be used for cognate language pairs, i.e. language pairs that can be traced back to a common his- torical origin and that possess highly similar linguistic (in particular, phonological and morphological) characteristics. Moreover, we can only expect phonetic similarity mea- sures to inducecognate word pairs, i.e. word pairs whose forms and significations are similar, as a result of a historical relationship. For illustration, the following list shows some cognate word pairs for Bernese Swiss German and Standard German:

(13)

Dialect word

Generation of similar strings Static models Learning models Rule-based model

Filter Standard German

Lexicon

List of Standard German words

Figure 1.1: The two-stage architecture of the proposed framework.

schpeter später ‘later’

du du ‘you’ (singular)

mache machen ‘make’

Sägu Segel ‘sail’ (noun)

Hut Haut ‘skin’

Prüefig Prüfung ‘exam’

bblybe geblieben ‘stayed’ (participle)

For language pairs without parallel corpora, the syntactic similarity methods are not usable. Even if small parallel corpora are available for some Swiss German dialects, our work will be exclusively phonology-based. We want to build models that work for any type of dialect, regardless of the amount of data available for it. However, we will use a parallel corpus to compare the syntactic and the phonetic approach.

We will present three types of models of phonology-based lexicon induction. All of them will share the same two-stage framework, illustrated in Figure 1.1. In the first stage, given a dialect word, a set of phonological variants will be generated. These vari- ants do not necessarily form words of the target language (Standard German). There- fore, a second stage, based on a lexicon, is necessary to filter out the non-words. While the second stage is identical for all models, the three types of models will use different techniques to generate and rank the set of phonological variants. One type of models will be language-independent. Another type will use machine learning techniques to learn the correspondences between Swiss German and Standard German. The third model is based on manually built phonetic rules. Our framework follows a generate- and-filter approach, where the first stage generates a series of hypotheses, which are filtered in the second stage.

We have already mentioned that this framework does not require parallel corpora. The

(14)

data need not be in sentence form, either. These limited data requirements will also allow us to use transcribed oral corpora. However, if high-quality data are available, the framework is unable to benefit from them. To avoid this possible shortcoming, a third stage could be added to the framework. In this stage, the remaining word candidates would be filtered according to the syntactic and semantic context in which they occur.

This third stage would thus combine different similarity definitions.

Let us illustrate our framework with an example. The dialect wordha ‘[I] have’should be linked to the Standard German wordhabe. First, an ordered list of several phono- logical variants of the dialect word are generated:

ha, he, hk, ho, hab, hat, cha, im, nah, gabe, habe, ...

Then, these variants are looked up in the Standard German lexicon. Only the variants that are found in that list are kept:

hat, im, nah, gabe, habe, ...

In a possible third step, the syntactically and semantically incompatible variants are excluded. In the contextI ha Hunger ‘I am hungry’, the suggestionsim, nah, gabe should be ruled out because they are not verbs, andhat ‘has’ should be ruled out be- cause it is incompatible with the first-person subject.

1.5 Terminology

The abstract two-stage architecture will be referred to asframework. Each instance of our framework consists of a particular distance model – the first stage – and of a lexicon model – the second stage. However, this study will not focus on the impact of different lexicon models, but use the same lexicon model in all instances. As the distance model is the only distinctive constituent of the different instances, the termmodelmay also refer to a specific framework instance by extension.

Each distance model encodes adistance measure, orsimilarity measure, assuming that the two terms are each other’s inverses. In the same way, “the most similar word”

is considered equivalent to “the least distant word”. However, this simple relationship between similarity and distance does not always hold, as Heeringa, Kleiweg, Gooskens, and Nerbonne (2006) note. In fact, the fundamental measure of our models is always

(15)

the distance measure: the higher the numbers, the more distant the compared words.

Furthermore, we will distinguish “candidate strings” from “candidate words”: the for- mer are not necessarily words of the target language, while the latter are.

1.6 Overview of the thesis

This thesis will be organized as follows.

Chapter 2 reviews some related work. Most of this work deals with the use of different distance measures to account for word similarity. While some of this work is inspired by similar techniques used in speech recognition, other studies deal more precisely with issues in dialectology. Other work describes translation lexicon induction from different perspectives.

Chapter 3 presents the available data for Swiss German dialects as well as for Standard German. The data constraints will have an impact on the setup of the study.

In Chapter 4, we will present the two-stage framework of lexicon induction in some detail. As we have briefly mentioned in the preceding section, three main models will be discussed, based on different approaches for the generation of phonological vari- ants. For each of these models, we will expose their underlying hypotheses and discuss some variants and possible extensions.

The theoretical description of the models in Chapter 4 is followed by a comparative analysis in Chapter 5. The performance of the models will be assessed in an empirical study based on a corpus.

Finally, some directions for future research will be illustrated in Chapter 6.

(16)

Chapter 2

Related Work

2.1 Overview

The purpose of this chapter is to review some previous studies that are related to our work. We begin by outlining the main characteristics of our research which have guided the selection and presentation of related work.

In this thesis, we propose a framework of translation lexicon induction. Lexicon in- duction is generally considered as a subproblem of machine translation and has been investigated from various perspectives, using various methods (2.2).

In our framework, we use string distance measures to induce the word mappings.

These measures have been successfully used in text-to-speech and speech-to-text ap- plications (2.3), and have become increasingly popular for a variety of tasks in Com- putational Linguistics (Kessler 2005), in particular for problems related to cognate lan- guages (2.4).

Our framework will be applied to two dialectal varieties of German. Therefore, we may be interested in the results obtained by computational approaches to dialectology.

Most studies in this field make extensive use of string distance measures (Nerbonne, Heeringa, and Kleiweg 1999) (2.5).

The language varieties we have chosen form a diglossia – the Bernese dialect does not fulfill the same sociolinguistic functions as Standard German. Moreover, the Swiss Ger- man dialects can be considered as languages with scarce resources. These two research

(17)

domains have become increasingly popular in the last few years, mostly in parallel with the growing interest in Arabic languages (2.2).

While research has been done in all of these domains, there have been few attempts to connect them. In the following sections, we will present previous work in the light of these particular domains, and we will try to establish some connections between them.

2.2 Machine translation and lexicon induction

A translation lexicon is an important part of every machine translation framework.

Translation lexicons contain mappings between words or lexemes of two languages.

While traditional approaches rely on manual construction of such lexicons or on the use of professional dictionaries, the free availability of large corpora helped to auto- mate the construction of such lexicons. Different algorithms have been proposed to find word alignments in parallel corpora (Brown et al. 1993; Och and Ney 2003). Start- ing with sentence pairs, they induce the maximally likely word pairs, i.e. word pairs that occur simultaneously in different sentence pairs and that are likely to be related in each sentence pair. However, parallel corpora may not be available for particular language pairs. Therefore, other approaches to lexicon induction have been proposed.

Rapp (1999) assumes that translation pairs, having the same sense, share co- occurrence patterns. If we know that GermanSchuleis a translation of Englishschool, and we see in our corpora thatSchule often co-occurs withLehrer, andschool with teacher, then we may conclude thatLehreris a good translation candidate forteacher.

This method requires a large German corpus and a large English corpus to count co- occurrences, but the corpora need not be parallel. Moreover, for this method to work, we need to start off with a seed lexicon – we need to know thatSchule translates to school. This approach is thus rather a lexiconextensionmodel than a lexiconinduc- tionmodel.

Koehn and Knight (2002) combine different heuristics – Rapp’s technique among oth- ers – to induce translation lexicons without using parallel corpora. They also exploit loanwords and cognate words to improve the induction of word pairs.

Hwa, Nichols, and Sima’an (2006) have applied Rapp’s model to an interesting lan- guage pair: they built a translation lexicon for Modern Standard Arabic and the Lev- antine Arabic dialect, spoken in Syria, Palestine, western Jordan and Lebanon. Like

(18)

our variants of German, this language pair forms a diglossia, which restricts the use- fulness (and thus the existence) of parallel corpora. Hwa et al. (2006) compare the performances of Rapp’s model on various types of corpora, differing in mode (spoken or written), topic and size. The work of Hwa et al. (2006) is part of a larger research pro- gram on parsing Arabic dialects (Rambow, Chiang, Diab, Habash, Hwa, Sima’an, Lacey, Levy, Nichols, and Shareef 2006; Chiang, Diab, Habash, Rambow, and Shareef 2006).

2.3 String distance measures

String distance measures are used in situations where similar, but not identical se- quences have to be compared. Since the beginning of computer science, they have been applied to a large range of problems: error correcting (Levenshtein 1966), spellchecking (Peterson 1980), comparing texts (Hunt and McIlroy 1976), and compar- ing DNA sequences in biology (Sankoff and Kruskal 1999). In computational linguis- tics, string distance algorithms, based on the simple Levenshtein distance algorithm (Levenshtein 1966) or on more sophisticated Hidden Markov Models (HMM) or finite- state transducers, have been used in speech-to-text (Rentzepopoulos and Kokkinakis 1996; Ristad and Yianilos 1998) and text-to-speech applications (Minker 1996; Jansche 2003).

Ristad and Yianilos (1998) tackle a problem of speech recognition. Each word usu- ally has one prototypical phonological representation, listed in a pronunciation lexi- con, but different speakers can pronounce it slightly differently depending on the con- text, the accent, and other parameters. String distance measures define the distance between two strings as the sum of the distances between the characters composing them. The simplest model, Levenshtein distance, uses a binary measure for charac- ter distance: two characters are identical or different. Roughly speaking, Levenshtein distance counts the number of differing characters1. For many applications, this bi- nary character distance measure is too coarse-grained. For instance, Ristad and Yian- ilos (1998) argue that some differences between prototypes and actual realisations are more stable and systematic than others. Therefore, the distance between prototypical phoneme strings and actual phone strings should not be based on a binary character distance measure. Rather, the character distance values should be defined separately for each phoneme-phone pair. However, the number of values is too high to be as- signed manually on the basis of phonetic intuitions.

1 See Chapter 4.2 for a more precise description of Levenshtein distance.

(19)

Ristad and Yianilos (1998) propose to induce the character distance values by train- ing the distance model with the Expectation-Maximisation (EM) algorithm (Dempster, Laird, and Rubin 1977). They provide a training corpus that contains manually as- signed pairs of phonetic and phonological word representations. In an iterative way, the character pair values are adapted such that they minimize the cumulative distance of the word pairs of the training corpus2. In Ristad and Yianilos’s tests, the error rate of this trained model was reduced by five compared to the Levenshtein distance model.

Jansche (2003) reviews the work of Ristad and Yianilos (1998) in some detail, correct- ing thereby some errors in the presentation of their algorithms. He then extends these algorithms and applies them to build letter-to-sound conversion rules for speech syn- thesis.

2.4 Cognate languages, string distance measures and lex- icon induction

While much work deals with the use of string distance measures in phonetics and phonology, these measures have also been applied in research on cognate languages.

Mann and Yarowsky (2001) use string-distance measures for lexicon induction be- tween cognate languages. They exploit the fact that, for cognate language pairs, the words constituting translation pairs often have similar forms due to a common origin.

To detect these similarities, they use string distance measures (calledcognate models).

They distinguish static and adaptive models. The static models are independent from a particular language pair. They include Levenshtein distance and a variant distinguish- ing vowels and consonants. The adaptive models are adapted to a particular language pair by training. Mann and Yarowsky (2001) use the trained stochastic transducer pro- posed by Ristad and Yianilos (1998), as well as a Hidden Markov Model (HMM). They also add a variant based on learned transition classes (see Section 4.3.2).

Mann and Yarowsky (2001) use these cognate models inside a framework of transitive multi-path lexicon induction. They induce translation lexicons between a resource- rich language (typically English) and a scarce resource language of another family (for example, Portuguese) by using a resource-rich bridge language of the same family (for

2 More details about this approach will be presented in Section 4.3.

(20)

example, Spanish). They show that the results can be improved with additional bridge languages (for example, English-French-Portuguese). While they rely on existing trans- lation lexicons for the source-to-bridge step (English-Spanish/French), they use the cognate models for the bridge-to-target step (Spanish/French-Portuguese).

We will closely follow Mann and Yarowsky’s approach. We will use most of their cog- nate models, like Levenshtein distance and the stochastic transducer. We will replace the HMM (which performed worst in their study) by a transducer based on manually selected symbolic rules. However, we will not use the whole bridge-language frame- work, but rather focus on the step involving the two cognate languages – in our case the two varieties of German. Adding an existing German-to-English or German-to-French translation lexicon would be relatively straightforward.

Schafer and Yarowsky (2002) applied Mann and Yarowsky’s framework to other lan- guage pairs and also changed the direction of the lexicon induction: from a scarce resource language of familyf1, through a bridge language of family f1, to a target lan- guage of family f2. This will also be the direction used in our studies. Furthermore, they do not limit themselves to string distance models, but add several other similarity measures largely inspired by Information Retrieval techniques. For example, they ar- gue that news texts talk about the same subjects at the same time, with the same words.

Therefore, they induce word pairs on the basis of their date distribution in (compara- ble, but non-parallel) news corpora. They also stipulate that word pairs should only be formed with words that have similar relative frequencies. Even if these enhancements pay off, we will not be able to use them because our dialect corpora are too small to compute reliable frequency measures, and because news corpora in Swiss German di- alects do not exist.

Besides Information Retrieval techniques, Schafer and Yarowsky (2002) also use Rapp’s co-occurrence vectors in combination with distance measures. They start with a stochastic transducer to find easily detectable cognate pairs, compute co-occurrence patterns to extend the lexicon, then train the transducer on these new words, and iter- ate several times. Surprisingly, Hwa et al. (2006) do not consider the use of this mixed model in their study – it seems to be ideal for two closely related languages like Modern Standard Arabic and Levantine Arabic.

De Yzaguirre, Ribas, Vivaldi, and Cabré (2000) developed a framework to improve alignments of parallel corpora for cognate language pairs. They compute the Word Similarity Coefficient (WSC) to measure the similarity of cognate word pairs. This co- efficient depends on three variables: the Graphematic Similarity Coefficient (GSC, see

(21)

below), the part-of-speech information of the two words, and the relative position of the two words in the aligned sentence pairs. Thus, like Schafer and Yarowsky (2002), they use syntactic and semantic clues to improve the word pair mappings generated by string distance measures (represented here by the GSC). The GSC is computed from the number of letters shared between the words, the position of these letters and the difference of the length of the words. Despite its different formulation, this measure is comparable to Levenshtein distance. In order to further improve their model, de Yza- guirre et al. (2000) do not compute the GSC from the observed word forms, but from their “SIMEX-reductions”3. These reductions are used to minimize the impact of regu- lar orthographical and morphological differences between the two languages. SIMEX operates some transformations on both of the two candidate words. For example, it reduces both Catalanautoritat and Spanishautoridad to the generic formautoritat; the two SIMEX-reductions are identical in this case, and the GSC is optimal. However, it is unclear if these reductions are performed on anad-hocbasis for each word pair, or if if they are generalized in a set of rules. Moreover, the choice of the generic form can be questioned: there is no apparent reason to chooseautoritat rather thanautoridad or evenautoritad as a common generic form for the example given above.

Kondrak and Sherif (2006) use phonetic similarity models for cognate word identifica- tion. In cognate language pairs, not all word pairs are cognate pairs, i.e. not all word pairs have similar forms due to a common historical origin. Therefore, it is important to first identify the cognate pairs. They compare different similarity measures used be- fore (Mann and Yarowsky 2001; Ristad and Yianilos 1998; Kondrak 2002) and propose some novel measures based on Dynamic Bayesian Nets and Pair HMMs.

While Ristad and Yianilos (1998); Jansche (2003) and their predecessors use transduc- ers to model the intra-speaker and inter-speaker variation (i.e. accent), Mann and Yarowsky (2001) and others use them to model the variation between the standard- ized versions of different languages of the same family. We will take an intermediate approach: while we abstract away from intra-speaker variation, we will use two dialec- tal variants (one of which is not standardized) of the same language group. In that sense, our work can also be compared to Riesa, Mohit, Knight, and Marcu (2006). They use a simple string distance model, based on some empirically selected transforma- tion rules, to reduce the orthographic variation of non-standardized Iraqi Arabic to one canonical form.

3 De Yzaguirre et al. (2000) do not specify the meaning of the abbreviation SIMEX.

(22)

2.5 String distance measures and dialectology

String distance measures are also used in dialectometry. Dialectometry is a field of dialectology that assesses differences between dialects with objective, numerical mea- sures (Goebl 1982). While these differences can be measured on any kind of linguis- tic data, the most convenient approach is to compare words of different dialects with string distance measures (Kessler 1995; Nerbonne et al. 1999; Heeringa 2004). The re- sults of these comparisons can then be used for a cluster analysis to obtain a classifi- cation of the dialects. The underlying distance measures should produce results that coincide as much as possible with the distance judgements of native speakers.

Heeringa et al. (2006) present different measures based on Levenshtein distance. They include models that use bigrams and trigrams as basic entities (instead of single let- ters), and other models use normalized distances to minimize the impact of the lengths of the tested words. Heeringa et al. (2006) implement Mann and Yarowsky’s intuition to distinguish vowels and consonants in another way: they introduce constraints on the type of allowed substitutions, only allowing vowel-to-vowel and consonant-to- consonant substitutions. Heeringa et al. (2006) test their models on Norwegian and Dutch data, but find very similar results for all of their models. Nerbonne and Siedle (2005) only work with pure Levenshtein distance, but focus on data from German di- alects.

Heeringa (2004) discusses several models that distinguish phonemes on the basis of their distinctive features. Kondrak (2002) also proposes a feature-based model, but applies it to historical linguistics: Its purpose is to find cognate word pairs to be used in the reconstruction of a proto-language.

It might be useful to compare the role of string distance in the lexicon induction task to its role in dialectometry. For lexicon induction, we use a given source word and a given distance metric to generate target words. The adequacy of the induced target words is measured by semantic criteria: the words must be translations of each other, and therefore have the same signification. The quality of the distance metric as a whole is measured by the percentage of correct word pairs induced. In dialectometry, a given word pair and a given distance metric are used to compute the distance value. The adequacy of the computed distance value is measured by its correlation with human judgements about the perceived distance of the word pair. The quality of the distance measure is again measured by the overall correlation with human judgements. While these two use cases may seem very different, the dialectometric task corresponds in

(23)

fact to the training stage of an adaptive metric used for lexicon induction. In this train- ing stage, a given word pair and a (possibly inaccurate) distance metric are used to compute a distance value. At each step, the distance metric is adapted to gradually approach minimal distance values for all word pairs. Finding dialectometric models that correlate best with given human judgements can thus be based on the same tech- niques as finding the best parameters for lexicon induction models.

2.6 Concluding remarks

String distance metrics are used for detecting and generating phonetic realizations and orthographical variants of words. They are also used for expressing dialectological in- tuitions in numbers, and they have proven useful to detect and classify word map- pings between languages of the same family. However, lexicon induction for cognate languages need not rely solely on string distance measures. While string distance mea- sures only apply to the forms of the words, several studies have incorporated measures that compute some kind of content similarity, generally based on occurrence and co- occurrence patterns. While most of the work presented in the following chapters deals with string distance measures, we also hope to show how alternative features can be integrated into our framework.

(24)

Chapter 3 Data

Like any empirical model of natural language processing, automatic lexicon induction depends very much on the nature and amount of available data. The data define the strategies and methods to be used in the development of the models. The data also determine the expectations one may have about the results. And finally, a succinct presentation of the data also represents a convenient way to introduce the architecture of the proposed framework.

3.1 Available Data

When working with Swiss German dialects and Standard German, one of the most striking differences is the great difference in amount of available data.

Without taking into account the dialectal variation, there are more than 90 million na- tive speakers of German. In terms of speakers, German is the first language in the Eu- ropean Union, and among the top ten languages of the world. It is the third most popu- lar language used by websites, and the most popular language for translation: German accounts for the most written translations into and from a language.1. All speakers of German have good active and passive competences in the standard variety, and all of them are able to use the standard variety in written contexts. In other words, Stan- dard German texts are produced and consumed by more than 90 million people. The

1 All data stem from http://en.wikipedia.org/wiki/German_language, 1st march 2007. For older data, see Ammon (1991).

(25)

economic force of the German-speaking countries confers additional importance to Standard German. From a scientific point of view, German is an interesting language because of its verb-second word order and its rich morphology. For these reasons, Standard German has become an attractive language for research in computational linguistics. Just to cite some examples, there are syntactically annotated corpora like NEGRA2orTIGER3. TheEuroparlcorpus4contains German text aligned with transla- tions of other European languages.

This rather comfortable situation for Standard German contrasts with the scarcity of dialect data. There are about 5 million speakers of Swiss German dialects. However, they do not produce official written dialect texts, and they do not require texts ad- dressed to them to be written in dialect. The few dialect texts that exist do not present uniformity, but dialectal variation. Written dialect material essentially comes from three sources: dialect literature, data collections obtained from speech transcription, and material collected from electronic communication medias.

In the 1980s and the early 1990s, many dialect books were published, most of them in Bernese dialect. The use of dialect in artistic contexts like literature and music has traditionally been most popular in the area of Bern. The advantage of using dialect lit- erature in linguistic applications is that such books constitute reasonably large corpora with a standardized orthography, written in a less colloquial style than the other, more informal data sources. However, the exploitation of such corpora is rendered difficult by copyright restrictions and by lack of availability in electronic formats.

Another data source are speech transcriptions. Since the beginning of the 20th cen- tury, there has been intensive dialectological research in Switzerland. This research has been based on oral interviews with dialect speakers5. However, only a small per- centage of these interviews has been fully transcribed, and even less data are acces- sible electronically. Other speech transcriptions may be available from regional radio or television stations. However, using speech transcriptions for lexicon induction can be problematic. Usually, transcriptions aim for phonetic and prosodic adequacy with the spoken source. Preprocessing would be required to unify the transcription conven- tions and to delete prosodic marking that is not relevant for lexical induction. More-

2 http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/

3 http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/

4 http://people.csail.mit.edu/koehn/publications/europarl/

5 Most of these data are conserved at thePhonogrammarchivof the University of Zurich (http:

//www.phonogrammarchiv.unizh.ch).

(26)

over, performance errors like repetitions, omissions and incomplete sentences are fre- quent in speech. While this is a minor problem in the word-based approach proposed here, it excludes robust syntactic processing, which would be needed to improve the performances of the framework.

The spread of electronic communication medias in the last decade has helped increase the written use of dialects in a spectacular way. For example, there are websites de- voted to Swiss German dialects that contain dictionaries with the most typical dialect words. One might argue that, given the existence and free availability of such dictio- naries, translation lexicon induction would not be necessary at all. However, we see the two approaches rather as complementary. Automatic lexicon induction can be ex- pected to induce cognate word pairs only – words that differ only slightly and which are often “too obvious” to be found in dictionaries. In contrast, dictionaries mostly fo- cus on typical, sometimes archaic, dialect words without etymological relation to their Standard German translations.

There are also numerous blogs written in Swiss German. But as some blogs mix dialect and Standard German texts and others are travelogues written on keyboards without umlauts, manual correction is necessary in any case. Moreover, as stated in section 1.2, a lot of e-mail, chat and SMS communication in German-speaking Switzerland is done in dialect. However, such data is not easily accessible and may present the same limitations as blog texts. Nevertheless, chat data have been exploited for dialec- tological research (Siebenhaar 2005), and e-mail and SMS data are being collected for a sociolinguistically-oriented research project on youth language (Werlen 2004).

One potentially promising data source is the Alemannic version ofWikipedia6, but it has some shortcomings. First, it does not cover only Swiss German dialects, but also other dialects of the Alemannic group. As the project has been initiated by Al- satians, most articles are written in Alsatian dialects. Second, the collaborative nature ofWikipedia allows one article to be written by several authors, possibly in different dialects. This makes the texts inconsistent and difficult to exploit. Third, many articles are translations from Standard GermanWikipediaentries. In this translation process, the syntactic structures are often insufficiently adapted, which leads to unnatural and sometimes even ungrammatical dialect text.

For this first study on Swiss German dialects, we chose to use dialect literature, as this

6 http://als.wikipedia.org

(27)

Model

Bilingual gold standard corpus Target lexicon

Induced words Evaluation

Model

Bilingual gold standard corpus

Adaptation of the model

Figure 3.1: Schematic representations of the data flow for the evaluation of a given model (left), and for the training of an adaptive model (right).

genre seemed to have the least drawbacks. But once the induction methods prove suf- ficiently robust, they should also be tested on less polished data.

3.2 Data acquisition

Our framework needs two data sources. The first one is a list of dialect words that need to be associated to their Standard German translations by the model. The second one is a lexicon of Standard German words, among which the model chooses the proposed translations. However, if we want to evaluate the performance of a particular model, we also need to know the correct translations. Moreover, some of the models need to be trained with correct word pairs. Thus, we also need some held-out data to perform these trainings. These data will be concentrated in a gold standard lexicon – a word pair list with dialect words and their correct translations. The different subtasks may or may not use parts of these data sources (see Figure 3.1). The following sections present these two data sources in detail.

(28)

3.2.1 Building a target lexicon

The target lexicon is essentially a list of Standard German words. It needs to be as com- plete as possible, in order not to be too strict a filter. It should also cover the different morphological forms of the words, which is of crucial importance for a morphology- rich language like German. As a lot of research has been done for German, existing resources can be reused. In our case, we used the lexical data associated to theFips project (Wehrli 2004).

TheFipslexicon is a collection of lexemes. Each lexeme contains general information (like part of speech tags), a set of subcategorization frames (if necessary), and a set of inflected word forms with their respective morphological features. The inflected forms are generated with the help of morphological rules; the paradigms are thus complete.

We only use the word form strings in our lexicon. The morphological and syntactic in- formation may be exploited in future work. TheFipslexicon includes some named en- tities like town names, and some compound nouns. In total, it contains about 202’000 inflected word forms. Following the Swiss High German orthography, all occurrences ofß are replaced byss.

3.2.2 Building a gold standard corpus

The corpus used for testing and training the models is extracted from a work of dialect literature, written in a Bernese dialect (Beutler 1996). It contains 57 short texts of two pages each. They had been presented as “thoughts of the day” on a Swiss German radio station. It is one of the rare books that presents the same texts in dialect and in Standard German, thus constituting a small parallel corpus. However, the present work only uses the dialect texts, reserving the Standard German counterparts for future research. As the book was not available in electronic form, the texts had to be scanned and converted with an optical character recognition (OCR) software. While most OCR programs use integrated word lists to improve their performances, this feature had to be avoided here.7The open source softwareOcrad8did not rely on such word lists and

7 In the absence of Swiss German word lists, the automatic language detection of the OCR software would use Standard German word lists, which would lead to bad results. Swiss German word lists could not be used simply because they do not exist, and because the creation of such lists was the very goal of this task.

8 http://www.gnu.org/software/ocrad/ocrad.html

(29)

proved quite efficient. However, all texts had to be proof-read in order to eliminate recognition mistakes.

From these texts, we extracted a simple list of dialect words. By using only this word list, there should not arise any copyright issues – there are no copyrights on words. One additional preprocessing step was applied to analysesandhiphenomena, i.e. the in- sertion of a bindingn between two vowels at word boundaries in Swiss German. For example, the wordNichte ‘niece’ is usually used in the cited form, but it can also be found in the formNichten if the following word begins with a vowel: d Nichtenisch nett ‘the niece is nice’. In order to avoid such double word entries, allns in the con- text<vowel>n<space><vowel> were stripped. A small number of proper nouns were erroneusly stripped9 and had to be corrected manually. Another difficulty was that the dialect texts contained quotations in French, English and Standard German. These words were annotated as foreign words in order exclude them in the experiments.

In a second step, the dialect words had to be annotated with the correct Standard Ger- man translations. Each Swiss German word was translated to one single Standard Ger- man word, respecting the constraints on syntagmatic and paradigmatic one-to-one correspondence (see Section 1.3). Of course, this simplifying assumption is problem- atic for some word pairs. For example, the dialect wordliechtpresents the same ambi- guity as its English counterpartlightand can be associated either to the Standard Ger- man adjectiveleichtor to the nounLicht. In this case, the second sense was chosen ar- bitrarily. However, the adjective sense was present in the corpus with the wordsfäder- liecht ‘light as a feather’andliechter ‘lighter’. In other cases, an etymologically related secondary sense was preferred to a non-etymologically related primary sense. For ex- ample,fyschter is a generic adjective meaning‘dark’. However, it was not translated by the generic Standard German adjectivedunkel ‘dark’, but rather by the etymologi- cally related adjectivefinsterwith the specific sense‘darksome, sable’. For some other dialect words, it was difficult to find an adequate translation. For example,Gäggelizüg describes any kind of little gadgets difficult to manipulate, but Standard German trans- lations likeKleinkramdo not contain the “difficult to manipulate” sense. Probably the most typical word of Bernese German, the sentential adverballwä has about five fun- damental acceptions. In such cases, the Standard German text of the corpus was used to choose one approximative translation.

Other words did have adequate translations, but they were etymologically unrelated.

For example,Assmäntu is translated by Latz ‘bib’, gnepfe is translated byneigen ‘to

9 For example,Berliinstead ofBerlin,Londoinstead ofLondon.

(30)

tend to’, or Himugüegeli corresponds toMarienkäferchen ‘ladybird’. Moreover, the high-frequency verbs luege ‘look’ and lose ‘listen’ turn out to be less distant from their English counterparts than from their Standard German translationsschauenand (zu)hören. In order to take such discrepancies into account, all word pairs were manu- ally annotated ascognateornon-cognate. These annotations allowed for the building of different corpus variants. TheFullcorpus (4731 word pairs) contains cognate and non-cognate word pairs, but foreign words and Standard German words occurring on the dialect side are excluded. TheCognatecorpus (4384 word pairs) is a subset ofFull and contains only word pairs annotated as cognates. Both corpora were divided in two equally-sized parts; one half is used for testing the models, while the other half is reserved for the training of some models.

The cognate/non-cognate annotation has been made on the ground of linguistic in- tuitions, without formal criteria. Mann and Yarowsky (2001) have presented a simple measure to perform this annotation automatically. They count the number of letter substitutions, insertions and deletions needed to transform one word into its transla- tion. If the minimal number of operations is smaller than the arbitrarily fixed threshold of 3, the word pair is considered as cognate. In other words, a word pair is considered as cognate if its Levenshtein distance equals less than 3 (see Section 4.2). We com- pared our manual annotation to Mann and Yarowsky’s heuristics. Of the 2365 tested word pairs, 2200 were manually annotated as cognates. The heuristics detected 1859 cognate pairs, of which 1849 coincided with the manually annotated pairs. Thus, the heuristics yields quite precise results (only 10 false positives), but its definition of cog- nate pairs is much more restrictive than the one implicitely admitted for manual an- notation. For the experiments, we will stick to the manual annotations.

The gold standard (bilingual) lexicon and the target (monolingual) lexicon are two completely independent resources. This means that the Standard German side of the gold standard corpus may contain words that do not occur in the lexicon. Most of these words are compound nouns. As German permits virtually unlimited composition of nouns, it is illusory to list all compound nouns explicitly in the lexicon. For example, Mietzinsaufschlag ‘rent increase’occurs in the corpus, but not in the lexicon, although its three componentsMiete ‘rent’, Zins ‘interest’, Aufschlag ‘increase’ are present. This problem also affects verbs with detachable particles. As the verb stems and the parti- cles are stored separately, they will be recognized correctly in the phraseer zählt auf ‘he enumerates’, but the form with the particle attached in front of the stem, used in verb- final contexts, will not be recognized:weil er aufzählt ‘because he enumerates’. Words that can be used in Standard German, but with a marked Swiss connotation, are also

(31)

absent from the lexicon, for exampleverstampfen ‘bruise’, Rappenspalter ‘cheapskate’, litterally‘penny splitter’.

3.3 Phonemes and Graphemes

As explained in the preceding sections, written data is easier to obtain and to handle than phonetic transcriptions. This holds for Standard German as well as for dialects.

However, the suggested lexicon induction methods rely on the pronunciation similar- ities of the two language varieties. This means that the more the writing conventions of the language varieties differ, the more they may occult underlying phonetic simi- larities and deteriorate the induction performance. Therefore, it is important to look at the correspondence between pronunciation units (phonemes) and writing units (graphemes) of both language varieties discussed.

e [e] [E] [@] ah, eh, ... [a:],[e:], ...

v [f] [v] ie [i:]

ch [x] [ç] ng [N]

ck [k] z [ts]

sch [S] x [ks]

Table 3.1: Complex phoneme-grapheme correspondences in Standard Ger- man. The three pronunciations of the lettere are illustrated in the word Erdbeben[Erdbe:b@n]‘earthquake’.

In Standard German, the correspondence between letters and sounds is one-to-one in most cases. There are fewer exceptions than in languages like French or English. Table 3.1 lists some exceptions. These relatively straightforward correspondences should not penalize us if we use graphemic data instead of phonetic data. However, in the context of lexicon induction, we should not look at this correspondence in an isolated way, but rather take into account the differences between the two language varieties we intend to use.

Swiss German dialects – as implied by the use of the term “dialect” – do not have a long writing tradition. Therefore, there are no well-established orthographic conventions.

Orthographic rules are often chosenad hoc. These choices bear on two antithetic prin- ciples.10On the one hand, it is desirable to reflect the pronunciation of the text as truly

10 Cf. http://de.wikipedia.org/wiki/Schweizerdeutsch and http://members.

aol.com/minoritas/vergflr.htm. See also Dieth (1986).

(32)

Vorbild Vorbiud Vorbild ‘ideal, example’

Guldesle Guwdesle Goldesel ‘cash cows’, lit. ‘gold donkeys’

Mantelsack Mantusack Mantelsack ‘coat pocket’

Table 3.2: Some examples of[l]-vocalization. The first column represents a readability-oriented transcription of Bernese, while the second column presents a transcription based on phonetic faithfulness. The third column contains the Standard German equivalent. The words in bold face are those which occur in the corpus. They testify the orthographic variation found in the corpus.

as possible. On the other hand, one should strive for high readability. As the potential readers are used to read Standard German texts, striving for readability is in practice equivalent to striving for similarity with the Standard German orthography. When the first dialect texts were written in the Bern area at the beginning of the 20th century, the readability aspect was emphasized. Later on, Dieth (1986) broke with this tradition in favor of a phonetically transparent writing. While his proposal gained significant support in the scientific community, it could not supplant the Bernese tradition. Marti (1972) managed to find a compromise between these two opposing views so that most contemporary Bernese writing is inspired by his guidelines. However, many authors – anda fortioriblog or SMS writers – do not know about these guidelines, or do not consciously apply them. They rather try to find their own compromise between read- ability and phonetic faithfulness. For instance, many non-professional writers use the graphemeä for[@]to emphasize the phonetic nuance between Swiss Germanschwa and Standard Germanschwa, even if this is suggested neither by Dieth nor by Marti.

Our data set reflects the opposition between readability and phonetic faithfulness. One of the most typical phonetic characteristics of Bernese Swiss German, the vocalization of[l]to [u], is only partially reflected by the corpus, as illustrated in Table 3.2. Long vowels are written according to the Standard German tradition, i.e.ah for[a:], notaa, with the exception ofiewhich represents a diphthong in Swiss German, but a long[i:]

in Standard German. In contrast to Standard German, our corpus – as Bernese texts in general – distinguishes the two vowels[i]and[I], rendered asi andy respectively.

Our choice of a Bernese dialect corpus is based on the long tradition of dialect litera- ture in that area, and on its importance for research on Swiss dialects. Accidentally, this choice presents some advantages for lexicon induction. Because the writing tradition of Bernese dialects rates the readability constraint high, the differences between writ- ten dialect data and written Standard German data are actually less marked than they

(33)

would be between phonetically transcribed material. Therefore, the proposed meth- ods of lexicon induction may work less well for texts which are written in a phonetically more transparent way.

(34)

Chapter 4

Two-stage models of lexical induction

4.1 The lexicon induction framework

4.1.1 The noisy channel metaphor

Like the standard models of statistical machine translation, our approach to lexicon induction is based on the noisy channel metaphor. According to this metaphor, an observed utterance is not viewed as a starting point of a transformation, but rather as the end point of a hidden transformation whose starting point has to be returned.

Assume that an observed utteranceaof languageAis to be translated into an utterance bof languageB. Following the noisy channel metaphor,b corresponds to the source utterance, which has been transmitted through a noisy channel and got distorted.1The result of this distortion isa. The translation task is then to return the original utterance bon the basis ofa, that is,ahas to bedecoded(see Figure 4.1).

The decoding algorithm has two components (see Figure 4.2). The first component, called channel model, contains information about the noisy channel. It specifies which parts of the utterances are affected by the noise, and in what manner they are affected. The channel model may not be sufficiently precise to retrieve the original ut- terance unambiguously. Therefore, there is a second component, thesource model,

1 In our case, dialect utterances would be viewed as corrupted, imperfect variants of Standard Ger- man. While the noisy channel metaphor obviously does not represent any linguistic justification of this claim, it remains a useful concept for computational models.

Références

Documents relatifs

The parameters of the signal include the accuracy and speed of information transfer. By processing the input signal properly, you can significantly simplify the task of

This phenomenon, called &#34;palatalization of /a/&#34;, was recorded for Pozzuoli by Rohlfs (1966: §22) in the first half of the twentieth century, but it seems to have

Children learning Standard German build Standard German words from known dialect words. (overgeneralizations

We remark that the predicted and the measured data have similar behaviours, and then we can conclude that the developed TS fuzzy model is adequate to fit the daily solar

In this paper, we take a fresh look at this question, and, using techniques from the abstract interpretation framework [17], we construct a reduction method which gener-

A generic plate model has been formulated using warping functions in order to describe the transverse shear stresses.. This model can handle classical plate theories as well as

Es zeigt sich, dass der Kompetenzbegriff nicht nur uneinheitlich und widersprüchlich, sondern auch missverständlich und zum Teil unzutreffend verwendet wird, ein Fokus jedoch auf

Two approaches to research on UX models and measures are discussed on basis of experiences from the field of usability research and an ongoing case of user involvement in