Available Data - Phonetic Distance Measures for the Induction of a Translation Lexicon for Dial

When working with Swiss German dialects and Standard German, one of the most striking differences is the great difference in amount of available data.

Without taking into account the dialectal variation, there are more than 90 million na-tive speakers of German. In terms of speakers, German is the first language in the Eu-ropean Union, and among the top ten languages of the world. It is the third most popu-lar language used by websites, and the most popupopu-lar language for translation: German accounts for the most written translations into and from a language.¹. All speakers of German have good active and passive competences in the standard variety, and all of them are able to use the standard variety in written contexts. In other words, Stan-dard German texts are produced and consumed by more than 90 million people. The

1 All data stem from http://en.wikipedia.org/wiki/German_language, 1st march 2007. For older data, see Ammon (1991).

economic force of the German-speaking countries confers additional importance to Standard German. From a scientific point of view, German is an interesting language because of its verb-second word order and its rich morphology. For these reasons, Standard German has become an attractive language for research in computational linguistics. Just to cite some examples, there are syntactically annotated corpora like NEGRA²orTIGER³. TheEuroparlcorpus⁴contains German text aligned with transla-tions of other European languages.

This rather comfortable situation for Standard German contrasts with the scarcity of dialect data. There are about 5 million speakers of Swiss German dialects. However, they do not produce official written dialect texts, and they do not require texts ad-dressed to them to be written in dialect. The few dialect texts that exist do not present uniformity, but dialectal variation. Written dialect material essentially comes from three sources: dialect literature, data collections obtained from speech transcription, and material collected from electronic communication medias.

In the 1980s and the early 1990s, many dialect books were published, most of them in Bernese dialect. The use of dialect in artistic contexts like literature and music has traditionally been most popular in the area of Bern. The advantage of using dialect lit-erature in linguistic applications is that such books constitute reasonably large corpora with a standardized orthography, written in a less colloquial style than the other, more informal data sources. However, the exploitation of such corpora is rendered difficult by copyright restrictions and by lack of availability in electronic formats.

Another data source are speech transcriptions. Since the beginning of the 20th cen-tury, there has been intensive dialectological research in Switzerland. This research has been based on oral interviews with dialect speakers⁵. However, only a small per-centage of these interviews has been fully transcribed, and even less data are acces-sible electronically. Other speech transcriptions may be available from regional radio or television stations. However, using speech transcriptions for lexicon induction can be problematic. Usually, transcriptions aim for phonetic and prosodic adequacy with the spoken source. Preprocessing would be required to unify the transcription conven-tions and to delete prosodic marking that is not relevant for lexical induction.

More-2 http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/

3 http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/

4 http://people.csail.mit.edu/koehn/publications/europarl/

5 Most of these data are conserved at thePhonogrammarchivof the University of Zurich (http:

//www.phonogrammarchiv.unizh.ch).

over, performance errors like repetitions, omissions and incomplete sentences are fre-quent in speech. While this is a minor problem in the word-based approach proposed here, it excludes robust syntactic processing, which would be needed to improve the performances of the framework.

The spread of electronic communication medias in the last decade has helped increase the written use of dialects in a spectacular way. For example, there are websites de-voted to Swiss German dialects that contain dictionaries with the most typical dialect words. One might argue that, given the existence and free availability of such dictio-naries, translation lexicon induction would not be necessary at all. However, we see the two approaches rather as complementary. Automatic lexicon induction can be ex-pected to induce cognate word pairs only – words that differ only slightly and which are often “too obvious” to be found in dictionaries. In contrast, dictionaries mostly fo-cus on typical, sometimes archaic, dialect words without etymological relation to their Standard German translations.

There are also numerous blogs written in Swiss German. But as some blogs mix dialect and Standard German texts and others are travelogues written on keyboards without umlauts, manual correction is necessary in any case. Moreover, as stated in section 1.2, a lot of e-mail, chat and SMS communication in German-speaking Switzerland is done in dialect. However, such data is not easily accessible and may present the same limitations as blog texts. Nevertheless, chat data have been exploited for dialec-tological research (Siebenhaar 2005), and e-mail and SMS data are being collected for a sociolinguistically-oriented research project on youth language (Werlen 2004).

One potentially promising data source is the Alemannic version ofWikipedia⁶, but it has some shortcomings. First, it does not cover only Swiss German dialects, but also other dialects of the Alemannic group. As the project has been initiated by Al-satians, most articles are written in Alsatian dialects. Second, the collaborative nature ofWikipedia allows one article to be written by several authors, possibly in different dialects. This makes the texts inconsistent and difficult to exploit. Third, many articles are translations from Standard GermanWikipediaentries. In this translation process, the syntactic structures are often insufficiently adapted, which leads to unnatural and sometimes even ungrammatical dialect text.

For this first study on Swiss German dialects, we chose to use dialect literature, as this

6 http://als.wikipedia.org

Model

Bilingual gold standard corpus Target lexicon

Induced words Evaluation

Model

Bilingual gold standard corpus

Adaptation of the model

Figure 3.1: Schematic representations of the data flow for the evaluation of a given model (left), and for the training of an adaptive model (right).

genre seemed to have the least drawbacks. But once the induction methods prove suf-ficiently robust, they should also be tested on less polished data.

Dans le document Phonetic Distance Measures for the Induction of a Translation Lexicon for Dialects - A Study on Bernese Swiss German and Standard German (Page 24-27)