Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

(1)

Conference Presentation

Reference

Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

SCHERRER, Yves, SAMARDŽIĆ, Tanja, GLASER, Elvira

Abstract

To study and automatically process Swiss German, it is necessary to resolve the issue of variation in the written representation of words. Due to the lack of written tradition and to the considerable regional variation, Swiss German writing is highly inconsistent, making it hard to establish identity between lexical items that are felt like “the same word”. This poses a problem for any task that requires establishing lexical identities, such as efficient corpus querying for linguistic research, semantic processing, and information retrieval. In the context of building the general-purpose electronic corpus ArchiMob, we have chosen to create an additional annotation layer that maps the original word forms to unified normalised representations. In this paper, we argue that these normalised representations can be induced in a semi-automatic fashion using techniques from machine translation. A lexical unit can be pronounced, and therefore transcribed, in various ways, due to dialectal variation, intra-speaker variation, code-switching or occasional transcription errors. In order to establish lexical identities between the [...]

SCHERRER, Yves, SAMARDŽIĆ, Tanja, GLASER, Elvira. Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German. In: 6th Days of Swiss Linguistics , Genève, 2016

Available at:

http://archive-ouverte.unige.ch/unige:90850

Disclaimer: layout of this document may differ from the published version.

(2)

Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

Yves Scherrer (UNIGE) Tanja Samardžić (UZH) Elvira Glaser (UZH)

9DSL, Geneva, June 2016

(3)

The ArchiMob corpus

Archimob(Archives de la mobilisation) is an oral history project with the aim of collecting and archiving testimonials of the Second World War period in Switzerland. The Archimob association was founded in 1998 by filmmaker Frédéric Gonseth and has more than 40 members, mainly historians and filmmakers.

(4)

The ArchiMob corpus

Key figures:

555 interviews recorded on video, 1–2 hours long

I Recorded in 1999–2001

I Informants from all linguistic regions, both genders, different social backgrounds, different political views 44 interviews selected for transcription

I Swiss German dialect speakers

I Least exposed to dialect/language contact

I Currently, 34 transcriptions completed

Annotation steps:

1. Transcription 2. Normalization

3. Part-of-speech tagging

(5)

The ArchiMob corpus

Key figures:

I Currently, 34 transcriptions completed Annotation steps:

(6)

The ArchiMob corpus

Key figures:

(7)

The ArchiMob corpus

Key figures:

(8)

Transcription

Transcribers are native Swiss German speakers

Specific transcription guidelines were elaborated at the UZH German Department, based on Dieth recommendations

Aligned with audio at the utterance level

Universität Zürich, Deutsches Seminar Lehrstuhl Prof. Dr. E. Glaser

Virgeln können – der Lesbarkeit halber – auch verwendet werden, um eine (intonatorisch gekennzeichnete) Sinneinheit zu markieren, auch wenn faktisch keine messbare Pause vorhanden ist.

Pausenfüller werden als ehm (mit dem Buchstaben h!) wiedergegeben und zwischen Virgeln gesetzt: / ehm /.

Unterbrüche in der Aufnahme, wie Kassettenwechsel, werden in eckigen Klammern vermerkt.

Bei Wortabbrüchen wird die Virgel direkt ans abgebrochene Wort gesetzt: aber die Fr/ Fa/ Frau landolt

Bei Satzabbrüchen wird nur dann eine Virgel gesetzt, wenn der Sprecher auch wirklich eine Pause macht (Stocken).

Fremdsprachige Passagen (englische, französische aber auch standarddeutsche Äusserungen) werden nach denselben Prinzipien transkribiert wie Dialekt: de eersch Film wäiss ich naa isch Alooma di Härin däär Süüdsee gsii.

Manche Laute können natürlich mit der vorliegenden Schreibung nicht genau wiedergegeben werden. Es soll dann jeweils das Zeichen verwendet werden, das dem entsprechenden Laut möglichst nahe kommt. Engl. <th> wird mit th, hdt.

weiches <ch> mit ch wiedergegeben.

7. Transkriptionsbeispiele 7.1. Zürich-Höngg:

{06:19} 0115 I hät Iri Mueter no es zwäits Maal ghüraate {06:22} 0116 G näi näi si hät nüme ghüraatet

{06:26} 0117 I sii isch Iri Mueter dänn au / ehm /

{06:31} 0118 jaa politisch ãgaschiirt gsii i de Arbäiterbewegig {06:37} 0119 oder Iri Gschwüschterti ufgrund ez vo Irem Schiksaal {06:43} 0120 G also aso vo miim Halbbrüeder

{06:47} 0121 wäiss ich das er au (big) bi dä Roote Falke gsii isch {06:51} 0122 I was häisst au

{06:52} 0123 G ja wien iich also

{06:54} 0124 iich schpööter natüürli dänn i de Viirs/ also aafangs Viirzgerjaare {06:59} 0125 und äär natüürli vil früener wo s im Entschtaa gsii isch {07:02} 0126 es isch im i de Zwänzgerjaare im Entschtaa gsii

7.2. Bern (Köniz):

{05:04} 0140 I also

{05:06} 0141 wämer daa wänd aachnüpfen aa {05:08} 0142 a vorhär

{05:09} 0143 Sii sind i däm Fluug- und Mäldebeobachtigsdienscht händ Sii sich freiwilig gmäldet {05:12} 0144 chönd Si no chli

4/16

(9)

Transcription

Circumstances of the transcription process:

Three phases (2006–2012, 2011–2013, 2015) Four transcribers

Different tools (none, FOLKER, EXMARaLDA)

Transcription guidelines were gradually simplified over time

I No uppercase/lowercase distinction

I No distinction of close/open vowels (ò↔o)

I No distinction of long/short fricatives (chch↔ch)

(10)

Transcription

Circumstances of the transcription process:

Three phases (2006–2012, 2011–2013, 2015) Four transcribers

Different tools (none, FOLKER, EXMARaLDA)

Transcription guidelines were gradually simplified over time

I No uppercase/lowercase distinction

I No distinction of close/open vowels (ò↔o)

I No distinction of long/short fricatives (chch↔ch)

(11)

Why normalize?

Variation in the transcriptions:

Transcription inconsistencies

Dialectal variation (informants from different dialectal origin) Intra-speaker variation

Code-switching (Swiss German – Standard German)

Transcription min maa het immer gsaait

Variants in mi ma hät gsait

the same text mii miin

Variants in mine hed ime gsäit

other texts hèd imer gsääit

hèt emmer Code-switching määin man

mäin main

(12)

Why normalize?

mäin main

(13)

Why normalize?

mäin main

(14)

Why normalize?

(15)

Normalization

Goal:Establish identities between forms that are felt like “the same word”

Guidelines:

Normalize word by word, keep original syntax

Use Standard German forms whenever they correspond in etymology and meaning

Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)

Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word

Difficult cases are listed

Transcription Normalisation

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(16)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(17)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(18)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(19)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(20)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(21)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general

(22)

Normalization

Guidelines:

jaa ja

de dann

het hat

me man

no noch

gluegt gelugt

tänkt gedacht

dasch das ist

ez jetzt

de der

genneraal general Other examples

öpper etwer

gheie geheien

go gan

hettemers hätten wir es

bimene bei einem

(23)

Learning to normalize

Currently, 6 texts (out of 34) have been manually normalized

Normalization of a text takes between 30 and 60 hours, depending on length, dialect and annotator experience

Can we speed up the normalization process for the remaining texts?

Two experimental stages:

Cross-validation Develop automatic normalization models using ~80% of each of the 6 texts[training]

Apply models to remaining ~20% of each of the 6 texts[test]

Bootstrapping Recreate best model using 100% of the 6 texts[training]

Apply model to new texts[test]

Three types of dialect forms in test: Cross-validation:

Unique Dialect forms with exactly one normalization in training 47%

Ambiguous Dialect forms with more than one normalization 41%

New Dialect forms that have not been observed, and for which no

normalization is available 12%

(24)

Learning to normalize

(25)

Learning to normalize

(26)

Learning to normalize

(27)

Cross-validation experiments

1. Baseline

Do not do anything. How many dialect forms are identical to their normalizations?

Unique:23.55% Ambiguous:22.08% New: 7.05% Total:21.26%

2. Most frequent normalization

Choose the most frequently seen normalization.

In case of ties, randomly choose one normalization.

Unique:98.13% Ambiguous:81.77% New:7.05% Total:80.69%

3. Character-level machine translation

Normalize new words character by character, using the character correspondences learned on the training data.

(28)

Cross-validation experiments

1. Baseline

(29)

Cross-validation experiments

1. Baseline

(30)

Cross-validation experiments

1. Baseline

10/16

Character-level machine translation

In order to construct such cognate lists, we need to decide whether a word in a source language has a cognate in a target language. If we already have candidate pairs, string similarity measures can be used to distinguish cognates and unrelated pairs (Montalvo et al., 2012; Sep´ulveda Torres and Aluisio, 2011; Inkpen et al., 2005; Kondrak and Dorr, 2004). However, these measures do not take the regular production processes into account that can be found for most cognates, e.g. the En- glish suffix

~tion

becomes

~ci´on

in Spanish like in

nation-naci´on

or

addition-adici´on. Thus, an alter-

native approach is to manually extract or learn pro- duction rules that reflect the regularities (Gomes and Pereira Lopes, 2011; Schulz et al., 2004).

All these methods are based on string align- ment and thus cannot be directly applied to lan- guage pairs with different alphabets. A possible workaround would be to first transliterate foreign alphabets into Latin, but unambiguous translitera- tion is only possible for some languages. Methods that rely on the phonetic similarity of words (Kon- drak, 2000) require a phonetic transcription that is not always available. Thus, we propose a novel production approach using statistical character- based machine translation in order to directly pro- duce cognates. We argue that this has the follow- ing advantages: (i) it captures complex patterns in the same way machine translation captures com- plex rephrasing of sentences, (ii) it performs bet- ter than similarity measures from previous work on cognates, and (iii) it also works for language pairs with different alphabets.

2 Character-Based Machine Translation Our approach relies on statistical phrase-based machine translation (MT). As we are not inter- ested in the translation of phrases, but in the trans- formation of character sequences from one lan- guage into the other, we use words instead of sen- tences and characters instead of words, as shown in Figure 1. In the example, the English charac- ter sequence

cc

is mapped to a single

c

in Spanish and the final

e

becomes

ar. It is important to note

that these mappings only apply in certain contexts.

For example,

accident

becomes

accidente

with a double

c

in Spanish and not every word-final

e

is changed into

ar. In statistical MT, the training pro-

cess generates a phrase table with transformation probabilities. This information is combined with

Figure 1: Character-based machine translation rithm selects the best combination of sequences.

The transformation is thus not performed on iso- lated characters, it also considers the surrounding sequences and can account for context-dependent phenomena. The goal of the approach is to directly produce a cognate in the target language from an input word in another language. Consequently, in the remainder of the paper, we refer to our method as COP (COgnate Production).

Exploiting the orthographic similarity of cog- nates to improve the alignment of words has al- ready been analyzed as a useful preparation for MT (Tiedemann, 2009; Koehn and Knight, 2002;

Ribeiro et al., 2001). As explained above, we ap- proach the phenomenon from the opposite direc- tion and use statistical MT for cognate production.

Previous experiments with character-based MT have been performed for different purposes. Pen- nell and Liu (2011) expand text message abbre- viations into proper English. In Stymne (2011), character-based MT is used for the identification of common spelling errors. Several other ap- proaches also apply MT algorithms for translit- eration of named entities to increase the vocabu- lary coverage (Rama and Gali, 2009; Finch and Sumita, 2008). For transliteration, characters from one alphabet are mapped onto corresponding let- ters in another alphabet. Cognates follow more complex production patterns. Nakov and Tiede- mann (2012) aim at improving MT quality using cognates detected by character-based alignment.

They focus on the language pair Macedonian- Bulgarian and use English as a bridge language.

As they use cognate identification only as an in- termediary step and do not provide evaluation re- sults, we cannot directly compare with their work.

To the best of our knowledge, we are the first to use statistical character-based MT for the goal of directly producing cognates.

3 Experimental Setup

Figure 2 gives an overview of the COP architec-

ture. We use the existing statistical MT engine

Moses (Koehn et al., 2007). The main difference

(31)

Cross-validation experiments

4. Language model

For ambiguous words, choose the most likely normalization based on the 1-6 surrounding normalised words.

The probabilities are estimated on the ArchiMob normalizations as well as on the TüBa-D/S corpus of spoken Standard German.

In particular: +16% for ambiguous words with tied counts, but -11% for ambiguous words with non-tied counts.

5. Language model for tied counts only

(32)

Cross-validation experiments

4. Language model

(33)

Cross-validation experiments

4. Language model

(34)

Bootstrapping experiments

84.13% of correct normalizations in the cross-validation setting:

I The test utterances are in a dialect that has already been seen during training.

I The test utterances contain proper nouns that have likely already been seen during training.

How does the normalization method perform when applied to different texts from different dialects?

I Bootstrapping setting

(35)

Bootstrapping experiments

84.13% of correct normalizations in the cross-validation setting:

I The test utterances are in a dialect that has already been seen during training.

I The test utterances contain proper nouns that have likely already been seen during training.

How does the normalization method perform when applied to different texts from different dialects?

I Bootstrapping setting

! !

!

! !

^ ^

Test3^

Test2 Test1

(36)

Bootstrapping experiments

1. Apply the model toTest1andTest2.

2. Evaluate and correct the

normalizations inTest1andTest2.

3. Create an augmented model by addingTest1andTest2.

4. Apply the model toTest3.

5. Evaluate and correct the normalizations inTest3.

!

! !

^ ^

Test3^

Test2 Test1

Training data Initial Initial Initial Initial Augmented Test data Cross-validation Test1 Test2 Test3 Test3

Unique 98.13 96.50 98.60 95.28 95.33

Ambiguous 81.96 76.42 87.96 78.42 78.39

New 35.90 50.40 51.47 44.45 42.01

All 84.13 78.08 87.58 78.44 78.90

(37)

Bootstrapping experiments

1. Apply the model toTest1andTest2.

2. Evaluate and correct the

normalizations inTest1andTest2.

3. Create an augmented model by addingTest1andTest2.

4. Apply the model toTest3.

5. Evaluate and correct the normalizations inTest3.

!

! !

^ ^

Test3^

Test2 Test1

Training data Initial Initial Initial Initial Augmented Test data Cross-validation Test1 Test2 Test3 Test3

Unique 98.13 96.50 98.60 95.28 95.33

Ambiguous 81.96 76.42 87.96 78.42 78.39

New 35.90 50.40 51.47 44.45 42.01

All 84.13 78.08 87.58 78.44 78.90

(38)

Discussion

Key figures:

Automatic methods allow us to correctly normalize about3/4of words, more if dialects are similar

Time spent to normalize a document by hand: 30-60 hours

Time spent to correct an automatically normalized document: 8-10 hours

Recent advances:

Use character-level machine translation for all types of words Apply it to entire utterances, not isolated words:

_ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ d a s c h _ e z _ d e _ g e n n e r a a l _

_ j a _ d e n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ d a s _ i s t _ j e t z t _ d e r _ g e n e r a l _

Use a larger external language model (Standard German film subtitles) Total accuracy on cross-validation set: >90%

All remaining documents were normalized with this model

(39)

Discussion

Key figures:

Automatic methods allow us to correctly normalize about3/4of words, more if dialects are similar

Time spent to normalize a document by hand: 30-60 hours

Time spent to correct an automatically normalized document: 8-10 hours

Recent advances:

Use character-level machine translation for all types of words Apply it to entire utterances, not isolated words:

_ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ d a s c h _ e z _ d e _ g e n n e r a a l _

_ j a _ d e n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ d a s _ i s t _ j e t z t _ d e r _ g e n e r a l _

Use a larger external language model (Standard German film subtitles) Total accuracy on cross-validation set: >90%

All remaining documents were normalized with this model

(40)

Conclusions

Summary:

First corpus of spoken Swiss German developed using up-to-date technology

Suitable for studying regional variation and for training natural language processing tools

Goal: 44 documents, around 700 000 tokens

Status:

34 documents (around 500 000 tokens) completely transcribed and aligned with audio source

6 documents manually normalized,

3 documents automatically normalized and hand-corrected, 25 documents automatically normalized

4 documents manually annotated with part-of-speech tags, 30 documents automatically tagged

(41)

Conclusions

Summary:

First corpus of spoken Swiss German developed using up-to-date technology

Suitable for studying regional variation and for training natural language processing tools

Goal: 44 documents, around 700 000 tokens

Status:

34 documents (around 500 000 tokens) completely transcribed and aligned with audio source

6 documents manually normalized,

3 documents automatically normalized and hand-corrected, 25 documents automatically normalized

4 documents manually annotated with part-of-speech tags, 30 documents automatically tagged

(42)

Conclusions

Availability:

Online lookup using Sketch Engine

XML download

Audio/video files on request

http://www.spur.uzh.ch/en/departments/korpuslab