Conference Presentation
Reference
Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German
SCHERRER, Yves, SAMARDŽIĆ, Tanja, GLASER, Elvira
Abstract
To study and automatically process Swiss German, it is necessary to resolve the issue of variation in the written representation of words. Due to the lack of written tradition and to the considerable regional variation, Swiss German writing is highly inconsistent, making it hard to establish identity between lexical items that are felt like “the same word”. This poses a problem for any task that requires establishing lexical identities, such as efficient corpus querying for linguistic research, semantic processing, and information retrieval. In the context of building the general-purpose electronic corpus ArchiMob, we have chosen to create an additional annotation layer that maps the original word forms to unified normalised representations. In this paper, we argue that these normalised representations can be induced in a semi-automatic fashion using techniques from machine translation. A lexical unit can be pronounced, and therefore transcribed, in various ways, due to dialectal variation, intra-speaker variation, code-switching or occasional transcription errors. In order to establish lexical identities between the [...]
SCHERRER, Yves, SAMARDŽIĆ, Tanja, GLASER, Elvira. Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German. In: 6th Days of Swiss Linguistics , Genève, 2016
Available at:
http://archive-ouverte.unige.ch/unige:90850
Disclaimer: layout of this document may differ from the published version.
Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German
Yves Scherrer (UNIGE) Tanja Samardžić (UZH) Elvira Glaser (UZH)
9DSL, Geneva, June 2016
The ArchiMob corpus
Archimob(Archives de la mobilisation) is an oral history project with the aim of collecting and archiving testimonials of the Second World War period in Switzerland. The Archimob association was founded in 1998 by filmmaker Frédéric Gonseth and has more than 40 members, mainly historians and filmmakers.
The ArchiMob corpus
Key figures:
555 interviews recorded on video, 1–2 hours long
I Recorded in 1999–2001
I Informants from all linguistic regions, both genders, different social backgrounds, different political views 44 interviews selected for transcription
I Swiss German dialect speakers
I Least exposed to dialect/language contact
I Currently, 34 transcriptions completed
Annotation steps:
1. Transcription 2. Normalization
3. Part-of-speech tagging
The ArchiMob corpus
Key figures:
555 interviews recorded on video, 1–2 hours long
I Recorded in 1999–2001
I Informants from all linguistic regions, both genders, different social backgrounds, different political views 44 interviews selected for transcription
I Swiss German dialect speakers
I Least exposed to dialect/language contact
I Currently, 34 transcriptions completed Annotation steps:
1. Transcription 2. Normalization
3. Part-of-speech tagging
The ArchiMob corpus
Key figures:
555 interviews recorded on video, 1–2 hours long
I Recorded in 1999–2001
I Informants from all linguistic regions, both genders, different social backgrounds, different political views 44 interviews selected for transcription
I Swiss German dialect speakers
I Least exposed to dialect/language contact
I Currently, 34 transcriptions completed Annotation steps:
1. Transcription 2. Normalization
3. Part-of-speech tagging
The ArchiMob corpus
Key figures:
555 interviews recorded on video, 1–2 hours long
I Recorded in 1999–2001
I Informants from all linguistic regions, both genders, different social backgrounds, different political views 44 interviews selected for transcription
I Swiss German dialect speakers
I Least exposed to dialect/language contact
I Currently, 34 transcriptions completed Annotation steps:
1. Transcription 2. Normalization
3. Part-of-speech tagging
Transcription
Transcribers are native Swiss German speakers
Specific transcription guidelines were elaborated at the UZH German Department, based on Dieth recommendations
Aligned with audio at the utterance level
Universität Zürich, Deutsches Seminar Lehrstuhl Prof. Dr. E. Glaser
Virgeln können – der Lesbarkeit halber – auch verwendet werden, um eine (intonatorisch gekennzeichnete) Sinneinheit zu markieren, auch wenn faktisch keine messbare Pause vorhanden ist.
Pausenfüller werden als ehm (mit dem Buchstaben h!) wiedergegeben und zwischen Virgeln gesetzt: / ehm /.
Unterbrüche in der Aufnahme, wie Kassettenwechsel, werden in eckigen Klammern vermerkt.
Bei Wortabbrüchen wird die Virgel direkt ans abgebrochene Wort gesetzt: aber die Fr/ Fa/ Frau landolt
Bei Satzabbrüchen wird nur dann eine Virgel gesetzt, wenn der Sprecher auch wirklich eine Pause macht (Stocken).
Fremdsprachige Passagen (englische, französische aber auch standarddeutsche Äusserungen) werden nach denselben Prinzipien transkribiert wie Dialekt: de eersch Film wäiss ich naa isch Alooma di Härin däär Süüdsee gsii.
Manche Laute können natürlich mit der vorliegenden Schreibung nicht genau wiedergegeben werden. Es soll dann jeweils das Zeichen verwendet werden, das dem entsprechenden Laut möglichst nahe kommt. Engl. <th> wird mit th, hdt.
weiches <ch> mit ch wiedergegeben.
7. Transkriptionsbeispiele 7.1. Zürich-Höngg:
{06:19} 0115 I hät Iri Mueter no es zwäits Maal ghüraate {06:22} 0116 G näi näi si hät nüme ghüraatet
{06:26} 0117 I sii isch Iri Mueter dänn au / ehm /
{06:31} 0118 jaa politisch ãgaschiirt gsii i de Arbäiterbewegig {06:37} 0119 oder Iri Gschwüschterti ufgrund ez vo Irem Schiksaal {06:43} 0120 G also aso vo miim Halbbrüeder
{06:47} 0121 wäiss ich das er au (big) bi dä Roote Falke gsii isch {06:51} 0122 I was häisst au
{06:52} 0123 G ja wien iich also
{06:54} 0124 iich schpööter natüürli dänn i de Viirs/ also aafangs Viirzgerjaare {06:59} 0125 und äär natüürli vil früener wo s im Entschtaa gsii isch {07:02} 0126 es isch im i de Zwänzgerjaare im Entschtaa gsii
7.2. Bern (Köniz):
{05:04} 0140 I also
{05:06} 0141 wämer daa wänd aachnüpfen aa {05:08} 0142 a vorhär
{05:09} 0143 Sii sind i däm Fluug- und Mäldebeobachtigsdienscht händ Sii sich freiwilig gmäldet {05:12} 0144 chönd Si no chli
4/16
Transcription
Circumstances of the transcription process:
Three phases (2006–2012, 2011–2013, 2015) Four transcribers
Different tools (none, FOLKER, EXMARaLDA)
Transcription guidelines were gradually simplified over time
I No uppercase/lowercase distinction
I No distinction of close/open vowels (ò↔o)
I No distinction of long/short fricatives (chch↔ch)
Transcription
Circumstances of the transcription process:
Three phases (2006–2012, 2011–2013, 2015) Four transcribers
Different tools (none, FOLKER, EXMARaLDA)
Transcription guidelines were gradually simplified over time
I No uppercase/lowercase distinction
I No distinction of close/open vowels (ò↔o)
I No distinction of long/short fricatives (chch↔ch)
Why normalize?
Variation in the transcriptions:
Transcription inconsistencies
Dialectal variation (informants from different dialectal origin) Intra-speaker variation
Code-switching (Swiss German – Standard German)
Transcription min maa het immer gsaait
Variants in mi ma hät gsait
the same text mii miin
Variants in mine hed ime gsäit
other texts hèd imer gsääit
hèt emmer Code-switching määin man
mäin main
Why normalize?
Variation in the transcriptions:
Transcription inconsistencies
Dialectal variation (informants from different dialectal origin) Intra-speaker variation
Code-switching (Swiss German – Standard German)
Transcription min maa het immer gsaait
Variants in mi ma hät gsait
the same text mii miin
Variants in mine hed ime gsäit
other texts hèd imer gsääit
hèt emmer Code-switching määin man
mäin main
Why normalize?
Variation in the transcriptions:
Transcription inconsistencies
Dialectal variation (informants from different dialectal origin) Intra-speaker variation
Code-switching (Swiss German – Standard German)
Transcription min maa het immer gsaait
Variants in mi ma hät gsait
the same text mii miin
Variants in mine hed ime gsäit
other texts hèd imer gsääit
hèt emmer Code-switching määin man
mäin main
Why normalize?
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general
Normalization
Goal:Establish identities between forms that are felt like “the same word”
Guidelines:
Normalize word by word, keep original syntax
Use Standard German forms whenever they correspond in etymology and meaning
Otherwise create etymologically motivated artificial forms (based on Idiotikonlemmas if possible)
Disambiguate ambiguous dialect forms Dialect forms may correspond to more than one normalized word
Difficult cases are listed
Transcription Normalisation
jaa ja
de dann
het hat
me man
no noch
gluegt gelugt
tänkt gedacht
dasch das ist
ez jetzt
de der
genneraal general Other examples
öpper etwer
gheie geheien
go gan
hettemers hätten wir es
bimene bei einem
Learning to normalize
Currently, 6 texts (out of 34) have been manually normalized
Normalization of a text takes between 30 and 60 hours, depending on length, dialect and annotator experience
Can we speed up the normalization process for the remaining texts?
Two experimental stages:
Cross-validation Develop automatic normalization models using ~80% of each of the 6 texts[training]
Apply models to remaining ~20% of each of the 6 texts[test]
Bootstrapping Recreate best model using 100% of the 6 texts[training]
Apply model to new texts[test]
Three types of dialect forms in test: Cross-validation:
Unique Dialect forms with exactly one normalization in training 47%
Ambiguous Dialect forms with more than one normalization 41%
New Dialect forms that have not been observed, and for which no
normalization is available 12%
Learning to normalize
Currently, 6 texts (out of 34) have been manually normalized
Normalization of a text takes between 30 and 60 hours, depending on length, dialect and annotator experience
Can we speed up the normalization process for the remaining texts?
Two experimental stages:
Cross-validation Develop automatic normalization models using ~80% of each of the 6 texts[training]
Apply models to remaining ~20% of each of the 6 texts[test]
Bootstrapping Recreate best model using 100% of the 6 texts[training]
Apply model to new texts[test]
Three types of dialect forms in test: Cross-validation:
Unique Dialect forms with exactly one normalization in training 47%
Ambiguous Dialect forms with more than one normalization 41%
New Dialect forms that have not been observed, and for which no
normalization is available 12%
Learning to normalize
Currently, 6 texts (out of 34) have been manually normalized
Normalization of a text takes between 30 and 60 hours, depending on length, dialect and annotator experience
Can we speed up the normalization process for the remaining texts?
Two experimental stages:
Cross-validation Develop automatic normalization models using ~80% of each of the 6 texts[training]
Apply models to remaining ~20% of each of the 6 texts[test]
Bootstrapping Recreate best model using 100% of the 6 texts[training]
Apply model to new texts[test]
Three types of dialect forms in test: Cross-validation:
Unique Dialect forms with exactly one normalization in training 47%
Ambiguous Dialect forms with more than one normalization 41%
New Dialect forms that have not been observed, and for which no
normalization is available 12%
Learning to normalize
Currently, 6 texts (out of 34) have been manually normalized
Normalization of a text takes between 30 and 60 hours, depending on length, dialect and annotator experience
Can we speed up the normalization process for the remaining texts?
Two experimental stages:
Cross-validation Develop automatic normalization models using ~80% of each of the 6 texts[training]
Apply models to remaining ~20% of each of the 6 texts[test]
Bootstrapping Recreate best model using 100% of the 6 texts[training]
Apply model to new texts[test]
Three types of dialect forms in test: Cross-validation:
Unique Dialect forms with exactly one normalization in training 47%
Ambiguous Dialect forms with more than one normalization 41%
New Dialect forms that have not been observed, and for which no
normalization is available 12%
Cross-validation experiments
1. Baseline
Do not do anything. How many dialect forms are identical to their normalizations?
Unique:23.55% Ambiguous:22.08% New: 7.05% Total:21.26%
2. Most frequent normalization
Choose the most frequently seen normalization.
In case of ties, randomly choose one normalization.
Unique:98.13% Ambiguous:81.77% New:7.05% Total:80.69%
3. Character-level machine translation
Normalize new words character by character, using the character correspondences learned on the training data.
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
Cross-validation experiments
1. Baseline
Do not do anything. How many dialect forms are identical to their normalizations?
Unique:23.55% Ambiguous:22.08% New: 7.05% Total:21.26%
2. Most frequent normalization
Choose the most frequently seen normalization.
In case of ties, randomly choose one normalization.
Unique:98.13% Ambiguous:81.77% New:7.05% Total:80.69%
3. Character-level machine translation
Normalize new words character by character, using the character correspondences learned on the training data.
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
Cross-validation experiments
1. Baseline
Do not do anything. How many dialect forms are identical to their normalizations?
Unique:23.55% Ambiguous:22.08% New: 7.05% Total:21.26%
2. Most frequent normalization
Choose the most frequently seen normalization.
In case of ties, randomly choose one normalization.
Unique:98.13% Ambiguous:81.77% New:7.05% Total:80.69%
3. Character-level machine translation
Normalize new words character by character, using the character correspondences learned on the training data.
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
Cross-validation experiments
1. Baseline
Do not do anything. How many dialect forms are identical to their normalizations?
Unique:23.55% Ambiguous:22.08% New: 7.05% Total:21.26%
2. Most frequent normalization
Choose the most frequently seen normalization.
In case of ties, randomly choose one normalization.
Unique:98.13% Ambiguous:81.77% New:7.05% Total:80.69%
3. Character-level machine translation
Normalize new words character by character, using the character correspondences learned on the training data.
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
10/16
Character-level machine translation
In order to construct such cognate lists, we need to decide whether a word in a source language has a cognate in a target language. If we already have candidate pairs, string similarity measures can be used to distinguish cognates and unrelated pairs (Montalvo et al., 2012; Sep´ulveda Torres and Aluisio, 2011; Inkpen et al., 2005; Kondrak and Dorr, 2004). However, these measures do not take the regular production processes into account that can be found for most cognates, e.g. the En- glish suffix
~tionbecomes
~ci´onin Spanish like in
nation-naci´onor
addition-adici´on. Thus, an alter-native approach is to manually extract or learn pro- duction rules that reflect the regularities (Gomes and Pereira Lopes, 2011; Schulz et al., 2004).
All these methods are based on string align- ment and thus cannot be directly applied to lan- guage pairs with different alphabets. A possible workaround would be to first transliterate foreign alphabets into Latin, but unambiguous translitera- tion is only possible for some languages. Methods that rely on the phonetic similarity of words (Kon- drak, 2000) require a phonetic transcription that is not always available. Thus, we propose a novel production approach using statistical character- based machine translation in order to directly pro- duce cognates. We argue that this has the follow- ing advantages: (i) it captures complex patterns in the same way machine translation captures com- plex rephrasing of sentences, (ii) it performs bet- ter than similarity measures from previous work on cognates, and (iii) it also works for language pairs with different alphabets.
2 Character-Based Machine Translation Our approach relies on statistical phrase-based machine translation (MT). As we are not inter- ested in the translation of phrases, but in the trans- formation of character sequences from one lan- guage into the other, we use words instead of sen- tences and characters instead of words, as shown in Figure 1. In the example, the English charac- ter sequence
ccis mapped to a single
cin Spanish and the final
ebecomes
ar. It is important to notethat these mappings only apply in certain contexts.
For example,
accidentbecomes
accidentewith a double
cin Spanish and not every word-final
eis changed into
ar. In statistical MT, the training pro-cess generates a phrase table with transformation probabilities. This information is combined with
Figure 1: Character-based machine translation rithm selects the best combination of sequences.
The transformation is thus not performed on iso- lated characters, it also considers the surrounding sequences and can account for context-dependent phenomena. The goal of the approach is to directly produce a cognate in the target language from an input word in another language. Consequently, in the remainder of the paper, we refer to our method as COP (COgnate Production).
Exploiting the orthographic similarity of cog- nates to improve the alignment of words has al- ready been analyzed as a useful preparation for MT (Tiedemann, 2009; Koehn and Knight, 2002;
Ribeiro et al., 2001). As explained above, we ap- proach the phenomenon from the opposite direc- tion and use statistical MT for cognate production.
Previous experiments with character-based MT have been performed for different purposes. Pen- nell and Liu (2011) expand text message abbre- viations into proper English. In Stymne (2011), character-based MT is used for the identification of common spelling errors. Several other ap- proaches also apply MT algorithms for translit- eration of named entities to increase the vocabu- lary coverage (Rama and Gali, 2009; Finch and Sumita, 2008). For transliteration, characters from one alphabet are mapped onto corresponding let- ters in another alphabet. Cognates follow more complex production patterns. Nakov and Tiede- mann (2012) aim at improving MT quality using cognates detected by character-based alignment.
They focus on the language pair Macedonian- Bulgarian and use English as a bridge language.
As they use cognate identification only as an in- termediary step and do not provide evaluation re- sults, we cannot directly compare with their work.
To the best of our knowledge, we are the first to use statistical character-based MT for the goal of directly producing cognates.
3 Experimental Setup
Figure 2 gives an overview of the COP architec-
ture. We use the existing statistical MT engine
Moses (Koehn et al., 2007). The main difference
Cross-validation experiments
3. Character-level machine translation
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
4. Language model
For ambiguous words, choose the most likely normalization based on the 1-6 surrounding normalised words.
The probabilities are estimated on the ArchiMob normalizations as well as on the TüBa-D/S corpus of spoken Standard German.
Unique:98.13% Ambiguous:71.38% New:35.90% Total:79.72%
In particular: +16% for ambiguous words with tied counts, but -11% for ambiguous words with non-tied counts.
5. Language model for tied counts only
Unique:98.13% Ambiguous:81.96% New: 35.90% Total:84.13%
Cross-validation experiments
3. Character-level machine translation
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
4. Language model
For ambiguous words, choose the most likely normalization based on the 1-6 surrounding normalised words.
The probabilities are estimated on the ArchiMob normalizations as well as on the TüBa-D/S corpus of spoken Standard German.
Unique:98.13% Ambiguous:71.38% New:35.90% Total:79.72%
In particular: +16% for ambiguous words with tied counts, but -11% for ambiguous words with non-tied counts.
5. Language model for tied counts only
Unique:98.13% Ambiguous:81.96% New: 35.90% Total:84.13%
Cross-validation experiments
3. Character-level machine translation
Unique:98.13% Ambiguous:81.77% New:35.90% Total:84.05%
4. Language model
For ambiguous words, choose the most likely normalization based on the 1-6 surrounding normalised words.
The probabilities are estimated on the ArchiMob normalizations as well as on the TüBa-D/S corpus of spoken Standard German.
Unique:98.13% Ambiguous:71.38% New:35.90% Total:79.72%
In particular: +16% for ambiguous words with tied counts, but -11% for ambiguous words with non-tied counts.
5. Language model for tied counts only
Unique:98.13% Ambiguous:81.96% New: 35.90% Total:84.13%
Bootstrapping experiments
84.13% of correct normalizations in the cross-validation setting:
I The test utterances are in a dialect that has already been seen during training.
I The test utterances contain proper nouns that have likely already been seen during training.
How does the normalization method perform when applied to different texts from different dialects?
I Bootstrapping setting
Bootstrapping experiments
84.13% of correct normalizations in the cross-validation setting:
I The test utterances are in a dialect that has already been seen during training.
I The test utterances contain proper nouns that have likely already been seen during training.
How does the normalization method perform when applied to different texts from different dialects?
I Bootstrapping setting
! !
!
!
! !
^ ^
Test3^
Test2 Test1
Bootstrapping experiments
1. Apply the model toTest1andTest2.
2. Evaluate and correct the
normalizations inTest1andTest2.
3. Create an augmented model by addingTest1andTest2.
4. Apply the model toTest3.
5. Evaluate and correct the normalizations inTest3.
!
!
!
!
! !
^ ^
Test3^
Test2 Test1
Training data Initial Initial Initial Initial Augmented Test data Cross-validation Test1 Test2 Test3 Test3
Unique 98.13 96.50 98.60 95.28 95.33
Ambiguous 81.96 76.42 87.96 78.42 78.39
New 35.90 50.40 51.47 44.45 42.01
All 84.13 78.08 87.58 78.44 78.90
Bootstrapping experiments
1. Apply the model toTest1andTest2.
2. Evaluate and correct the
normalizations inTest1andTest2.
3. Create an augmented model by addingTest1andTest2.
4. Apply the model toTest3.
5. Evaluate and correct the normalizations inTest3.
!
!
!
!
! !
^ ^
Test3^
Test2 Test1
Training data Initial Initial Initial Initial Augmented Test data Cross-validation Test1 Test2 Test3 Test3
Unique 98.13 96.50 98.60 95.28 95.33
Ambiguous 81.96 76.42 87.96 78.42 78.39
New 35.90 50.40 51.47 44.45 42.01
All 84.13 78.08 87.58 78.44 78.90
Discussion
Key figures:
Automatic methods allow us to correctly normalize about3/4of words, more if dialects are similar
Time spent to normalize a document by hand: 30-60 hours
Time spent to correct an automatically normalized document: 8-10 hours
Recent advances:
Use character-level machine translation for all types of words Apply it to entire utterances, not isolated words:
_ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ d a s c h _ e z _ d e _ g e n n e r a a l _
_ j a _ d e n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ d a s _ i s t _ j e t z t _ d e r _ g e n e r a l _
Use a larger external language model (Standard German film subtitles) Total accuracy on cross-validation set: >90%
All remaining documents were normalized with this model
Discussion
Key figures:
Automatic methods allow us to correctly normalize about3/4of words, more if dialects are similar
Time spent to normalize a document by hand: 30-60 hours
Time spent to correct an automatically normalized document: 8-10 hours
Recent advances:
Use character-level machine translation for all types of words Apply it to entire utterances, not isolated words:
_ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ d a s c h _ e z _ d e _ g e n n e r a a l _
_ j a _ d e n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ d a s _ i s t _ j e t z t _ d e r _ g e n e r a l _
Use a larger external language model (Standard German film subtitles) Total accuracy on cross-validation set: >90%
All remaining documents were normalized with this model
Conclusions
Summary:
First corpus of spoken Swiss German developed using up-to-date technology
Suitable for studying regional variation and for training natural language processing tools
Goal: 44 documents, around 700 000 tokens
Status:
34 documents (around 500 000 tokens) completely transcribed and aligned with audio source
6 documents manually normalized,
3 documents automatically normalized and hand-corrected, 25 documents automatically normalized
4 documents manually annotated with part-of-speech tags, 30 documents automatically tagged
Conclusions
Summary:
First corpus of spoken Swiss German developed using up-to-date technology
Suitable for studying regional variation and for training natural language processing tools
Goal: 44 documents, around 700 000 tokens
Status:
34 documents (around 500 000 tokens) completely transcribed and aligned with audio source
6 documents manually normalized,
3 documents automatically normalized and hand-corrected, 25 documents automatically normalized
4 documents manually annotated with part-of-speech tags, 30 documents automatically tagged
Conclusions
Availability:
Online lookup using Sketch Engine
XML download
Audio/video files on request
http://www.spur.uzh.ch/en/departments/korpuslab