• Aucun résultat trouvé

Natural Language Processing for Swiss German Dialects

N/A
N/A
Protected

Academic year: 2022

Partager "Natural Language Processing for Swiss German Dialects"

Copied!
23
0
0

Texte intégral

(1)

Conference Presentation

Reference

Natural Language Processing for Swiss German Dialects

SCHERRER, Yves

Abstract

Most Natural Language Processing (NLP) applications focus on standardized, written language varieties. From a practical point of view, this focus is understandable: such systems are most likely to be used on written data of a standard variety, and this kind of data is also most easily available for training and parametrizing NLP systems. However, in many regions of the world, the linguistic reality is somewhat more complex: many speakers use some kind of non-standard language variety -- mostly in speech, but sometimes also in writing.

Non-standard lects are subject to continuous variation along the dialectal and sociolinguistic level. From a methodological as well as a practical point of view, it is therefore interesting to include findings of variational linguistics in existing NLP methods. Our work focuses on Swiss German dialects. The German-speaking part of Switzerland has been subject to more than a century of dialectological research that has resulted in dialect atlases, grammars and lexicons.

Today, the dialects represent the default variety of oral communication (Standard German is only used for writing). [...]

SCHERRER, Yves. Natural Language Processing for Swiss German Dialects. In: 55th Annual Conference of the International Linguistic Association , New Paltz, NY (USA), April 2010, 2010

Available at:

http://archive-ouverte.unige.ch/unige:22810

Disclaimer: layout of this document may differ from the published version.

(2)

Natural Language Processing for Swiss German Dialects

Yves Scherrer

Université de Genève, Geneva, Switzerland Columbia University, New York (2009-2010)

Joint work with Owen Rambow at Columbia University

4/16/2010

(3)

Outline

1

Introduction: Swiss German

2

Computational Linguistics and Dialectology

3

Three applications for Swiss German dialects

4

There’s a map for that!

5

Outlook

(4)

Languages in Switzerland

Quelle: Eidgenössische Volkszählung 2000, BFS Source: Recensement fédéral de la population 2000, OFS

© Bundesamt für Statistik / Office fédéral de la statistique, ThemaKart, Neuchâtel 2004 – Relief: swisstopo, Wabern / K16.16

Thematische Karten Cartes thématiques

BFS OFS UST SH

ZH

ZG BS

LU

BE

TI

VD FR

NE

GE

GR SO

BL AG

ARAI

SZ

OW

SG

VS JU

TG

NW GL

UR

0 25 50 km

keine Dominanz aucune prédominance Deutschsprachige Dominanz Prédominance de l'allemand Französischsprachige Dominanz Prédominance du français Italienischsprachige Dominanz Prédominance de l'italien Rätoromanischsprachige Dominanz Prédominance du romanche 70 – 84,9 %≥ 85 %

schwach faible stark

forte Dominierende Landessprache Langue nationale la plus parlée Landessprachen in den Gemeinden, 2000 Langues nationales parlées dans les communes, en 2000

Standard German/ Swiss German Diglossia

French Italian Romansh

http://www.bfs.admin.ch/bfs/portal/de/index/regionen/thematische_karten/maps/bevoelkerung/sprachen_religionen.html

(5)

Variation in Swiss German dialects

Standard German:

Warst du auf demMarkt einkaufen?

Were you at the market shopping?

‘Have you been shopping at the market?’

Swiss German Dialects:

ZH Bisch uf em Mèrt gsy go poschte?

SZ Bisch ufem Märcht gsi go poschte?

SO BeschofemMäred gsy go boschte?

OW Bischt ufm Märt gsii go ichoifä?

SG Bisch uf dä Märt goposchtä?

BE Bisch uf e Märit ga kömerle?

FR Büschu z’Määret gsi gaiichufe?

WS Bisch düuf umMarkt gsii gaichöifu?

http://als.wikipedia.org/wiki/Alemannischer_Beispielsatz

Differences between Standard German and Swiss German:

Preterite tense replaced by present perfect

go/ga+ infinitive construction

Differences within Swiss German dialects:

No standardized spelling Phonetics

Lexicon Morphology Syntax

(6)

Research questions

How can we present the results of

dialectological

research in a more intuitive and dynamic way?

Research on Swiss German dialects since 100 years, but not easily accessible to laymen.

A computational approach can give new insights into interdialectal variation, dialect change, . . .

How can we apply methods of

computational linguistics / natural language processing

to dialect data?

Few work on language variation (except in speech processing).

Reuse and adaptation of resources for Standard German.

Explicit dialectological data (rather than large corpora of raw text) calls for a knowledge-based approach.

(7)

Three applications

Examples of computational linguistics applications:

Speech recognition systems, text-to-speech synthesizers, automated voice response systems, machine translation systems, parsers, part-of-speech taggers, web search engines. . .

1 Machine translation

from Standard German to a given Swiss German dialect

User provides a Standard German text and clicks on a Swiss map to select one target dialect, obtains a translation of the text.

2 Identification

of Swiss German dialects

User provides a snippet of dialect text, obtains a shaded map.

3 Parsing

of Swiss German dialect text

User provides a dialect text, obtains a structural analysis and a shaded map indicating where this analysis applies.

Written representations of dialect, as used in electronic media.

(8)

Ingredients

Existing resources for Standard German:

Lemmatizers

Morphological annotation Part-of-speech taggers Parsers

Word-form lexicons

Rules to transform Standard German into Swiss German:

Phonetic rulesthat adapt Standard German words phonetically.

Lexicon correspondencesthat translate Standard German words which cannot be generated with phonetic rules.

Syntactic rulesthat modify the sentence structure.

Most of these rules yield different results in different Swiss

German dialects. Hence, the rules are associated with probability

maps.

(9)

Machine Translation

Standard German Text Target dialect

coordinates

Standard German lemmatization/tagging/parsing Analyzed Standard German Text

Phonetic/lexical/syntactic

transfer rules

Swiss German Text

(10)

Dialect identification

Morphologically annotated Standard German Lexicon

Phonetic/lexical transfer rules Morphologically annotated

Swiss German Lexicon

Each entry is associated to a map.

Swiss German Text

Lexicon lookup and map combination

Probability map

(11)

Dialect parsing

Morphologically annotated Standard German Lexicon

Phonetic/lexical transfer rules Morphologically annotated

Swiss German Lexicon

Standard German parsing model

Syntactic transfer rules Swiss German

parsing model

Swiss German Text

Parsing

Parse tree and

probability map

(12)

Georeferenced rules

Example of a phonetic rule:

Variable Standard German

-nd

in word-final position e.g.

Hund ‘dog’, Kind ‘child’

Variants

-nn[n],-ng[N],-nt [nt],-nd [nd

˙]

nd →nn nd →ng nd →nt

keep

nd

Black: p = 1 White: p = 0

(13)

Georeferenced rules

Example of a lexical rule:

Variable Standard German

immer ‘always’

Variants

geng immer all

Example of a syntactic rule:

Variable Word order of auxiliary verb and past participle in subordinate clauses

Variants

Aux + PP PP + Aux (like

Standard German)

(14)

Map sources and processing

SDS (Sprachatlas der deutschen Schweiz):

Data collection 1939-1958, publication 1962-1997 8 volumes, > 1000 hand-drawn maps

Need to scan and digitize maps Phonetics, Morphology, Lexicon

SADS (Syntaktischer Atlas der deutschen Schweiz):

Data collection recently completed, publication scheduled 2011 118 items on questionnaire, syntax and morphosyntax only Access to numerical raw data

Some rules apply uniformly to all dialects and do not require maps.

Interpolation:

Atlas maps only show values at the inquiry points.

To obtain values at the points in between, use interpolation.

(15)

Original hand-drawn map from SDS

SDS II/120

(16)

Digitized point map

(17)

Set of interpolated surface maps

nd →nn nd →ng

nd →nt

keep

nd

Black: p = 1 White: p = 0

(18)

Map interpolation – Open questions

The current interpolation algorithm merely looks at distance between inquiry points.

Influence of topographical, political, denominational borders?

Importance and population size of inquiry points?

Continuous probability maps suggest continuous dialect change.

Rumpf et al. (2010) use polygons (Voronoi mosaics), i.e. clear-cut borders.

19th century debate in dialectology.

Atlas data is from the 1950s, but dialects have changed over time (Christen 1998):

Small-scale variants tend to disappear.

City dialects and suburb dialects influence each other.

Convergence in lexicon, divergence in grammar.

Different interpolation settings yield different results on the three tasks.

(19)

Current status

Models

Translation and dialect identification models working.

Parsing model to be built, depends on results on dialect identification.

Rule base

108 phonetic rules (fairly complete) 28 morphology rules (50% of target) 43 lexicon rules

Some syntactic rules for testing

Development and evaluation corpus

Text data from 5 dialects extracted from Swiss German Wikipedia.

(20)

Text data for evaluation

Need:

Written dialect data

Annotated with the author’s dialect

For parsing: annotated with syntactic structure Have:

Blogs, Chat

No dialect annotation, huge differences in spelling

Dialect literature

Only a few regions have literary traditions Copyright restrictions

Swiss German Wikipedia

Annotated with dialect (~6 categories with sufficient material) Different spelling conventions in use, but fairly consistent

Scientific data

Mostly non-digitized phonetic transcriptions

Dialect corpus under construction (University of Zurich)

(21)

Conclusion

Main characteristics

Systems take advantage of existing Standard German data.

Different applications use the same maps and rules.

Knowledge-based methods are more adapted to the data situation.

Contributions to dialectology and computational linguistics

New insights into dialectological questions by using methods from computational linguistics.

Modelling of language variation in computational linguistics.

Perspectives

Test different interpolation methods to model dialect change.

Dialect speech synthesis to respect modality-related diglossia.

Dialect corpora open new perspectives for model tuning and evaluation.

(22)

Danke! Dank schöön!

Merci! Vergelt’s Gott!

(23)

References

C. BUCHELI& E. GLASER(2002): The Syntactic Atlas of Swiss German Dialects [SADS]: empirical and methodological problems. In Sjef

Barbiers, Leonie Cornips et Susanne van der Kleij, eds.,Syntactic Microvariation, Vol. II, 41-74.

H. CHRISTEN(1998). Convergence and Divergence in the Swiss German Dialects. In:Folia Linguistica 32(1-2), 53–67.

D. CHIANG, M. DIAB, N. HABASH, O. RAMBOW& S. SHAREEF(2006).

Parsing Arabic dialects. InProceedings of EACL’06, 369-376.

R. HOTZENKÖCHERLE, R. SCHLÄPFER, R. TRÜB& P. ZINSLI, eds.

(1962-1997).Sprachatlas der deutschen Schweiz[SDS].

J. RUMPF, S. PICKL, S. ELSPASS, W. KÖNIG& V. SCHMIDT(to appear).

Structural analysis of dialect maps using methods from spatial statistics.

In:Zeitschrift für Dialektologie und Linguistik.

Références

Documents relatifs

b) Socio-demographics: recent studies show systematic variation (esp. with respect to age groups) which points to apparent time change (cf. Friedli 2012, Stoeckle accepted)..

Children learning Standard German build Standard German words from known dialect words. (overgeneralizations

5: Mean similarity values for different linguistic levels (lexis: 64 working maps; phonology: 100 working maps; morphology: 118 working maps; syntax: 68 working maps)..

The tagger trained on T¨uBa-D/S – unsurprisingly – per- forms badly when applied directly to the original ArchiMob corpus: 72.78% of the words are unknown to the tagger, since they

To sum up, the results of the trained models based on unigrams (the Basic EM and the Class model) are situated between those of the static models and the rule-based model, while

Pursue the converging trends between NLP and dialectology by building a machine translation model that uses dialect maps..

A syntactic transfer rule: (SADS I/9) Verb raising in subordinate clauses Syntactic transfer rules change the struc- ture of the syntax tree. StdG ob er einmal heiraten will if he

Yves Scherrer (Université de Genève) MT for Swiss German Dialects 10/14/2009 11 / 40.. representations of) all Swiss German dialects. Rule-based transfer model, parametrized