• Aucun résultat trouvé

This thesis is based on the assumption that phonetic (or graphemic) similarity mea-sures facilitate the induction of word mappings between cognate words. Cognate words appear most frequently in closely related languages. Moreover, the phonetic models rely on word lists only; they did not need more sophisticated data like parallel

corpora. Thus, phonetic similarity models are ideal for closely related language pairs with (at least) one scarce-resource language. However, it is clear that these models, working on limited data, cannot perform as well as models working on more informa-tive data. We showed that even a small parallel corpus yields better lexicon mappings than those obtained with phonetic similarity measures on simple word lists.

We have proposed ten models that encode phonetic similarity in different manners and grouped them into three categories – static models, models of direct adaptation, and models of indirect adaptation. We showed that adaptive models perform better than static ones, provided that enough data is available to adapt the models to the lan-guage pair. The information for the adaptation can be presented in the form of a lin-guistic description of the phoneme transformations, like in the case of the rule-based model, or as a list of reference word pairs in the case of the learning models. Further-more, we found that the models which take the phonetic context into account, like the rule-based model or then-gram models, perform better than simpler ones, based on the comparison of single letters. However, even the models that rely on phoneme contexts still induce every word pair in an isolated way. Hence, these models may stitute a basis for a more complex framework that takes into account the syntactic con-texts of the word pairs to be induced.

Bibliography

Ulrich Ammon.Die internationale Stellung der deutschen Sprache. De Gruyter, Berlin, 1991.

Maja Beutler.Tagwärts. Nagel & Kimche, Zürich, Switzerland, 1996.

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.

The mathematics of statistical machine translation: parameter estimation. Compu-tational Linguistics, 19(2):263–311, 1993.

David Chiang, Mona Diab, Nizar Habash, Owen Rambow, and Safiullah Shareef. Pars-ing Arabic dialects. InEACL’06: Proceedings of the Eleventh Conference of the Euro-pean Chapter of the Association for Compuational Linguistics, pages 369–376, Trento, Italy, 2006.

Helen Christen. Convergence and Divergence in the Swiss German Dialects. Folia Linguistica, 32(1-2):53–67, 1998.

Ferdinand de Saussure.Cours de linguistique générale. Payot, Paris, 1916.

Lluís de Yzaguirre, Marta Ribas, Jordi Vivaldi, and M. Teresa Cabré. Some technical aspects about aligning near languages. InProceedings of the Second International Conference on Language Resources and Evaluation, pages 545–548, Athens, Greece, 2000.

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.

Eugen Dieth.Schwyzertütschi Dialäktschrift. Sauerländer, Aarau, Switzerland, second edition, 1986.

William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora.Computational Linguistics, 19(1):75–102, 1993.

Hans Goebl. Dialektometrie. Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Verlag der Österreichischen Akademie der Wissenschaften, Wien, 1982.

Wilbert Heeringa.Measuring Dialect Pronunciation Differences using Levenshtein Dis-tance. PhD thesis, Rijksuniversiteit Groningen, 2004.

Wilbert Heeringa, Peter Kleiweg, Charlotte Gooskens, and John Nerbonne. Evaluation of string distance algorithms for dialectology. InProceedings of the ACL Workshop on Linguistic Distances, pages 51–62, Sydney, Australia, 2006.

Cor Hoppenbrouwers and Geer Hoppenbrouwers. De featurefrequentiemethode en de classificatie van nederlandse dialecten. TABU: Bulletin voor taalwetenschap, 18 (2):51–92, 1988.

James W. Hunt and M. Douglas McIlroy. An algorithm for differential file comparison.

Computing Science Technical Report, Bell Laboratories, 41, 1976.

Rebecca Hwa, Carol Nichols, and Khalil Sima’an. Corpus variations for translation lex-icon induction. InAMTA’06: Proceedings of the Seventh Conference of the Association for Machine Translation in the Americas, pages 74–81, Cambridge, MA, USA, 2006.

Sara Hägi and Joachim Scharloth. Ist Standarddeutsch für Deutschschweizer eine Fremdsprache? Untersuchungen zu einem Topos des sprachreflexiven Diskurses.

Linguistik online, 24, 2005. URLhttp://www.linguistik-online.de/24_

05/haegiScharloth.html.

Martin Jansche. Inference of String Mappings for Language Technology. PhD thesis, Ohio State University, 2003.

Brett Kessler. Phonetic comparison algorithms.Transactions of the Philological Society, 103(2):243–260, 2005.

Brett Kessler. Computational dialectology in Irish Gaelic. InEACL’95: Proceedings of the Seventh Conference of the European Chapter of the Association for Compuational Linguistics, pages 60–66, Dublin, Ireland, 1995.

Philipp Koehn and Kevin Knight. Learning a translation lexicon from monolingual cor-pora. InSIGLEX’02: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon, pages 9–16, Philadelphia, PA, 2002.

Grzegorz Kondrak. Algorithms for Language Reconstruction. PhD thesis, University of Toronto, 2002.

Grzegorz Kondrak and Tarek Sherif. Evaluation of several phonetic similarity algo-rithms on the task of cognate identification. In Proceedings of the ACL Workshop on Linguistic Distances, pages 43–50, Sydney, Australia, 2006.

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710, 1966.

Gideon S. Mann and David Yarowsky. Multipath translation lexicon induction via bridge languages. InProceedings of NAACL’01: Second Meeting of the North Amer-ican Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA, 2001.

Werner Marti.Berndeutsch-Grammatik. Francke, Bern, Switzerland, 1985.

Werner Marti.Bärndütschi Schrybwys. Francke, Bern, Switzerland, 1972.

Wolfgang Minker. Grapheme-to-phoneme conversion: An approach based on hidden Markov models. Notes et documents LIMSI 96-04, Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur (LIMSI-CNRS), Orsay, France, 1996.

Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. A rational design for a weighted finite-state transducer library. In Derick Wood and Sheng Yu, editors, WIA’97: Workshop on Implementing Automata, volume 1436 of Lecture Notes in Computer Science, pages 144–158. Springer, 1998.

John Nerbonne and Christine Siedle. Dialektklassifikation auf der Grundlage ag-gregierter Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik, 72 (5):129–147, 2005.

John Nerbonne, Wilbert Heeringa, and Peter Kleiweg. Edit distance and dialect prox-imity. In Sankoff and Kruskal (1999), pages v–xv.

Damaris Nübling.Klitika im Deutschen. Gunter Narr, Tübingen, Germany, 1992.

Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003.

James Lyle Peterson. Computer Programs for Spelling Correction. Springer, New York, 1980.

Owen Rambow, David Chiang, Mona Diab, Nizar Habash, Rebecca Hwa, Khalil Sima’an, Vincent Lacey, Roger Levy, Carol Nichols, and Safiullah Shareef. Parsing Arabic dialects. Final report, John Hopkins Summer Workshop, 2006.

Reinhard Rapp. Automatic identification of word translations from unrelated English and German corpora. InACL’99: Proceedings of the 37th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 519–526, Maryland, USA, 1999.

Panagiotis A. Rentzepopoulos and George K. Kokkinakis. Efficient multilingual phoneme-to-grapheme conversion based on HMM. Computational Linguistics, 22 (3):351–376, 1996.

Jason Riesa, Behrang Mohit, Kevin Knight, and Daniel Marcu. Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources.

InInterspeech’06: Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 2006.

Eric Sven Ristad and Peter N. Yianilos. Learning string-edit distance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532, 1998.

David Sankoff and Joseph Kruskal, editors. Time Warps, String Edits and Macro-molecules: The Theory and Practice of Sequence Comparison. CSLI Press, Stanford, 1999.

Charles Schafer and David Yarowsky. Inducing translation lexicons via diverse similar-ity measures and bridge languages. InProceedings of CoNLL’02: 6th Conference on Natural Language Learning, pages 146–152, Taipei, Taiwan, 2002.

Joachim Scharloth. Zwischen Fremdsprache und nationaler Varietät. Untersuchun-gen zum Plurizentrizitätsbewusstsein der Deutschschweizer. TRANS. Internet-Zeitschrift für Kulturwissenschaften, 15, 2004. URL http://www.inst.at/

trans/15Nr/06_1/scharloth15.htm.

Beat Siebenhaar. Die dialektale Verankerung regionaler Chats in der deutschsprachi-gen Schweiz. In Eckhard Eggers, Jürdeutschsprachi-gen Erich Schmidt, and Dieter Stellmacher, edi-tors,Moderne Dialekte – Neue Dialektologie, pages 691 – 717. Steiner, Stuttgart, 2005.

Eric Wehrli. Un modèle multilingue d’analyse syntaxique. In A. Auchlin, M. Burger, L. Filliettaz, A. Grobet, J. Moeschler, L. Perrin, C. Rossari, and L. de Saussure, edi-tors,Structures et discours, Mélanges offerts à Eddy Roulet, pages 311–329. Nota bene, Montréal, 2004.

Erika Werlen. Jugendsprache zwischen Dialekt und Sprachenportfolio. In Elvira Glaser, editor,Alemannisch im Sprachvergleich, pages 443–464. Steiner, Wiesbaden, 2004.