• Aucun résultat trouvé

Challenges for language technologies in Ayapaneco

N/A
N/A
Protected

Academic year: 2021

Partager "Challenges for language technologies in Ayapaneco"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02919375

https://hal.archives-ouvertes.fr/hal-02919375

Submitted on 22 Aug 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Challenges for language technologies in Ayapaneco

Jhonnatan Rangel

To cite this version:

Jhonnatan Rangel. Challenges for language technologies in Ayapaneco. UNESCO International Con-ference Language Technologies for All (LT4All), Dec 2019, Paris, France. �hal-02919375�

(2)

C

HALLENGES FOR LANGUAGE TECHNOLOGIES IN

A

YAPANECO

J

HONNATAN

R

ANGEL

, P

H

D

O

BJECTIVES

This poster aims to:

• Present challenges to implementing language technologies in Ayapaneco

• Illustrate how almost 10% of the world’s lan-guages could face these same challenges

I

NTRODUCTION

Ayapaneco is a critically endangered and low-resourced language. Like many other languages in this situation, it poses various fundamental chal-lenges for the implementation of language technolo-gies. These challenges limit the scope of its docu-mentation, preservation, reclamation and revitaliza-tion.

A

YAPANECO

Figure 1: Geographic location

Ayapa Zoque, Ayapaneco or numde ‘oode (autonym) belongs to the Mixe-Zoque language family and is spoken in Ayapa, a village of 5,500 inhabitants in the municipality of Jalpa de Méndez, Tabasco, Mexico [3].

• Approximately 11 elder speakers representing only 0.3% of Ayapa’s total population

• Not transmitted for over 60 years • Limited use of the language

• No written tradition, practical orthography in progress

• Under-documented & under-studied • Recent language revitalization efforts • Critically endangered & low-resourced

R

EFERENCES

[1] DUONG, L. Natural Language Processing forResource-Poor Languages. PhD dissertation, University of Mel-bourne, 2017.

[2] MOSELEY, C. Atlas of the World’s Languages in Danger, 3rd ed. UNESCO, 2010.

[3] RANGEL, J. Variations linguistiques et langue en dan-ger. Le cas du numte oote ou zoque ayapaneco dans l’état de Tabasco, Mexique. PhD dissertation, INALCO, 2019.

F

UTURE

P

ERSPECTIVES

The implementation of language technologies could help us maximize the limited time we have left to engage with critically endangered languages. Given the multiple and complex challenges we face, we must explore from a theoretical and methodologi-cal point of view new multidisciplinary approaches. We must also create synergies among varied actors (academia, governments, NGOs, communities and

civil society) to better address these challenges and support capacity building in the long term. Simulta-neously, and more fundamentally, we must address the root causes of these challenges such as inequality, poverty and discrimination as critically endangered and low-resourced languages are spoken by minor-ity and marginalized communities.

C

ONTACT

I

NFORMATION

Email [email protected] Twitter @JhonnRan Blog https://wils.hypotheses.org/ Academia /jhonnatanrangel HAL jhonnatan-rangel

Research Gate jhonnatan_rangel

ORCID https://orcid.org/0000-0003-2493-7634

C

HALLENGES FOR

A

YAPANECO

Human resources Annotation bottleneck Financial resources

Capacity Infrastructure Linguistic

As Ayapaneco is a critically endangered, low-resourced, under-documented and under-studied language, we are confronted with multiple chal-lenges to implementing language technologies.These can be categorized in six major groups.

1. Annotation bottleneck: 1-2 hours spent anno-tating for every minute of recording

2. Human resources: only 1-2 manual annotators who are not speakers; all speakers are elders with no digital literacy

3. Financial resources: no state funding for lan-guage technologies in critically endangered languages

4. Capacity: scarce computer access; written sys-tem still in progress

5. Infrastructure: limited, unreliable and expen-sive access to internet

6. Linguistic: limited language documentation and description; high proportion of unstruc-tured variation [3]

C

RITICALLY ENDANGERED

There are currently 577 critically endangered lan-guages in the world, making up almost 10% of the world’s 6,000+ languages [2].

Figure 2: Vitality of the world’s languages

Critically endangered languages present the follow-ing characteristics:

• Small proportion of speakers to larger commu-nity

• Remaining speakers are elders (grandparent generation)

• Inter-generational transmission interrupted for decades

• Limited use in daily language practices • Lack of written tradition and literacy

L

OW

-

RESOURCED

Low-resourced languages (LRL) lack the necessary annotated corpora and tools for their utilization in language technologies such as the fast growing field of Natural Language Processing. It is estimated that out of the world’s 6,000+ languages, only about 20 of them could be considered high-resourced (HRL). An additional 60 languages have some language technology tools available or are medium-resourced (MRL)[1].

HRL MRL LRL

0.4% 1% 98.6%

Table 1: Resource distribution of world languages

Some examples include:

• HRL: English, German, French, Spanish • MRL: Hebrew, Hindi, Czech

Références

Documents relatifs

Building automatic speech recognition system (ASR) for under-resourced languages has many constraints such as: lack of transcribed speech, few speaker diversity for acoustic

These results are in parallel with system combination outputs as shown in Table 8; a combina- tion of Hybrid G2P with Malay G2P system gave better result compared to Hybrid G2P

In this context, it is important to point out the specificities of endangered language dictionaries as opposed to major language dictionaries, which will lead us

Keywords: automatic workstation; terminology data bases; text corpora; term ex- traction, termhood, terminology management; user translation dictionaries; professional

This leaves pupils with an idea of an imagined and rather mythical language community based on a “language” they have little first-hand experience of outside school time, Occitan

Features were added to the app in order to facilitate the collection of parallel speech data in line with the require- ments of the French-German ANR/DFG BULB (Breaking the

With the objective of obtaining a neural state-of-the-art TTS for Gascon Occitan with relatively few recording hours which would pronounce French proper names in a standard way, we

§ Using Malay (closely-related) data in the lexicon design for Iban is better than using English (not a close language). § Cross-lingual effect on acoustic model is more evident on