• Aucun résultat trouvé

Non-standard texts: from theoretical positions to Natural Language Processing normalisation

N/A
N/A
Protected

Academic year: 2021

Partager "Non-standard texts: from theoretical positions to Natural Language Processing normalisation"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01487025

https://hal.archives-ouvertes.fr/hal-01487025

Submitted on 10 Mar 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Non-standard texts: from theoretical positions to

Natural Language Processing normalisation

Cédric Lopez, Mathieu Roche, Rachel Panckhurst

To cite this version:

Cédric Lopez, Mathieu Roche, Rachel Panckhurst. Non-standard texts: from theoretical positions to Natural Language Processing normalisation. PLIN-Day, May 2016, Louvain-la-Neuve, Belgium. 2016. �hal-01487025�

(2)

› annotation is not neutral

› annotation is linked to interpretative frameworks

› researchers should not be trapped

› researchers need to conduct own annotation

› full corpus (88,000 sms) & samples available for download

theoretical position

alignment algorithms

for transcoding

88milSMS corpus

SMS writing

(eSMS)

knowledge

building

http://88milsms.huma-num.fr/

Non-standard texts: from theoretical positions to

Natural Language Processing normalisation

Cédric Lopez* Mathieu Roche** Rachel Panckhurst***

*R&D, Viseo, Grenoble

cedric.lopez@viseo.com

**UMR TETIS,Cirad, Irstea, AgroParisTech, Montpellier;

LIRMM, UMR 5506 CNRS & Université Montpellier

mathieu.roche@cirad.fr

***Praxiling UMR 5267 CNRS & Université Paul-Valéry Montpellier 3

rachel.panckhurst@univ-montp3.fr

Case 1

plus

svt

possible

plus

souvent possible

Case 2

Vasi

lâche

moi

Vas-y

lâche-moi

Case 3

T’as

eu

Tu

as

eu

language classification

C1.2: LEFF no accents

C1.1: LEFF

C2.1 (sole letters)

a, c, f, j, p, v…

C2.2 (time)

8heures, 10minaperdre, 6-7h

C2.3 (repetition)

Mdrrr, Loool, tkkkkt, Huuuummm

C2.4 (special)

Conn*rd, désannule, resto+ciné

C2.5 (numbers)

numb3rs, mc2, 106ounette, 3615ma-vie

C2.6 (smileys)

^^ :p ;) :d <3 :-) xd :( :/

C3 (INSO)

tkt, jte, cc, voituration, cinglicité, tetrangle

‘unknown’ non-standard items (INSO)

› new typology of detected ‘mistakes’

› normalisation based on most frequent errors

› confrontation with:

traditional automatic translation,

speech recognition,

spelling/grammatical checker principles

› comparison between different types of instant media (SMS, forums, tweets)

automatic normalisation techniques

PLIN Day,

Louvain-la-Neuve, 12 May 2016

ARTS, LETTRES, LANGUES, SCIENCES HUMAINES ET SOCIALES

Références

Documents relatifs

In the CLEF 2005 Ad-Hoc Track we experimented with language-specific morphosyntactic processing and light Natural Language Processing (NLP) for the retrieval of Bulgarian,

In spite of the increasingly large textual datasets humanities researchers are confronted with, and the need for automatic tools to extract information from them,

This module searches a file of code or documentation for blocks of text that look like an interactive Python session, of the form you have already seen many times in this book.

We have described how we induced useful word embeddings by applying our architecture to a language modeling task trained using a large amount of unlabeled text data.. These

We describe in Section 4 an experiment with probabilistic topic models on a large collection of reports, with mixed results that cannot conclude on the utility of this method in

Consider some experimental results. Associations were created on both formal contexts regarding the position of the subject of the action − the first position for the three-element

Keywords: Natural Language Processing, Compositional Distributional Model of Meaning, Quantum Logic, Similarity Measures..

We add additional tags if special words (extreme, obligately, etc.) recognized in the texts. A BioNLP data is always domain-specific. All the texts in the corpus [4] are about