• Aucun résultat trouvé

Non-standard texts: from theoretical positions to Natural Language Processing normalisation

N/A
N/A
Protected

Academic year: 2021

Partager "Non-standard texts: from theoretical positions to Natural Language Processing normalisation"

Copied!
1
0
0

Texte intégral

(1)

› annotation is not neutral

› annotation is linked to interpretative frameworks

› researchers should not be trapped

› researchers need to conduct own annotation

› full corpus (88,000 sms) & samples available for download

theoretical position

alignment algorithms

for transcoding

88milSMS corpus

SMS writing

(eSMS)

knowledge

building

http://88milsms.huma-num.fr/

Non-standard texts: from theoretical positions to

Natural Language Processing normalisation

Cédric Lopez* Mathieu Roche** Rachel Panckhurst***

*R&D, Viseo, Grenoble [email protected] **UMR TETIS,Cirad, Irstea, AgroParisTech, Montpellier;

LIRMM, UMR 5506 CNRS & Université Montpellier [email protected] ***Praxiling UMR 5267 CNRS & Université Paul-Valéry Montpellier 3

[email protected]

Case 1

plus

svt

possible

plus

souvent possible

Case 2

Vasi

lâche

moi

Vas-y

lâche-moi

Case 3

T’as

eu

Tu

as

eu

language classification

C1.2: LEFF no accents

C1.1: LEFF

C2.1 (sole letters)

a, c, f, j, p, v…

C2.2 (time)

8heures, 10minaperdre, 6-7h

C2.3 (repetition)

Mdrrr, Loool, tkkkkt, Huuuummm

C2.4 (special)

Conn*rd, désannule, resto+ciné

C2.5 (numbers)

numb3rs, mc2, 106ounette, 3615ma-vie

C2.6 (smileys)

^^ :p ;) :d <3 :-) xd :( :/

C3 (INSO)

tkt, jte, cc, voituration, cinglicité, tetrangle

‘unknown’ non-standard items (INSO)

› new typology of detected ‘mistakes’

› normalisation based on most frequent errors

› confrontation with:

traditional automatic translation,

speech recognition,

spelling/grammatical checker principles

› comparison between different types of instant media (SMS, forums, tweets)

automatic normalisation techniques

PLIN Day,

Louvain-la-Neuve, 12 May 2016

ARTS, LETTRES, LANGUES, SCIENCES HUMAINES ET SOCIALES

Références

Documents relatifs

We describe in Section 4 an experiment with probabilistic topic models on a large collection of reports, with mixed results that cannot conclude on the utility of this method in

In spite of the increasingly large textual datasets humanities researchers are confronted with, and the need for automatic tools to extract information from them,

In the CLEF 2005 Ad-Hoc Track we experimented with language-specific morphosyntactic processing and light Natural Language Processing (NLP) for the retrieval of Bulgarian,

This module searches a file of code or documentation for blocks of text that look like an interactive Python session, of the form you have already seen many times in this book.

We have described how we induced useful word embeddings by applying our architecture to a language modeling task trained using a large amount of unlabeled text data.. These

Consider some experimental results. Associations were created on both formal contexts regarding the position of the subject of the action − the first position for the three-element

Keywords: Natural Language Processing, Compositional Distributional Model of Meaning, Quantum Logic, Similarity Measures..

We add additional tags if special words (extreme, obligately, etc.) recognized in the texts. A BioNLP data is always domain-specific. All the texts in the corpus [4] are about