• Aucun résultat trouvé

A Method to Lexical Normalisation of Tweets

N/A
N/A
Protected

Academic year: 2022

Partager "A Method to Lexical Normalisation of Tweets"

Copied!
5
0
0

Texte intégral

(1)

A Method to Lexical Normalisation of Tweets

Un m´etodo de normalizaci´on l´exica de tweets

Pablo Gamallo y Marcos Garcia CITIUS

Univ. de Santiago de Comp.

pablo.gamallo@usc.es

Jos´e Ramom Pichel Imaxin Software jramompichel@imaxin.com

Resumen: Este art´ıculo describe una estrategia de normalizaci´on l´exica de pal- abras “out-of-vocabulary” (OOV) en tweets escritos en espa˜nol. Para corregir OOV incorrectos, el sistema de normalizaci´on genera candidatos “in-vocabulary” (IV) que aparecen en diferentes recursos l´exicos y selecciona el m´as adecuado. Nuestro m´etodo genera dos tipos de candidatos, primarios y secundarios, que ser´an ordena- dos de diferentes maneras en el proceso de selecci´on del mejor candidato.

Palabras clave: Normalizaci´on l´exica, Mensajes cortos de texto, Procesamiento de tweets

Abstract: This paper describes a strategy to perform lexical normalisation of out- of-vocabulary (OOV) words in Spanish tweets. To correct any ill-formed OOV, the normalisation system generates in-vocabulary (IV) candidates found in several lexical resources, and selects the best one. Our method generates two types of candidates, primary and secondary IV candidates, which will be ranked in different ways to select the best candidate.

Keywords: Lexical Normalisation, Short Text Message, Tweet Processing

1 Introduction

In this paper, we describe a strategy to perform lexical normalisation of out-of- vocabulary (OOV) words in Spanish tweets.

The task can be described as follows. Given an OOV, the algorithm must decide whether the OOV is either correct or ill-formed and, for the latter case, it must propose an in- vocabulary (IV) word found in a lexical re- source to restore the incorrect OOV.

There has been few work on lexical nor- malisation in short messages. So far, the most successful strategy to normalise En- glish tweets is described in (Han y Baldwing, 2012b; Han y Baldwing, 2013). They propose merging two different strategies: normalisa- tion dictionary lookup and selection of the best in-vocabulary (IV) candidate.

The first strategy simply consists in look- ing up a normalisation dictionary, which con- tains specific abreviations and other types of lexical variants found in the Twitter lan- guage. Each lexical variant is associated to its standard form, for instance gl → girlf riend. The dictionary lookup method

This work has been supported by Ministerio de Ciencia e Innovaci´on, within the project OntoPedia, ref: FFI2010-14986.

achieves very high precision, but with low re- call. As recall relies on the size of the dic- tionary, (Han y Baldwing, 2012a) propose to build wide-coverage normalisation dictio- naries in an automatic way, by considering that lexical variants occur in similar con- texts to their standard forms. Normalisation dictionary should only contain unambiguous

“variant-standard” pairs. Ambiguous vari- ants will be tackled using the second strat- egy.

The second strategy is applied when the OOV is a lexical variant that has not been found in the normalisation dictionary. It con- sists of the following two tasks:

• Generation of IV candidates (standard forms) for each particular OOV (lexical variant).

• Candidate selection of the best IV can- didate.

The objective of the first task is to build, for each OOV, a list of standard forms which were derived from the OOV using different processes. For instance: reduction of charac- ter repetitions (e.g.,carrrr→car), or gener- ation of those IV words whose Edit distance

(2)

with regard to the target OOV is within a given threshold.

The second task consists in selecting the best candidate out of the list generated in the previous step. Two different selection meth- ods can be used: string similarity and con- text inference. To compute string similarity between the OOV and the different IV candi- dates, several measures and strategies can be used: lexical Edit distance, phonemic Edit distance, the longest common subsequence, affix substring, and so on. For context infer- ence, the IV candidates of a given OOV can be ranked and then filtered on the basis of their local contexts. Local contexts are com- pared against a language model. The main problem of this method is that the local con- text of an OOV is often constituted by other incorrect lexical variants that are not found in the language model.

These two selection methods (string simi- larity and context inference) are complemen- tary and then can be used together to select the best candidate.

There are, at least, two significant differ- ences between the task evaluated in (Han y Baldwing, 2013) and that proposed at the Tweet Normalization Workshop at SEPLN 2013. On the one hand, the task in (Han y Baldwing, 2013) relies on the basic assump- tion that lexical variants have already been identified. This means that only ill-formed OOV are taken as input of the selection pro- cess. By contrast, the task defined by the Workshop guidelines includes the detection of ill-formed OOV. On the other hand, in (Han y Baldwing, 2013) the correspondences one- to-several are not considered, for instance imo → in my opinion. At the Workshop, by contrast, it is required to search for one- to-several correspondences, since the IV stan- dard forms used to correct OOV can be mul- tiwords. In sum, the task defined at the Tweet Normalization Workshop is more com- plex than that described in (Han y Baldwing, 2013).

Finally, there are other approaches to SMS and tweet normalisation based on very different strategies. For instance (Beaufort et al., 2010) and (Kaufmann y Kalita, 2010) make use of the Statistical Machine Transla- tion framework, as well as of the noisy chan- nel model, very common in speech process- ing. The main problem of these approaches is that they rely on large quantities of labelled

training data, which are not available for mi- croblogs.

2 The method

The normalisation method we propose com- bines the main strategies and tasks described in (Han y Baldwing, 2013), namely: normal- isation dictionary lookup, generation of IV candidates, and selection of the best IV can- didate with context information. In addition, given the conditions of the Workshop, we also include in our algorithm ill-formed OOV de- tection.

The design of our algorithm was moti- vated by the conclusions we draw from the analysis of the development corpus. We observed that the most frequent types of incorrect Spanish OOV are the following:

(1) Uppercase/lowercase confusion: patri → P atri; (2) character repetition for emphasis:

Buuenoo→Bueno; (3) language-dependent spelling problems, namely for Spanish: miss- ing accents and letter confusion (v/b, g/j, ll/y, h/∅ . . . ).

These three types of errors can be solved using simple specific rules. For the remain- ing phenomena, which correspond to more heterogeneous problems, we will make use of generic strategies such as those described in the previous section: dictionary lookup and selection of the best IV candidate. For de- tection of correct/incorrect OOV, we use the following method: if no IV associated to an OOV is found using specific rules or generic strategies, then the OOV is considered as cor- rect. Otherwise it is taken as an ill-formed OOV. Text is lemmatised and PoS tagged us- ing FreeLing (Padr´o y Stanilovsky, 2012).

Our method contains two modules: a set of lexical resources and an algorithm to de- tect and correct ill-formed OOV.

2.1 Lexical resources

Our system makes use of three different lexi- cal resources:

ND Normalisation dictionary, containing in- correct lexical variants and their stan- dard forms.

SD Standard dictionary, a list of correct forms generated from the lemmas found in the Real Academia Espa˜nola dictio- nary (DRAE).

PND Proper names dictionary, containing

(3)

proper names extracted from the Span- ish Wikipedia.

In the following, we describe how these three dictionaries have been built.

2.1.1 Normalisation Dictionary (ND) It was mainly built using the develope- ment data distributed by the organizers for the Tweet Normalization Workshop at SEPLN 2013. We also used as source of data the list of emoticons accesible from http://en.wikipedia.org/wiki/List_of_

emoticons, as well as the list of Spanish abreviations released in http://www.rae.

es/dpd/apendices/apendice2.html. Our final normalisation dictionary contains 824 entries.

2.1.2 Standard Dictionary (SD) The standard dictionary is constituted by all the forms automatically generated from the lemmas found in DRAE. These lemmas have been extracted and freely distributed by the project http://olea.org/proyectos/

lemarios. Verb forms were generated with the Cilenis verb conjugator (Gamallo et al., 2013), whereas we used specific morpho- logical rules to generate noun and adjec- tive forms. The final dictionary consists of 778,149 forms, which is significatively larger than that provided by the last version of FreeLing (556,509 Spanish forms in FreeLing 3.0).

2.1.3 Proper Names Dictionary (PND)

To make easier the detection of correct OOV (for instance, proper names and domain- specific terms that are not in a standard vo- cabulary), it is useful to make use of a large list of OOV extracted from an enclyclopaedic resource, for instance the Wikipedia. Sev- eral PND were automatically extracted. Fi- nally, the PND allowing the best performance in the normalisation task was extraced as follows: First, using CorpusPedia (Gamallo y Gonz´alez, 2010), a simplified format de- rived from the original downloadable XML file (Wikipedia Dump of May 2011), the names of articles belonging to categories re- lated to persons, locations, and organisations were identified, by using the strategy de- scribed in (Gamallo y Garcia, 2011). Then, these names were tokenized and those uni- grams whose lowercase variants are found in the standard dictionary (SD) were filtered

out. The result is a list of 107,980 unigrams taking part in the names of persons, loca- tions, and organisations.

2.2 The algorithm

The system takes a list of OOV as input.

An OOV is considered as correct if the Dic- tionary Lookup process is true. Dictionary lookup is a process that consists in searching a token in one of the three lexical dictionar- ies: ND, SD, or PD. If the OOV is found in one of them, then it is considered as cor- rect. However, even if Dictionary Lookup is false, the OOV will be considered as correct if Affix Check is true. Affix Check is a pro- cess that extracts regular suffixes and prefixes from the OOV and verifies whether the stem of the OOV takes part of an entry found in one of the three dictionaries. Otherwise, the OOV can be incorrect.

Given an incorrect OOV, we generate a list of variants. A variant of an OOV is an IV candidate if either Dictionary Lookup or Affix Check is true. We distinguish between primary and secondary variants.

2.2.1 Generation of primary variants Primary variants of an OOV are its most likely IV candidates, according to the type of errors we found in the development corpus.

Primary variants will be favoured in the pro- cess of candidate selection: if at least a pri- mary variant is found, then the system does not consider secondary variants.

Primary variants of an OOV are those IV candidates derived from the OOV that only differ from the source OOV with regard to one of these linguistic phenomena: Upper- case/lowercase confusion, character repeti- tion, or frequent Spanish spelling errors. The frequent spelling errors include, not only typ- ical problems with accents and frequent let- ter confusions (v/b, j/g, etc), but also some phonemic conventions, namely the use of “x”

for “ch” (e.g. xicle → chicle). Primary variants generated by simplifying repetition include the cases of interjection reduction:

jejeeje → je. For uppercase and lowercase variation, we take into account that words can be written with only lowercase letters, with capitalisation (proper names or first po- sition in the sentence), or with only upper- case letters (e.g. acronyms). For instance, given the OOV “pedro”, two other variants are generated: “Pedro” and “PEDRO”. If one of them is found in the lexical resources,

(4)

then it is considered as a primary IV candi- date. Let us note that a primary variant is considered an IV candidate if either Dictio- nary Lookup or Affix Check is true.

2.2.2 Generation of secondary variants

If no primary variant is found as IV candi- date, then a large list of secondary variants is generated using Edit distance. In our ex- periments, we only generate those variants that have Edit distance 1 with regard to the original OOV. Dictionary Lookup and Affix Check allow us to identify the list of sec- ondary IV candidates. In the next step, we select the best candidate.

2.2.3 Candidate selection

To select the best IV candidate of a given OOV, we compare the local context of each candidate against a language model contain- ing bigrams of tokens found within a window of size 4 (2 tokens to the left and 2 to the right of a given token). More precisely, for each candidate, chi-square measure is com- puted by considering observed frequencies in the local context against expected frequen- cies in the language model. The language model was built by selecting lemmas of the following list of PoS categories: nouns, verbs, adjectives, prepositions, and adverbs. Text was processed with FreeLing. We also in- troduced an important restriction that takes into account whether the IV candidate is ei- ther a primary or a secondary variant. A primary variant is always selected even if its chi-square score is 0. It means that a primary variant is always selected even if it is not found in the language model. By contrast, for secondary variants, the chi-square must be higher than 0 to be selected. Candidates are ranked considering chi-square values and the above restriction. The best IV candidate on the top of the rank is selected and given as correction of the OOV. At the end, we apply the capitalisation rule which considers the position of the original ill-formed OOV in the sentence: if it is the first word in the sen- tence, then the selected IV candidate must be written with its first letter in uppercase.

Finally, if no IV candidate (primary or sec- ondary variant) is selected, then the OOV is considered as correct. So, correct OOV are detected in two different ways: first, if Dic- tionary Lookup or Affix Check is true for the original OOV, or if no IV candidate is se-

lected.

3 Experiments

Some experiments were performed using as test set the development corpus provided by the organisation of the Tweet Normaliza- tion Workshop. This corpus contains 500 tweets and 651 OOV manually corrected.

The language model used by our system was built from two text sources: the collection of 227,255 tweets provided by the Workshop, which were captured between April 1st and 2nd of 2013, and a collection of news from El Pais and El Mundo captured via RSS Crawl- ing. In sum, the language model was created from 50MB of text. The normalisation dic- tionary contains annotated information from the sample corpus with 100 tweets provided by the Workshop. For the final tests, this dictionary also includes the annotated pairs of the development corpus.

Two versions of our system were tested,

“Standard” and “Restricted”, and com- pared against two baselines: “Baseline1” and

“Baseline2”. The standard version has been described in the previous section. The re- stricted version includes a constraint on short proper names and short acronyms (with less than 5 letters). The constraint prevents short proper names and acronyms from being ex- panded with secondary variants. For in- stance, if the OOV is “BBC”, the system does not create IV candidates such as “BBV”,

“ABC”, and so on. In Baseline1, we do not separate primary from secondary vari- ants, and all IV candidates are treated as pri- mary variants. Baseline2 does not separate primary from secondary variants, and all IV candidates are treated as secondary variants.

Table 1 shows the results obtained from the experiments performed on the develop- ment set. The best performance is achieved with “Restricted”, which is based on the algo- rithm that makes use of restrictions on short proper names. The low scores reached by the baseline systems clearly show that candidates must be separated at different levels to be treated in different ways. In the test set, “Re- stricted” achieved 66.3% accuracy, the sec- ond best score among the 13 participants in the Tweet-Norm Competition.

Bibliograf´ıa

Beaufort, Richard, Sophie Roekhaut, Louise- Am´elie Cougnon, y C´edrick Fairon. 2010.

(5)

Systems pos neg accuracy

Baseline1 273 378 41.80 Baseline2 288 363 44.10 Standard 444 207 67.99 Restrictive 451 200 69,06 Table 1: Results from the development set

A hybrid rule/model-based finite-state framework for normalizing SMS mes- sages. En 48th Annual Meeting of the As- sociation for Computational Linguistics, p´aginas 770–779, Uppsala, Sweden.

Gamallo, P., M. Garcia, I. Gonz´alez, M. Mu noz, y I. del R´ıo. 2013. Learning verb inflection using Cilenis conjugators. Eu- rocall Review, 21(1):12–19.

Gamallo, Pablo y Marcos Garcia. 2011.

A resource-based method for named en- tity extraction and classification. LNCS, 7026:610–623.

Gamallo, Pablo y Isaac Gonz´alez. 2010.

Wikipedia as a multilingual source of com- parable corpora. En LREC 2010 Work- shop on Building and Using Comparable Corpora, p´aginas 19–26, Valeta, Malta.

Han, B. y T. Baldwing. 2012a. Automati- cally constructing a normalisation dictio- nary for microblogs. En Conference on Empirical Methods in Natural Language Processing and Natural Language Learn- ing (EMNLP-CoNLL 2012), Jeju, Korea.

Han, B. y T. Baldwing. 2012b. Lexical nor- malisation of short text messages: Makn sens a twitter. En 49th Annual Meeting of the Association for Computational Lin- guistics, p´aginas 368–378, Portland, Ore- gon, USA.

Han, B. y T. Baldwing. 2013. Lexical nor- malisation of social media text. ACM Transactions on Intelligent Systems and Technology, 4(1):15–27.

Kaufmann, J. y J. Kalita. 2010. Syn- tactic normalization of twitter messages.

En Conference on Natural Language Pro- cessin, Kharagpur, India.

Padr´o, Llu´ıs. y Evgeny Stanilovsky. 2012.

Freeling 3.0: Towards wider multilingual- ity. En Conference on Language Re- sources and Evaluation (LREC’12), Istan- bul, Turkey.

Références

Documents relatifs

Champs de normalisation Sports/Loisirs, biens de consommation et services Agro-alimentaire Eau : Milieu et Usages Environnement Santé Santé et sécurité au travail Transport et

(a) From top to bottom, amplitude spectrum of the SG data, amplitude spectrum of the reconstructed signal and amplitude spectrum of the residuals (original signal minus

consequence, a finite and convergent 2-polygraph Σ yields generators for the 1-cells of the category Σ, together with a decision procedure for the corresponding word problem

The other consequence of this update is that UMD, which is based on the updated lookup tables, is intuitive, because it gives non-zero values to successive symbols, so the UMD of

Unite´ de recherche INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex Unite´ de recherche INRIA Sophia-Antipolis, 2004 route des Lucioles, BP

If we assume that the measurement error is normally distributed, we could model the effect of the measurement error by using the joint distribution of the errors with the

changer les caractéristiques initiales de données en une représentation qui est plus approprié pour le traitement de ces données?. Ce traitement est souvent

Physics Basic Mechanics Physics Principles of Optics Math Basic Mechanics Math Vector Analysis Math