• Aucun résultat trouvé

Descriptors for the Detection of the Chemical Risk

N/A
N/A
Protected

Academic year: 2022

Partager "Descriptors for the Detection of the Chemical Risk"

Copied!
2
0
0

Texte intégral

(1)

Descriptors for the detection of the chemical risk

Natalia Grabar UMR8163 STL CNRS, Universit´e Lille 3 Villeneuve d’Ascq, France

natalia.grabar@univ-lille3.fr

Thierry Hamon LIMSI-CNRS, Orsay

Universit´e Paris 13 Sorbonne Paris Cit´e, France

hamon@limsi.fr

Abstract

We propose an experience on the automatic detection of sentences conveying the notion of chemical risk. Our objective is to study which resources are useful for the automatic detection of such sentences. Lexical, se- mantic and opinion-oriented content of the sentences is studied. Our results indicate that not only lexical and semantic content must be taken into account, but also markers related to the modality, opinion and polarity.

1 Introduction

Chemical risk is relative to situations in which chemical products are dangerous for human or animal health and consumption, and for environ- ment. The automatization of the process can help the experts to control and manage large amounts of scientific literature, that have to be analyzed to support the decision making process (van der Sluijs et al., 2008). The sentences that must be rec- ognized are for instance:The Panel concluded that the current NOAEL for BPA (5 mg/kg b.w./day) would be sufficiently low to exclude any concern for this effect, or Despite this lack of evidence, the possibility of poultry and egg consumption as an exposure route to HPAIV remains a concern to food safety experts. Such sentences are to be as- signed in categories related to the chemical risk:

the first sentence is related to the significance of the results, while the second is related to the qual- ity of the scientific hypothesis. If such sentences are detected in scientific publications or reports, it means that these publications or reports contain information not fully reliable and can possibly in- dicate the insufficiency of the corresponding stud- ies and the presence of the risk.

The chemical risk is poorly studied, although the notion of the risk is addressed by other works:

building of the dedicated resources (Makki et al., 2008), exploring of known industrial incidents (Tulechki and Tanguy, 2012), computing the ex- position to the risk (Marre et al., 2010). Our objec- tive is to study which resources are useful for the automatic detection of the sentences which convey the notion of the chemical risk.

2 Material and Methods

In addition to the lexical and semantic content of the text, we use several kinds of resources in order to favour one aspect or another. These resources contain markers oriented on modality, opinion and polarity expressed by the authors on the proposed experiements: (1) uncertainty (possible, should, may, usually) indicates that there are doubts on the results presented, their interpretation, etc.;(2) negation (no, neither, lack, absent, missing) indi- cates that the results have not been observed, that the study does not respect the expected norms, etc.; (3) limitations (only, shortcoming, insuffi- cient) indicates that there are some limits of the work, such as unsufficient sample size, small num- ber of tests or doses explored, etc.; (4) approxi- mation (approximately, commonly, estimated) in- dicates other kinds of insufficiency related to im- precise values of substances, samples, dosage, etc.

The work is done with the corpus on chemi- cal risk reporting on several chemical experiments with bisphenol A (EFSA Panel, 2010). It contains over 80,000 occurrences. The reference data are obtained through a manual categorization of the corpus sentences: 425 sentences are assigned to 55 classes of the chemical risk.

Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 191

(2)

0 0.2 0.4 0.6 0.8 1

all form lemm lf lft stag tag

performance

descripteurs freq norm tfidf

(a) Significance of the results

0 0.2 0.4 0.6 0.8 1

all form lemm lf lft stag tag

performance

descripteurs freq norm tfidf

(b) Natural variability

Figure 1: F-measure obtained during the categorization of sentences into classes of the chemical risk.

We tackle the problem through the supervized categorization with the Weka platform (Witten and Frank, 2005). Sentences correspond to the units, while 7 classes (most frequent) of the chem- ical risk are the categories to which the sentences have to be assigned. The resources and the linguis- tic annotation of corpus (Schmid, 1994) provide several descriptors. These are used to build sev- eral sets of descriptors. They represent the seman- tic and linguistic content of the sentences: forms (the forms such as they occur in the corpus),lem- mas(lemmatized forms),lf(combination of forms and lemmas),tag(POS tags, such as nouns, verbs, adjectives),lft(combination of forms, lemmas and POS-tags), stag(semantic tags of words, such as uncertainty, negation, limitations), all (combina- tion of all the descriptors available). The descrip- tors are weighted with various methods (freqraw frequency,normnormalization by the length of the sentences, andtfidftf-idf normalization).

3 Results

Figure 1 presents some results obtained for two categories:Significance of the resultsandNatural variability of the results. We can observe some difference according to the descriptors: the ex- ploitation of forms, semantic tags (with Signifi- cance of the results) and various combinations of descriptors provide results that are often better for these two categories and for other categories. We assume that these two kinds of descriptors (lexical and semantic content of corpus and the descriptors related to modality, polarity and opinion (Vinod- hini and Chandrasekaran, 2012)) provide comple-

mentary views on the content and should be com- bined. These results also indicate that chemical risk is not fully conceptual category but is also re- lated to subjective and contextual values.

References

EFSA Panel. 2010. Scientific opinion on Bisphenol A:

evaluation of a study investigating its neurodevelop- mental toxicity, review of recent scientific literature on its toxicity and advice on the danish risk assess- ment of Bisphenol A.EFSA journal, 8(9):1–110.

J Makki, AM Alquier, and V Prince. 2008. Ontology population via NLP techniques in risk management.

InProceedings of ICSWE.

A Marre, S Biver, M Baies, C Defreneix, and C Aventin. 2010. Gestion des risques en ra- dioth´erapie.Radioth´erapie, 724:55–61.

H Schmid. 1994. Probabilistic part-of-speech tag- ging using decision trees. In International Con- ference on New Methods in Language Processing, pages 44–49.

N Tulechki and L Tanguy. 2012. Effacement de di- mensions de similarit´e textuelle pour l’exploration de collections de rapports d’incidents a´eronautiques.

InTALN, pages 439–446.

Jeroen P van der Sluijs, Arthur C Petersen, Peter H M Janssen, James S Risbey, and Jerome R Ravetz.

2008. Exploring the quality of evidence for com- plex and contested policy decisions. Environ. Res.

Lett., 3(2).

G Vinodhini and RM Chandrasekaran. 2012. Senti- ment analysis and opinion mining: A survey. Inter- national Journal of Advanced Research in Computer Science and Software Engineering, 2(6):282–292.

I.H. Witten and E. Frank. 2005. Data mining: Practi- cal machine learning tools and techniques. Morgan Kaufmann, San Francisco.

Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain) 192

Références

Documents relatifs

3.2 Information fusion for Drama genre To evaluate the ”Drama” character of a movie thanks to its image content, the different symbols are fused according different symbolic

Estimates are made by multiple regression where the dependent variable is in turn the frequency, intensity, and duration of inversions and the explanatory variables

[10] M. Lenz, Emotion related structures in large image databases, in: Proceedings of the ACM International Conference on Image and Video Retrieval, ACM, 2010, pp.. Hanbury,

∙ SURF 1024 and SURF 4096: SURF features extracted on a dense grid, from the original video keyframe; K-means vocabulary clustering produces 1024 and 4096 visual words (therefore a

A good characterization descriptor must be stable over environments. The stability of the chosen descriptors have not been verified.?. )if not heritable, the descriptors values could

Face detection based on generic local descriptors and spatial constraints.. Veronika Vogelhuber,

Besides unilateral contact and Coulomb friction, other parameters relevant to the shape and contact anisotropy, which influence the anisotropy of stress

Studied system Modeled property Encoded information Local descriptors Training set size Machine- learning method 1 Complexes of organic molecules with I 2 in hexane i