• Aucun résultat trouvé

Information extraction in documents

N/A
N/A
Protected

Academic year: 2021

Partager "Information extraction in documents"

Copied!
47
0
0

Texte intégral

(1)

Mathieu Roche

Cirad – TETIS – Montpellier, France

Web : http://www.textmining.biz

XWCMXWLC%LXW%L

Analyse des dynamiques spatiales et thématiques par des méthodes

de fouille de textes

(2)

2

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Data Science

Volume

Velocity

Variety

Variability, Véracity, Value, Visualisation, Valorization

 Pluridisciplinary domain 3V of Big Data

(3)

Data Science

Volume

Velocity

Variety

Variability, Véracity, Value, Visualisation, Valorization

 Pluridisciplinary domain 3V of Big Data

(4)

4

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Songes Project

(5)

Logs

0/1 x B_1404 [WARNING]: "Asynchronous reset/set/load <%item> exists in module/unit"

0/1 x B_1405 [WARNING]: "<%value> asynchronous resets in this unit detected"

0/1 x B_1406 [WARNING]: "<%value> synchronous resets in this unit detected"

0/1 x B_1407 [ERROR]: "Do not use active high asynchronous reset/set/load"

// Total Module Instance Coverage Summary TOTAL COVERED PERCENT

lines 501 158 31.54 statements 501 158 31.54 Policy: DESIGN Ruleset: RESETS

<violated>/<checked> x <label> [<severity>]:<message>

--- --- ---

0/1 x NTL_RST04 [ERROR]: "A reset signal is not allowed to be used as

Et il meismes en descovri son corage a Lancelot et dist que, a l'ore que la guerre commença, baoit il a tot le monde conquerre: et bien i parut, kar il fu a vint et cinc ans chevaliers et puis conquist il .XXVIII. roialmes [72d]

et a trente noef ans fu la fin de son aage. Mais de totes ces choses le traist Lancelos ariere et il li mostra bien, la ou il fist de sa grant honor sa grant honte, quant il estoit au desus le roi Artu et il li ala merci crier...

Textual Data Science

(6)

6

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Textual data and satellite images – ANIMITEX (2013-2014)

Textual data and data bases – QDoSSI (since 2016)

Pluridiciplinary projects (Mastodons)

(7)

Generic Process

Step 1

Spatio-temporal entities

Step 3

Sentiments

Step 2

Themes

Senterritoire and

(8)

8

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Outline

Part 1 Data Science and Big Data

Part 2 Information extraction in documents Part 3 Sentiment Analysis

Part 4 QDoSSI project Part 5 Future Work

(9)

Part 2

Information extraction in documents

Animal disease surveillance

(10)

10

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Generic Process

Step 1

Spatio-temporal entities

Step 3

Sentiments

Step 2

Themes

Knowledge

(11)

Why the need of Epidemic Intelligence?

More than 60% of the initial outbreak reports come from unofficial informal and heterogeneous sources, including sources other than the electronic media, which require

verification [Arsevska et al. ISVEE'2015]

(12)

12

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

How to detect disease outbreak on the Web?

•  Five animal disease models: African swine fever (ASF), Foot-and-mouth disease (FMD), Bluetongue (BTV), and Schmallenberg virus (SBV), Influenza

(13)

Methodology

(14)

14

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology

(15)

Methodology - step 1

•  Step 1: Data acquisition

[Falala et al. IC-Demo'2016]

(16)

16

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 1

•  Step 1: Data acquisition

(17)

Methodology - step 2

•  Step 2: Data classification

(18)

18

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 3

•  Step 3: Information extraction and management

(19)

Methodology - step 3

•  Step 3: Information extraction (I)

Aim: Automatically detecting key information from Web

news articles (country, species, diseases, number of cases, dates, …)

“Since its initial appearance in Poland in February 2014, 72 cases of African Swine Fever have been detected in wild boars and there have been three outbreaks in pigs.” - http://www.thenews.pl

•  Use dictionaries (Geonames, HeidelTime, disease

names, species names, etc.), and data mining techniques in order to learn extraction rules

Rules assiociated with ‘case numbers’:

(number)(species_name,1-3) with frequency 26% and confidence 83%

(20)

20

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 3

•  Step 3: Information extraction and disambiguation (I) Spatial features taken into account in the learning

approach:

  The frequency in the document: the fraction of occurrences of this candidate geographical entity in the document, among all the

location candidates in the document.

  The country frequency in the document: the fraction of

occurrences of the country of this candidate geographical entity in the document, among all the location candidates in the

document.

  The relative position in the document: the position of this

geographical entity in the document, among all other candidate geographical entities, normalized between 0 and 1.

  The total number of candidate geographical entities in the document.

(21)

Methodology - step 3

•  Step 3: Information extraction (I)

- Results for the rule-based approach on the annotated corpus - Classification based on SVM (features are rules)

- 2 classes: correct, incorrect - 10-fold cross validation

Type Accuracy (%)

Locations 0.80

Dates 0.88

Diseases 0.96

Cases 0.85

Species 0.96

Julien Rabatel,

LIRMM, Numev, France

(22)

22

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 3

•  Step 3: Information extraction (I)

(23)

Methodology - step 3

•  Step 3: Information management (II)

(24)

24

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 3

•  II. Querying the Web: (a) Terminology extraction

[Lossio Ventura et al. ISWC-Demo'2014]

(25)

Methodology - step 3

•  II. Querying the Web: (b) Terminology ranking - BioTex Ranking [Lossio Ventura et al. IRJ'2016]:

-  A new ranking function to take into account the

heterogeneity of the sources (Si) [Arsevska et al. CEA'2016]:

(26)

26

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology - step 3

•  II. Querying the Web: (c) Terminology validation

Using of a Delphi method [Arsevska et al. LREC'2016]

Delphi method is to reach group consensus with experts (5 to 7 experts for each disease) when knowledge is not sufficient for a given scientific question

(27)

Methodology - step 3

•  II. Querying the Web: (d) Association of terms

DANDweb = 2× hit(h AND cs) hit(h)+ hit(cs)

[Roche and Prince Informatica’2010 ; Arsevska et al. IJAEIS'2016]

(28)

28

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Methodology

(29)

Influenza

(30)

30

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Pluridisciplinary results

Scientific results in Epidemiology:

•  Spatial detection larger than official surveillance (e.g. PPA detected in Ouganda in February and March 2016)

•  Detection of important information related to diseases not given by official surveillance systems (e.g. preventive vaccinations)

•  Detection of information about other diseases (e.g. SBV)

Scientific results in Computer Science:

•  New visualisation systems

•  New ranking measures to extract keywords from texts

•  New methods to extract information in textual documents (i.e. new features based on sequential patterns)

•  Implemented system in collaboration between LIRMM, TETIS, and ASTRE

(31)

Part 3

Sentiment Analysis

(32)

32

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Generic Process

Step 1

Spatio-temporal entities

Step 3

Sentiments

Step 2

Themes

Knowledge

(33)

Sentiment analysis

Methods in order to identify sentiments:

Towards a sentiment lexicon » [Cruz et al., SimBig'2016]

Step 1: choice of seeds related to opinions

P = {good; nice; excellent; positive; fortunate; correct; superior}

N = {bad; nasty; poor ; negative; unfortunate; wrong; inferior}

Construction of 14 corpora related to a specific domain Step 2: PoS, Association rules, choice of a window

(34)

34

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Sentiment analysis

Step 3: Statistic selection and web mining

Statistic measures (Dice and MI) that consist of measuring the association between seed adjectives and candidate adjectives association based on "hits" from the web (i.e.

search engine) and contextual information

Examples of learnt adjectives:

great, hilarious, funny, happy, perfect, important, beautiful, amazing, complete, major, helpful

 Agriculture domain:

{gmo; agricultural biotechnology; biotechnology for agriculture}

Examples of learnt adjectives: green, healthy, enthusiastic, creative, etc.

Laura Vanessa Cruz, San Agustin University, Peru

(35)

Sentiment analysis

Work in progress: tweets and SMS (88milSMS corpus)

-  Impact of repetition of characters for sentiment analysis: « j’aaaadore IC’2016 ! » [Khiari et al., JADT'2016]

-  Identification of new spatial entities and new spatial relations: « IC’2016 est organisé sur montpeul !» [Zenasni

(36)

36

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Part 4

QDoSSI

Data quality

(37)

Generic Process

Step 1

Spatio-temporal entities

Step 3

Sentiments

Step 2

Themes

(38)

38

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

QDoSSI

CNRS Mastodons 2016

Qualité des Données multi-Sources

Un double défi pour les sciences Sociales et les sciences de l’Informatique

Different challenges

(39)

QDoSSI – Objectives

Sources (de gauche à droite) : extraits de cartographies / Lucie Bacon, 2015, « Variations sur l’épisode migratoire de Kreuz. Cartographies mises en scène », Mediapart ; Cyril Roussel, 2013, « Les migrations des membres du Komala entre l’Iran et le Kurdistan d’Irak », Atlas du Kurdistan d’Irak ; Nelly Robin, 2013,

Balkan Corridor Kurdistan irakien

Trans-Saharan and sea lanes et maritimes

(40)

40

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

QDoSSI – Corpora

Text mining on interviews

3 corpora:

- Balkan Corridor - Asylum seekers (in English and French) - Sea lanes Atlantic - Economic migrants (in French)

- Trans-Saharan Africa - Isolated foreign minors (in French)

(41)

QDoSSI – Methodology and results

Methodology:

- Construction of multilingual corpora - Automatic pre-processing of texts - Adaptation of BioTex

- Manual analysis of extracted terms in order to identify (sub)concepts

Résults: Multilingual thesaurus

- 3 levels (e.g., Actors -> Authorities -> Police Authorities)

-  5 main concepts: Acteurs (famille, amis, autorités, organisation, acteurs clandestins, réseau de solidarité, population), Ressources (nourriture et hébergement, transport, communication, papiers, travail, argent, informations), Gestion des migrations, Motivations,

Sentiments [+ Informations temporelles et spatiales]

- More than 300 instances in English / 400 instances in French:

words (e.g, road), and multi-word terms (e.g., airline tickets).

(42)

42

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

QDoSSI – Themes

Analysis of concepts intra-corpus and inter-corpus

(43)

QDoSSI – Spatial Entities

Extraction of 3 types of information (spatial, theme, temporal)

(44)

44

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

QDoSSI – Future Work

-  Analysis of concepts (data mining + visualisation)

-  Vocabulary of the spatiality: forêt proche, ville là-bas, lieu de départ, big forest

-  Sentiment analysis based on emotions: total disaster, horrible trip, lovely connection, good camp, nice room, stress, panic

-  Construction of itineraries

Main issues: agregation of itinearies, integration of semantic knowledge

- And… other data, other issues

(45)

Part 5

Conclusions and future

work

(46)

46

M. Roche – GAST’2017, 24 janvier 2017, Grenoble

Conclusions

New challenges of Texual Data Science:

-  Matching different types of documents (image/text, video/text, and so forth)

-  Integration of visual analytics skills [Fadloun, Inforsid'2016]

(47)

How to match documents?

•  Data and Issue

•  Hard Disc (157 188 files)

Références

Documents relatifs

Mohamed Khemakhem, Simon Gabay, Béatrice Joyeux-Prunel, Laurent Romary, Léa Saint-Raymond, Lucie Rondeau Du Noyer.. To cite

observable text features identified through corpus analysis, signalling the kind of relations between lexical items used in building terminologies (such as generic/specific, see

In this paper, an innovating information extraction system is introduced, based on a text line model able to handle relevant (words of a lexicon) and irrelevant (everything

Kando Text structure analysis based on human recognition: cases of newspaper articles and English newspaper articles. Bulletin of the National Center for Science Information Systems,

The combination LDA/Word2Vec as we proposed to implement it, allows us to free ourselves from the arbitrary choice of the K parameter (number of clusters) during partitioning.

Alt- hough not appearing in any known lists, this same clan name does appear in the Darkot pass inscription accompanying a chorten of simi- lar design (see Mock 2013b

Of particular interest to the space weather community is the data base interface, which features time series plots of selected spacecraft measured quantities, access to a

El objetivo principal de este trabajo con- siste en dise˜ nar e implementar un sistema computacional que permita desde un tex- to cl´ınico escrito en lenguaje natural (LN), extraer