• Aucun résultat trouvé

Text mining on COVID19 datasets Terminology extraction

N/A
N/A
Protected

Academic year: 2021

Partager "Text mining on COVID19 datasets Terminology extraction"

Copied!
18
0
0

Texte intégral

(1)

Text mining on COVID19 datasets

Terminology extraction

Mathieu Roche

Cirad – TETIS – Montpellier, France

(2)

Informal sources Press Social media Blog Data: Not verified, Not validated Unstructured data 2

Context: Event-based surveillance

How to use terminology and text-mining

for Event-based surveillance systems?

(3)

(Arsevska et al., Plos One 2018) (Valentin et al., CEA 2020)

(4)

> 300.000 news

Dataverse

(5)

> 300.000 news

Dataverse

Context: PADI-Web and Terminology

Terminology for

(6)

Aim 2 : Trend analysis of COVID-19 terminology per

period and location

Terminology extraction and COVID-19

Objectives

Aim 1 : Terminology extraction for surveillance systems

and information extraction

- Disease-based surveillance

- Symptom-based surveillance

(7)

-

Scientific publications: COVID-19 Open Research Dataset

(CORD-19):

https://pages.semanticscholar.org/coronavirus-research [Allen Institute for AI]

-

Media data: Medisys and PADI-web data:

-

https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1

8167/DVN1/ZUA8MF

-

https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1

8167/DVN1/MSLEFC

-

Social media data (tweets):

https://github.com/echen102/COVID-19-TweetIDs (corpus collected since January 22,

2020: https://arxiv.org/abs/2003.07372)

(8)

Aim 1 : Terminology extraction for surveillance systems

and information extraction

Aim 2 : Trend analysis of COVID-19 terminology per

period and location

Terminology extraction and COVID-19

Objectives

- Disease-based surveillance

- Symptom-based surveillance

(9)

respiratory syndrome BioTex

(Lossio et al., 2016)

Terminology extraction in scientific papers

Part-of-Speech tagging

Candidate term extraction

Ranking of candidate terms

1

2

3

(10)

respiratory syndrome BioTex

Faster

(Jacquemin et al.,1997) (Lossio et al., 2016)

(Kafando et al., MOOD003)

Terminology extraction in scientific papers

[free extraction]

(11)

(Kafando et al, MOOD003)

Terminology extraction in scientific papers

Results

Variations:

72% are relevant

(12)

Aim 2 : Trend analysis of COVID-19 terminology per

period and location

Terminology extraction and COVID-19

Objectives

Aim 1 : Terminology extraction for surveillance systems

and information extraction

- Disease-based surveillance

- Symptom-based surveillance

(13)

-

Scientific publications: COVID-19 Open Research Dataset

(CORD-19):

https://pages.semanticscholar.org/coronavirus-research [Allen Institute for AI]

-

Media data: Medisys and PADI-web data:

-

https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1

8167/DVN1/ZUA8MF

-

https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1

8167/DVN1/MSLEFC

-

Social media data (tweets):

https://github.com/echen102/COVID-19-TweetIDs (corpus collected since January 22,

2020: https://arxiv.org/abs/2003.07372)

(14)

COVID-19 and Media Dataset:

Mining textual data according periods and countries

UK Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 face mask (10) coronavirus mask (19) masks (2)

2 gas mask (59) masks (40) mask (17)

3 protective mask (167) mask mess (53) mandatory mask (237)

4 mask (202) surgical mask (57) mandatory mask-wearing (238)

5 masks (227) face masks (80) mandatory masks (239)

Spain Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 máscara de protección (1) máscara facial (87) máscaras antigás con filtro (10)

2 máscara de snorkel (34) máscaras quirúrgicas (152) máscaras antigás (21)

3 máscara antigás (47) máscaras (170) máscara obligatoria en comercios (23)

4 máscara (80) máscara (212) máscara obligatoria (37)

5 máscara de tristeza (489) máscara marrón (366) diseños de máscaras (86)

France Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 masques (3) masques (1) port du masque (21)

2 masque à abidjan (21) masque de protection (161) masque (32)

(15)

A new ranking measure

TF-IDF:

- Popular Information Retrieval measure

- Better score for terms frequent in a document

and rare in the others

- For Twitter: 1 document = 1 tweet

H-TFIDF:

- Exploration of different spatio-temporal criteria in order to extract

discriminative terms

(16)

Experiments conducted on 270000 tweets

(17)

Conclusion and challenges

Generecity

Social media

Quality criteria

Informal sources Press Social media Blog Data: Not verified, Not validated Unstructured data

(18)

Publications

Roche M. COVID-19 and Media Datasets: Period- and location-specific textual data mining.

Data in Brief. Elsevier. Volume 33, December 2020

https://doi.org/10.1016/j.dib.2020.106356

Valentin S, Mercier A, Roche M, Lancelot R, Arsevska E. Monitoring online media reports for early detection of unknown diseases: insight from a retrospective study of COVID-19 emergence. Transboundary and Emerging Diseases. 2020

https://onlinelibrary.wiley.com/doi/10.1111/tbed.13738

Data

Valentin S, Mercier A, Lancelot R, Roche M, Arsevska E, PADI-web COVID-19 corpus: news Decoupes R., Kafando R., Roche M, Teisseire M. H-TFIDF: What makes areas specific over time in the massive flow of tweets related to the covid pandemic? In Proc. of AGILE conference (Association of Geographic Information Laboratories in Europe), to appear 2021

Références

Documents relatifs

This paper de- scribes the work in progress on the construction of a veterinary terminology resource as a basis for a text mining tool to classify, with minimal human

We have shown that mining multiple linked data sources improves classification performance of lung cancer ICD-10 codes from textual data, as compared to using a

The fact that these characteristics are valuable to extract accurate knowledge from Wikipedia is strongly confirmed by a number of previous researches on.. These researches are

Our information extraction systems analyse human language text as linguistics structure in order to extract information about different types of events, time, place, casualties

Event extraction from Social Media text using Conditional Random FieldsN. Nagesh

To Further enhance the overall performance for entity and relation extraction and typing, We propose a novel domain-independent framework, called Co-Type (Xiang Ren, et al., 2017),

4 e.g.. Mining Resource Roles. In this work, the goal is to understand the actual roles of project members that collaborate using a VCS. Roles are decided in the project planning

While text mining and visualization tools have evolved into mainstream research methods in many fields (e.g. social sci- ences, machine learning), their application to literary