Text mining on COVID19 datasets
Terminology extraction
Mathieu Roche
Cirad – TETIS – Montpellier, France
Informal sources Press Social media Blog Data: Not verified, Not validated Unstructured data 2
Context: Event-based surveillance
How to use terminology and text-mining
for Event-based surveillance systems?
(Arsevska et al., Plos One 2018) (Valentin et al., CEA 2020)
> 300.000 news
Dataverse
> 300.000 news
Dataverse
Context: PADI-Web and Terminology
Terminology for
Aim 2 : Trend analysis of COVID-19 terminology per
period and location
Terminology extraction and COVID-19
Objectives
Aim 1 : Terminology extraction for surveillance systems
and information extraction
- Disease-based surveillance
- Symptom-based surveillance
-
Scientific publications: COVID-19 Open Research Dataset
(CORD-19):
https://pages.semanticscholar.org/coronavirus-research [Allen Institute for AI]
-
Media data: Medisys and PADI-web data:
-
https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1
8167/DVN1/ZUA8MF
-
https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1
8167/DVN1/MSLEFC
-
Social media data (tweets):
https://github.com/echen102/COVID-19-TweetIDs (corpus collected since January 22,
2020: https://arxiv.org/abs/2003.07372)
Aim 1 : Terminology extraction for surveillance systems
and information extraction
Aim 2 : Trend analysis of COVID-19 terminology per
period and location
Terminology extraction and COVID-19
Objectives
- Disease-based surveillance
- Symptom-based surveillance
respiratory syndrome BioTex
(Lossio et al., 2016)
Terminology extraction in scientific papers
Part-of-Speech tagging
Candidate term extraction
Ranking of candidate terms
1
2
3
respiratory syndrome BioTex
Faster
(Jacquemin et al.,1997) (Lossio et al., 2016)
(Kafando et al., MOOD003)
Terminology extraction in scientific papers
[free extraction]
(Kafando et al, MOOD003)
Terminology extraction in scientific papers
Results
Variations:
72% are relevant
Aim 2 : Trend analysis of COVID-19 terminology per
period and location
Terminology extraction and COVID-19
Objectives
Aim 1 : Terminology extraction for surveillance systems
and information extraction
- Disease-based surveillance
- Symptom-based surveillance
-
Scientific publications: COVID-19 Open Research Dataset
(CORD-19):
https://pages.semanticscholar.org/coronavirus-research [Allen Institute for AI]
-
Media data: Medisys and PADI-web data:
-
https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1
8167/DVN1/ZUA8MF
-
https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.1
8167/DVN1/MSLEFC
-
Social media data (tweets):
https://github.com/echen102/COVID-19-TweetIDs (corpus collected since January 22,
2020: https://arxiv.org/abs/2003.07372)
COVID-19 and Media Dataset:
Mining textual data according periods and countries
UK Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 face mask (10) coronavirus mask (19) masks (2)
2 gas mask (59) masks (40) mask (17)
3 protective mask (167) mask mess (53) mandatory mask (237)
4 mask (202) surgical mask (57) mandatory mask-wearing (238)
5 masks (227) face masks (80) mandatory masks (239)
Spain Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 máscara de protección (1) máscara facial (87) máscaras antigás con filtro (10)
2 máscara de snorkel (34) máscaras quirúrgicas (152) máscaras antigás (21)
3 máscara antigás (47) máscaras (170) máscara obligatoria en comercios (23)
4 máscara (80) máscara (212) máscara obligatoria (37)
5 máscara de tristeza (489) máscara marrón (366) diseños de máscaras (86)
France Period 1 (March 2020) Period 2 (May 2020) Period 3 (July 2020) 1 masques (3) masques (1) port du masque (21)
2 masque à abidjan (21) masque de protection (161) masque (32)
A new ranking measure
TF-IDF:
- Popular Information Retrieval measure
- Better score for terms frequent in a document
and rare in the others
- For Twitter: 1 document = 1 tweet
H-TFIDF:
- Exploration of different spatio-temporal criteria in order to extract
discriminative terms
Experiments conducted on 270000 tweets
Conclusion and challenges
Generecity
Social media
Quality criteria
Informal sources Press Social media Blog Data: Not verified, Not validated Unstructured data
Publications
Roche M. COVID-19 and Media Datasets: Period- and location-specific textual data mining.
Data in Brief. Elsevier. Volume 33, December 2020
https://doi.org/10.1016/j.dib.2020.106356
Valentin S, Mercier A, Roche M, Lancelot R, Arsevska E. Monitoring online media reports for early detection of unknown diseases: insight from a retrospective study of COVID-19 emergence. Transboundary and Emerging Diseases. 2020
https://onlinelibrary.wiley.com/doi/10.1111/tbed.13738
Data
Valentin S, Mercier A, Lancelot R, Roche M, Arsevska E, PADI-web COVID-19 corpus: news Decoupes R., Kafando R., Roche M, Teisseire M. H-TFIDF: What makes areas specific over time in the massive flow of tweets related to the covid pandemic? In Proc. of AGILE conference (Association of Geographic Information Laboratories in Europe), to appear 2021