• Aucun résultat trouvé

CENDARI Virtual Research Environment & Named Entity Recognition techniques

N/A
N/A
Protected

Academic year: 2021

Partager "CENDARI Virtual Research Environment & Named Entity Recognition techniques"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01577975

https://hal.inria.fr/hal-01577975

Submitted on 28 Aug 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

CENDARI Virtual Research Environment & Named Entity Recognition techniques

Patrice Lopez, Alexander Meyer, Laurent Romary

To cite this version:

Patrice Lopez, Alexander Meyer, Laurent Romary. CENDARI Virtual Research Environment &

Named Entity Recognition techniques. Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen, Feb 2014, Berlin, Germany. �hal-01577975�

(2)

CENDARI Virtual Research Environment &

Named Entity Recognition techniques

Patrice Lopez Alexander Meyer Laurent Romary

Inria (Institut national de recherche en informatique et en automatique) & HU Berlin forename.surname @inria.fr

with support from the Inria Aviz team (Jean-Daniel Fekete et al.)

CENDARI (Collaborative European Digital Archive Infrastructure) is a research infrastructure project aimed at integrating digital archives and re- sources for research on medieval and modern European history.

The project brings together information and computer scientists with histo- rians and existing historical research infrastructures (archives, libraries, other digital projects) to improve conditions for digital historical scholarship. CEN- DARI has engaged in extensive networking with the archives and libraries of Europe, especially those in Eastern Europe.

CENDARI is a 4-year, European-Commission-funded project led by Trinity College Dublin, in partnership with 14 institutions across 8 countries.

Overview

• Medieval culture

• World War I

Case study areas

The CENDARI project

CENDARI has carried out multiple participatory design workshops in order for historians to articulate their needs and wishes regarding a digital research environment, and for the computer scientists to understand those needs. Two major outcomes of those workshops are discussed on this poster:

VRE Virtual Research Environment

NER Named Entity Recognition & Resolution techniques

(called from within the VRE)

CENDARI has carried out multiple participatory design workshops in order for historians to articulate their needs and wishes regarding a digital research environment, and for the computer scientists to understand those needs. Two major outcomes of those workshops are discussed on this poster:

VRE Virtual Research Environment

NER Named Entity Recognition & Resolution techniques

(called from within the VRE)

• two data spaces: personal and project-wide

• collecting notes (taken in e. g. archives)

• uploading files (e. g. scans of documents)

• recognition of Named Entities in notes

• visualization of found entities & documents → foster exploration and analysis

• collaboration and sharing of notes & documents (if wanted)

• enrichment of a common repository of historical information

Historians’ requirements and wishes

Dec. 12, 2013 Cendari VRE 4

Kriegsarchiv (KA), Vienna

Kriegsministerium (KM) 1918/19, Abteilung (Abt.) 5 [Abt. 5: military intelligence]

64-41/8-72

Excerpt from Die Reichspost [date unclear]

about several railway thefts undertaken by a gang calling itself “die grüne Brigade”

in Galicia Mehrere Eisenbahndiebe bei einem Kampf getötet. Aus Krakau, 28.d wird Gemeldet:[....]

daß ein Dieb bei einem Einbrüche Verwendung Einer Dynamitpatrone beide Füße verlor.

Wie der "Gniec Krakowski" weiters meldet,wurde […...]"

64-41/8-76

16.10.1918 report of a shooting attack by bandits on an Ueberwachungs Patrouille in Lemberg.

64-41/8-92

17.4.1918 report of a theft by soldiers of flour sacks from the train station in Ung. Hradisch 64-42

10.1.1918 report from MilKmmdo in Zagreb on growing Desertion und Rauberunwesen in the

area on the example of a certain Gličanov Gliša [scan]

2.7.1818 “meldet das Gemeindeamt Voganj: Seit einigen Tagen sind erpressungen in der Gemeinde Voganj an […..]

wurde bei Tage von zwei Des. Avram Maleti

bedroht,‘dass man ihm die Schweine holen wird’Zu kamen sie abends und verlangten Br[...] schaffen werden.

Zum Dorfnotär E. Wolf kamen zwei Des. [….]

Cendari VRE interface

My Library

Jakub Beneš

Editor

All My

Projects

Green Cadres

Entities [18]

Notes [1]

Images[3]

Books[82]

Entity Visualizations

Cendari Database

Cadres_Notes_KA

Cadres_

Notes_KA

Artifacts [0]

Dates [0]: Timeline

Locations [0]

References [0]

--- --- --- --- ---

Persons [0] Events [0]

[ No events]

Cadres Visualizations

VRE mockups by Jean-Daniel Fekete

3 major components:

• LEFT: Storage user data space / CENDARI shared data space

• MIDDLE: Note-taking environment creation, modification & use

• RIGHT: Entity visualization overview, exploration & use 4 Simplicity: minimalistic design for ease of learning

4 Gentle learning: typing in an editor is the only pre-required user knowledge 4 Unification: all CENDARI services in one platform

VRE prototype

Recognizing and resolving Named Entities has been mentioned by historians as being one of the most important features that would leverage the CENDARI VRE and distinguish it from other software currently used in the field.

Named Entities are persons, organizations, places, dates, events, artifacts.

The VRE will allow for

• manual tagging of entities

• automatic tagging of entities (with possible manual correction)

Dec. 12, 2013 Cendari VRE 5

Kriegsarchiv (KA), Vienna

Kriegsministerium (KM) 1918/19, Abteilung (Abt.) 5 [Abt. 5: military intelligence]

64-41/8-72

Excerpt from Die Reichspost [date unclear]

about several railway thefts undertaken by a gang calling itself “die grüne Brigade”

in Galicia Mehrere Eisenbahndiebe bei einem Kampf getötet. Aus Krakau, 28.d wird Gemeldet:[....]

daß ein Dieb bei einem Einbrüche Verwendung Einer Dynamitpatrone beide Füße verlor.

Wie der "Gniec Krakowski" weiters meldet,wurde […...]"

64-41/8-76

16.10.1918 report of a shooting attack by bandits on an Ueberwachungs Patrouille in Lemberg.

64-41/8-92

17.4.1918 report of a theft by soldiers of flour sacks from the train station in Ung. Hradisch 64-42

10.1.1918 report from MilKmmdo in Zagreb on growing Desertion und Rauberunwesen in the

area on the example of a certain Gličanov Gliša [scan]

2.7.1818 “meldet das Gemeindeamt Voganj: Seit einigen Tagen sind erpressungen in der Gemeinde Voganj an […..]

wurde bei Tage von zwei Des. Avram Maleti

bedroht,‘dass man ihm die Schweine holen wird’Zu kamen sie abends und verlangten Br[...] schaffen werden.

Zum Dorfnotär E. Wolf kamen zwei Des. [….]

Cendari VRE interface

My Library

Jakub Beneš

Editor

All My

Projects

Green Cadres

Entities [18]

Notes [1]

Images[3]

Books[82]

Entity Visualizations

Cendari Database

Cadres_Notes_KA

Cadres_

Notes_KA Galicia

Krakau, 28.d

Lemberg

Ung. Hradisch 16.10.1918

10.1.1918

2.7.1918 64-41/8-76

64-41/8-92 17.4.1918 64-42

Gličanov Gliša

Avram Maleti

E. Wolf Dynamitpatrone

Vienna

Kriegsministerium (KM) 1918/19, Abteilung (Abt.) 5

64-41/8-72

Artifacts [1]

Dates [4]: Timeline

Locations [5]

January April July October

Year 1918

References [5]

1. .

2.

3.

4.

5.

Persons [3] Events [0]

Galicia

Krakau

Lemberg Vienna Ung. Hradisch

Gličanov Avram

E.Wolf

Kriegsministerium (KM) 64-41/8-72

64-41/8-76 64-41/8-92 64-42

[ No events]

Cadres Visualizations

Named Entity highlighting in the notes & visualizations directly beside!

Dealing with entities

Statistical approaches are state-of-the-art in NER. They are accurate, provide high coverage and are portable when applied to new domains. However, the customization of such algorithms towards the historical domain raises several specific challenges:

• Lack of training data and reference/evaluation corpora

• Lack of knowledge resources (gazetteers, terminological databases). Ex- isting gazetteers and terminological databases are only partially helpful for the historical researcher. They are more relevant for contemporary history.

• Multilinguality of sources and heterogeneous writing systems in use (for the World War I domain, especially Eastern European languages)

• Digitalization at an early stage: the documents to be processed are poorly integrated and normalized.

The gap between unstructured and semi-structured data on the one hand and semantic representation on the other is thus significantly larger than for fields such as biotechnology or chemistry where NER currently is most advanced.

Problems in the historical domain

For contemporary English, we use statistical NER based on Conditional Random Fields (CRF), allowing for very accurate and fine-grained resolution (e. g. not only choosing entity type “person”, but “military person”).

For other languages, we are currently developing an original approach based on intensive exploitation of existing generalist knowledge bases:

1. A huge lexicon is assembled from Freebase and Wikipedia, containing entity names and translations into various languages.

2. Text in a target language is searched for entities from the lexicon.

(The lexicon is large enough so that some will always be found.)

3. For unambiguous matches, machine learning is applied using language-independent features to find entities not in the lexicon.

4. Resolution of entities against the lexicon is done using measures of semantic relatedness.

Proof-of-concept demo for 1. & 2.: http://dev1.cendari.saclay.inria.fr/bulgarian/

Solutions

Named Entity Recognition

Poster made with LATEX and fancytikzposter by Elena Botoeva: http://www.inf.unibz.it/~ebotoeva/fancytikzposter.html

Références

Documents relatifs

Named Entity Recognition (NER) is an information extraction task that aims at extract- ing and categorizing specific entities (proper names or dedicated linguistic units as

The Distant Reading for European Literary History (COST Action CA16204) kicked off in 2017 with the goal of using computational methods of analysis for large collections of

This thematic VRE will provide multidisciplinary distributed teams of researchers which are not experts in the field of information technologies (climatologists,

Keywords—dangerous permission group, dangerous permission, information leakage, android operating system, smart environment, smart device, information value, information

In this paper we take the example of the VRE4EIC e-VRE metadata service, which uses X3ML mappings to build a single CERIF catalogue for describing data products and other

It has been found that the simulated result was similar to the realistic measured fatigue, which means the theoretical approach in virtual human simulation might give an

7.01-00 – Impacts of air pollu- tion and climate change on forest ecosystems (Dr. Elena Pao- letti), The European Commission – DG Joint Research Centre Institute for Evironment

The resulting infrastructure not only provides access to more than 800,000 archival and historical sources but also integrates them into a collection of tools and services developed