• Aucun résultat trouvé

Open Datasets for Evaluating the Interpretation of Bibliographic Records

N/A
N/A
Protected

Academic year: 2021

Partager "Open Datasets for Evaluating the Interpretation of Bibliographic Records"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01302830

https://hal.archives-ouvertes.fr/hal-01302830v2

Submitted on 18 Oct 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Open Datasets for Evaluating the Interpretation of Bibliographic Records

Joffrey Decourselle, Fabien Duchateau, Trond Aalberg, Naimdjon Takhirov, Nicolas Lumineau

To cite this version:

Joffrey Decourselle, Fabien Duchateau, Trond Aalberg, Naimdjon Takhirov, Nicolas Lumineau. Open Datasets for Evaluating the Interpretation of Bibliographic Records. Joint Conference on Digital Libraries, Jun 2016, Newark, United States. pp.253-254, �10.1145/2910896.2925457�. �hal-01302830v2�

(2)

4 – Extract of a unit test from T42 1 - Background

FRBRization is a metadata migration process which aims at extracting FRBR entities from MARC records.

• Crucial for the adoption of Semantic Web technologies in libraries

• Many tools proposed to perform the migration during the last decades

• No benchmark to compare and evaluate these tools

We provide two open datasets dedicated to the evaluation of FRBRization tools considering different specificities of MARC catalog like cataloguing practices, inconsistencies and bibliographic patterns.

Open Datasets for Evaluating the Interpretation of Bibliographic Records

Joffrey Decourselle 1 , Fabien Duchateau 1 , Trond Aalberg 2 , Naimdjon Takhirov 3 and Nicolas Lumineau 1

1

LIRIS, UMR5205, Université Lyon 1 Lyon, France

firstname.lastname@liris.cnrs.fr

2

NTNU

Trondheim, Norway trondaal@idi.ntnu.no

3

Westerdals - Oslo School of Arts, Communication and Technology - Faculty of Technology - Oslo, Norway

taknai@westerdals.no

2 – Specificities of MARC records

Cataloguing practices and inconsistencies:

3 – Open Datasets

Including both MARC files and FRBR gold standard

Features T42 BIB-RCAT

Number of unit tests 42 -

Number of collections 126 3

Number of languages 3 1

Number of media types 8 4

Average MARC records 10 / test 560

Average fields / records 18 17

Average FRBR entities 73 / test 1922 Average FRBR properties 241 / test 9517

http://bib-r.github.io/

T42 allows the evaluation of a migration tool in terms of bibliographic patterns and cataloging issues.

BIB-RCAT offers a larger collection for evaluating the interpretation of MARC records in a real-world context.

Missing information (missing of publication info or authoritative data leading to misunderstandings).

Linkage errors (All errors in title or responsibility identifiers leading to dead links between records).

Cataloguing practices and norms (Specific form of data in the record, e.g., ISBD punctuation)

Core pattern (basic bibliographic cases)

Augmentation pattern (any addition of a Work)

Derivation pattern (Intellectual modification)

Aggregation pattern (whole-part relationships)

Complementary pattern (other related works) Bibliographic patterns:

Example of derivation patterns in FRBR

(adaptation and translations)

Références

Documents relatifs

The EER value computed from DataSU is used to qualify the capacity of synthetic Keystroke dynam- ics data to be indistinguishable from real Keystroke dynamics data.. Thus, an EER of

As often in pattern recognition applications, noise may affect the structural representation, that is to say that there exist differences between the pattern graph and each of

Building Data Genome is another recent research project [11], [4], that deals with producing datasets from public and non-residential buildings that be can utilized by the

Starting from some public datasets, an initial data augmentation has been introduced to include location information whereas it was possible and useful.. Then an optimized

In this paper, we analyze to which extent citation data of publications are openly available, using the intersection of the Cross- ref metadata and unpaywall snapshot as

To facilitate the handling of digital library content and its accompanying metadata, four multimodal and multilingual datasets are presented that are relying on the publicly

In our poster, we will introduce new datasets in propositional logic and first-order logic that can be used for learning to rea- son, and present some initial results on systems

The Resource Description Framework (RDF) 1 can be used to provide a uniform rep- resentation for network data derived from heterogeneous resources [2], however, automatically