• Aucun résultat trouvé

Pipelines for semantic annotation and FAIR data production

N/A
N/A
Protected

Academic year: 2021

Partager "Pipelines for semantic annotation and FAIR data production"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-03234314

https://hal.inrae.fr/hal-03234314

Submitted on 25 May 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

Pipelines for semantic annotation and FAIR data production

Christian Pichot, Damien Maurice, Philippe Clastre, Benjamin Jaillet, Rachid Yahiaoui

To cite this version:

Christian Pichot, Damien Maurice, Philippe Clastre, Benjamin Jaillet, Rachid Yahiaoui. Pipelines for semantic annotation and FAIR data production. RDA 17th Plenary Meeting, Apr 2021, Edinburgh, United Kingdom. �hal-03234314�

(2)

Dataverse repository

²

Pipelines for semantic annotation

and FAIR data production

PICHOT C. (1) , MAURICE D. (2), YAHIAOUI R. (3), CLASTRE P. (1) , JAILLET B. (1)

The AnaEE (Analysis and Experimentation on Ecosystems) Research Infrastructure offers experimental facilities for studying ecosystems and biodiversity. The data generated by the AnaEE platforms are most often managed in relational database.

A distributed Information System (IS) is developed, based on semantic interoperability of its components and using common vocabularies (AnaeeThes thesaurus and OBOE-based ontology). Discovery and access portals are fed by information (rdf triples) produced by the semantic annotation of the resources that also generate metadata and datasets.

Two pipelines are developed for facilitating the semantic annotation and exploitation processes which may represent a huge conceptual and practical work.

Context and objective

Relational databases

Metadata & data for discovery portals

and repositories

Semantic data generation

from a pipeline for database annotation

2. INRAE UMR SILVA route d'Amance 54280 Champenoux . 1. INRAE URFM 228 route de l’Aérodrome 84914 Avignon

Experimental, analytical and modelling platforms

Graph database

Ontology

(OBOE-based for AnaEE)

RDB raw data mapping conf file (.odba) Step1_pipeline processing

End Point inferred triplesraw data with raw data

ofEntity hasContext hasMeasurement hasValue usesStandard ofCharacteristic Observation Temporal information

hasContext hasContext hasContext

Semantic data graph (per category) Variable semantic description

Step2_pipeline processing Step3_pipeline processing Step4_pipeline processing

Semantic data exploitation

by a pipeline for metadata and datasets generation

AnaEE

standard Category Context Entity Characteristic Unit

Phytoplankton Biodiversity Water Phytoplankton Volume Per Volume

MicroMeterCubed Per Millimeter

WaterPH Physical

Chemistry Water Column Water pH pHUnit

... ... ... ... ... ...

+

+

← A csv input file describes the main characteristics of the variables submitted for semantic annotation, allowing to parametrize generic semantic graphs (Yed* format).

The annotation pipeline** can be used in different contexts of ontologies and databases.

It includes shell scripts and specific java developments and requires the Yed, Ontop, Corese and Blazegraph softwares. It is deployed as a classical scripted application or as a dockerised one.

The annotation pipeline consists in 4 steps:

1. generation of the mapping files to be used by Ontop*** (database connexion parameters, sets of sql queries and of the corresponding triples)

2. production of the initial triples (ttl files) by a specific program and Ontop tool.

3. production of inferred triples by Corese**** from the ontology rules

4. uploading of the ttl files within the Blazegraph***** software and initialisation of the end point

Current developments will allow data annotation from several relational database software (PostgreSQL, MySQL, …) and csv flat files.

*https://www.yworks.com/products/yed **https://forgemia.inra.fr/anaee-dev/coby

***Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Davide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. Ontop: Answering SPARQL Queries over Relational Databases. In: Semantic Web Journal 8.3 (2017), pp. 471–487.

****http://wimmics.inria.fr/corese

*****https://wiki.blazegraph.com/wiki/index.php/Main_Page

Outputs of the annotation pipeline are used by an exploitation pipeline* for:

1. generation (through a SPARQL query on raw data) of synthetic data that feeds the discovery portal,

2. generation of standardised GeoDCAT and ISO19115/19139 metadata records,

3. generation of data file (NetCDF as first format) from selected perimeters (e.g years, experimental sites , variable categories..). In that case, the annotation pipeline is

launched using a dedicated webservice.

Metadata and data products are transferred to a Dataverse repository * https://forgemia.inra.fr/anaee-dev/semdata

End Point delimitationperimeter OBOE

metadata

conversion GeoDCAT

metadata

API

(XSLT) 19139

3. INRAE US INFOSOL 2163 avenue de la Pomme de Pin 45075 Orléans

WaterColumn Water pHUnit pH {value} Measurement Observation ofEntity Observation Spatial information Observation Site information Specific component Discovery Portal Graph database End Point GeoDCAT metadata 19115/19139 Spatial information Observation

Références

Documents relatifs

The Schema Editor also combines the extension of ontologies (the Sensor Type Editor) with using them to annotate instance data (the Sensor Instance Editor). This is one of the

Note that the standard data in the portal refers to a set of datasets by using the public data open standard guidelines of the government that defines an item name (data field) and

In this paper, we have proposed two semantic annotation interfaces to enable authors to link their content with DBpedia entities as convenient, fast, and accu- rate as possible

To apply the annotation path method to web service discovery we implemented a logics based service matcher [1] that operates only on path-based annotations of the inputs and outputs

Since our formalization allows for reusing existing lin- guistic resources, we assume that texts have been preprocessed beforehand by linguistic annotation tools, such as named

The computer application part of the project has started in September 2008, a prototype has been held with MediaWiki and Semantic MediaWiki through May to July 2009. After

We have grouped the results according to the number of unique tags assigned to the documents (Figure 1), the number of times a user has tagged the document (Figure 2(a)), and the

Thus, in this work we will show how social metadata can be included in service descriptions, let them be SOA services or RESTful services, in order to