• Aucun résultat trouvé

Semantic Web Technologies for a Knowledge Base of Biomedical Facts Extracted from Scientific Literature

N/A
N/A
Protected

Academic year: 2022

Partager "Semantic Web Technologies for a Knowledge Base of Biomedical Facts Extracted from Scientific Literature"

Copied!
2
0
0

Texte intégral

(1)

Semantic Web technologies for a knowledge base of biomedical facts extracted from scientific

literature

Maria Biryukov1, Valentin Groues1, Christophe Trefois1, Venkata Satagopam1, and Reinhard Schneider1

Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-Belval, Luxembourg

1 Objective

Biomedical literature, including scientific articles, public health reports and books become more and more available due to massive digitalization. Explo- ration and analysis of this rich source of data requires assistance of automatic tools capable of dealing with large volumes of text. We are developing a pipeline for processing publicly available biomedical text, abstracts, full text articles, conference proceedings, eventually books and electronic health records, starting from searching the web and downloading raw files, to extraction and storing concepts (entities) and semantic relations between them into a knowledge base.

The goal is to create a biomedical knowledge base publicly available for both human and machine access (SPARQL endpoint and REST API).

2 Events Extraction Pipeline

For the extraction of biomedical entities from the literature, we rely first on Re- flect (httpd://reflect.ws) - a named entity recognition engine to identify biomed- ical concepts in the text. GeniaTagger is used to obtain basic morphological and syntactic information. The latter is completed by application of the Stan- ford Syntactic Parser which converts sentences into syntactic trees representing dependencies between the words. Combined with a set of rules and dedicated patterns, this information allows for getting semantic interpretation of sentences and extraction of meaningful relationships between the concepts.

3 Semantic Web technologies

An ontology has been created to represent the biomedical events extracted from the literature. Examples of extracted biomedical events include, among others, proteins interactions, chemicals effects and genes expression. Named graphs are used to add metadata about those events (e.g. scoring). The events are stored in a triple store (OpenLink Virtuoso). A hierarchy of biomedical relationships has been defined and the reasoning capabilities of Virtuoso are used to dynamically

(2)

use this hierarchy at query time. From 1 million publications already processed, 11 millions events were extracted resulting in about 150 millions of triples. A first prototype of a web-based visualization tool has been created to browse this knowledge base.

Références

Documents relatifs

Example of a coreference relation involving a protein entity, and results of coreference resolution performed by both the TEES and the Stanford

The some- what unusual multiple-choice nature of this question answering task meant that we were able to attempt interesting transformations of questions by inserting answer graphs

Two runs use Term Frequency-Inverted Document Frequency (TF-IDF) of the query words to weight the sentences.. The two unique runs differ in the way that when multiple answers get

Since MeSH indexing can be viewed as a categorization task, we use machine learn- ing in the post-processing stage in an effort to improve both Recall and Precision on the

Structured information is best represented in a knowledge base (the semantics is then explicit in the data) but the bulk of our knowledge of the world will always be in the form

Akin to other semantically powered federated query engines like DARQ [32] and ANAPSID [33,] introduced to facilitate transparent query to distributed linked data resources,

In this paper we propose a pipeline to improve the extraction and normalization of biomedical ontology terms in EHR by crowd- sourcing the validation of the results obtained

Our pipeline output consists to 37% of non-horizontal words while ex- tracting 41% more words on average than actually present in the gold standard.. On the other hand, the