• Aucun résultat trouvé

Semantic Knowledge Graph Embeddings for biomedical Research: Data Integration using Linked Open Data

N/A
N/A
Protected

Academic year: 2022

Partager "Semantic Knowledge Graph Embeddings for biomedical Research: Data Integration using Linked Open Data"

Copied!
5
0
0

Texte intégral

(1)

Semantic Knowledge Graph Embeddings for biomedical Research: Data Integration using

Linked Open Data

Jens D¨orpinghaus1,2, Marc Jacobs1

1 Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany

2 jens.doerpinghaus@scai.fraunhofer.de

Abstract. Knowledge Graphs are becoming a key instrument for biomed- ical knowledge discovery and modeling. These approaches rely on struc- tured data, e.g. about related proteins or genes, and form cause-and- effect networks or – if enriched with literature data and other linked data sources – knowledge graphs. A key aspect of analysis on these graphs is the missing context. Here we present a novel semantic approach towards a context enriched Knowledge Graph for biomedical research utilizing data integration with linked data. The result is a general graph concept that can be used for graph embeddings in different contexts or layers.

1 Introduction

Copyright©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Biological and medical researchers considering computational approaches rely on structured data, e.g. about related proteins or genes, see [9]. Cause-and-effect networks are a special subtype of more general Knowledge Graphs. In principle, the integration of external data sources and manual curated data is key. Although several commercial solutions exist, Fakhry et al. state, that the ”adoption and extension of such methods in the academic community has been hampered by the lack of freely available, efficient algorithms and an accompanying demonstration of their applicability using current public networks” [4].

This and the emerging improvements on large-scale Knowledge Graphs and machine learning approaches are the motivation for our novel approach on se- mantic Knowledge Graph embeddings for biomedical research utilizing data in- tegration with linked open data. Several similar approaches (often in the context of drug-repurposing) have been described like Bio2RDF [2], hetionet [6], or Open PHACTS [5]. Our approach is more focussed on integrating the literature itself in a FAIR [10] and open knowledge graph which is also accessible from public a public resource: SCAIView3. SCAIView is an information retrieval system that allows semantic searches in large textual collections by ontological representa- tions of automatic recognized biological entities [7].

3 https://www.scaiview.com/

(2)

Knowledge Graphs: Context Embeddings for Knowledge Discovery and Data Mining on biomedical data

Information Highway Macro-Context

Knowledge Graph Layer n e.g. Document Layer

Layer n-1 e.g. Mechanism Layer (BEL)

Knowledge Foundation Layer 0

Micro-Context Layer 1 e.g. Molecular Layer

Data Integration Adding more data will increase the Knowledge-Foundation and gives a more precise view on the micro-context and helps to unveil new context and insights.

Data Layers Data within the Knowledge Graph can be ordered according to context and information to data layers (e.g. a molecular or mechanism layer). This helps to examine novel causal connections and context.

Knowledge Graph The Knowledge Graph is a key concept connecting unstructured knowledge entities into a network structure. It allows easy data integration and data mining utilizing graph- theoretical algorithms and technologies.

Information Highway To allow an easy and FAIR access to the data and benefit from semantic graph-queries we provide users with an SPARQL and Cypher frontend.

Data Uplifting (Data Mining)

Data Integration

Fig. 1.Illustration of the knowledge graph embedding between different layers. Here, every layer corresponds to a context defining new contexts on several other layers.

Thus layers and contexts are flexible and can be defined in a feasible way for every application.

The basis for generating our large-scale Knowledge Graph representation is the biomedical literature, e.g. MedLine and PubMed4. These articles or abstracts are the source for biological relations mentioned above. In addition, meta infor- mation like authors, journals, keywords (so called MeSH-Terms, Medical Subject Headings), etc. are freely available. Ontologies can be used to contextualize en- tities in the Knowledge Graph providing biological or medical relations (cf.5).

Every ontology will form another knowledge (sub-)graph.

Using methods of natural language processing (NLP) and text mining, we can combine and link these knowledge graphs to a giant and very dense new knowledge graph. This will meet a very general definition ofcontext. We can see every knowledge (sub-)graph as context to another. Biological expressions are context of the corresponding literature, authors are context of a text, named entities from ontologies found in a text are context to it or to the corresponding biological expressions.

Our overarching integration schema is based on the Biological Expression Language6is widely applied in biomedical domain to convert unstructured tex- tual knowledge into a computable form. The BEL statements that form knowl- edge graphs are semantic triples that consist of concepts, functions and relation- ships. Thus they can be easily added to a knowledge graph representing another layer or context. An example for a large Alzheimer network can be found in [8].

In the next section we describe the novel concept of semantic graph embed- dings within large-scale Knowledge Graphs. We will present several use-cases

4 See https://www.ncbi.nlm.nih.gov/pubmed/.

5 OLS, https://www.ebi.ac.uk/ols/index

6 BEL, www.openbel.org

(3)

and application examples as well as the semantic interoperability layer using RDF and SPARQL.

2 Knowledge Graph architecture

A Knowledge Graph is a systematic way to connect information and data to represent common knowledge. As described above, the context is the most im- portant topic to generate knowledge or even wisdom.

We define knowledge graphsG= (E, R) with entitiese∈E coming from a formal structure like an ontologyO, see [1] and [11]. The relationsr∈Rcan be ontology relations, thus in general we can say every ontologyOwhich is part of the data model is a subgraph of G which means O ⊆ G. In addition we allow inter-ontology relations between two nodes e1, e2 with e1 ∈ O1, e2 ∈ O2 and O1 6=O2. More general we define R ={R1, ..., Rn} as list of either ontologies, terminologies or any sort of controlled vocabulary containing relations or not.

We define contexts C = {c1, ..., cm} as a finite, discrete set. Every node v∈Gand every edger∈Rmay have one ore more contexts c∈C denoted by con(v) or con(r). It is also possible to setcon(v) =∅. Thus we have a mapping con:E∪R→ P(C). If we use a quite general approach towards context, we may set C =E. Thus every inter-ontology relation defines context of two entities, but also the relations within an ontology can be seen as context, see figure 1 for an illustration. Here every context is identified as a layer (e.g. a document layer, a molecular layer, a mechanism layer, ...). This allows new connections between different contexts or layers: If two edgese1, e2∈R1 are connected and e01, e02 ∈R2 withcon(e1) =e01 andcon(e2) =e02 are not connected, we may add another edge (e1, e2) with provenance information that this connection comes from a different context, namelyR2. Since every layer or context can be seen as a subgraph forming a surface we can denote the relation between two layers a knowledge graph embedding.

It is also possible to get the context of a subgraph Ri ⊆ G which can be denominated by con(Ri) or with the notation of graph theory as the extended induced subgraph by the vertex set Ei from Ri given by Gc[Ei]. This is quite trivial if context fromRi can only be annotated to vertices inG. Then

Gc[Ei] =G[Ei]∪ {(e, e0)∀e0∈N(e), e∈Ei}

Herecon|Ei =Gc[Ei] is the context ofEirestricted to the set of edges (relations) in the graph. The two edgese0, e00are implicitly given by this context. It is quite easy to see that the restriction on context annotated to edges makes the problem more easy from a computational perspective. Nevertheless, context on edges is needed from a real-world perspective.

The technical design was done with respect to the microservice architecture of SCAIView [3]. We offer both a REST API as well as a Java Message Service (JMS) interface. As a database backend, we used Neo4j. Here, we used Spring Data Neo4j to map objects to graphs. Thus our software can be used to perform Cypher and SPARQL queries. Data can be retrieved in JSON Graph Format or RDF format.

(4)

BELRelation

BELR elation

hasE vidence hasF

unc tion

hasFunction

hasBel hasBel

hasST ATEME NT_GROUP hasSu

bgraph hasC

ondition hasSu

bgraph hasF

unc tion

hasSu bgraph

hasC ondition hasST ATEME

NT_GROUP hasF

unc tion

hasSu bgraph

hasSu bgraph

hasST ATEME

NT_GROUP

hasC ondition hasF

unc tion

hasSu

bgraphhasST

ATEME NT_GROUP

hasC ondition

hasFunction

hasSu bgraph

hasC ondition hasST

ATEME NT_GROUP hasFunction

hasSu bgraph

hasSu bgraph

hasST ATEME

NT_GROUP

hasC ondition hasFunction

hasSu bgraphhasST

ATEME NT_GROUP

hasC ondition

hasBel

hasBel

hasE

vidence BELR

elation

BELR elation BELRelation

BELR elation

BELR elation

hasF unction hasBel

hasBel

hasBel

hasBel

hasBel hasBel hasBel

hasSu bgraphhasSu

bgraph

hasMeSHAn atomy BELR elation

hasSu bgraph

hasMeSHAn atomy BELRelation

hasF unc

tion hasSu

bgraph

hasBel hasBel hasFunc

tion hasSu

bgraph hasSu

bgraph hasST

ATEMENT_GROUP hasC

ondition

hasF unction

hasSubgraph hasSubgraph hasSTATEMENT_GROUP hasCondition hasF

unction

hasSu bgraph hasSu

bgraph hasST

ATEME NT_GROUP hasC

ondition

hasBel hasBel

hasBel

hasBel hasBel

hasBel hasBel

hasBel hasBel

hasBel

hasSubgraph

hasSu bgraph

hasC ondition hasMeSHAn

hasST ATEME

NT_GROUP

HGNC HGNC

HGNC

p

act Group 1

proteinTau subgr…

Normal Healthy State Amyloi…

HGNC

HGNC HGNC

deg Caspase subgra…

Regulat…

Neurons GOBP:"…

hasBel

hasBel hasBel

hasBel

hasBel hasBel

hasBel

hasBel

hasBel

hasBel hasBel

hasBel hasBel hasBel hasBel hasBelhasBel hasBelhasBel

hasBelhasBel hasBelhasBel hasBel hasBelhasBel hasBel hasBel hasBel hasBel hasBel

hasBel hasBel

hasBel hasBel hasBel

hasBel hasBel

hasBel hasBel

BELR elation

BELR elation

BELR elation

BELR elation

BELR elation BELRelation BELRelationBELRelationBELRelation BELRelation BELRelation BELRelation BELRelation BELRelation BELRelation

BELRelation BELRelation BELRelation BELRelation BELRelation

hasF unction

hasST ATEME NT_GROUP hasSu bgraph hasSu

bgraph hasDisease

hasFunctionhasDisease

hasST AT…

hasF unc tion hasDisease

hasSTATEMENT_GRO…

hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSubgraphhasSu

bgraph hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease

hasF unc tion hasDisease

hasST ATEME NT_GROUP

hasSu bgraph hasSu

b…

hasSubgr…

hasSu bgraph hasSu

bgraph

BELR elation BELR

elation BELRelation

BELRelation BELRelation

BELR elation

BELRelation BELRelation hasSubgraph

hasFunction hasF

unction hasF

unction hasFunc

tion hasF

unction hasFunc

tion hasF

unction hasF

unction hasF

unction hasFunction

hasSu bgraph hasSu bgraph BELR elation BELR

elation BELR

elation

hasSubgraph hasSubgraph BELRelation

hasBel hasBel

hasBel hasBel

hasBel hasBel hasBel hasBel hasBel hasSubgraphhasSubgraph

hasSubgraph

hasSu bg…

hasBel

hasBel hasBel

hasSubgraph hasSu bgraph

dbSNP:r…

dbSNP:r…

MESHD…

dbSNP:r…

dbSNP:r…

g

Group 1 Neurotr…

Calciu…

Alzhei…

adhesionCell subgr…

cycleCell subgr…

HGNC HGNC

HGNC

HGNC HGNC

HGNC Endoso…

path Glutath…

Respon…

HGNC:…

GOBP:"…

Cholest…

Non-a…

CHEBI:…

TGF-Be…

(b)

hasSubgraph hasSubgraph hasSu

bgraph

hasSu bgraph BELR elation

BELRelation

BELR elation

BELRelation hasEvidence

BELRelation BELRelation

BELRelation BELRelation hasSubgraph

hasSubgraph hasFunction

hasFunction

hasFunction

hasFunction hasFunction

hasFunction hasSubgraph

BELRelation hasSubgraph

hasSubgraph

hasSubgraph

hasSubgraph hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel

hasBel hasBel

hasBel

hasBel

hasBel BELRelation

BELRelation BELRelation BELRelation

BELR elation BELRelation

BELR elation BELRelation

hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP

BELRelation BELRelation

hasSubgraph

hasCondition hasCondition hasCondition

hasSubgraph hasDisease hasMeSHAnatomy

hasMeSHAnatomy

BELRelation BELRelation

BELRelation BELRelation

BELR elation

BELRelation

BELRelation

hasFunction hasFunction

hasFunction hasFunction hasF unction

hasFunction hasCondition

hasCondition

hasCondition hasSTATEMENT_GROUP

hasSTATEMENT_GROUP

hasSTATEMENT_GROUP

hasSTATEMENT_GROUP hasSubgraph

hasSubgraph hasSubgraph

hasSubgraph hasSubgraphhasSubgraph hasSubgraph

hasSubgraph

hasSubgraph hasDisease hasMeSHAn

atomy

hasMeSHAn atomy hasF unction hasFunction

hasSubgraph hasSu

bgraph hasMeSHAn atomy

hasSTATEMENT_GROUP hasCondition

hasFunction hasF unction

hasSubgraph hasSu

bgraph

hasMeSHAn atomyhasSTATEMENT_GROUP

hasCondition

hasF unction

hasFunction

hasSu

bgraph hasSu

bgraph hasMeSHAn

atomy

hasSTATEMENT_GROUP hasCondition

hasFunction hasSu

bgraph

hasSu bgraph hasMeSHAn

atomy

hasST ATEME NT_GROUP hasC ondition

hasFunction

hasFunction

hasSu

bgraph hasSu

bgraph hasMeSHAn

atomy

hasSTATEME

NT_GROUPhasC

ondition

hasF unction hasSu

b…

hasSubgr…

hasMeSHAn atomy hasConditionhasSubgraph

hasSu

b…

hasST

ATEME

NT_GROUP hasC ondition

hasBel hasBel hasBel hasBel hasBelhasBel hasBel hasBel hasBelhasBel

hasBel

hasBelhasBel

hasBel hasBel

hasBel hasBel

hasBel hasBel hasBel

hasBel hasBel hasBel

hasBel hasBel

hasBelhasBel

hasBel

hasBel hasBel hasBel

hasBel hasBel

hasBel hasBel hasBel hasBel hasBel hasBel

hasBel hasBel

hasBel hasBel

hasBel hasBel hasBel hasBel hasBel hasBel

hasBel hasBel

hasBel

hasBel hasBel

hasBel

hasBel hasBel hasBel hasBel hasBel

hasBel hasBel

hasBel BELRelation

hasSubgraph

hasFunc tion hasFunction

hasSubgraph hasST

ATEMENT_GROUP hasCondition

BELRelation

hasSu bgraph

hasF unction hasFunction

hasSu bgraph

hasST ATEME NT_GROUP hasCondition

hasSu bgraph

hasFunction hasSu

bgraph

hasST ATEME NT_GROUP hasCondition

hasSu bgraph

hasFunction hasSubgraph

hasSTATEME NT_GROUP hasC

ondition

hasSubgraph

hasFunction hasSubgraph

hasSTATEMENT_GROUP hasCondition hasSubgraph

hasFunction hasSubgraph

hasSTATEMENT_GROUP hasCondition

hasSubgraph

hasFunction hasSubgraph

hasSTATEMENT_GROUP hasConditionhasSubgraph

hasFunction hasSubgraph

hasSTATEMENT_GROUP hasCondition hasCondition hasC ondition hasC ondition hasC ondition hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasSu bgraph

hasSu

bgraph hasSu bgraph

hasSu

bgraph hasSu

bgraph

hasSubgraph

hasSubgraph

hasSubgraph

hasSu bgraph

hasSubgraph

hasSu bgraph

hasDisease hasDisease hasDisease hasDis…

hasDis…

hasDis…

hasDisease hasDisease hasDisease

hasMeSHAn atomy hasBel hasBel

Synucl…

HGNC

HGNC HGNC

HGNC

HGNC HGNC

HGNC HGNC Chemo…

p act

Binding and Up…

GOBP:"…

Interleu…

HGNC HGNC

HGNC

HGNC

HGNC HGNC

HGNC HGNC Group 1 GOBP:"…

Caspase subgra… Amyloi…HealthyNormalState

Alzhei…

Astrocy…

Hippoc…

(a) (c)

Fig. 2. (a) This is an illustration of the context found for the BEL statement act(p(HGNC:KLC1)) => p(HGNC:MAPT) (found on the bottom of the graph). Both HGNC terms have an evidence in two different documents (purple) and both form a relation in another document (PMID:22272245 in the middle). The green nodes form a manually curated context (e.g. ”Normal Healthy State” or ”Tau protein subgraph”).

All HGNC entities are connected to other HGNC elements, documents and function.

(b) This is an illustration of the context of a single document (purple, left). (c) This is an illustration of the context of a context (green, left).

3 Application

The initial research question was how a general context could be added to biomedical knowledge graphs to answer generic questions according to context, e.g. time, location or biological layer. We have integrated subsets of PubMed data, several ontologies like GO, HGNC, MGI and mappings, BEL networks from Parkinson’s and Alzheimer’s disease as well as data obtained from KEGG.

See fig. 2 for some illustrations of different context layers. For example, seman- tic questions can be formulated as subgraph structures of the initial knowledge graphs. We may think of complex examples, e.g. ”Give me all pathways from protein A to B in the context of Disease C focusing on clinical trials”.

Hypothesis generation within medical research and digital health may lead to search for genomic or moleculare patterns, diagnosis or build longitudinal models which build the basis for a multitude of predictive and personalised medicine ML and AI approaches. This information system can be used to retrieve data by context (cohort size, settings, results, ..) and by content (imaging data, genomic or moleculare measures, ...). For example, this system may answer questions like Give me a clinical trial to reproduce my results or to apply my model or Give me literature for phenotype A, disease B age between C and D and a CT-scan with characteristic E.

(5)

4 Conclusion

Here we presented a novel approach that annotates research data with context information. The result is a knowledge graph representation of data, the context graph. It contains computable statement representation (e.g. RDF or BEL). This graph allows to compare research data records from different sources as well as the selection of relevant data sets using graph-theoretical algorithms.

References

1. Guidelines for the construction, format, and management of monolingual controlled vocabularies. Standard, National Information Standards Organization, Baltimore, Maryland, U.S.A. (2005)

2. Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: to- wards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics41(5), 706–716 (2008)

3. Drpinghaus, J., Klein, J., Darms, J., Madan, S., Jacobs, M.: Scaiview – a seman- tic search engine for biomedical research utilizing a microservice architecture. In:

Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems - SEMANTiCS2018 (2018)

4. Fakhry, C.T., Choudhary, P., Gutteridge, A., Sidders, B., Chen, P., Ziemek, D., Zarringhalam, K.: Interpreting transcriptional changes using causal graphs: new methods and their practical utility on public networks. BMC bioinformatics17(1), 318 (2016)

5. Harland, L.: Open phacts: A semantic knowledge infrastructure for public and commercial drug discovery research. In: International Conference on Knowledge Engineering and Knowledge Management. pp. 1–7. Springer (2012)

6. Himmelstein, D.S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E.: Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife6, e26726 (2017)

7. Hodapp, S., Madan, S., Fluck, J., Zimmermann, M.: Integration of UIMA Text Mining Components into an Event-based Asynchronous Microservice Architecture.

In: Proceedings of the LREC 2016 Workshop ”Cross-Platform Text Mining and Natural Language Processing Interoperability”. pp. 19–23. European Language Resources Association (ELRA), Portoroˇz, Slovenia (2016)

8. Kodamullil, A.T., Younesi, E., Naz, M., Bagewadi, S., Hofmann-Apitius, M.: Com- putable cause-and-effect models of healthy and alzheimer’s disease states and their mechanistic differential analysis. Alzheimer’s & Dementia11(11), 1329–1339 (2015)

9. Martin, F., Sewer, A., Talikka, M., Xiang, Y., Hoeng, J., Peitsch, M.C.: Quantifi- cation of biological network perturbations for mechanistic insight and diagnostics using two-layer causal models. BMC bioinformatics15(1), 238 (2014)

10. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.:

The fair guiding principles for scientific data management and stewardship. Scien- tific data3(2016)

11. Zeng, M.: Knowledge organization systems (kos)35, 160–182 (01 2008)

Références

Documents relatifs

We trained a Random Forest classifier on the training data (positive and negative statements) plus the verified test statements and made predic- tions for test data.. Keywords:

The basic idea is to represent the available data in the form of a graph, learn embeddings for entities using Knowledge graph embedding methods [12] and incorporate them

Topic modeling techniques has been applied in many scenar- ios in recent years, spanning textual content, as well as many different data sources. The existing researches in this

In this work, we have applied Knowledge Graph Embedding approaches to extract feature vector representation of drugs using DrugBank LOD from Bio2RDF to predict potential

We implement a prototype of this framework using (a) a repository of feature vector rep- resentations associated with entities in a knowledge graph, (b) a mapping between the

Linked Data principles and standards provide a powerful means for data integration across multiple sources.. However, these stan- dards are often too open

property is a N-to-N property (a user typically likes N items and an item is liked by N users), which is the typical case where TransE fails to generate good predictions [9,14].

We fuse information about network infrastructure, security posture, cyber threats, and mission dependencies to build a CyGraph knowledge base for enterprise risk analysis,