Semantic Knowledge Graph Embeddings for biomedical Research: Data Integration using
Linked Open Data
Jens D¨orpinghaus1,2, Marc Jacobs1
1 Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
2 jens.doerpinghaus@scai.fraunhofer.de
Abstract. Knowledge Graphs are becoming a key instrument for biomed- ical knowledge discovery and modeling. These approaches rely on struc- tured data, e.g. about related proteins or genes, and form cause-and- effect networks or – if enriched with literature data and other linked data sources – knowledge graphs. A key aspect of analysis on these graphs is the missing context. Here we present a novel semantic approach towards a context enriched Knowledge Graph for biomedical research utilizing data integration with linked data. The result is a general graph concept that can be used for graph embeddings in different contexts or layers.
1 Introduction
Copyright©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Biological and medical researchers considering computational approaches rely on structured data, e.g. about related proteins or genes, see [9]. Cause-and-effect networks are a special subtype of more general Knowledge Graphs. In principle, the integration of external data sources and manual curated data is key. Although several commercial solutions exist, Fakhry et al. state, that the ”adoption and extension of such methods in the academic community has been hampered by the lack of freely available, efficient algorithms and an accompanying demonstration of their applicability using current public networks” [4].
This and the emerging improvements on large-scale Knowledge Graphs and machine learning approaches are the motivation for our novel approach on se- mantic Knowledge Graph embeddings for biomedical research utilizing data in- tegration with linked open data. Several similar approaches (often in the context of drug-repurposing) have been described like Bio2RDF [2], hetionet [6], or Open PHACTS [5]. Our approach is more focussed on integrating the literature itself in a FAIR [10] and open knowledge graph which is also accessible from public a public resource: SCAIView3. SCAIView is an information retrieval system that allows semantic searches in large textual collections by ontological representa- tions of automatic recognized biological entities [7].
3 https://www.scaiview.com/
Knowledge Graphs: Context Embeddings for Knowledge Discovery and Data Mining on biomedical data
Information Highway Macro-Context
Knowledge Graph Layer n e.g. Document Layer
Layer n-1 e.g. Mechanism Layer (BEL)
…
Knowledge Foundation Layer 0
Micro-Context Layer 1 e.g. Molecular Layer
Data Integration Adding more data will increase the Knowledge-Foundation and gives a more precise view on the micro-context and helps to unveil new context and insights.
Data Layers Data within the Knowledge Graph can be ordered according to context and information to data layers (e.g. a molecular or mechanism layer). This helps to examine novel causal connections and context.
Knowledge Graph The Knowledge Graph is a key concept connecting unstructured knowledge entities into a network structure. It allows easy data integration and data mining utilizing graph- theoretical algorithms and technologies.
Information Highway To allow an easy and FAIR access to the data and benefit from semantic graph-queries we provide users with an SPARQL and Cypher frontend.
Data Uplifting (Data Mining)
Data Integration
Fig. 1.Illustration of the knowledge graph embedding between different layers. Here, every layer corresponds to a context defining new contexts on several other layers.
Thus layers and contexts are flexible and can be defined in a feasible way for every application.
The basis for generating our large-scale Knowledge Graph representation is the biomedical literature, e.g. MedLine and PubMed4. These articles or abstracts are the source for biological relations mentioned above. In addition, meta infor- mation like authors, journals, keywords (so called MeSH-Terms, Medical Subject Headings), etc. are freely available. Ontologies can be used to contextualize en- tities in the Knowledge Graph providing biological or medical relations (cf.5).
Every ontology will form another knowledge (sub-)graph.
Using methods of natural language processing (NLP) and text mining, we can combine and link these knowledge graphs to a giant and very dense new knowledge graph. This will meet a very general definition ofcontext. We can see every knowledge (sub-)graph as context to another. Biological expressions are context of the corresponding literature, authors are context of a text, named entities from ontologies found in a text are context to it or to the corresponding biological expressions.
Our overarching integration schema is based on the Biological Expression Language6is widely applied in biomedical domain to convert unstructured tex- tual knowledge into a computable form. The BEL statements that form knowl- edge graphs are semantic triples that consist of concepts, functions and relation- ships. Thus they can be easily added to a knowledge graph representing another layer or context. An example for a large Alzheimer network can be found in [8].
In the next section we describe the novel concept of semantic graph embed- dings within large-scale Knowledge Graphs. We will present several use-cases
4 See https://www.ncbi.nlm.nih.gov/pubmed/.
5 OLS, https://www.ebi.ac.uk/ols/index
6 BEL, www.openbel.org
and application examples as well as the semantic interoperability layer using RDF and SPARQL.
2 Knowledge Graph architecture
A Knowledge Graph is a systematic way to connect information and data to represent common knowledge. As described above, the context is the most im- portant topic to generate knowledge or even wisdom.
We define knowledge graphsG= (E, R) with entitiese∈E coming from a formal structure like an ontologyO, see [1] and [11]. The relationsr∈Rcan be ontology relations, thus in general we can say every ontologyOwhich is part of the data model is a subgraph of G which means O ⊆ G. In addition we allow inter-ontology relations between two nodes e1, e2 with e1 ∈ O1, e2 ∈ O2 and O1 6=O2. More general we define R ={R1, ..., Rn} as list of either ontologies, terminologies or any sort of controlled vocabulary containing relations or not.
We define contexts C = {c1, ..., cm} as a finite, discrete set. Every node v∈Gand every edger∈Rmay have one ore more contexts c∈C denoted by con(v) or con(r). It is also possible to setcon(v) =∅. Thus we have a mapping con:E∪R→ P(C). If we use a quite general approach towards context, we may set C =E. Thus every inter-ontology relation defines context of two entities, but also the relations within an ontology can be seen as context, see figure 1 for an illustration. Here every context is identified as a layer (e.g. a document layer, a molecular layer, a mechanism layer, ...). This allows new connections between different contexts or layers: If two edgese1, e2∈R1 are connected and e01, e02 ∈R2 withcon(e1) =e01 andcon(e2) =e02 are not connected, we may add another edge (e1, e2) with provenance information that this connection comes from a different context, namelyR2. Since every layer or context can be seen as a subgraph forming a surface we can denote the relation between two layers a knowledge graph embedding.
It is also possible to get the context of a subgraph Ri ⊆ G which can be denominated by con(Ri) or with the notation of graph theory as the extended induced subgraph by the vertex set Ei from Ri given by Gc[Ei]. This is quite trivial if context fromRi can only be annotated to vertices inG. Then
Gc[Ei] =G[Ei]∪ {(e, e0)∀e0∈N(e), e∈Ei}
Herecon|Ei =Gc[Ei] is the context ofEirestricted to the set of edges (relations) in the graph. The two edgese0, e00are implicitly given by this context. It is quite easy to see that the restriction on context annotated to edges makes the problem more easy from a computational perspective. Nevertheless, context on edges is needed from a real-world perspective.
The technical design was done with respect to the microservice architecture of SCAIView [3]. We offer both a REST API as well as a Java Message Service (JMS) interface. As a database backend, we used Neo4j. Here, we used Spring Data Neo4j to map objects to graphs. Thus our software can be used to perform Cypher and SPARQL queries. Data can be retrieved in JSON Graph Format or RDF format.
BELRelation
BELR elation
hasE vidence hasF
unc tion
hasFunction
hasBel hasBel
hasST ATEME NT_GROUP hasSu
bgraph hasC
ondition hasSu
bgraph hasF
unc tion
hasSu bgraph
hasC ondition hasST ATEME
NT_GROUP hasF
unc tion
hasSu bgraph
hasSu bgraph
hasST ATEME
NT_GROUP
hasC ondition hasF
unc tion
hasSu
bgraphhasST
ATEME NT_GROUP
hasC ondition
hasFunction
hasSu bgraph
hasC ondition hasST
ATEME NT_GROUP hasFunction
hasSu bgraph
hasSu bgraph
hasST ATEME
NT_GROUP
hasC ondition hasFunction
hasSu bgraphhasST
ATEME NT_GROUP
hasC ondition
hasBel
hasBel
hasE
vidence BELR
elation
BELR elation BELRelation
BELR elation
BELR elation
hasF unction hasBel
hasBel
hasBel
hasBel
hasBel hasBel hasBel
hasSu bgraphhasSu
bgraph
hasMeSHAn atomy BELR elation
hasSu bgraph
hasMeSHAn atomy BELRelation
hasF unc
tion hasSu
bgraph
hasBel hasBel hasFunc
tion hasSu
bgraph hasSu
bgraph hasST
ATEMENT_GROUP hasC
ondition
hasF unction
hasSubgraph hasSubgraph hasSTATEMENT_GROUP hasCondition hasF
unction
hasSu bgraph hasSu
bgraph hasST
ATEME NT_GROUP hasC
ondition
hasBel hasBel
hasBel
hasBel hasBel
hasBel hasBel
hasBel hasBel
hasBel
hasSubgraph
hasSu bgraph
hasC ondition hasMeSHAn
…
hasST ATEME
NT_GROUP
HGNC HGNC
HGNC
p
act Group 1
proteinTau subgr…
Normal Healthy State Amyloi…
HGNC
HGNC HGNC
deg Caspase subgra…
Regulat…
Neurons GOBP:"…
hasBel
hasBel hasBel
hasBel
hasBel hasBel
hasBel
hasBel
hasBel
hasBel hasBel
hasBel hasBel hasBel hasBel hasBelhasBel hasBelhasBel
hasBelhasBel hasBelhasBel hasBel hasBelhasBel hasBel hasBel hasBel hasBel hasBel
hasBel hasBel
hasBel hasBel hasBel
hasBel hasBel
hasBel hasBel
BELR elation
BELR elation
BELR elation
BELR elation
BELR elation BELRelation BELRelationBELRelationBELRelation BELRelation BELRelation BELRelation BELRelation BELRelation BELRelation
BELRelation BELRelation BELRelation BELRelation BELRelation
hasF unction
hasST ATEME NT_GROUP hasSu bgraph hasSu
bgraph hasDisease
hasFunctionhasDisease
hasST AT…
hasF unc tion hasDisease
hasSTATEMENT_GRO…
hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUPhasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSubgraphhasSu
bgraph hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease hasDisease
hasF unc tion hasDisease
hasST ATEME NT_GROUP
hasSu bgraph hasSu
b…
hasSubgr…
hasSu bgraph hasSu
bgraph
BELR elation BELR
elation BELRelation
BELRelation BELRelation
BELR elation
BELRelation BELRelation hasSubgraph
hasFunction hasF
unction hasF
unction hasFunc
tion hasF
unction hasFunc
tion hasF
unction hasF
unction hasF
unction hasFunction
hasSu bgraph hasSu bgraph BELR elation BELR
elation BELR
elation
hasSubgraph hasSubgraph BELRelation
hasBel hasBel
hasBel hasBel
hasBel hasBel hasBel hasBel hasBel hasSubgraphhasSubgraph
hasSubgraph
hasSu bg…
hasBel
hasBel hasBel
hasSubgraph hasSu bgraph
dbSNP:r…
dbSNP:r…
MESHD…
dbSNP:r…
dbSNP:r…
g
Group 1 Neurotr…
Calciu…
Alzhei…
adhesionCell subgr…
cycleCell subgr…
HGNC HGNC
HGNC
HGNC HGNC
HGNC Endoso…
path Glutath…
Respon…
HGNC:…
GOBP:"…
Cholest…
Non-a…
CHEBI:…
TGF-Be…
(b)
hasSubgraph hasSubgraph hasSu
bgraph
hasSu bgraph BELR elation
BELRelation
BELR elation
BELRelation hasEvidence
BELRelation BELRelation
BELRelation BELRelation hasSubgraph
hasSubgraph hasFunction
hasFunction
hasFunction
hasFunction hasFunction
hasFunction hasSubgraph
BELRelation hasSubgraph
hasSubgraph
hasSubgraph
hasSubgraph hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel hasBel
hasBel hasBel
hasBel
hasBel
hasBel BELRelation
BELRelation BELRelation BELRelation
BELR elation BELRelation
BELR elation BELRelation
hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP hasSTATEMENT_GROUP
BELRelation BELRelation
hasSubgraph
hasCondition hasCondition hasCondition
hasSubgraph hasDisease hasMeSHAnatomy
hasMeSHAnatomy
BELRelation BELRelation
BELRelation BELRelation
BELR elation
BELRelation
BELRelation
hasFunction hasFunction
hasFunction hasFunction hasF unction
hasFunction hasCondition
hasCondition
hasCondition hasSTATEMENT_GROUP
hasSTATEMENT_GROUP
hasSTATEMENT_GROUP
hasSTATEMENT_GROUP hasSubgraph
hasSubgraph hasSubgraph
hasSubgraph hasSubgraphhasSubgraph hasSubgraph
hasSubgraph
hasSubgraph hasDisease hasMeSHAn
atomy
hasMeSHAn atomy hasF unction hasFunction
hasSubgraph hasSu
bgraph hasMeSHAn atomy
hasSTATEMENT_GROUP hasCondition
hasFunction hasF unction
hasSubgraph hasSu
bgraph
hasMeSHAn atomyhasSTATEMENT_GROUP
hasCondition
hasF unction
hasFunction
hasSu
bgraph hasSu
bgraph hasMeSHAn
atomy
hasSTATEMENT_GROUP hasCondition
hasFunction hasSu
bgraph
hasSu bgraph hasMeSHAn
atomy
hasST ATEME NT_GROUP hasC ondition
hasFunction
hasFunction
hasSu
bgraph hasSu
bgraph hasMeSHAn
atomy
hasSTATEME
NT_GROUPhasC
ondition
hasF unction hasSu
b…
hasSubgr…
hasMeSHAn atomy hasConditionhasSubgraph
hasSu
b…
hasST
ATEME
NT_GROUP hasC ondition
hasBel hasBel hasBel hasBel hasBelhasBel hasBel hasBel hasBelhasBel
hasBel
hasBelhasBel
hasBel hasBel
hasBel hasBel
hasBel hasBel hasBel
hasBel hasBel hasBel
hasBel hasBel
hasBelhasBel
hasBel
hasBel hasBel hasBel
hasBel hasBel
hasBel hasBel hasBel hasBel hasBel hasBel
hasBel hasBel
hasBel hasBel
hasBel hasBel hasBel hasBel hasBel hasBel
hasBel hasBel
hasBel
hasBel hasBel
hasBel
hasBel hasBel hasBel hasBel hasBel
hasBel hasBel
hasBel BELRelation
hasSubgraph
hasFunc tion hasFunction
hasSubgraph hasST
ATEMENT_GROUP hasCondition
BELRelation
hasSu bgraph
hasF unction hasFunction
hasSu bgraph
hasST ATEME NT_GROUP hasCondition
hasSu bgraph
hasFunction hasSu
bgraph
hasST ATEME NT_GROUP hasCondition
hasSu bgraph
hasFunction hasSubgraph
hasSTATEME NT_GROUP hasC
ondition
hasSubgraph
hasFunction hasSubgraph
hasSTATEMENT_GROUP hasCondition hasSubgraph
hasFunction hasSubgraph
hasSTATEMENT_GROUP hasCondition
hasSubgraph
hasFunction hasSubgraph
hasSTATEMENT_GROUP hasConditionhasSubgraph
hasFunction hasSubgraph
hasSTATEMENT_GROUP hasCondition hasCondition hasC ondition hasC ondition hasC ondition hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasST ATEME NT_GROUP hasSu bgraph
hasSu
bgraph hasSu bgraph
hasSu
bgraph hasSu
bgraph
hasSubgraph
hasSubgraph
hasSubgraph
hasSu bgraph
hasSubgraph
hasSu bgraph
hasDisease hasDisease hasDisease hasDis…
hasDis…
hasDis…
hasDisease hasDisease hasDisease
hasMeSHAn atomy hasBel hasBel
Synucl…
HGNC
HGNC HGNC
HGNC
HGNC HGNC
HGNC HGNC Chemo…
p act
Binding and Up…
GOBP:"…
Interleu…
HGNC HGNC
HGNC
HGNC
HGNC HGNC
HGNC HGNC Group 1 GOBP:"…
Caspase subgra… Amyloi…HealthyNormalState
Alzhei…
Astrocy…
Hippoc…
(a) (c)
Fig. 2. (a) This is an illustration of the context found for the BEL statement act(p(HGNC:KLC1)) => p(HGNC:MAPT) (found on the bottom of the graph). Both HGNC terms have an evidence in two different documents (purple) and both form a relation in another document (PMID:22272245 in the middle). The green nodes form a manually curated context (e.g. ”Normal Healthy State” or ”Tau protein subgraph”).
All HGNC entities are connected to other HGNC elements, documents and function.
(b) This is an illustration of the context of a single document (purple, left). (c) This is an illustration of the context of a context (green, left).
3 Application
The initial research question was how a general context could be added to biomedical knowledge graphs to answer generic questions according to context, e.g. time, location or biological layer. We have integrated subsets of PubMed data, several ontologies like GO, HGNC, MGI and mappings, BEL networks from Parkinson’s and Alzheimer’s disease as well as data obtained from KEGG.
See fig. 2 for some illustrations of different context layers. For example, seman- tic questions can be formulated as subgraph structures of the initial knowledge graphs. We may think of complex examples, e.g. ”Give me all pathways from protein A to B in the context of Disease C focusing on clinical trials”.
Hypothesis generation within medical research and digital health may lead to search for genomic or moleculare patterns, diagnosis or build longitudinal models which build the basis for a multitude of predictive and personalised medicine ML and AI approaches. This information system can be used to retrieve data by context (cohort size, settings, results, ..) and by content (imaging data, genomic or moleculare measures, ...). For example, this system may answer questions like Give me a clinical trial to reproduce my results or to apply my model or Give me literature for phenotype A, disease B age between C and D and a CT-scan with characteristic E.
4 Conclusion
Here we presented a novel approach that annotates research data with context information. The result is a knowledge graph representation of data, the context graph. It contains computable statement representation (e.g. RDF or BEL). This graph allows to compare research data records from different sources as well as the selection of relevant data sets using graph-theoretical algorithms.
References
1. Guidelines for the construction, format, and management of monolingual controlled vocabularies. Standard, National Information Standards Organization, Baltimore, Maryland, U.S.A. (2005)
2. Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: to- wards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics41(5), 706–716 (2008)
3. Drpinghaus, J., Klein, J., Darms, J., Madan, S., Jacobs, M.: Scaiview – a seman- tic search engine for biomedical research utilizing a microservice architecture. In:
Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems - SEMANTiCS2018 (2018)
4. Fakhry, C.T., Choudhary, P., Gutteridge, A., Sidders, B., Chen, P., Ziemek, D., Zarringhalam, K.: Interpreting transcriptional changes using causal graphs: new methods and their practical utility on public networks. BMC bioinformatics17(1), 318 (2016)
5. Harland, L.: Open phacts: A semantic knowledge infrastructure for public and commercial drug discovery research. In: International Conference on Knowledge Engineering and Knowledge Management. pp. 1–7. Springer (2012)
6. Himmelstein, D.S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E.: Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife6, e26726 (2017)
7. Hodapp, S., Madan, S., Fluck, J., Zimmermann, M.: Integration of UIMA Text Mining Components into an Event-based Asynchronous Microservice Architecture.
In: Proceedings of the LREC 2016 Workshop ”Cross-Platform Text Mining and Natural Language Processing Interoperability”. pp. 19–23. European Language Resources Association (ELRA), Portoroˇz, Slovenia (2016)
8. Kodamullil, A.T., Younesi, E., Naz, M., Bagewadi, S., Hofmann-Apitius, M.: Com- putable cause-and-effect models of healthy and alzheimer’s disease states and their mechanistic differential analysis. Alzheimer’s & Dementia11(11), 1329–1339 (2015)
9. Martin, F., Sewer, A., Talikka, M., Xiang, Y., Hoeng, J., Peitsch, M.C.: Quantifi- cation of biological network perturbations for mechanistic insight and diagnostics using two-layer causal models. BMC bioinformatics15(1), 238 (2014)
10. Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.:
The fair guiding principles for scientific data management and stewardship. Scien- tific data3(2016)
11. Zeng, M.: Knowledge organization systems (kos)35, 160–182 (01 2008)