• Aucun résultat trouvé

Drug Discovery and Big Linked Data

N/A
N/A
Protected

Academic year: 2022

Partager "Drug Discovery and Big Linked Data"

Copied!
2
0
0

Texte intégral

(1)

Drug Discovery and Big Linked Data

Ronald Siebes1, Victor de Boer1, Bryn Williams-Jones2, and Stian Soiland-Reyes3

1 VU University Amsterdam, the Netherlands, rm.siebes@few.vu.nl, v.de.boer@vu.nl

2 Open PHACTS Foundation, Cambridge, United Kingdom bryn@openphactsfoundation.org

3 School of Computer Science, University of Manchester, United Kingdom soiland-reyes@cs.manchester.ac.uk

1 The Open PHACTS Drug Discovery Platform

A large part of the daily practice of a researcher doing in vitro Drug Discov- ery is comparing and manually matching high-quality information from multiple disciplines in the Life and Biomedical Sciences. The Open PHACTS Discovery Platform4 is an initiative to integrate publicly available data relevant for both academia and the pharmaceutical industry. It integrates numerous datasets in- cluding for example ChEBI, ChemSpider, DrugBank and the GeneOntology.

The platform provides an easy interface that allows researchers to consult the database without being confronted with the complexity of defining efficient Linked Data queries. A set of services are accessible via a RESTful interface.

The Open PHACTS Discovery Platform provides an interpretation of biomed- ical research activities (identified by domain experts) asworkflows that are au- thored using visual tools. Workflows retrieve data via API calls. The platform executes the resulting instantiated queries at an endpoint that serves relevant data.Currently, the infrastructure uses commercial software to reason over the vast amount of RDF data and the Big Data Europe (BDE) project took up the challenge to get the same functionality but via open source Big Data technology.

2 The Big Data Europe infrastructure

The BDE project5 is developing a re-usable Big Data infrastructure (BDI) needed by data-intensive science practitioners tackling a wide range of soci- etal challenges. The infrastructure is designed to cover aspects of publishing and consuming semantically interoperable, large-scale, multi-lingual data assets and knowledge. This BDE infrastructure is designed to minimize the disruption of current workflows, and maximizes the opportunities by taking advantage of the latest European RTD developments, including multilingual data harvesting, data analytics, and data visualization. To test the effectiveness of the platform,

4 http://www.openphactsfoundation.org

5 https://www.big-data-europe.eu

(2)

multiple pilot implementations are developed in the various domains. The first of these pilots is the Drug Discovery Pilot implementation, which replicates much of the functionalities of the Open PHACTS platform. The infrastructure relies heavily on the Docker containers6and configuration via Docker Compose where generic 3rd party Docker containers (e.g. MemCached, MySQL, SPARK, HDFS) are combined with custom made pilot specific Docker containers.

3 Drug Discovery Pilot implementation

In the pilot we propose to demonstrate7, the Open PHACTS functionality is implemented on the BDI. One goal of this pilot is to investigate dealing with the significant diversity of the entity name space in the bio-medical domain and exploring how this issue affects a generic big data infrastructure. Mapping this vast amount of entities leads to a significant increase of triples. A second goal is covering data and query security and privacy requirements and exploring how the methods used to handle this in the current implementation of the Open PHACTS Discovery Platform can be used to guide development of the generic BDE platform. An important challenge for this pilot is to replace the commercial cluster version RDF store, with an open source variant version: 4Store. To this end, we are implementing a 4Store BDE docker component and improving it in such a way that it can serve as a generic component on the BDE infrastructure8. The pilot integrates multiple datasets, available in RDF. The mappings be- tween the identifiers used in the various datasets are freely available as RDF linksets9. Most datasets have a metadata description published in VoID. The functionality of the Open PHACTS services is described in SWAGGER. The following processing is carried out:

– Real time processing: Using an external service (such as the Scientific Lenses keyword expansion service) to process a query and then to execute the pro- cessed query on the data stored in the infrastructure.

– Batch processing: Data transformations that align and link datasets at inges- tion time. The datasets above are regularly updated and must be periodically re-ingested.

The pilot implementation exposes a querying endpoint as well as a data ingestion endpoint for visualization or further processing.

The pilot itself is available in its entirety as Open Source software10. Both BDI and the pilot-specific components are implemented as Docker components.

Acknowledgements.This work is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 644564 www.big- data-europe.eu. We thank our BDE collaborators for their support.

6 https://www.docker.com/what-docker

7 https://github.com/big-data-europe/pilot-sc1-cycle1

8 https://github.com/big-data-europe/docker-4store

9 https://www.openphacts.org/2/sci/data.html

10Download and instructions at https://github.com/big-data-europe/pilot-sc1-cycle1

Références

Documents relatifs

Therefore, to further increase the number of indexed datasets, we integrated in H-BOLD a procedure through which the user is able to upload the URL of a SPARQL endpoint and to see,

We present the pipeline of enabling vis- ual query creation over a SPARQL endpoint and ready-to-use data schemas over existing public Linked data endpoints, available in the

The Schema View shows a portion of the complete Schema Summary (see Figure 3 (c)) that con- tains only the classes grouped in the “Concept” Cluster selected in the previous level.

In order to avoid the need for users to specify a complex structured query upfront [2, 4] or for local free-text indexes to be built [5, 6], we propose an approach that combines

In the current design, the platform offers six different types of basic data operations for which Web APIs have been designed: DB-RDFization (for mapping data from re- lational

over the Linked Data would see a difference in the data returned when applying lenses with different notions of operational equivalence.. Note that although we are developing our

The platform hides the complexities of interacting with the linked data and concepts by exposing an API that provides the core functionality to support a wide variety of drug

Third, we build upon our translation algorithm to develop a rewriting optimization that converts graph traversal queries into equivalent SPARQL queries that execute over RDF graphs