Integration and provenance control of proteomics data using SWOMed, a Product Lifecycle Management framework for biomedical research

(1)

HAL Id: hal-01654383

https://hal.archives-ouvertes.fr/hal-01654383

Submitted on 3 Dec 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Integration and provenance control of proteomics data using SWOMed, a Product Lifecycle Management

framework for biomedical research

Amel Raboudi, Marianne Allanic, Pierre-Yves Hervé, Daniel Balvay, Joevin Sourdon, Philippe Boutinaud, Bertrand Tavitian

To cite this version:

Amel Raboudi, Marianne Allanic, Pierre-Yves Hervé, Daniel Balvay, Joevin Sourdon, et al.. Inte- gration and provenance control of proteomics data using SWOMed, a Product Lifecycle Management framework for biomedical research. SMMAP2017, Oct 2017, Marne la vallée„ France. �hal-01654383�

(2)

Integration and provenance control of proteomics data using SWOMed, a Product Lifecycle Management

framework for biomedical research

A ^MEL R ÂBOUDI ^1,2,3 , M ÂRIANNE A ^LLANIC ¹ , P ÎERRE -Y ^VES H ÊRVÉ ¹ , D ÂNIEL B ÂLVAY ^2,4 , J ÔEVIN S ÔURDON ^2,4 , P ^HILIPPE B ÔUTINAUD ¹ , B ÊRTRAND T ÂVITIAN ^2,4,5

1. FEALINX, 37 rue Adam Ledoux 92400 Courbevoie, France 2. INSERM, UMR970, Paris-Cardiovascular Research Center at HEGP, Paris, France

3. Université de Technologie de Compiègne (UTC), Roberval Laboratory, Compiègne, France 4. Université Paris Descartes, Sorbonne Paris Cité, Faculté de Médecine, F-75006 Paris, France 5. Department of Radiology, Georges Pompidou European Hospital, Paris, France

Context

Because of the complexity of living organisms, biomedical research makes use of multiple data sources from multiple instruments, techniques and protocols, e.g. various in vivo and in vitro imaging techniques, various omics methods, physiology, pharmacology, etc. Presently, there is a lack of tools to integrate efficiently multiple heterogeneous biomedical data and exploit their significance for addressing specific research issues.

Product Lifecycle Management (PLM) was developed by the industry to provide collaborative, secure, and reliable tools for industrial manufacturing. It provides traceability, versioning, strict access rights and data integrity to complex data from multiple sources in multiple formats.

SWOMed is a biomedical PLM system, recently developed during the interdisciplinary research project BIOMIST (ANR- 13-CORD-0007). It provides a collaborative framework for biomedical data lifecycle management, with a focus on cohort imaging and human cognitive neuroscience studies (Allanic et al. [1]), but was not tested in the context of an experimental preclinical study incorporating proteomics data.

Case study Materials and Methods

Results

US

Histology

qRT-PCR Mass Spectrometry

Western Blot

PET-CT

Trypsin

In vitro use case

SCX HPLC MS MS/MS Extraction Lyse Digestion C18 desalting

Euthanasia ^Spectrum

Anesthesia Physiological monitoring Radioactive agent (FDG) injection

Fasting CT Dynamic PET Scan DICOM images

Qualitycontrol

Subject: Mouse

Exam: LC-MS followed by MS/MS

Acquisition: MS/MS

DataUnit: Spectrum

Exam: PET-CT

Qualitycontrol

Agent: FDG with annotations about

injection parameters

[1] M. Allanic et al., « BIOMIST: A Platform for Biomedical Data Lifecycle Management of Neuroimaging Cohorts », Front. ICT, vol.

3, janv. 2017.

Collect data from multiple

sources

Understand and Annotate data through interviews

and domain ontologies:

OBI, QIBO, MSO

Analyse, Correct and Validate data Model data using

SWOMed XML input format

Automate data staging in SWOMed

(Re)Use data for workflows and processing using

SWOMed

Device: Q-Extractive with specific configuration version

Intervention: Fasting

Agent: Isoflurane with annotations about

anesthesia parameters

Acquisition: Monitoring

Device: Mediso NSPC10 With version 2.021

DataUnit: DICOM Acquisition: PET

Acquisition: CT Sample(s): peptides fractions

with annotations about Lyse, Digestion and C18 desalting

Conclusion

Objective

• Traceability

• Provenance

• Versioning

• Multisite studies

• Strict access rights

• Access to previous research data

• Integrated workflows

• FAIR guidelines

• Comprehensive metadata

• Use of ontologies

Each

represented object must reference its definition

object ^.

For visibility, only major BMI-LM objects are shown.

Web Service

Node 002

PET-CT Scanner Mass

spectrometry

raw data

^{ftp, XML}

DICOM

Data description services

• Data annotation using SWOMed classification.

• Data modeling using BMI-LM objects.

• Data and vocabularies mapping

Integrated scientific workflows 1. Peptide identification and

quantification 2. Protein inference

Quality control workflows

• Manual and automated validation.

• Visual QC

• Notification

High Performance Computing

cluster frontend

Nipype workflows [3]

Node 001

scp

……… Node N Graphical

Interface

scp/ssh

Maxquant analysis results User generated

derived data

Reference database.

Our main objective is to integrate proteomics and experimental multimodal preclinical studies using SWOMed.

Specific objectives are to guarantee research data quality, improve data sharing and collaboration, ensure reproducibility and reuse of heterogenous study data.

We adapted the generic data model (BMI-LM) of SWOMed to the needs of DRIVE- SPC (Déploiement du Réseau d’Images du Vivant de Sorbonne Paris Cité), a joint project of PARCC-Inserm laboratory and Fealinx company aiming at bridging the gap between multi-source heterogenous data and final research results.

Our first use case is an experimental cardiotoxicity study combining proteomics, histology and two imaging modalities (Positron emission tomography and cardiac ultrasound) results with the aim to understand the mechanisms underlying the cardiotoxicity of an anti-angiogenic anticancer treatment in mice [2].

[2] J. Sourdon et al., « Cardiac Metabolic Deregulation Induced by the Tyrosine Kinase Receptor Inhibitor Sunitinib is rescued by Endothelin Receptor Antagonism », Theranostics, vol. 7, n^o 11, p. 2757-2774, 2017.

a University cloud service

Results from integrated proteomics workflows. Above, is shown the workflow for raw to MzXML files conversion, and the results from workflows for peptide (PCR_proteomics_peptides) and protein (PCR_proteomics_proteins) identification and quantification.

In vivo use case

Convert raw to MzMLPetidesand proteins identification and quantification

[2]

[1]

Features

Sample: Heart

Intervention: Euthanasia Acquisition: MS

Acquisition: LC

Features extractionData AnalysisResult publication

FIDO X!Tandem

Identified peptides, Quantified peptides

Feature list, Id list, Protein list

MzML/MzXML files,

Fasta files

Processing Maxquant Processing PMOD

ProcessingUnitResult: All results from Maxquant

ProcessingUnitResult: Formatted results for next analysis ProcessingUnitResult: All PMOD results

Dataset: ProteinGroup.txt Dataset: TAC VOI

Dataset: AIF VOI Dataset: metabolic flux (PKIN folder)

Dataset: All-group-results.xlsx

Processing GraphPadPrism statistics Processing PathwayStudio

ProcessingUnitResult: metabolic Flux analysis ProcessingUnitResult: generated group comparisons

ProcessingUnitResult: chosen graph for publication ProcessingUnitResult: chosen interesting pathways for publication

BibliographicReference: Published article Reference Data: Published data in Pride WorkflowInput

Reference Data: Uniprot SoftwareTool:

Maxquant version 1.5.2.8 SubjectGroup: Serie2

To Proteomics DataUnits

Acquisition flowProcessingflow

[3] K. Gorgolewski et al., “Nipype: A Flexible, Lightweight and Extensible Neuroimaging Data Processing Framework in Python,” Front. Neuroinformatics, vol. 5, 2011.

We have built a centralized management framework for heterogenous research data, including imaging and proteomics data lifecycle. It uses standard based methodology that guarantees research data quality and ensures comprehensive metadata. We now wish to extend this centralized data management solution to complex workflows integrating more and more diverse data sources. Moreover, during the course of this study we encountered an unexpectedly high rate of protocol changes and system evolutions. Therefore, we will develop new tools and approaches taking into account the evolutions and mutations of biomedical research ecosystems in order to adapt PLM methods to high protocol mutation rates and improve the stability and resilience of our management framework for heterogenous research data.