Road to effective data curation for translational research

(1)

feature

Road to effective data curation for

translational research

WeiGu^1,2,z,SamiulHasan^3,z,PhilippeRocca-Serra^4,zand VenkataP.Satagopam^1,2,venkata.satagopam@uni.lu

Translational research today is data-intensive and requires multi-stakeholder collaborations to generate

and pool data together for integrated analysis. This leads to the challenge of harmonization of data from

different sources with different formats and standards, which is often overlooked during project

planning and thus becomes a bottleneck of the research progress. We report on our experience and

lessons learnt about data curation for translational research garnered over the course of the European

Translational Research Infrastructure & Knowledge management Services (eTRIKS) program (https://

www.etriks.org), a unique, 5-year, cross-organizational, cross-cultural collaboration project funded by

the Innovative Medicines Initiative of the EU. Here, we discuss the obstacles and suggest what steps are

needed for effective data curation in translational research, especially for projects involving multiple

organizations from academia and industry.

Introduction

Billionsofdollarsarespentannuallyongenerat- ingclinicalandtranslationalresearchdata.Yet despitesuchsignificantlevelsofinvestment,the amountofdatathatarereusedremainssurpris- inglylow.Besidesthelegalconstraints,twomain reasonsexplainthisdeficit:first,thedifficultiesin findingandaccessingdatasetsthemselves,and second,theeffortrequiredtoharmonizedatasets bothsyntacticallyand semantically.Theperpetual needfor addedcuration efforts highlights alackof interoperabilitybetweendigitalassets.Thisis symptomaticofthedichotomyplaguingthelife sciencesdomainwhenitcomestodataman- agementandassethandling.Ontheonehand, there is an operational gapin termsoftraining life- scienceresearchersinthenecessaryskills,tools andculturerequiredtoturndatasetsintoFAIR

(findable,accessible,interoperable,reusable) resources[1].Andontheotherhand,thereisa culturaldividebetweentheworldsofacademia andindustry,whichtranslatesintodiverseand frequentlydivergentattitudestowardsthe adoptionofdata-managementstandards(Fig.1).

Thisiscompoundedbyanarrayofoftencom- petingstandardsspecificationssponsoredby differentstakeholders,whoseviewsoftheworld arenotnecessarilydiscordant,butarenotentirely aligned,either.Hence,unlessalife-sciencelingua francaemerges,realisingthevisionofaFAIRdata exchangerequiresstrategicthinkingwithagood understandingofexistingstandardsresources andaclearcommitmenttousingthem.

Furthermore,funders,sponsors,program managers,dataanalystsandscientistsneedto fullyappreciatetheaddedburdenofformat

conversions,languagetranslationsandmap- pingbetweenterminologies.Theywillalsohave tobearinmindthatstandardsarenotstatic artefactsbutevolvingentitiesdesignedtore- flectthegrowthofdomainknowledge.There- fore,theyneedtoplanforupgrades,migration andobsolescencestrategies,leveraginglearn- ingsandpracticeswell-entrenchedinother domainsinthesoftwareindustry.

TheInnovativeMedicinesInitiative(IMI)isthe largestpublic–privatepartnershipinthelife- sciencesdomaininEurope,anditisfocusedon developingbetterandsafermedicinesfor patients.Centraltothisstrategyisaknowledge management(KM)environmentthatprovides sustainableaccesstothedatainanintegrated manner.Tomeetthischallenge,theIMIEuro- peanTranslationalResearchInfrastructure&

FeaturesPERSPECTIVE

1359-6446/ã2020TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1

(2)

KnowledgemanagementServices(eTRIKS) projectfocusedonbuildingasustainableKM platformandprovidedsupportatthelevelof datamanagementtootherIMIprojects.Cura- tion,asanessentialpartofdatamanagement, connectskeycomponents,suchasvocabulary managementanddatahosting,inalonger

‘valuechain’thatleadstoasubstantialim- provementoftheutilityofdataforresearch.

BasedonourlearningsfromeTRIKS,weoutline thekeystepstodefine,specify,rolloutand enactrobust,proven,standards-awaredata- managementplans(DMPs).Theseexperiences arerelevanttoalldata-intensiveclinicaland translationalstudies.Thisworkbuildsonthe eTRIKSstandardsstarterpack(SSP)[2],which surveyed,evaluatedandcompiledacollection ofdatastandardsofrelevancetothefield.

Ouraimistosharethepracticalknowledge thusgainedtoassistorganizationstomaximize thevalueoftheirownandpublicdataandto practicallyassesswhentoimplementdata curationintranslational-researchoperational processes.Thisinvolvesoutliningacompre- hensiveconceptofdatacurationasanintegral partoftheglobalconceptofaDMP.Itcovers datahandlingprocessingfromendtoend, introducingtheuseofdatastandardsfromthe studyplanningstagestodata-setsocialization throughplannedreleases.

ExperienceofdatacurationinIMIeTRIKS supportedprojects

eTRIKSdirectlyengagedwithIMIprojectsby providingsupportindatacurationandtraining.

Thisfacilitateddataaccesstoconsortium membersviadata-sharingplatformssuchas tranSMART[3].Thecurationactivitiestargeted compliancetostandardsasrecommendedby

theSSP[2].Curationthereforecoveredaspects ofmetadatastructuring(thesyntax)andele- mentsofcontentannotation(thesemantics).

TheIMIprojectsdirectlyinvolvedwere:UBI- OPRED[4],OncoTrack[5],RA-MAP[6],ABIRISK [7],APPROACH(https://www.approachproject.

eu/about-approach)andAETIONOMY[8].

Initially,eTRIKSadopteda‘fullsupport’model byallocatingexperiencedpersonneltocurate dataforleadclientprojects.However,asifto confirmtheaphorism‘nobattleplaneversurvives contactwiththeenemy’,thissoonresultedin havingtomakeadjustments,oftensignificant ones,owingtothefollowingshareableobstacles:

Obstacle 1: underestimation of the time

required to clear legal hurdles

Thenegotiationofmaterialordataprocessing agreementsbetweenconsortiaturnedouttobe atime-consumingprocess.Thecomplexdetails resultedin>20monthsspenttograntdata accessforcuration.Thisrestrictionwascom- poundedfurtherbyapprehensivenessinshar- inginformationaboutstudyvariables,which resultedinfurtherunanticipateddelays.

Itiscrucialforfutureconsortiatoclearthe pathtodataexchangebetweenpartnersto preventbeingboggeddowninjuridicalno man’sland.

Obstacle 2: under resourcing of the

curation activities and sustainability

issues

ThenumberofcuratorsallocatedtoeTRIKSonly allowedustorealisticallyengagewithfiveto eightleadprojects,whichwouldhavefallenfar shortofeTRIKS’engagementgoalof40projects.

Therealcostsandthescaleofthegapbetween thecuratorialresourcesandthenumberofIMI

projectsbeingfundedwasquicklyandcrudely exposed.

Toprovidesustainablecurationsupporttosub- sequentIMIprojects,anadditional‘lightweight’ supportmodelwasintroduced.Underthisremit, eTRIKShasdevelopeddata-curationguidelinesand an IMI curatortraining course[9]. The guidelines and aseriesoftrainingsessionswereprovidedtosup- portedprojectstoenableindependentcuration.

Eventhismodeloftenfacedchallengesowingtoa lackofpersonnelinmanyprojects.

Westronglysuggestthatfutureconsortia/

projectsshouldplansufficientcuration resourcesintheirproposalstoensureaneffi- cientdataflowfromdataproductiontoanaly- ses.

Obstacle 3: underestimation of the

‘cultural divide’ of standards

Industrystrengthstandardshaveadaunting complexity.Thismeansthatbringingdata managerstotherequiredlevelsofcompetence demandsadequatetrainingandresourcing.By contrast,datastandardsusedinacademiaoften haveadifferentorigin,withpragmatism,agility andflexibilityattheircore,andwithloosely coupledimplementation.Thesedistinctatti- tudestendtopolarizepracticeandmakefinding commongroundchallenging.Havingthesetwo worldsexchange,coordinateanddevelop globalstandardsconsumestime,expertise, trainingandknowledgetransferforharmoni- zationandstability.

Weurgescientificleadersandresearchers frombothacademiaandindustrytoreachoutto fundersandsponsorsregardingthisissue,so thatitwillbeacknowledgedbyfundingagen- ciesandstudysponsorstopromoteconver- genceofefforts.

Drug Discovery Today

FIGURE1

Fundamentaldatacurationpolicies,technicalinfrastructureandculturaladoptionarenecessarytohelppublic–privatepartnershipsdelivernewandbetter- structuredclinicalandomicsdatatosupportpersonalizedmedicine.

2 www.drugdiscoverytoday.com FeaturesPERSPECTIVE

(3)

Obstacle 4: lack of a proper data-

management plan

ADMPisvitalforvariousactivitiesinvolvingthe handlingofdataorinformationthatisoutlined intheprotocoltobecollected,analysedand shared.Italsocontainsdetailedinformation necessaryforthecuratorstounderstandthe datatobeprocessed.Yetalltoooften,DMPsare missingor,whenavailable,aretoothinlywritten tobootstrapthecurationprocess.

BecauseoftheessentialroleofaproperDMP inefficientandeffectivedatacuration,wehave madesomesuggestionsindetailinadedicated section‘Supportforeffectivecurationwithina data-managementplan’.

Obstacle 5: lack of metadata descriptors

AlthoughitrelatestotheDMP,thispointisworth separateattentionowingtoitslargepractical impactoncuration,especiallyforretrospective datasetsforwhichaDMPmightnotbeavailable.

Assumingthatcuratorshaveovercomethepre- vioushurdlesandobtainedadataset,thevariables andtheirvaluesetsareoftensopoorlyannotated thatitisnearimpossibletodecipherthemwithout furtherinteractionswiththeprimaryresearchers, whoaretypicallypressedfortimeorunreachable.

Mostly,itfallsonthecuratorstounderstandthe pre-curated(‘straightfromthehose’)data.This affectstheefficiencyofcurationandfrequently resultsinerroneousinterpretations.

Westronglyrecommendthataproperdata dictionarybedevelopedtodocumentthemeta- dataofeachvariable(name,label,type,unit,value ranges,controlledvocabularies,dependencies, etc.).Thisinformationshouldbecollectedasearly aspossibleor,ifpossible,evenbeforetheproject starts:akeysteptomakedataFAIRbydesign.

Supportforeffectivecurationwithina data-managementplan

Themajorresearchfundershavenowevolved relatively(orsomewhat)homogenous approachestotheirDMPs.DMPtemplatesare availablefromnumeroussources,buttheUK DigitalCurationCentre[10]andELIXIRDMP- wizard(https://ds-wizard.org/)provide resourcesthatreflectbroadcurrentthinking.

DMPsarenowevolvingfromsimplecheckliststo documentscoveringtheentiredatacustody planfromcollectiontopublication,ensuring compliancewiththeformatsandannotation requirementsofrepositories(e.g.,CONSORT, http://www.consort-statement.org/),regulators (e.g.,CDISC,https://www.cdisc.org/standards/

therapeutic-areas)andEuropeanlaw(EUGen- eralDataProtectionRegulation).Keyperfor- manceindicatorsarebeingdevelopedto

determinethelevelofadherencetoprinciples suchastheFAIRprinciplesbothintheEU(e.g., FAIR-DOM.org)[11]andintheUnitedStates(US NationalInstitutesofHealthDCCPCKC1FAIR access)[1].Theseeffortsshareacommonob- jective:Toselectdatastandards,terminologies orontologiesthatwillbeimplementedinthe data-capturingphasewhereverpossible,lead- ingtoanideallimitof‘freeoffreetext’datasets.

Theupfrontdeclarationofstandards,theirname andversionthusallowstheavailabilityofkey provenancedescriptors,whichoughttobeas- sociatedtometadatacapturetemplates,which themselvesoughttobeidentified,licensedand versioned.ThisfollowedtheworkbyDietrich etal.onDMPs[12].Insupportofthesetasks, effortssuchasFAIRsharing[13]andCDISCShare arelaudableinitiatives.

However,inspiteofthisprogress,many instancesofDMPsseemtobetooweak,orevenan afterthought.DMPsneedtobepreparedatthe earlystagestoprovideameanstodeliverpro- spective,designdriven,compliantdata-collection blueprintsthataredigitalartefactsintheirown right,bothmachinereadableandactionable.To reachthisstage,itisessentialthatfrominception, datacontrollersanddataprocessors(including usersandstatisticians)shouldbecloselyinvolved, alongwithstudydesigners.Theyshouldascertain thattheprotocolsembedsufficientsafeguardsto accountforbiasanddocumenthowlurkingvari- ablesaredealtwith;thisiscrucial,andthelackof suchinformationinlegacyorpublicdatasetsis usuallyacausefortheirexclusioninmeta-analyses andreviews,astheuseofinsufficientlydescribed datasetscarriesariskofmisinterpretingsignals.

Somedatamodalitiespresentnewchal- lenges,suchasthehigh-dimensionaldatasets generatedbyomicstechnologies.Thesechal- lengesarebasedontheemergenceofnew platformandtechnologydescriptionsandsig- nal-processingmethodologies(e.g.,normaliza- tionandbatch-effectcorrections).Inaddition, thecomputationalworkflowsandinformation technology(IT)infrastructuresforstoringand executingthecomputationsareevolving,add- ingtothecomplexities.

TheDMPmustbeaugmentedwithdetailed data-accessconditionsandtermsofuse(suchas theDataUseOntologyGA4GHstandard(https://

www.ga4gh.org/news/data-use-ontology- approved-as-a-ga4gh-technical-standard)and theEuropeanGenome-phenomeArchiveterms [14])tobecompliantwithlegalframeworksand tosafeguardpatients’rightsandprivacy.Hence, dataprocessorsshouldprovide(iftheyhostthe data)technicalimplementationplanstohost (transfer),process,analyseandsharedataina

secureandscalablemanner.Theyoughtto makeavailableaclearpictureofthedataflow andprovidestandardoperatingproceduresfor dataaccess.ModelssuchasDataTagSuite (DATS)[15,16]andthedataaccesscomponents couldbefollowed.

TheIMIpublishedthe‘GuidelinesonFAIR DataManagementinHorizon2020’[17]witha FAIRDMPtemplatetoharmonizeandimprove curationactivities.Anothergoodresourceisthe

‘GoodClinicalDataManagementPractices’, whichaimstohelpimplementsustainable curationactivities[18].

Supportforeffectivecurationwithina curationcommunity

ButbeyondaDMP,whatwouldbenefitthe translation-researchcommunitymostisan open,precompetitivecurationcommunity.This oughttoincludeacademia,datamanagers, bioinformaticians,standardsdevelopment,IT developers,physicians,statisticiansandcura- tors.InthecaseoftheIMI,thisincludesEuro- peanFederationofPharmaceuticalIndustries andAssociations(EFPIA)partners,academiaand smallandmediumenterprises.Forthat,mem- bersoftheresearchcommunityshouldestablish andmaintainthefollowingkeypillars.

People and organizations

Buildupanetworkofpeoplewithdiverseex- pertise(domainknowledge,ITskills,algorithms, legalandethicalrequirementsandcoordina- tion)andcreateaforumforallcurationstake- holderstoexchangeideas,requirementsand resources.DespiteinitiativesliketheBiocuration SocietyandthePistoiaAlliance,datacreatedby thepharmaceuticalindustryarestilltreatedas tradesecrets,leadingtoalotofwheelrein- ventionineveryorganization.

Projects and ideas

Seekoutopportunitiestoapplyjointgrantsto improvecurationmethods,resources(e.g., metadataandreference-datarepositories)and infrastructures.

Data sets

Releasetrainingdatasetsofvariables,their valuesetandactual‘dirtydata’.Asmachine learningandartificialintelligence(AI)method- ologiesaregainingimportance,itiscrucialthat acommunityisbuiltaroundassemblingtraining datasetsthatcanestablishthebasisfortools gearedtobootstrapcurationefforts.Tothat effect,obtainingcommondataelementsfrom theNationalCancerInstitute(NCI)(https://www.

cancer.gov/),USNationalInstitutesofHealth,

www.drugdiscoverytoday.com 3 FeaturesPERSPECTIVE

(4)

FederalInteragencyTraumaticBrainInjuryRe- search(FITBIR)(https://fitbir.nih.gov/)andmany others,organizedbyclinicaldomain,wouldbe essentialtoseedanAIengine.

Training

Providetrainingnotonlytocurators,butalsoto principalinvestigators,clinicalresearchers,IT staffandlegalexpertstohelpthemunderstand theimportanceofcurationandotherpeople’s viewsandrolesinthewholedata-management processrelatedtocuration.

Supportforeffectivecurationwithin infrastructuredevelopments

Thankstoitswidespreaduseofgenomicsand high-throughputorhigh-resolutionimaging techniques,translationalresearchsitsfirmlyin therealmofbigdatascience.Effectiveand

efficientdatacurationrequiresnowmorethan everstrongsupportfrominfrastructure,refer- ence(meta)dataresources,andtoolssothat FAIRprinciplescanbeimplemented.Thegoals ofsuchinfrastructurearetoreducetheoverall timeofcurationatthesametimeasimproving thequalityoftheoutcome.Otherbenefitsin- cludeloweringthebarrierofcurationandbeing abletotraceandreproducecurationsteps.A moderncurationinfrastructurerequirescare andproperset-up,followingthestepwisepro- cessshowninFigure2.

Wewouldliketoemphasizethatmanyofthe functionalitiesmentionedherearealready availablepiecebypieceinexistingsoftware applications.Forexample,someofthecom- ponentsneededarealreadyprovidedbyon- tologysupporttools(inEurope,https://www.

ebi.ac.uk/spot/;intheUnitedStates,https://

bioportal.bioontology.org/),OpenRefine(http://

openrefine.org/)andmanycommunityversions ofcommercialtools,suchasPentaho-Kettle (https://github.com/pentaho/pentaho-kettle) andTalendOpenStudio(https://github.com/

Talend/tbd-studio-se).Whenacurationinfra- structureistobesetup,oneshouldreuseand integratetheexistingtoolsasmuchaspossible toavoidreinventingthesamesolutions.

Conclusion

Thisarticledeliversanhonestreviewofthe lessonslearntfromourexperiencesinsup- portingseveralIMIprojectswiththeirdata- curationneeds.Theoutcomeoftheseeffortshas resultedinthereuseofdatainseveralmulti- partyprojectswithinIMI[4–8]andbeyond.For example,theH2020-SYSCIDproject(http://

www.syscid.eu/)reusedthecurateddatarelated

Drug Discovery Today

FIGURE2

ToolssuchasanefficientITinfrastructureareneededforeffectiveandefficientdatacuration.Theupperexampleconversationisatypicalstartingpointfor researchers(ordatamanagersanddataanalysts)whotrytoperformdatacurationwithoutthepropertools.Thelowerpartisalistoffunctionalcomponentsfor anidealcurationinfrastructurethroughoutthelifecycleofdatacuration,frompre-curationtocurationproperandpost-curation.

4 www.drugdiscoverytoday.com FeaturesPERSPECTIVE

(5)

toautoimmunedisease,andtheNCER-PD project(https://parkinson.lu)reusedthecurated datarelatedtoParkinson’sdisease.Datacura- tionremainsthemaintechnicalchallengefor thereuseofdatainclinicalandtranslational research.

Cleaningandscrubbingdataisnotonlya time-consumingandtediousprocess,itisalso anexpensiveone.Yetprojectsstillconsentto thatexpenditure,eventhoughtheyareless inclinedtoagreetoamorerobust,‘front-loaded’

approachtodatamanagement.Indeed,possibly themosteffectivewaytowrestletheproblemis toformawell-structuredDMPatthestartthat coversallthestepsfromdatagenerationor capturetodataanalysis,withthehighestpos- sibleprecision,definitionsanddiscipline.The resources,rolesandresponsibilitiesofeach partyshouldbeclearlydefinedandplanned.

Necessaryinfrastructureshouldbedeployed and,ifneeded,developedtoaccelerateand improvethecurationactivities.Acuration communityisneededtosustainandcontinue developingknowledge,technologiesandex- pertise.Lastbutnotleast,awarenessofthe importanceofdatacurationandtheriskofalack ofplanningshouldbedisseminatedtothe clinicalandtranslationalresearchfield,so researcherscanavoidmakingthesamemistakes timeandtimeagain.

DeclarationofCompetingInterest Theauthorsdeclarethattheyhavenoknown competingfinancialinterestsorpersonal

relationshipsthatcouldhaveappearedtoin- fluencetheworkreportedinthispaper.

Acknowledgements

Wehighlyappreciatetheessentialcontributions oftheeTRIKSconsortiummembers:Dorina Bratfalean,AdrianoBarbosa-Silva,SergeEifes, FranciscoCapdevilaBonachela,IbrahimEmam, ManfredHendlich,PaulHouston,ChrisMarshall, PaulPeeters,KavitaRege,FabienRichard,Martin Romacker,AndreasTielmann,Michael

Braxenthaler,Susanna-AssuntaSansoneand ReinhardSchneider.Thisworkhasreceived supportfromtheIMIJointUndertakingunder grantagreementno.115446(eTRIKS),the resourcesofwhicharecomposedoffinancial contributionfromtheEU’sSeventhFramework Programme(FP7/2007–2013)andTheEuropean FederationofPharmaceuticalIndustriesand Associations(EFPIA)companies’in-kind contributions(www.imi.europa.eu).

References

1 Wilkinson,M.D.etal.(2016)TheFAIRGuidingPrinciples forscientificdatamanagementandstewardship.Sci.

Data3,160018

2 Rocca-Serra,P.etal.(2016)eTRIKSStandardsStarter PackRelease1.1April2016.Zenodo.http://dx.doi.org/

10.5281/zenodo.50825

3 Szalma,S.etal.(2010)Effectiveknowledge managementintranslationalmedicine.Brief.Bioinform.

8,68

4 Wheelock,C.E.etal.(2013)Applicationof’omics technologiestobiomarkerdiscoveryininflammatory lungdiseases.Eur.Respir.J.42,802–825

5 Gu,W.etal.(2019)Dataandknowledgemanagement intranslationalresearch:implementationoftheeTRIKS

platformfortheIMIOncoTrackconsortium.BMC Bioinform.20,164

6 Cope,A.P.etal.(2018)TheRA-MAPConsortium:a workingmodelforacademia–industrycollaboration.

Nat.Rev.Rheumatol.14,53–60

7 Bachelet,D.(2016)Occurrenceofanti-drugantibodies againstinterferon-betaandnatalizumabinmultiple sclerosis:Acollaborativecohortanalysis.PLoSOne11, e0162752

8 Hofmann-Apitius,M.etal.(2015)Bioinformaticsmining andmodelingmethodsfortheidentificationofdisease mechanismsinneurodegenerativedisorders.Int.J.

Mol.Sci.16,29179–29206

9 Marchetti,G.andJullian,N.(2017)eTRIKSFinalTraining Curriculum:DeliverableD6.6.eTRICKS

10 DCC(2013)ChecklistforaDataManagementPlan,v.4.0.

DigitalCurationCentre

11 Wolstencroft,K.etal.(2016)FAIRDOMHub:arepository andcollaborationenvironmentforsharingsystems biologyresearch.NucleicAcidsRes.45,D404–D407 12 Dietrich,D.etal.(2012)De-mystifyingthedata

managementrequirementsofresearchfunders.Issues Sci.Technol.Librariansh70.http://dx.doi.org/10.5062/

F44M92G2

13 Sansone,S.A.etal.(2019)FAIRsharingasacommunity approachtostandards,repositoriesandpolicies.Nat.

Biotechnol37,358–367

14 Lappalainen,I.etal.(2015)TheEuropeanGenome- phenomeArchiveofhumandataconsentedfor biomedicalresearch.Nat.Genet.47,692–695 15 Sansone,S.A.etal.(2017)DATS,thedatatagsuiteto

enablediscoverabilityofdatasets.Sci.Data4,170059 16 Alter,G.etal.(2020)TheDataTagsSuite(DATS)model

fordiscoveringdataaccessanduserequirements.

Gigascience9 giz165

17 EuropeanCommission(2016)H2020Programme:

GuidelinesonFAIRDataManagementinHorizon2020.

EuropeanCommission

18 SocietyforClinicalDataManagement(2020)Good ClinicalDataManagementPractices(GCDMP).Society forClinicalDataManagement

www.drugdiscoverytoday.com 5 FeaturesPERSPECTIVE