feature
Road to effective data curation for
translational research
WeiGu1,2,z,SamiulHasan3,z,PhilippeRocca-Serra4,zand VenkataP.Satagopam1,2,venkata.satagopam@uni.lu
Translational research today is data-intensive and requires multi-stakeholder collaborations to generate
and pool data together for integrated analysis. This leads to the challenge of harmonization of data from
different sources with different formats and standards, which is often overlooked during project
planning and thus becomes a bottleneck of the research progress. We report on our experience and
lessons learnt about data curation for translational research garnered over the course of the European
Translational Research Infrastructure & Knowledge management Services (eTRIKS) program (https://
www.etriks.org), a unique, 5-year, cross-organizational, cross-cultural collaboration project funded by
the Innovative Medicines Initiative of the EU. Here, we discuss the obstacles and suggest what steps are
needed for effective data curation in translational research, especially for projects involving multiple
organizations from academia and industry.
Introduction
Billionsofdollarsarespentannuallyongenerat- ingclinicalandtranslationalresearchdata.Yet despitesuchsignificantlevelsofinvestment,the amountofdatathatarereusedremainssurpris- inglylow.Besidesthelegalconstraints,twomain reasonsexplainthisdeficit:first,thedifficultiesin findingandaccessingdatasetsthemselves,and second,theeffortrequiredtoharmonizedatasets bothsyntacticallyand semantically.Theperpetual needfor addedcuration efforts highlights alackof interoperabilitybetweendigitalassets.Thisis symptomaticofthedichotomyplaguingthelife sciencesdomainwhenitcomestodataman- agementandassethandling.Ontheonehand, there is an operational gapin termsoftraining life- scienceresearchersinthenecessaryskills,tools andculturerequiredtoturndatasetsintoFAIR
(findable,accessible,interoperable,reusable) resources[1].Andontheotherhand,thereisa culturaldividebetweentheworldsofacademia andindustry,whichtranslatesintodiverseand frequentlydivergentattitudestowardsthe adoptionofdata-managementstandards(Fig.1).
Thisiscompoundedbyanarrayofoftencom- petingstandardsspecificationssponsoredby differentstakeholders,whoseviewsoftheworld arenotnecessarilydiscordant,butarenotentirely aligned,either.Hence,unlessalife-sciencelingua francaemerges,realisingthevisionofaFAIRdata exchangerequiresstrategicthinkingwithagood understandingofexistingstandardsresources andaclearcommitmenttousingthem.
Furthermore,funders,sponsors,program managers,dataanalystsandscientistsneedto fullyappreciatetheaddedburdenofformat
conversions,languagetranslationsandmap- pingbetweenterminologies.Theywillalsohave tobearinmindthatstandardsarenotstatic artefactsbutevolvingentitiesdesignedtore- flectthegrowthofdomainknowledge.There- fore,theyneedtoplanforupgrades,migration andobsolescencestrategies,leveraginglearn- ingsandpracticeswell-entrenchedinother domainsinthesoftwareindustry.
TheInnovativeMedicinesInitiative(IMI)isthe largestpublic–privatepartnershipinthelife- sciencesdomaininEurope,anditisfocusedon developingbetterandsafermedicinesfor patients.Centraltothisstrategyisaknowledge management(KM)environmentthatprovides sustainableaccesstothedatainanintegrated manner.Tomeetthischallenge,theIMIEuro- peanTranslationalResearchInfrastructure&
FeaturesPERSPECTIVE
1359-6446/ã2020TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1
KnowledgemanagementServices(eTRIKS) projectfocusedonbuildingasustainableKM platformandprovidedsupportatthelevelof datamanagementtootherIMIprojects.Cura- tion,asanessentialpartofdatamanagement, connectskeycomponents,suchasvocabulary managementanddatahosting,inalonger
‘valuechain’thatleadstoasubstantialim- provementoftheutilityofdataforresearch.
BasedonourlearningsfromeTRIKS,weoutline thekeystepstodefine,specify,rolloutand enactrobust,proven,standards-awaredata- managementplans(DMPs).Theseexperiences arerelevanttoalldata-intensiveclinicaland translationalstudies.Thisworkbuildsonthe eTRIKSstandardsstarterpack(SSP)[2],which surveyed,evaluatedandcompiledacollection ofdatastandardsofrelevancetothefield.
Ouraimistosharethepracticalknowledge thusgainedtoassistorganizationstomaximize thevalueoftheirownandpublicdataandto practicallyassesswhentoimplementdata curationintranslational-researchoperational processes.Thisinvolvesoutliningacompre- hensiveconceptofdatacurationasanintegral partoftheglobalconceptofaDMP.Itcovers datahandlingprocessingfromendtoend, introducingtheuseofdatastandardsfromthe studyplanningstagestodata-setsocialization throughplannedreleases.
ExperienceofdatacurationinIMIeTRIKS supportedprojects
eTRIKSdirectlyengagedwithIMIprojectsby providingsupportindatacurationandtraining.
Thisfacilitateddataaccesstoconsortium membersviadata-sharingplatformssuchas tranSMART[3].Thecurationactivitiestargeted compliancetostandardsasrecommendedby
theSSP[2].Curationthereforecoveredaspects ofmetadatastructuring(thesyntax)andele- mentsofcontentannotation(thesemantics).
TheIMIprojectsdirectlyinvolvedwere:UBI- OPRED[4],OncoTrack[5],RA-MAP[6],ABIRISK [7],APPROACH(https://www.approachproject.
eu/about-approach)andAETIONOMY[8].
Initially,eTRIKSadopteda‘fullsupport’model byallocatingexperiencedpersonneltocurate dataforleadclientprojects.However,asifto confirmtheaphorism‘nobattleplaneversurvives contactwiththeenemy’,thissoonresultedin havingtomakeadjustments,oftensignificant ones,owingtothefollowingshareableobstacles:
Obstacle 1: underestimation of the time
required to clear legal hurdles
Thenegotiationofmaterialordataprocessing agreementsbetweenconsortiaturnedouttobe atime-consumingprocess.Thecomplexdetails resultedin>20monthsspenttograntdata accessforcuration.Thisrestrictionwascom- poundedfurtherbyapprehensivenessinshar- inginformationaboutstudyvariables,which resultedinfurtherunanticipateddelays.
Itiscrucialforfutureconsortiatoclearthe pathtodataexchangebetweenpartnersto preventbeingboggeddowninjuridicalno man’sland.
Obstacle 2: under resourcing of the
curation activities and sustainability
issues
ThenumberofcuratorsallocatedtoeTRIKSonly allowedustorealisticallyengagewithfiveto eightleadprojects,whichwouldhavefallenfar shortofeTRIKS’engagementgoalof40projects.
Therealcostsandthescaleofthegapbetween thecuratorialresourcesandthenumberofIMI
projectsbeingfundedwasquicklyandcrudely exposed.
Toprovidesustainablecurationsupporttosub- sequentIMIprojects,anadditional‘lightweight’ supportmodelwasintroduced.Underthisremit, eTRIKShasdevelopeddata-curationguidelinesand an IMI curatortraining course[9]. The guidelines and aseriesoftrainingsessionswereprovidedtosup- portedprojectstoenableindependentcuration.
Eventhismodeloftenfacedchallengesowingtoa lackofpersonnelinmanyprojects.
Westronglysuggestthatfutureconsortia/
projectsshouldplansufficientcuration resourcesintheirproposalstoensureaneffi- cientdataflowfromdataproductiontoanaly- ses.
Obstacle 3: underestimation of the
‘cultural divide’ of standards
Industrystrengthstandardshaveadaunting complexity.Thismeansthatbringingdata managerstotherequiredlevelsofcompetence demandsadequatetrainingandresourcing.By contrast,datastandardsusedinacademiaoften haveadifferentorigin,withpragmatism,agility andflexibilityattheircore,andwithloosely coupledimplementation.Thesedistinctatti- tudestendtopolarizepracticeandmakefinding commongroundchallenging.Havingthesetwo worldsexchange,coordinateanddevelop globalstandardsconsumestime,expertise, trainingandknowledgetransferforharmoni- zationandstability.
Weurgescientificleadersandresearchers frombothacademiaandindustrytoreachoutto fundersandsponsorsregardingthisissue,so thatitwillbeacknowledgedbyfundingagen- ciesandstudysponsorstopromoteconver- genceofefforts.
Drug Discovery Today
FIGURE1
Fundamentaldatacurationpolicies,technicalinfrastructureandculturaladoptionarenecessarytohelppublic–privatepartnershipsdelivernewandbetter- structuredclinicalandomicsdatatosupportpersonalizedmedicine.
2 www.drugdiscoverytoday.com FeaturesPERSPECTIVE
Obstacle 4: lack of a proper data-
management plan
ADMPisvitalforvariousactivitiesinvolvingthe handlingofdataorinformationthatisoutlined intheprotocoltobecollected,analysedand shared.Italsocontainsdetailedinformation necessaryforthecuratorstounderstandthe datatobeprocessed.Yetalltoooften,DMPsare missingor,whenavailable,aretoothinlywritten tobootstrapthecurationprocess.
BecauseoftheessentialroleofaproperDMP inefficientandeffectivedatacuration,wehave madesomesuggestionsindetailinadedicated section‘Supportforeffectivecurationwithina data-managementplan’.
Obstacle 5: lack of metadata descriptors
AlthoughitrelatestotheDMP,thispointisworth separateattentionowingtoitslargepractical impactoncuration,especiallyforretrospective datasetsforwhichaDMPmightnotbeavailable.Assumingthatcuratorshaveovercomethepre- vioushurdlesandobtainedadataset,thevariables andtheirvaluesetsareoftensopoorlyannotated thatitisnearimpossibletodecipherthemwithout furtherinteractionswiththeprimaryresearchers, whoaretypicallypressedfortimeorunreachable.
Mostly,itfallsonthecuratorstounderstandthe pre-curated(‘straightfromthehose’)data.This affectstheefficiencyofcurationandfrequently resultsinerroneousinterpretations.
Westronglyrecommendthataproperdata dictionarybedevelopedtodocumentthemeta- dataofeachvariable(name,label,type,unit,value ranges,controlledvocabularies,dependencies, etc.).Thisinformationshouldbecollectedasearly aspossibleor,ifpossible,evenbeforetheproject starts:akeysteptomakedataFAIRbydesign.
Supportforeffectivecurationwithina data-managementplan
Themajorresearchfundershavenowevolved relatively(orsomewhat)homogenous approachestotheirDMPs.DMPtemplatesare availablefromnumeroussources,buttheUK DigitalCurationCentre[10]andELIXIRDMP- wizard(https://ds-wizard.org/)provide resourcesthatreflectbroadcurrentthinking.
DMPsarenowevolvingfromsimplecheckliststo documentscoveringtheentiredatacustody planfromcollectiontopublication,ensuring compliancewiththeformatsandannotation requirementsofrepositories(e.g.,CONSORT, http://www.consort-statement.org/),regulators (e.g.,CDISC,https://www.cdisc.org/standards/
therapeutic-areas)andEuropeanlaw(EUGen- eralDataProtectionRegulation).Keyperfor- manceindicatorsarebeingdevelopedto
determinethelevelofadherencetoprinciples suchastheFAIRprinciplesbothintheEU(e.g., FAIR-DOM.org)[11]andintheUnitedStates(US NationalInstitutesofHealthDCCPCKC1FAIR access)[1].Theseeffortsshareacommonob- jective:Toselectdatastandards,terminologies orontologiesthatwillbeimplementedinthe data-capturingphasewhereverpossible,lead- ingtoanideallimitof‘freeoffreetext’datasets.
Theupfrontdeclarationofstandards,theirname andversionthusallowstheavailabilityofkey provenancedescriptors,whichoughttobeas- sociatedtometadatacapturetemplates,which themselvesoughttobeidentified,licensedand versioned.ThisfollowedtheworkbyDietrich etal.onDMPs[12].Insupportofthesetasks, effortssuchasFAIRsharing[13]andCDISCShare arelaudableinitiatives.
However,inspiteofthisprogress,many instancesofDMPsseemtobetooweak,orevenan afterthought.DMPsneedtobepreparedatthe earlystagestoprovideameanstodeliverpro- spective,designdriven,compliantdata-collection blueprintsthataredigitalartefactsintheirown right,bothmachinereadableandactionable.To reachthisstage,itisessentialthatfrominception, datacontrollersanddataprocessors(including usersandstatisticians)shouldbecloselyinvolved, alongwithstudydesigners.Theyshouldascertain thattheprotocolsembedsufficientsafeguardsto accountforbiasanddocumenthowlurkingvari- ablesaredealtwith;thisiscrucial,andthelackof suchinformationinlegacyorpublicdatasetsis usuallyacausefortheirexclusioninmeta-analyses andreviews,astheuseofinsufficientlydescribed datasetscarriesariskofmisinterpretingsignals.
Somedatamodalitiespresentnewchal- lenges,suchasthehigh-dimensionaldatasets generatedbyomicstechnologies.Thesechal- lengesarebasedontheemergenceofnew platformandtechnologydescriptionsandsig- nal-processingmethodologies(e.g.,normaliza- tionandbatch-effectcorrections).Inaddition, thecomputationalworkflowsandinformation technology(IT)infrastructuresforstoringand executingthecomputationsareevolving,add- ingtothecomplexities.
TheDMPmustbeaugmentedwithdetailed data-accessconditionsandtermsofuse(suchas theDataUseOntologyGA4GHstandard(https://
www.ga4gh.org/news/data-use-ontology- approved-as-a-ga4gh-technical-standard)and theEuropeanGenome-phenomeArchiveterms [14])tobecompliantwithlegalframeworksand tosafeguardpatients’rightsandprivacy.Hence, dataprocessorsshouldprovide(iftheyhostthe data)technicalimplementationplanstohost (transfer),process,analyseandsharedataina
secureandscalablemanner.Theyoughtto makeavailableaclearpictureofthedataflow andprovidestandardoperatingproceduresfor dataaccess.ModelssuchasDataTagSuite (DATS)[15,16]andthedataaccesscomponents couldbefollowed.
TheIMIpublishedthe‘GuidelinesonFAIR DataManagementinHorizon2020’[17]witha FAIRDMPtemplatetoharmonizeandimprove curationactivities.Anothergoodresourceisthe
‘GoodClinicalDataManagementPractices’, whichaimstohelpimplementsustainable curationactivities[18].
Supportforeffectivecurationwithina curationcommunity
ButbeyondaDMP,whatwouldbenefitthe translation-researchcommunitymostisan open,precompetitivecurationcommunity.This oughttoincludeacademia,datamanagers, bioinformaticians,standardsdevelopment,IT developers,physicians,statisticiansandcura- tors.InthecaseoftheIMI,thisincludesEuro- peanFederationofPharmaceuticalIndustries andAssociations(EFPIA)partners,academiaand smallandmediumenterprises.Forthat,mem- bersoftheresearchcommunityshouldestablish andmaintainthefollowingkeypillars.
People and organizations
Buildupanetworkofpeoplewithdiverseex- pertise(domainknowledge,ITskills,algorithms, legalandethicalrequirementsandcoordina- tion)andcreateaforumforallcurationstake- holderstoexchangeideas,requirementsand resources.DespiteinitiativesliketheBiocuration SocietyandthePistoiaAlliance,datacreatedby thepharmaceuticalindustryarestilltreatedas tradesecrets,leadingtoalotofwheelrein- ventionineveryorganization.
Projects and ideas
Seekoutopportunitiestoapplyjointgrantsto improvecurationmethods,resources(e.g., metadataandreference-datarepositories)and infrastructures.
Data sets
Releasetrainingdatasetsofvariables,their valuesetandactual‘dirtydata’.Asmachine learningandartificialintelligence(AI)method- ologiesaregainingimportance,itiscrucialthat acommunityisbuiltaroundassemblingtraining datasetsthatcanestablishthebasisfortools gearedtobootstrapcurationefforts.Tothat effect,obtainingcommondataelementsfrom theNationalCancerInstitute(NCI)(https://www.
cancer.gov/),USNationalInstitutesofHealth,
www.drugdiscoverytoday.com 3 FeaturesPERSPECTIVE
FederalInteragencyTraumaticBrainInjuryRe- search(FITBIR)(https://fitbir.nih.gov/)andmany others,organizedbyclinicaldomain,wouldbe essentialtoseedanAIengine.
Training
Providetrainingnotonlytocurators,butalsoto principalinvestigators,clinicalresearchers,IT staffandlegalexpertstohelpthemunderstand theimportanceofcurationandotherpeople’s viewsandrolesinthewholedata-management processrelatedtocuration.
Supportforeffectivecurationwithin infrastructuredevelopments
Thankstoitswidespreaduseofgenomicsand high-throughputorhigh-resolutionimaging techniques,translationalresearchsitsfirmlyin therealmofbigdatascience.Effectiveand
efficientdatacurationrequiresnowmorethan everstrongsupportfrominfrastructure,refer- ence(meta)dataresources,andtoolssothat FAIRprinciplescanbeimplemented.Thegoals ofsuchinfrastructurearetoreducetheoverall timeofcurationatthesametimeasimproving thequalityoftheoutcome.Otherbenefitsin- cludeloweringthebarrierofcurationandbeing abletotraceandreproducecurationsteps.A moderncurationinfrastructurerequirescare andproperset-up,followingthestepwisepro- cessshowninFigure2.
Wewouldliketoemphasizethatmanyofthe functionalitiesmentionedherearealready availablepiecebypieceinexistingsoftware applications.Forexample,someofthecom- ponentsneededarealreadyprovidedbyon- tologysupporttools(inEurope,https://www.
ebi.ac.uk/spot/;intheUnitedStates,https://
bioportal.bioontology.org/),OpenRefine(http://
openrefine.org/)andmanycommunityversions ofcommercialtools,suchasPentaho-Kettle (https://github.com/pentaho/pentaho-kettle) andTalendOpenStudio(https://github.com/
Talend/tbd-studio-se).Whenacurationinfra- structureistobesetup,oneshouldreuseand integratetheexistingtoolsasmuchaspossible toavoidreinventingthesamesolutions.
Conclusion
Thisarticledeliversanhonestreviewofthe lessonslearntfromourexperiencesinsup- portingseveralIMIprojectswiththeirdata- curationneeds.Theoutcomeoftheseeffortshas resultedinthereuseofdatainseveralmulti- partyprojectswithinIMI[4–8]andbeyond.For example,theH2020-SYSCIDproject(http://
www.syscid.eu/)reusedthecurateddatarelated
Drug Discovery Today
FIGURE2
ToolssuchasanefficientITinfrastructureareneededforeffectiveandefficientdatacuration.Theupperexampleconversationisatypicalstartingpointfor researchers(ordatamanagersanddataanalysts)whotrytoperformdatacurationwithoutthepropertools.Thelowerpartisalistoffunctionalcomponentsfor anidealcurationinfrastructurethroughoutthelifecycleofdatacuration,frompre-curationtocurationproperandpost-curation.
4 www.drugdiscoverytoday.com FeaturesPERSPECTIVE
toautoimmunedisease,andtheNCER-PD project(https://parkinson.lu)reusedthecurated datarelatedtoParkinson’sdisease.Datacura- tionremainsthemaintechnicalchallengefor thereuseofdatainclinicalandtranslational research.
Cleaningandscrubbingdataisnotonlya time-consumingandtediousprocess,itisalso anexpensiveone.Yetprojectsstillconsentto thatexpenditure,eventhoughtheyareless inclinedtoagreetoamorerobust,‘front-loaded’
approachtodatamanagement.Indeed,possibly themosteffectivewaytowrestletheproblemis toformawell-structuredDMPatthestartthat coversallthestepsfromdatagenerationor capturetodataanalysis,withthehighestpos- sibleprecision,definitionsanddiscipline.The resources,rolesandresponsibilitiesofeach partyshouldbeclearlydefinedandplanned.
Necessaryinfrastructureshouldbedeployed and,ifneeded,developedtoaccelerateand improvethecurationactivities.Acuration communityisneededtosustainandcontinue developingknowledge,technologiesandex- pertise.Lastbutnotleast,awarenessofthe importanceofdatacurationandtheriskofalack ofplanningshouldbedisseminatedtothe clinicalandtranslationalresearchfield,so researcherscanavoidmakingthesamemistakes timeandtimeagain.
DeclarationofCompetingInterest Theauthorsdeclarethattheyhavenoknown competingfinancialinterestsorpersonal
relationshipsthatcouldhaveappearedtoin- fluencetheworkreportedinthispaper.
Acknowledgements
Wehighlyappreciatetheessentialcontributions oftheeTRIKSconsortiummembers:Dorina Bratfalean,AdrianoBarbosa-Silva,SergeEifes, FranciscoCapdevilaBonachela,IbrahimEmam, ManfredHendlich,PaulHouston,ChrisMarshall, PaulPeeters,KavitaRege,FabienRichard,Martin Romacker,AndreasTielmann,Michael
Braxenthaler,Susanna-AssuntaSansoneand ReinhardSchneider.Thisworkhasreceived supportfromtheIMIJointUndertakingunder grantagreementno.115446(eTRIKS),the resourcesofwhicharecomposedoffinancial contributionfromtheEU’sSeventhFramework Programme(FP7/2007–2013)andTheEuropean FederationofPharmaceuticalIndustriesand Associations(EFPIA)companies’in-kind contributions(www.imi.europa.eu).
References
1 Wilkinson,M.D.etal.(2016)TheFAIRGuidingPrinciples forscientificdatamanagementandstewardship.Sci.
Data3,160018
2 Rocca-Serra,P.etal.(2016)eTRIKSStandardsStarter PackRelease1.1April2016.Zenodo.http://dx.doi.org/
10.5281/zenodo.50825
3 Szalma,S.etal.(2010)Effectiveknowledge managementintranslationalmedicine.Brief.Bioinform.
8,68
4 Wheelock,C.E.etal.(2013)Applicationof’omics technologiestobiomarkerdiscoveryininflammatory lungdiseases.Eur.Respir.J.42,802–825
5 Gu,W.etal.(2019)Dataandknowledgemanagement intranslationalresearch:implementationoftheeTRIKS
platformfortheIMIOncoTrackconsortium.BMC Bioinform.20,164
6 Cope,A.P.etal.(2018)TheRA-MAPConsortium:a workingmodelforacademia–industrycollaboration.
Nat.Rev.Rheumatol.14,53–60
7 Bachelet,D.(2016)Occurrenceofanti-drugantibodies againstinterferon-betaandnatalizumabinmultiple sclerosis:Acollaborativecohortanalysis.PLoSOne11, e0162752
8 Hofmann-Apitius,M.etal.(2015)Bioinformaticsmining andmodelingmethodsfortheidentificationofdisease mechanismsinneurodegenerativedisorders.Int.J.
Mol.Sci.16,29179–29206
9 Marchetti,G.andJullian,N.(2017)eTRIKSFinalTraining Curriculum:DeliverableD6.6.eTRICKS
10 DCC(2013)ChecklistforaDataManagementPlan,v.4.0.
DigitalCurationCentre
11 Wolstencroft,K.etal.(2016)FAIRDOMHub:arepository andcollaborationenvironmentforsharingsystems biologyresearch.NucleicAcidsRes.45,D404–D407 12 Dietrich,D.etal.(2012)De-mystifyingthedata
managementrequirementsofresearchfunders.Issues Sci.Technol.Librariansh70.http://dx.doi.org/10.5062/
F44M92G2
13 Sansone,S.A.etal.(2019)FAIRsharingasacommunity approachtostandards,repositoriesandpolicies.Nat.
Biotechnol37,358–367
14 Lappalainen,I.etal.(2015)TheEuropeanGenome- phenomeArchiveofhumandataconsentedfor biomedicalresearch.Nat.Genet.47,692–695 15 Sansone,S.A.etal.(2017)DATS,thedatatagsuiteto
enablediscoverabilityofdatasets.Sci.Data4,170059 16 Alter,G.etal.(2020)TheDataTagsSuite(DATS)model
fordiscoveringdataaccessanduserequirements.
Gigascience9 giz165
17 EuropeanCommission(2016)H2020Programme:
GuidelinesonFAIRDataManagementinHorizon2020.
EuropeanCommission
18 SocietyforClinicalDataManagement(2020)Good ClinicalDataManagementPractices(GCDMP).Society forClinicalDataManagement
www.drugdiscoverytoday.com 5 FeaturesPERSPECTIVE