ContentslistsavailableatScienceDirect
Journal of Informetrics
jo u r n al hom e p ag e :w w w . e l s e v i e r . c o m / l o c a t e / j o i
Regular article
DataCite as a novel bibliometric source: Coverage, strengths and limitations 夽
Nicolas Robinson-Garcia
a,∗, Philippe Mongeon
b, Wei Jeng
c, Rodrigo Costas
d,eaINGENIO(CSIC-UPV),UniversitatPolitècnicadeValència,Spain
bÉcoledebibliothéconomieetdessciencesdel’information,UniversitédeMontréal,Canada
cDepartmentofLibraryandInformationScience,NationalTaiwanUniversity,Taiwan
dCWTS,LeidenUniversity,TheNetherlands
eCentreforResearchonEvaluation,ScienceandTechnology(CREST),StellenboschUniversity,PrivateBagX1,Matieland7602,South Africa
a rt i c l e i n f o
Articlehistory:
Received6March2017
Receivedinrevisedform17July2017 Accepted17July2017
Availableonline30August2017 Keywords:
Datasharing Datacitations Bibliometricsources Opendata Datainfrastructure Datametrics DataCite
a b s t ra c t
ThispaperexploresthecharacteristicsofDataCitetodetermineitspossibilitiesandpoten- tialasanewbibliometricdatasourcetoanalyzethescholarlyproductionofopendata.
Openscienceandtheincreasingdatasharingrequirementsfromgovernments,funding bodies,institutionsandscientificjournalshasledtoapressingdemandforthedevelop- mentofdatametrics.Asaveryfirststeptowardsreliabledatametrics,weneedtobetter comprehendthelimitationsandcaveatsoftheinformationprovidedbysourcesofopen data.Inthispaper,wecriticallyexaminerecordsdownloadedfromtheDataCite’sOAIAPI andelaborateaseriesofrecommendationsregardingtheuseofthissourceforbibliomet- ricanalysesofopendata.Wehighlightissuesrelatedtometadataincompleteness,lack ofstandardization,andambiguousdefinitionsofseveralfields.Despitetheselimitations, weemphasizeDataCite’svalueandpotentialtobecomeoneofthemainsourcesfordata metricsdevelopment.
©2017ElsevierLtd.Allrightsreserved.
1. Introduction
Callsfordataavailabilityandsharingcanbetracedbacktothebeginningofthe20thcenturywhenGaltonstated:“Ihave beguntothinkthatnooneoughttopublishbiometricresults,withoutlodgingawellarrangedandwellboundmanuscript copyofallhisdata,insomeplacewhereitshouldbeaccessible,underreasonablerestrictions,tothosewhodesireto verifyhiswork”(Galton,1901,ascitedinPerneger,2011).However,ithasbeenjustafewdecadessincetechnologyhas madepossiblethedevelopmentofthenecessaryinfrastructuretomakethishappen(Peng,2011).Inthelastdecade,public fundingagencies,publishersandinstitutionshavedirectedtheireffortstowardsdevelopingsuchinfrastructureaswellas toincentivizingdatasharingandreusewithinthescientificcommunitybypromotingdatacitations(Robinson-Garcíaetal., 2015).
夽 “ThepeerreviewprocessofthispaperwashandledbyStaˇsaMilojevi ´c,AssociateEditorofJournalofInformetrics.”
∗ Correspondingauthor.
E-mail addresses: [email protected] (N. Robinson-Garcia), [email protected] (P. Mongeon), [email protected] (W. Jeng), [email protected](R.Costas).
http://dx.doi.org/10.1016/j.joi.2017.07.003 1751-1577/©2017ElsevierLtd.Allrightsreserved.
InitiativessuchasthelaunchoftheDataCitationIndexandtheDataCiteconsortiumareexamplesofeffortsdirectedat promotingdatacitations.However,littleisknownabouttheproductionofdata,field-specificpractices,andotherbasic requirementssuchastheformatadatarecordshouldhavetofacilitateinformationretrievalandbibliometricanalyses.
PreviousstudiesfocusingonThomsonReuters’DataCitationIndex(nowClarivateAnalytics)haveexploreddisciplinary biasesanddatatypesincluded(Torres-Salinas,Martín-Martín,&Fuente-Gutiérrez,2014),datacitationpracticesbetween fields(Robinson-Garcíaetal.,2015),andtherelationbetweendatacitationsanddatamentionsinsocialmedia(Peters, Kraker,Lex,Gumpenberger,&Gorraiz,2016).
Inarecentreport,Costasetal.(2013)highlightedtheneedfordevelopingdatapublicationstandards,reducingthedisper- sionofdatarepositories,andfacilitatingthetraceability,citationandmeasurementofdatarecords.Themostcomprehensive sourceforopendatacurrentlyavailableisDataCite,whichcontainsmorethan7millionfreelyaccessiblerecords,almost doublingthefigureslastreportedfortheDataCitationIndex(Petersetal.,2016).
Inlinewiththeopensciencemovementandcallsforincreaseddatasharingandreuse,wehighlighttheimportanceofdata publicationsandcitations.ThispaperanalyzesthestructureandtypeofmetadataofferedbyDataCitetoassessitspotentialto becomeanimportantsourcefordevelopingdata-levelmetrics.DataCiteisaninternationalnon-profitorganizationformedin 2009.Itisaconsortiumofpublicresearchinstitutions,fundingbodiesandpublishersworldwidewhosemissionistopromote openresearchdataaccessibilityandtracking.Forthelatter,DataCiteadvocatesfortheuseofDigitalObjectIdentifiers(DOI) byassigningDOIstotheirrecords(DataCiteMetadataWorkingGroup,2015).
2. Objectives
ThispaperaimstoexplorethecharacteristicsofthedatacollectedbyDataCitetodetermineitspotentialasanewsource ofbibliometricdataforthestudyofopendataproduction.Specifically,weexaminethedatabasestructureandthelevel ofstandardizationoftheinformationprovidedineachfield,toassesstheusabilityofthedataforbibliometricpurposes.
Thepaperisstructuredasfollows.Firstly,wepresentthemetadataschemeofDataCiterecords(2015).Thenweassess thecompletenessofthedataineachspecificfieldandgiveanoverviewofthedatabasecoverage.Finally,wediscussthe potentialofDataCiteasasourcefortrackingopendataproduction,andweprovidesomerecommendationsforitsuseas toolforstudyingdataproductionandcitationpatterns.
3. Dataandmethods
Thissectionisstructuredinthreeparts.ThefirstonedescribesthedifferentpointsofaccessavailablebyDataCiteand advantagesandlimitationsofusingoneortheother.Second,werecollectanddescribetheinformationprovidedbyDataCite astoitsstructure,definitionofdatarecordfields,andinformationrequestedtoeachrepository.Theaimistogivethereader afullaccountastowhatDataCiteexpectstoreceivefromeachdatarepositoryandhowthisinformationisexpectedtobe presentedtothefinaluser.ThelastpartdescribesthedatasetdownloadedfromDataCite’spublicOAIAPI.Theinformation retrievedanditsstructureiscomparedwiththeinformationprovidedinthefirstsubsection.
3.1. PointsofaccesstoDataCite
DataCiteprovidestwoAPIstothepublicfordownloadingrecordsindexedinitsdatabase.Thesetwopointsofaccess containthesamenumberofrecordsbut differinthestructureinwhich theyarepresentedaswellasinthedetailof informationprovided.
DataCiteMetadataStore(https://oai.datacite.org/).TheDataCiteMetadataStoreisaservicetomanageactivitiesrelated toDigitalObjectIdentifier(DOI)registrationatDataCite.TheMDSisusedtocreate,register,storeandmanageDOIsand
associateddatasetmetadatacreatedbyDataCite’susersandmembers.Herewearepresentedtorawdataasprovidedby DataCite’smembersandhasnotyetbeenprocessedbyDataCite.
DataCiteRESTAPI(https://api.datacite.org/).TheDataCiteRESTAPIincludesthesamecontentsastheDataCiteMetadata Storebutwithaddedlayersofinformationbyrecord.TheDataCiteteamaddsnewinformationtoeachrecordregarding funding,ORCIDs,citationsnotprovidedbythedatacentersthemselvesareadded.
As well as these two points of access, DataCite allows bulk queries via two additional URLs: search SOLR (https://search.datacite.org/ui)andsearch(https://search.datacite.org/).Inthispaper,wehaveusedtheDataCiteMeta- dataStoretoretrieveallrecordsfromDataCite.ThroughouttherestofthepaperallreferencesmadetoDataCite’smetadata structurearebasedonsuchinformation.
3.2. DataCitemetadataschemev.3.1
InApril2016,weretrievedallrecordsfromDataCiteusingtheirpublicOAIAPI(https://oai.datacite.org).DataCiteprovides ametadataschemewhichshowstherecordstructureanddefineseachfield(DataCiteMetadataWorkingGroup,2015).Note thatalthougha4.0versionofthemetadataschemehasrecentlybeenimplemented,inthispaperwerefertoversion3.1 asitwastheschemainplaceatthetimeofdatacollection.Thisversionincludesmandatory,recommendedandoptional fields.Inthefollowingsections,webrieflydescribethemainfieldsretrievedfromtheDataCiteMetadataStore.
3.2.1. Mandatoryfields
Identifier.WhileinprincipleDataCiteencouragesandpromotestheuseofDOInumbers,italsoallowstheinclusionof otheruniqueidentifiers(e.g.URN,CCDC,INCHIkey,URL).
Creator.Thisfieldincludesthename,surnameoraffiliationnameofthecreatorsofthedatarecords.Itwouldbeequivalent totheauthorfieldofbibliographicrecords.
Title.Thenamebywhichtheresourceisknown.Sometimesitalsoincludessubtitleasasub-field.
Publisher.DataCitedefinespublisheras“[t]henameoftheentitythatholds,archives,publishesprints,distributes,releases, issues,orproducestheresource”(DataCite,2015).Forthecurrentpractice,therecanbedifferentinterpretationsonthis definitionthuscouldbeperformedbydifferentactors.Hence,itcanresultinambiguityonthetypeofentitiesassignedas publisher,namelyindividualauthors,institutions,orindividualdatarepositories.Wediscussthislimitationinsubsection 4.2.
PublicationYear. Theyearinwhichthedatarecordwasmadepubliclyavailable,whichmaydifferfromtheyearof itscreation.DataCite’sdocumentationacknowledgesthatthiscanbeproblematicincertaincasesleavinguptotheuser depositingthedatatochoosetheirpreferreddateforcitationpurposes.
3.2.2. Recommendedfields
Subject.Thisisafreetextfieldthatcanincludekeywords,classificationcodes,subjects,orkeyphrases.Itincludesas subfieldthesubjectschemeused,ifany,withalinktothesubjectscheme.
Contributor.Thisfieldincludestheinstitutionsandindividualsinvolvedonthecollection,management,distributionor othertypesofcontributionstotheproductionofthedata.Itincludesassubfieldthetypeofcontribution(i.e.,contactperson, datacollector,etc.).
Date.Duetothepotentialambiguityofthepublicationyear,thisfieldallowstospecifymorethanonedatewhichmay berelevantfortheuser,suchasdataavailability,collection,publication,etc.
ResourceType.Here,atwo-levelclassificationofdatatypesisintroduced.Whilethetoplevelisaclosedlistof15data types,thesecondlevelclassificationisafreetextfield.
RelatedIdentifier.ThisfieldcontainsidentifiersdifferentfromtheDOI.
Description.Thisisastructuredfield.Ifused,freetextcanbeenteredbutthetypeofcontent(abstract,methods,series information,tableofcontents,andother)mustbespecified.
GeoLocation.Includesthegeographicallocationinwhichthedatapresentedwascollected.
3.3. Generaldescriptionoftheretrieveddatabase
DatawereparsedandorganizedintoanSQLdatabase.Atotalof7,440,415recordswereretrieved.TheAPIdoesnot providetherecommendedGeolocationfield.ThisfieldwasincludedinSeptember2016.Itprovidesfiveoptionalfields:
Relation,Format,Language,andRights.Furthermore,thefieldsIdentifierandRelatedIdentifierandthefieldsPublicationYear andDatearecombinedintwofields(IdentifierandDate).Additionally,itindicatestheDataCenterprovidingtherecordsto DataCite.762organizationswereincludedasdatacentersatthetimeofthedownload.Theseorganizationshavecontracted withanindividualDataCitemembertoassignDOIs.AppendixAincludesadetaileddescriptionofeachfieldretrievedand theinformationtheycontain.
Fig.1showstheshareofrecordsinDataCitewithinformationineachofthefieldsdescribedinAppendixA.Wesee thatmanyrecordscontainemptyfields(evenmandatoryones).Atotalof1,092,131records(14.7%ofallrecordscollected) includenodataatall.ThisappearstobecausedbymodificationsmadebyDataCiteinthedatastructure.Morespecifically, DataCiteemploystheOpenArchivesInitiativeProtocolforMetadataHarvesting(OAI-PMH)andassignsanOAIidtoeach
Fig.1.Distributionofmetadatainformationbyfields.
Fig.2.ExampleofanemptyrecordretrievedfromDataCite’sAPI.
record.Itappearsthatwhenarecordneedstobemodified,anewrecordiscreatedwiththeupdatedinformation.The informationintheoldrecordisdeleted(exceptfortheOAIandthedatacenterinformation),butnottherecorditself.Fig.2 showsanexampleofanemptyrecord.ThisisanimportantelementtoconsiderwhenworkingwithDataCite’sAPIasthese recordsshouldberemovedfromthesample.
Whenfocusingontherecordsthatdoincludeinformation(6,348,284records),westillfindthat1306records(0.02%) donotincludeatitleorpublisherinformation.Resourcetypeandlanguagearereportedin60%and51%oftherecords, respectively.Thecontributor(18%)andrelation(25%)fieldshavethelowestpresenceinDataCiterecords.
4. Results
Inthissection,wereportourfindingsregardingthecontentofeachfieldandthelevelofstandardizationofthedata.
First,wepresentdescriptivestatisticsondifferenttypesofdatarecords.Thenweanalyzethegeographicaldistributionof datacentersandthenumberofrecordsbycountry.Wealsoanalyzethepublisherfieldtodisentanglethedifferenttypesof entitiesitcontains.Wealsopresentanoverviewofthedifferenttypesofdatesincludedinthedatabase.Finally,wefocus onthedescriptionoftherelationfield,whichcontainsDOIsofrelatedrecords,tryingtounderstandthetype(s)oflinkages capturedbyDataCite.
4.1. Resourcetypes
TheResourceTypefieldpresentsacontrolledlistof15values,complementedbyafree-textsubtype.Table1reportsthe totalnumberofrecordsbyresourcetypeandthethreemostcommonsubtypes.Weobservethat42%oftherecordsare categorizedasdatasets,followingbytext(18%),image(14%),andcollection(7%).AsobservedinTable1,mostofrecordswith aResourceType‘text’aremanuscripts,conferencepapersorjournalarticles.Recordstaggedasimagesareheterogeneous, rangingfromacademicposterstohistoricalmanuscripts,ordatafigures.Thesubtypeisnotmandatoryandisthusemptyin manyrecords.Forinstance,only4.3%,6%and6%ofrecordswiththeresourcetype“Model”,“Sound”and“Film”,respectively, haveasubtype.Overall,wefind158,781differentvariationsofresourcesubtypes,anaturaloff-shootofitbeingafree-text field,butwhichreflectsdifferentunderstandingsofwhatisdataandwhatisincludedbyeachofthe15datatypes.
Table1
Recordsbyresourcetypeandshareoftop3mostcommonsubtypesinDataCite.Inbold-cursivesubtypesappearinginmorethanonedatatypecategory.
Resourcetype Numberofrecords Mostfrequentsubtypes
N %
Dataset 1,867,627 41.69 Dataset(63.5%),Metadata(5.8%),Datapackage(4.1%)
786,882 17.56 Conferencepapers(15.5%),Journalarticles(15.4%),Report (10.1%)
Image 641,404 14.32 Image(11.9%),Figure(11.2%),Plate(8.1%)
Collection 303,638 6.78 Collection(20.7%),Gaussianjobarchive(9.1%),Report(4.7%)
Software 12,340 0.03 Simulationtool(16.9%),Software(10.8%),Code(5.3%)
Audiovisual 4470 0.10 Audiovisual(43.8%),Media(23.9%),Teachingmaterial(8.5%)
Film 960 0.02 Experiment(5.4%),Video(0.4%),Animation(0.1%)
PhysicalObject 587 0.01 Archivalobject(63.9%),HIAPER-HAISairbornesensor(2.4%),
Physicalobject(0.9%)
Event 508 0.01 Conferencepresentation(73.4%),Presentation(9.6%),Event
(1.6%)
Model 470 0.01 Model(2.8%),Ontology(0.9%),Shapefiles(0.2%)
InteractiveResources 287 0.01 Interactiveresources(12.2%),Learningobject(2.1%),SitesWeb (0.3%)
Sound 234 0.01 Recording,oral(4.3%),Sound(0.4%),Conference(0.4%)
Workflow 209 <0.01 Taverna2workflow(7.2%),Workflow(1.0%),RapidMiner
workflow(0.5%)
Service 18 <0.01 Service(88.9%),S-map(5.6%),Dataprovider(5.6%)
Other 871,549 19.45 Datasheet(98.2%),Oceanographiccruise(0.7%),Field
expedition(0.7%)
Total 4,480,077 100
Wealsoobserveclassificationredundanciesbetweenthetwolevels.Forexample,theresourcetype“dataset”hasa subtypealsocalled“dataset”.Therearealsoredundantsubtypesbetweendifferentresourcetypes.Forexample,thesubtype
“report”appearsasasubtypeofboththeresourcetypes“collection”and“text”.AspecificallyproblematiccaseistheResource Type“other”,forwhich98.2%oftherecordshaveasubtypelabeledas“Datasheet”.Thissuggeststhattheserecordscould perhapsbeconsideredasdatasets.Takingacloserlookattheserecords,wefoundthattheywereallderivedfromthe samerepository,Data-Planet.Actually,allrecordsfromData-Planetareclassifiedas“Datasheet”.Thisvariabilityinthe distributionofrecordsmayreflectsomeinconsistenciesinthewaydatacentersclassifyrecordsaccordingtothescheme proposedbyDataCite.
Fromnowon,wewillreferas“datarecords”toallthoserecordsinDataCitethathavearesourcetypedifferentthan
“text”(i.e.weconsiderasdata-relatedrecordsallrecordsthatarenotarticlessuchasmanuscriptsorpre-prints).
4.2. Thegeographicdistributionofdatainfrastructures
Inthissection,wefocusonthedataprovidersandthecountriesinwhichtheyarebased,toprovideinsightsonhow datainfrastructuresarebeingdevelopedindifferentcountries.DataCiteprovidesaclosedlistof762institutionsfromwhich recordsareretrieved.Thedistributionofrecordsacrossthesedatacentersisuneven:15(2%)datacentersaccountformore than80%ofallrecords.Fig.3showsthedistributionofrecordsbyresourcetype(excludingtherecordswherethisfieldis empty)forthe20datacenterswhoprovidedthemostrecords.
Thedatahighlightsthevarietyofinstitutionsprovidingdata:fromthematicdatarepositories(Data-Planet,PANGAEA, DigitalScience),toscientificsocialplatforms(ResearchGate)oruniversities(ImperialCollegeLondon,EThZürich).Data- PlanetisthelargestdatacenterinDataCite,providing20%ofalltherecords.Asmentionedbefore,allrecordsprovided byData-Planetare“datasheets”.Also,somedatacenters(ResearchGate,E-Periodica,UniversitätZürich,Zora,andETH E-Collection)provideonly“text”records.
In Table2weassigned eachdatacenter totheircountries. ThisinformationwasretrievedfromDataCiteStatistics (https://stats.datacite.org/).Itisimportanttonotethattheclassificationwasbasedonthelocationoftheirheadquarters, andthatsomedatacenterswereassociatedtomorethanonecountryiftheyhaveheadquartersindifferentcountries.The countrydistributioninTable2doesnotreflecttheaffiliationofdatacreatorsnorthegeographicoriginofcountries,but providesanoverviewofcountriescontributingtowardsthedevelopmentofanopendatainfrastructure.Wefindthatthe distributionofrecordsbycountryisveryskewed:theUnitedStates,GermanyandtheUnitedKingdomaccountfor82%of thetotalrecords.Thedistributionofresourcetypesalsodiffersbycountry.Forinstance,almost100%ofrecordscomingfrom Estonia,Denmark,andCanadaaredatarecords,whilethisproportionismuchsmallerinothercountriessuchasHungary (0.8%),Italy(4.2%),Ireland(16.1%),Australia(19.6%),andGermany(26.5%).Moreover,nodatarecordswerefoundindata centersbasedinAustria,Russia,Iran,SouthKorea,Liechtenstein,Slovenia,andJapan.
Thesecond sourceofinformation relating toopendataprovidersis obtainedfromthepublisherfield. It is anon- standardizedfree-textfieldinwhichwefound118,136differentnames.Thedistributionofrecordsishighlyskewed,hence
Fig.3.Top20datacentersbydatatypes.
Table2
Thenumberofdatacenters,numberofrecordsandshareofrecordsafterexcludingrecordslabeledasdatatype“text”bycountry.Countriesareordered bytotalnumberofrecords.
Countries Datacenters #records %datarecordsa Countries Datacenters #records %datarecordsa
USA 217 2952086 58.6% Hungary 37 1809 0.8%
Germany 185 1795638 26.5% Poland 4 1713 1.3%
UK 66 1382661 49.9% Russia 3 1388 0.0%
Switzerland 48 1120868 32.1% Iran 2 1292 0.0%
Estonia 6 489896 99.5% Romania 3 1032 47.2%
Denmark 5 138640 98.0% China 2 703 31.2%
Canada 24 85984 93.5% CzechRepublic 1 470 100.0%
Thailand 1 61529 87.9% SouthKorea 1 188 0.0%
Italy 35 50350 4.2% Belgium 1 106 79.2%
Netherlands 16 49900 80.8% SouthAfrica 1 105 93.3%
Austria 7 36450 0.0% Liechtenstein 1 56 0.0%
Australia 41 24122 19.6% Ghana 1 53 98.1%
Ireland 3 23181 16.1% Spain 2 37 100.0%
France 32 13093 48.7% Slovenia 1 18 0.0%
NewZealand 2 3081 39.4% Japan 1 15 0.0%
Sweden 6 2835 97.9% Tanzania 1 10 90.0%
Unknown 8 2722 1.8% Uruguay 1 1 100.0%
aDatarecordsaredefinedasalldatatypesexcludingtext.
bymanuallydisambiguatingthemostcommon1148publisherswemanagedtocoverabout90%ofalltherecordsthat includepublisherinformation.
Foreachofthese1148publishers,weassignedtwovariables:countryandtypeofentity.TheCountryinformationwas retrievedfromthepublishers’websitesandcorrespondstothecountrywherethepublisherislocated(likedatacenters, multiplecountriescanbeassignedtoasinglepublisher).Fig.4presentsthenumberofrecordsforeachcountry.Onlyrecords includingresourcetypeandpublisherinformationarerepresented(3,704,161records).Whilethedistributionofrecords bycountryissimilarusingeitherthedatacenterorpublisherinformation,therearenotabledifferences.Wefindthatthe numberofcountriescontributingtoDataCiteislowerwhenusingthepublisherinformationthanwhenusingdatacenter location.Forexample,norecordwouldbeassignedtoEstonia,ThailandorIrelandusingthismethod.However,theyoccupy
Fig.4.Totalnumberofdatarecords(excludingdatatype“text”)bycountryusingdatacenterandpublisheraffiliationdata.Y-axisarelogarithmic.Countries areorderedaccordingtothetotalnumberofrecordsusingthedatacenteraffiliation.
Fig.5.Numberofrecordsandshareofdatarecords(afterexcludingtext)bytypeofpublisher.Onlyrecordswithpublisherinformationanddatatypeare shown.
thethird,eighthandtwelfthpositionsrespectivelywhenusingthedatacenter.Attheotherextreme,Italy,Belgiumand Spainareclearlyunderrepresentedaccordingtodatacenters’location.
Wealsodividedthepublishersin11typesofentitytobettercomprehendwhatusersunderstandas“datapublisher”, butalsotoidentifydifferenttypesofinstitutionspublishdataproducts.Wedistinguishfourtypesofrepositories(i.e., national,institutional,disciplinary,andmultidisciplinaryrepositories),andtheotherentitiesarediversegroups(research body,professionalbody,andeducationalbody),publishers,firms,conferencesandindividuals.AppendixBprovidesmore detailsonthisclassification.
AsshowninFig.5,atotalof156distinctentitiesareidentifiedfromthe1148namevariantsdisambiguatedfromthe publisherfield.Mostoftherecordswereassignedto18thematicrepositories(43%).Among156entities,35areinstitutional repositories,followedby33researchbodies(e.g.,researchcentersandscientificassociations),and24academicpublishers (journals).Insecondandthirdplacebutwithasubstantiallylowerproportionofdatarecords,wefindinstitutionalreposi- tories(17%)andresearchbodies(15%).Theproportionofdatarecordsvariessubstantiallybypublishertype.While89%of recordsincludedinmultidisciplinaryrepositoriesaredatarecords,noneoftherecordspublishedbyprofessionalbodies,
Fig.6.NumberofrecordsperyearusingthepublicationyearinDataCite.1950–2020period.
conferencesandauthorsaredatarecords.Theseresultsreflecttheconceptualproblemstillexistingonthemeaningthat
“publishing”hasinthedataproductionmodel(Costasetal.,2013)orattheveryleast,theeffectofthediversityofrecords includedinDataCite.
4.3. Publicationyearandrelateddates
Publicationyearisakeyfieldinanybibliometricanalysisintendingtoprovidealongitudinalperspectiveortoframethe studyperiod(s).DataCiterequiresthepublicationyeartobepresentedinafour-digitformat.However,animportantpointto considerforthedevelopmentofdatametricsisthatdatarecordscanbesubjectedtodifferentactionsoccurringondifferent datesofactions,thatmayallbeincludedinthemetadata.Thus,DataCite(2015)hastwodate-relatedfields:publicationyear anddate.ThepublicationyearfieldisamandatoryfieldthatDataCiteMetadataWorkingGroup(2015)definesas“theyear whenthedatawasorwillbemadepubliclyavailable”.Still,DataCiteacknowledgesthatthisinformationmaybeunclear orunavailable,providingalternativessuchas,“[if]thatdatecannotbedetermined,usethedateofregistration”or“[i]fan embargoperiodhasbeenineffect,usethedatewhentheembargoperiodends”.Concludingthat“[i]fthereisnostandard publicationyearvalue,usethedatethatwouldbepreferredfromacitationperspective”.
Thedatefieldisanoptionalfree-textfieldthatcanrefertodifferentdatesrelevanttotherecord.Thesecanberelatedtothe datewhenthedatasetwascreated,uploadedtoarepository,madepubliclyavailable,updated,etc.Thus,wheninformation ifprovidedinthedatefield,oneofthefollowing9subtypesisrequired:accepted,available,copyrighted,collected,created, issued,submitted,updatedandvalid.
AsmentionedbeforeandpresentedinAppendixA,thefield“date”retrievedDataCiteMetadataStoreOAIAPIcombines boththepublicationyearanddateinasinglefield.Hencethedistinctionsdiscussedabovearenotavailable.Thismeansthat multipledatesmaybeassignedtoasinglerecordandthatthepublicationyearfieldcanonlybedistinguishedfromthedate fieldwhenthelatterisnotinafour-digitformat.Therefore,thedateinformationretrievedwiththeAPImustbesomehow processedbeforeused.Inthisstudy,wedefine“publicationyear”asadatepresentedwithafour-digitformat.Weidentified 4,242,804datarecordswiththisformat.Thiscleaningprocessisnotcompletelyaccurateasatotalof50,679recordsreported publicationyearsabove2099orfromearly1000sandwerethusnotconsidered.1Fig.6showsthenumberofrecordsfor the1950–2020period.Weobservemanyrecordsdatingfrom2016onwardsduetotheembargotheyarerestrictedby.
Thefactthatthereisnocleardefinitionforthepublicationyearfield,mayleadtosomediscrepanciesinthedata.Thisis especiallymeaningfulinthecaseofhistoricaldatawheretheusercouldchoosetoindicatethedateofthehistoricrecord orthedateofitsretrieval.Fig.7providestheexampleofadigitizedphotographwhichhadalreadybeenpublishedinits physicalform.Here,thepublicationyearfieldcontainsthevalue1929,whichisinfactthedatewhenthephotographwas taken.
Regardingrecordsincludingadditionaldates,weidentified2,095,183recordsofwhich43%reportedtheavailabilitydate, 25%reportedthedateofcreation14%declaredthecollectiondateand12%anupdateand3%andissuedate.Lessthan0.2%
oftherecordsreportedthedateofcopyright,submission,validityoracceptance.
1 Althoughtherearecasesofdatarecordsdatingfromtheearly1000s,e.g.,digitalizedarchivalobjects.
Fig.7.Exampleofrecordwithanolderdatetothedevelopmentofdatarepositories.6A.Contentsofaphotographtakenin1929.6BDatarecordinDataCite.
Thedateofpublicationoftherecordis1929.
4.4. RelatedDOInumbers
TheOAIDataCiteAPIalsoprovidesafieldnamedrelation,whichisequivalenttotheRelatedIdentifierfieldintheDataCite MetadataSchema.Themaindifferenceisthathereweretrieveonlytheinformationprovidedbythedatacenters,while theRelatedIdentifierfieldretrievedfromtheRESTAPIincludesadditionalrelatingprovidedbytheDataCiteteam.Itcontains identifiersforpublications(e.g.,DOIs,arxiv,bibcode,handles;notnecessarilyinDataCite).AsallrecordsinDataCiteinclude aDOInumberalongwithotherassociatedidentifiers,wecrossedrelatedDOInumberswith:1)theDataCitedatabaseitself, tofindpotentialrelationsamongdatarecordswithinDataCite;and2)withtheWebofScience,toidentifypotentialrelations withscientificpublications.AsshowninFig.8A,23%ofallDataCiterecordsincluderelatedDOIs.ThenumberofrelatedDOI numbersbyrecordvariesgreatly,showingahighly-skeweddistribution(Fig.8B).Fig.8CcrossesDataCiterelatedDOIswith DataCiterecords,withDataCiterecordsdefinedasdatasets,andwithWebofSciencerecords.Lessthan25%oftherelatedDOI numbersbelongtootherDataCiterecords.Approximately15%belongingtoarticlesindexedintheWebofScience(Fig.8C).
WhenwefocusonthedatatypeofrelatedDOIscontainedinDataCite(Fig.8D),weobservethat90%ofthesearedatasets.
Afteracursorycheckofsomeofthesecases,weobservethatoccasionallytherelationisformedbyacontainerdatarecord (i.e.,adatabase)anditstables(i.e.,datasets).Forexample,thedatabasehttp://dx.doi.org/10.15468/dl.qnbifhincludedatthe timeofthedatacollection,5192relateddatasets.ThispartiallyexplainstheskeweddistributionobservedinFig.8B.Inother cases,therelationindicatesdata(re)usebylinkingthedatawithapaper.However,thisfielddoesnotseemtocontainthe DOIofarticlescitingthedatarecord,andwefindnoevidentcriteriaforcharacterizingthetypesofrelationsreportedinthis field.
Interestingly,Robinson-García,Jiménez-Contreras,andTorres-Salinas(2016)reportedasimilartypeofrelationsalso consignedintheThomsonReuters’DataCitationIndex,althoughinthatcase,onlyrelationsbetweendatasetsandscientific paperswereincluded.However,theyreportedarepositorydependenceofthereportingoftheserelations,thatis,depending ontherepositorywewouldfindrecordswithrelationsornot.InDataCitethereisevidencesuggestingthatsuchadependency
Fig.8.AnalysisoftherelationfieldinDataCite.AShareofrecordsinDataCitewithrelatedDOInumberswithinDataCiterecords.B.Distributionofthe numberofrelatedDOInumbersbydatarecord.C.ShareofrelatedDOInumbersincludedinDataCitebytheirdatatype.D.ShareofrelatedDOInumbers indexedinDataCite,indexedinDataCiteandwithdatatypeinformation,andindexedinWebofScience.
alsoexists,inthiscasewithdatacenters:only226(30%)datacentersreportedatleastonedatarecordwitharelatedDOI number,and44(5%)ofthemreportedrelatedDOInumbersinalltheirrecords(seeFig.9).
5. Concludingremarksandrecommendations
Theresearchondatasharingandopendataisgrowing,whileatthesametimefundingbodiesareencouraginggreater researchtransparency.Termslikedata-drivenscience,data-intensivescience,andopensciencearebecomingmoreand morecommoninpolicydocumentsandstatementssuchastheEuropeanUnions’Horizon2020(EuropeanCommission, 2016).Inthiscontext,DataCiteiscalledtoplayanimportantroleassourcefortheanalysisandstudyofdatapublication andreuse.Whilethedemandofdatametricshasbeenaconstantsincethebeginningofthe2010s(Costasetal.,2013),there isstillalongwaytogountilthemovementexpandstobroaderfieldsofScienceandtomorecountries.
Thispaperpresentsthefirstlarge-scaledatacollectionandanalysisofDataCitetoassessitspotentialasabibliometric toolabletoprovideinformationandmetricsaboutopendataactivitiesata macro-scale.Comparedwithothersimilar productssuchastheDataCitationIndex,thesizeandrichnessofDataCitedataoffergreaterpossibilitiesasabibliometric sourcefordevelopingopendatametrics.Still,thisrichnessofdatacomesataprice.Conceptualproblemssuchaswhatis dataortowhichscientificfieldordisciplinedifferentdatasetsbelongto,alongwithtechnicalproblemssuchasthelackof standardizationofmanyofitsfields,maystillrepresentanadvantagetowardstheDataCitationIndex,inwhichthestructure offieldsintheDataCitationIndexadaptstosomeextentthestructureofbibliographicrecords.Thisispresentsapositive advantagefortheDataCitationIndexbecauseitallowsbibliometricanalyseswithoutpriorprocessing(e.g.,Robinson-Garcia etal.,2016).However,thisanalyticalsimplicityoftheDataCitationIndexoverlookssomeofthekeyissuesfoundwhen exploringthenatureandheterogeneityofopendata.Asshowninthispaper,themetadataofDataCiterecordsisveryrich andheterogeneous,herewedescribesomeoftheimportantissuesthatneedtobeconsideredwhenusingDataCiteasa sourceofdataforopendataanalytics.
5.1. CentralissuesregardingthemetadataprovidedbyDataCite 5.1.1. Datatypesandthedefinitionof“data”
AnimportantcriticalelementthatneedstobeconsideredwhenworkingwithDataCiteisthatassuch,allrecordsincluded inthedatabasearenotstrictlydata-related.Forexample,morethan12%ofthevalidrecordsinDataCitearetextorarticles.
Therefore,inordertoproperlyidentifyandanalyzetheproductionofdata,diversefiltersneedtobeappliedbytypesof data.However,wehavehighlightedtheimportantdiversityofdatatypesincludedinDataCite.Inaway,themanytypesof datacoveredinDataCitesuggestthatabroaderunderstandingofwhatconstitutesresearchdataisverynecessary.Infact,
Fig.9.ShareofrecordswithrelatedDOInumbersassignedtothem.BluerepresentsrecordswithrelatedDOInumbers.Greyrepresentsrecordswithno relatedDOInumbersreported.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
thepresenceofmultipledatarelatedtypessuchas“Images”,“collection”or“software”reinforcestheideathatweneedto stopconsidering“data”asahomogeneouspublicationtype.
5.1.2. DataCitemetadatafields
TheDataCiteschemacloselyalignedwithDublinCore,whichallowsinteroperabilitybetweendifferentplatformsand recordtypesaswellasensuringminimumlevelsofqualityofauthor-generatedmetadata(Greenberg,Pattuelli,Parsia,&
Robertson,2002).However,thesimplicityofthemodel(Lagoze,2001)leavesroomtoambiguityinmanyofthefieldsrequired inordertodevelopanytypeofbibliometricanalysis.WefoundthatamajorissueexistingonDataCiteisthatalotofrecords aremissinginformationinmanyofthefields(evenmandatoryones).Inaddition,makingsomeoftherecommendedfields mandatory(e.g.,thesubject,theinstitutionalaffiliationofthecreator)wouldenhanceDataCite’spotentialforbibliometric analyses.Itwouldalsobeusefultomakemandatorya“typeofrelation”subfieldforthe“Relation”fieldwhichisoneofthe
arecriticalissuesregardingthestructureandcleanlinessofDataCiterecordsthatwouldneedtobeaddressedtoimprove itsusability.Inanycase,theconclusionsdrawnherearebasedontheDataCiteMetadataStoreanddonotconsiderany improvedfunctionalitiesavailablethroughtheDataCiteRESTAPI.Inthissense,theadvantagesandlimitationsofusing differentpointsofaccessshouldbemadeclearersothatuserscanchooseoneortheotherdependingontheanalysisthey wishtoconduct.
Inthissense,potentialusersofDataCiteshouldconsiderthefollowingissues:First,emptyrecordsshouldberemoved beforeattemptingtomakeanystatementregardingtheactualdatacontainedbyDataCite.Asnotedinsubsection‘General descriptionoftheretrieveddatabase’,over1millionrecordswerefoundemptyatthetimeoftheretrievalofthedata.The non-removaloftheserecordsmaymisleadthecountsoftheactualsizeofthedatabase.
Second,issuesrelatedtodatacompletenessreducetheanalyzabledatasetasmorefiltersareusedtoretrieverecords.
Forexample,tofocusononlydata-relatedrecords(e.g.datasets)itisnecessarytofilterbyResourceType.However,this fieldisemptyforasubstantialamount(40%)ofrecords.Inaddition,theDataCiteMetadataStorecontainsawidevariety of“resourcetypes”.Thus,usersmustdecidebeforehandwhichdatatypesarerelevantfortheanalysisandunderstandthe potentiallossesofinformationthatthefilterswillimpose.
Third,aconsiderableamountofdataprocessingandcleaningwillmostlikelybeneeded,asmostfieldsarenotstandard- ized.Furthermore,thefactthatsomefieldsaremerged(e.g.publicationdateanddate)makesitcompulsorytoprocessand cleanthedatabeforeanalyzingit.
Finally,animportantissuecriticalforthepotentialusabilityofthedatabaseformetricpurposesisthelackofstandard- izationofmanymetadatafields.Havingmanyfreetextfields(e.g.Publicationyear,publisher,creator)makesdataretrieval morearduousandmakesitnecessarytodisambiguatethedata.Bysimplyimposingastandardformatforcertainfieldssuch asthecreatorfield,orbyincludingaclosedlistfortheResourceTypefieldandsubfieldorforthesubjectfieldwouldgreatly improvethequalityofthedataandfacilitateitsanalysis.
6.1. Furtherresearch
DataCiteiscurrentlyoneofthemaindatasourcesavailableforthedevelopmentofdatametrics,andagreatpromoter ofdatasharingandreuse.Indeed,despiteitsrecentcreation,DataCiteisprobablythelargestdatabase,withavastand heterogeneoussetofdatarecords,bringingusastepclosertoanidealofopensciencecharacterizedbyitstransparencyand itscapacitytooptimizetheuseofresources.ByprovidinganoverviewofthestructureandcontentoftheDataCiterecords, thispaperhashopefullyservedasafirststeptowardsabetterunderstandingofdataproduction,publicationandreuse bythescientificcommunity.FurtherresearchwillfocusoncomparisonswithdifferentofaccesstoDataCiterecords,the studyoftherelationshipsbetweenauthorsofscientificpublicationsandcreatorsofdatasets,thedevelopmentofsuitable classificationsofdatarecordsandthepresenceofmentionstoDOIsinthereferencesofscientificpublicationstodata.
Authorcontributions
NicolasRobinson-Garcia:Conceivedanddesignedtheanalysis,Contributeddataoranalysistools,Performedtheanalysis, Wrotethepaper.
PhilippeMongeon:Contributeddataoranalysistools.
WeiJeng:Collectedthedata,Contributeddataoranalysistools.
RodrigoCostas:Conceivedanddesignedtheanalysis,Contributeddataoranalysistools,Performedtheanalysis.
2 ThecurrentrecommendeddatacitationformatfromDataCiteisthefollowing.Creator(Publicationyear).Title.Publisher.Identifier(DataCite,2015).
Acknowledgements
Preliminaryresultsofthispaperwerereportedatthe3:AMConferenceheldinBucharest(Romania),27–29September, 2016.TheauthorswouldliketothankHenrideWinterfromCWTSforhelpingintheretrievalofthedataandKristianGarza fromDataCiteforfruitfulandhelpfuldiscussionsonpointsofaccesstoDataCiteandstructureofrecords.Thetwoanonymous reviewersarealsothankedfortheirconstructivecommentsandrecommendations.Thisstudyhasbeenpartiallysupported bytheEuropeanCommissionprojectRTD-B6-00964-2013MonitoringtheevolutionandbenefitsofResponsibleResearchand Innovation(MoRRI).NicolasRobinson-GarciaiscurrentlysupportedbyaJuandelaCierva-FormacióngrantfromtheSpanish MinistryofEconomyandCompetitiveness.
AppendixA. Retrievedfieldsanddescriptionoftheircontents
Field Description
Identifier Uniquenumberidentifier.DataCiteassignsDOIstoalldatarecords,althoughmanyincludeadditional identifierssuchasCCDC(CambridgeCrystallographicDataCentre)orInChI(InternationalChemicalIdentifier).
Creator Authorofthedatarecord.Thisfieldisnotpresentedinastandardizedformat(i.e.Surname,Initials).
Title Nameofthedatasetorfilestoredintherepository.
Publisher Non-standardizedformatwhichincludesagreatvarietyofdifferententitiesragingfromrepositories,journals, institutions,etc.
Date Thisfieldincludesthemandatoryfield‘PublicationYear’aswellasthe‘Date’field,whichmeansthateach recordcanhavemorethanonepublicationyear.Theformatisstandardizedbutheterogeneous.Hence
‘PublicationYear’informationappearsasafour-digitnumberwhileDateappearsstatingthetypeofdateand theactualyear(i.e.,Available:01/2/2005).
Subject Keywordsassignedtoeachdatarecord.Whileweobservethatforsomerepositoriesafixedclassification systemisemployed;thisisnotsystematizedforalldatarecords.
Contributor Individualsandinstitutionscollaboratingonthecreationofthedatabutnotconsideredascreators.Aswith the‘Creator’field,thisfieldisnotpresentedinastandardizedformat.
ResourceType Thisfieldincludesboth,thefirst-leveldatatypeclassificationaswellasthesecond-leveldatatype classification.
Description ThisfieldincludesinitscontentthefivedistinctsubsectionsdescribedbyDataCite.Howevernotallrecords includeallsubsections.
DataCenter InstitutioninchargeoffeedingDataCitewithrecords.Datacentershaveauniqueidentifiereachconstructed intwoparts.Firsttheintermediaryinstitutionandsecondly,thesendinginstitution.Forinstance,
BL.IMPERIAListheidentifierforImperialCollegeLondon.BLstandsforBritishLibrary,theintermediary institutionandIMPERIALforthesendinginstitution.
Relation ThisfieldrelatedeachdatarecordwithadditionalDOInumbers.Howsuchrelationisestablishedisnot formallydeclaredintherecord.DespiteDataCiteoffersacontrolledlistofvaluesindicatingthetypeof relationestablishedbetweenrecords,wedidnotfindthisinformationinthedataretrieved.Moreonthisin subsection3.4
Format Non-standardizedfieldwhichincludesaformaldescriptionofthecontentsoftherecord.Herewefind informationwhichrangesfromacatalographicdescriptionofthecontents(i.e.,ZweiTeilein1Band;17cm)to actualformatofthesubmittedfile(i.e.,SPSSfile).
Language Non-standardizedfieldindicatingthelanguageoftherecord.Languageisindicatedbyusingatwo-digit format,athree-digitformatorthefullname.Insomecases,morethanonelanguageisreported(i.e.,fr-en) Rights Non-standardizedformatincludingtheholderofthecopyrightsifanyorthelicensebywhichthedatarecord
isprotected.InformationisreportedherenotonlyinEnglishbutalsoinotherlanguages.
AppendixB. Classificationofpublishertypes
Publisherswereclassifiedintoelevenmutuallyexclusivecategoriestoanalyzedifferentnationaldatainfrastructures.
Followingweincludethetwelvetypesofpublishersidentifiedalongwithexamplesforeachofthem.
Publishertype Examples #records
Thematicrepository Data-PlanetÔStatisticalReadyReferencebyConquestSystems,Inc.;Cambridge CrystallographicDataCentre
2,205,204 Institutionalrepository ImperialCollegeLondon,ETH-BibliothekZürich,Bildarchiv,UniversityofPittsburgh 852,954 Researchbody PartnershipforInterdisciplinaryStudiesofCoastalOceans(PISCO),LeibnizInstitutfür
AstrophysikPotsdam(AIP)
764,962
Multidisciplinaryrepository Figshare,ZENODO 408,355
Scientificpublisher GermanMedicalScienceGMSPublishingHouse,ZofingerTagblatt,PeerJ 149,305 Nationalrepository DigitalRepositoryofIreland,Colchester,Essex:UKDataArchive 40,634 Firm Huber&Co.AG,VerlegergemeinschaftWerk,Bauen+WohnenBauen+WohnenGmbH 20,704 Professionalbody BundSchweizerArchitekten,Freidenker-VereinigungderSchweiz,Unionsyndicale
Suisse
19,215
Conference EuropeanCongressofRadiology 18,571
Individual W.Jegher&A.Ostertag,J.F.Boscovits 8025
Educationalbody nanoHUB 2326
AssociationforInformationScienceandTechnology,68(6),1341–1359.
Mayernik,M.S.(2012).Bridgingdatalifecycles:trackingdatauseviadatacitationsworkshopreport.NCARTechnicalNoteNCAR/TN-494+PROC.Boulder,CO:
NationalCenterforAtmosphericResearch(NCAR).http://dx.doi.org/10.5065/D6PZ56TX
Missier,P.(2016).Datatrajectories:Trackingreuseofpublisheddatafortransitivecreditattribution.InternationalJournalofDigitalCuration,11(1),1–16.
Parsons,M.A.,&Fox,P.A.(2013).Isdatapublicationtherightmetaphor?DataScienceJournal,12,WDS32–WDS46.
Peng,R.D.(2011).Reproducibleresearchincomputationalscience.Science,334(6060),1226.
Perneger,T.V.(2011).Sharingrawdata:AnotherofFrancisGalton’sideas.BritishMedicalJournal,342,d3035.
Peters,I.,Kraker,P.,Lex,E.,Gumpenberger,C.,&Gorraiz,J.(2016).Researchdataexplored:Anextendedanalysisofcitationsandaltmetrics.
Scientometrics,107(2),723–744.
Piwowar,H.A.,Day,R.S.,&Fridsma,D.B.(2007).Sharingdetailedresearchdataisassociatedwithincreasedcitationrate.PublicLibraryOfScience,2(3), e308.
Piwowar,H.A.,Becich,M.J.,Bilofsky,H.,&Crowley,R.S.(2008).Towardsadatasharingculture:Recommendationsforleadershipfromacademichealth centers.PLoSMedicine,5(9),e183.
Robinson-García,N.,Jiménez-Contreras,E.,&Torres-Salinas,D.(2016).AnalyzingdatacitationpracticesusingtheDataCitationIndex.Journalofthe AssociationforInformationScienceandTechnology,67(12),2964–2975.
Torres-Salinas,D.,Robinson-García,N.,&Cabezas-Clavijo.(2012).Compartirlosdatosdeinvestigaciónenciencia:introducciónaldatasharing.El ProfesionalDeLaInformación,21(2),173–184.
Torres-Salinas,D.,Martín-Martín,A.,&Fuente-Gutiérrez,E.(2014).AnálisisdelacoberturadelDataCitationIndex–ThomsonReuters:disciplinas, tipologíasdocumentalesyrepositorios.RevistaEspa˜nolaDeDocumentaciónCientífica,37(1),e036.