Discoverers in scientific citation data
Gui-Yuan Shi
a, Yi-Xiu Kong
a, Guang-Hui Yuan
b, Rui-Jie Wu
a, An Zeng
c,∗,
Matúˇs Medo
d,a,e,∗∗aDepartment of Physics, University of Fribourg, Fribourg 1700, Switzerland
bFintech Research Institute, Shanghai University of Finance and Economics, Shanghai 200433, PR China cSchool of Systems Science, Beijing Normal University, Beijing 100875, PR China
dInstitute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, PR China eDepartment of Radiation Oncology, Inselspital, University Hospital of Bern and University of Bern, Bern 3010, Switzerland
Keywords: Citation networks Preferential attachment Microscopic mechanisms Prediction
Identifying the future influential papers among the newly published ones is an important yet challenging issue in bibliometrics. As newly published papers have no or limited citation history, linear extrapolation of their citation counts—which is motivated by the well-known preferential attachment mechanism—is not applicable. We translate the recently intro-duced notion of discoverers to the citation network setting, and show that there are authors who frequently cite recent papers that become highly-cited in the future; these authors are referred to as discoverers. We develop a method for early identification of highly-cited papers based on the early citations from discoverers. The results show that the identified discoverers have a consistent citing pattern over time, and the early citations from them can be used as a valuable indicator to predict the future citation counts of a paper. The discoverers themselves are potential future outstanding researchers as they receive more citations than average.
1. Introduction
Preferential attachment, also known as the rich-get-richer phenomenon, cumulative advantage, or the Matthew effect, is omnipresent in bibliometrics. This phenomenon was first systematically studied in 1965 (Price, 1965), and later modeled under the name cumulative advantage process (Price, 1976). At the beginning of the surge of interest in complex networks, preferential attachment was independently proposed to explain the power-law distribution of the number of links con-necting to web pages (Barabási & Albert, 1999). SeeMitzenmacher (2004),Perc (2014)andZeng et al. (2017)for reviews of preferential attachment and related models.
In the context of bibliometric research, the preferential attachment model assumes that the rate at which a paper receives new citations is proportional to the number of citations that the paper already has. The simplistic original model has been later generalized by introducing node fitness (Bianconi & Barabási, 2001;Caldarelli, Capocci, De Los Rios, & Munoz, 2002), aging (Dorogovtsev & Mendes, 2000;Wu, Fu, & Chiu, 2014), node similarity (Papadopoulos, Kitsak, Serrano, Boguná, &
Krioukov, 2012). The introduction of aging is particularly important as in its absence, preferential attachment naturally
leads to strong first-mover advantage (Newman, 2009). Without aging, the oldest papers will constistenly be the most cited
∗ Corresponding author. ∗∗ Corresponding author.
E-mail addresses:anzeng@bnu.edu.cn(A. Zeng),matus.medo@unifr.ch(M. Medo).
http://doc.rero.ch
Published in "Journal of Informetrics 13(2): 717–725, 2019"
which should be cited to refer to this work.
(Berset&Medo,2013;Mariani,Medo,&Zhang,2016).Inparticular,agingincombinationwithheterogeneousnodefitness
(Medo,Cimini,&Gualdi,2011)producesvariousdegreedistributionsthatarecommonlyfoundinrealsystems.InMedoand
Cimini(2016),thismodelwasusedtocompareresearchercitationimpactmetricsinacontrolledsetting.Atthesametime,
therearemanymorefactorsthatcanbeincludedtomakethemodelsmorerealistic(Golosovsky&Solomon,2017;Wang,
Yu,&Yu,2009).
AparticularexampleofextendingthebasicpreferentialparadigmisduetoMedo,Mariani,Zeng,andZhang(2016)who showthatine-commercedatarepresentedbybipartiteuser-itemnetworks,thereareuserswhoarerepeatedlyamongthe firstincollectingitemsthatbecomepopularmuchlater.Theseusersarereferredtoasdiscoverers(seeCervellini,Menezes,
andMago(2016)forarelatedconceptoftrendsetters.)Onesimpleandfundamentalexplanationofthisobservationisthat
besidesthecommonuserswhoaredrivenbyitempopularity(inagreementwithpreferentialattachment),therearealso userswhoaredrivenbyintrinsicitemfitness.Herefitnessistheintrinsicitemqualityasviewedbyagivenaudience;high fitnessitemsarelikelytobecomepopular.Auserwhoissensitivetoitemfitnesscanthuslinkwithahigh-fitnessitem despiteitslowerpopularity.AsshowninMedoetal.(2016),discoverersareubiquitousacrossvariouse-commercesystems. Anothermechanismthat,albeitfundamentallydifferent,canleadtoasimilarpatternhasbeenputforwardbyWuand
Holme(2009)whostudiedtheprocessofthecitationsmadebyinfluentialresearchersbeingcopiedbytheirfollowers,and
itsimpactonthenetworkclusteringcoefficient.
Whilescholarlycitationdatacanbenaturallyrepresentedasanauthor-paperbipartitenetworks,anditisthusnatural toapplytheframeworkofdiscoverersonthem,thishasnotbeendonebefore.Weusethemuch-studiedAmericanPhysical Society(APS)data(Chen&Redner,2010;Martin,Ball,Karrer,&Newman,2013;Medoetal.,2011)toshowthatdiscoverers arecommonalsoincitationdata.Uponidentifyingthediscoverers,wetrackthepapersfirstcitedbytheminthefutureand findthattheyareabove-averagecited.Finally,wefocusonthediscoverersthemselvesandassessthesuccessoftheirown publicationsandtheiroverallimpact(ascomparedwiththeimpactofotherauthors).
2. Dataandmodel
2.1. Datadescriptionandauthornamedisambiguation
WeusethecitationdatamadeavailablebytheAmericanPhysicalSociety(APS).Thedataspanfrom1893to2016with thetimeresolutionofoneday.Thereare593443papers,and5553199citationsamongthemintotal.Notethatthepapers publishedbyReviewsofModernPhysicsareexcludedfromtheanalysis,becausereviewarticlesusuallyhavemanymore citations,sotheyareinprinciple“easytargets”whichshouldnotcontributetotheauthors’recordofearlycitingpapersthat laterbecomeverypopular.Wealsoexcludeself-citations,i.e.,thecitingpaperandthecitedpaperhaveatleastoneauthor incommon.Itisconventionalthatresearcherscitetheirownpapers,sowebelievethisislargelydeviatedfromhowwe definea“discovery”.
Weconsidertwoauthorstobethesameindividualiftheysatisfythefollowingtwoconditions:(1)Theirlastnames, andtheinitialsofbothfirstnamesandgivennamesareexactlythesame,(2)Onehasbeencitedbyorcitingtheotherone, ortheyhaveco-authorswhosenamesatisfiesthefirstcondition.Afterthenamedisambiguation,wefinallyobtain559940 authorsintotal.
2.2. Model
ForagiventimepointT,thepaperspublishedbetweenT−K−10andT−Kareusedasthetrainingset(here10represents 10years).Toidentifythediscoverers,i.e.theresearcherswhooftenearlycitepapersthatareeventuallyhighly-cited,the firststepistosetthethresholdsfor“citeearly”and“highly-cited”.Sinceinrealitywecannotknowanyinformationafter timeT,weuseashort-termcitationcountCK(thecitationcountKyearsafterpublication)asaproxyforthepaper’seventual
popularity.AsKincreases,CKbecomesclosertotheeventualpopularity.Atthesametime,largeKmeansthatthetraining
setisfarfromthecurrenttimeTwhichlimitsthepredictivepoweroftheidentifieddiscoverersfortwomainreasons: (1)AuthorsactiveinthetrainingsetmaybecomeinactiveuntiltimeT,(2)Authors’discoveryabilitydoesnothavetobe stationary,sotheauthorswhowerediscoverersinthetrainingsetneednottobediscoverersattimeT.Thepaperswhose CKisinthetopfp%amongallpaperspublishedbetween1970and2006formthetargetgroupofhighly-citedpapers.For
eachtrainingsetpaper,itsfirstFcitationsreceivedwithinoneyearafterthepublicationdatearedenotedasexplorations ofthepaper’spotential.Ifapaperturnsouttobeahighly-citedpaper,thoseFexplorationsarealsolabeledasdiscoveries. Theauthorsareassessedbythenumberofexplorationsanddiscoveriesthattheyhavemade(seebelow).Fig.1showsa schematicdiagramoftheevaluationscheme.NotethatoursettingismorecontrolledthanthatusedinMedoetal.(2016)
wherethecurrentpopularityofitemsattimeTwasusedinsteadofCK.TheuseofCKisadvantageouswhenanalyzingcitation
datawherecitationcountsofhighly-citedpapersareknowntoaccumulatecitationsoverextremelylongperiodsoftime
(Golosovsky&Solomon,2013).
Usingtheabove-describedprocedure,wedeterminethetotalnumberofexplorations,Ei,anddiscoveries,Di,foreach
researcheri.Nowthenullhypothesisisthattherearenodifferencesbetweentheauthorsinhowtheycitepapersthatlater becomehighly-cited.Inourterminology,each“exploration”hasthesamechancetobecomea“discovery”(i.e.,eachearly citationhasthesamechancetobeacitationofafuturehighly-citedpaper).Theoverallprobabilitythatanexplorationturns
Fig.1.Aschematicplotoftheevaluationofdiscoverers.1)˛isapaperinthetrainingset.Ifˇisoneof˛’sfirstFcitationswithin1year,ˇ’sauthors’number ofexplorationsincrease1.If˛’scitationafterKyearsranksthetopfp%,ˇ’sauthors’numberofdiscoveriesincrease1.2)isapaperinthetestset,and isfirstcitedbyı,ifoneofı’sauthorsisadiscoverer,ispredictedtobeahighly-citedpaper.
outtobeadiscoveryisPd=
Di/Ei.Underthenullhypothesis,thenumberofdiscoveriesaresearchermakesfollowsthe
binomialdistribution P(Di;Ei,Pd,H0)=
Ei Di PDi d (1−Pd)Ei−Di. (1)Weshalltestwhetherthenullhypothesisissupportedbythedata.Ifnot,howmuchtheresultdisagreewiththenull hypothesis?
3. Results
3.1. Statisticalsignificanceofdiscoverers
Withalargenumberofresearchers,someofthemcanachievemanydiscoveriesevenifthenullhypothesisactuallyholds. Tobeabletofocusonreallynotableauthors,wecomparethefoundempiricalnumbersofexplorationsanddiscoverieswith theirexpectedbehaviorunderthenullhypothesis.
SimilarlytothediscoveryprobabilityPD,onecanintroducealsotheexplorationprobabilityPe=
Ei/LiwhereLiisthe
numberofcitationsmadebyauthoriinthetrainingset.Theexpecteddistributionofthenumberofexplorationsisthen closelyinterwoundwiththedistributionofthenumberofcitations.Underthenullhypothesis,theexpectednumberof authorswhomadeEexplorationsis
Ne(E)= ∞
x=1 A(x) x E PE e(1−Pe)x−E, (2)whereA(x)isthenumberofauthorswhocitedxpapers.TheexpectednumberofauthorswhomadeDdiscoveries,Nd(D),can
beobtainedsimilarly.TheresultsinFig.2a,bdemonstratetherearemanyresearcherswhomakemanymoreexplorations anddiscoveriesthanexpectedunderthenullhypothesis.
SimilarlytoMedoetal.(2016),weusethep-valuetoquantifythestatisticalsignificanceofauthorimakingDidiscoveries
outofEiexplorations.Foragivenauthor,thep-value,i.e.theprobabilityofmakingatleastDidiscoveriesunderthenull
hypothesisH0is Pv(Di,Ei)= Ei
D=Di P(D;Ei,Pd,H0). (3)Authorachievesmallp-valuePvonlyiftheycitenewpaperssufficientlyoften(whenEiissmallorzero,eventhemaximal
possibleDi=Eimaybeinsufficientforastatisticallysignificantobservation,seeFig.2dforanillustration),anda
dispro-portionatenumberofthemturnsouttobehighly-citedinthefuture.Toidentifythestatisticallysignificantdiscoverers, wecontroltheFalseDiscoveryRate(FDR)atthelevelof5%byapplyingtheBenjamini-Hochbergprocedure.Theprocedure worksasfollows:(1)rankalltheactiveNauthorswhomadeatleastoneexplorationinascendingorderofPv;(2)findthe
largestxsuchthatPv(x)≤x˛/N,here˛istheFDRlevel;(3)thesexresearcherswhohavetheminimalPvareidentified
asdiscoverers.BenjaminiandHochbergprovedthattheexpectedproportionoffalsediscoveries(identifiedasdiscoverers becauseofpureluck)is˛(Benjamini&Hochberg,1995).SeeFig.2cforanexample.
Wefindthattheresultsoftheouralgorithmarelittlesensitivetothechoiceofparameters.Toidentifyasmanydiscoverers aspossible,weusemultipleparameterchoicesandmarktheauthorswhoareidentifiedbythestatisticalprocedureforat leastoneparametersettingasdiscoverers.Inparticular,weusedF∈{1,...,5},K∈{1,...,10},andfP∈{0.2,0.4,...,2.0},
whichcorrespondsto5×10×10=500differentparametersettings.
Weidentifythediscoverersfrom1990to2006withthemethoddescribedabove.Thetotalnumberofdiscoverersineach year,showninFig.3a,waversaround900.InFig.3b,weshowtheproportionofpapersfirstcitedbydiscoverersineach year.About8%ofallpapersarefirstcitedbydiscoverersandthuspredictedtobepotentialhighly-citedpapers.
Fig.2. Statisticalsignificanceofdiscoverers.WeusetheparametersF=5,K=3,fp=1.0(%),andT=2006asanexample,thatimplyPD=7.6%andPE=5.4%.
(a)Numberofresearcherswhomakedifferentnumberofexplorations(reddots);theblacklinerepresentstheresultunderthenullhypothesis,seeEq.(2).
(b)Numberofresearcherswhomakedifferentnumberofdiscoveries(reddots);theblacklinerepresenttheresultunderthenullhypothesis.(c)Author p-valuesobtainedwithEq.(3)andtheBenjamini-HochbergproceduretocontroltheFalseDiscoveryRate.Thesolidlineisr˛/NwhereNisthenumber ofallactiveauthorswhomadeatleastoneexplorationundertheaboveparametersettingatT(hereN=72931),ristheauthorrank(thex-axis),and˛is theFDRlevel(wechoose5%).Thedotsbelowthelinemarktheauthorswhosignificantlydeviatefromthenullhypothesis(redsymbols);inthiscase,124 discoverersareidentifiedwhosep-valuesarebelowpm=8.9×10−5.(d)Thediscoveryrate(Di/Ei)oftheidentifieddiscoverers.Thesolidlineshowsthe smallestdiscoveryratethatwouldsufficetoachievethep-valueunderpmatagivennumberofexplorations.
Fig.3.(a)Numberofdiscoverersineachyear.(b)TheproportionofpapersfirstcitedbydiscoverersamongallthepaperswithC10≥1.
3.2. Anexampleofdiscoverers
Previousstudies(Jeong,Néda,&Barabási,2003;Redner,2004;Wang,Yu,&Yu,2008)showthatmostoftheresearchers citepapersaccordingtothepreferentialattachmentmechanism,andthereexistsastrongtendencyforthoseresearchers tocitepopularpapers.However,wefindthatthereexistsatinyfractionofresearcherswhorepeatedlycitenewpapersthat willbecomehighly-citedinthefuture.Herewegiveanexampleofatypicaldiscovereridentifiedinyear1990,withhis/her publishingtimeseriesillustratedinFig.4.Thediscovererismarkedasdiscoverer301timesamongallthe500parameter settingsatT=1990,whichisthehighestofallresearchers.Thediscoverer’sfirstpaperwaspublishedinJuly,1973. 4. Thebehaviorofdiscoverers
4.1. Futuresuccessofpapers
Sincethediscoverershavehistoricallyhighsuccessinearlycitinghighly-citedpapers,itisnaturaltoinvestigatewhether theyalsohavetheabilitytodiscoverhighly-citedpapersinthefuture.Tothisend,weemployaprocedureillustratedin
Fig.1:usingthedescribedprocedure,weidentifythediscoversatthebeginningofayear(timeT,from1990to2006)and considerallpaperspublishedbetweenTandT+1thathaveC10≥1.Amongthosepapers,ifapaper’sfirstcitationcomes
fromanotherpaperthathasatleastonediscovereramongtheauthors,wesaythatthepaperis“firstcitedbydiscoverers”.
Fig.4.Themostoutstandingdiscovereridentifiedintheyearof1990.Bluebarsshowthetotalcitationcountsofallthepapersthediscovererhascited, andtheredbarsshowthecitationcountofthepaperwhenthediscoverercitedit.
Fig.5.(a)AcomparisonbetweenC10ofthepapersfirstcitedbythediscoverers(red),highh-indexauthors(blue),andallresearchers(black),eachdot representsthemeanC10ofpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrby C10,risthenormalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupfirstcitedbydiscoverers,highh-index authorsandallauthorsrespectively.
IntheAPSdata,fromthebeginningof1990totheendof2006,186567papershavebeenpublishedandhaveC10≥1
citation.Amongthem,13157papershavebeenfirstcitedbydiscoverers.Onaverage,thepapersfirstcitedbydiscoverers haveC10of16.7,whiletheoverallaverage(forpaperswithC10≥1)is9.85.Asanadditionalbenchmark,wecomputethe
h-indexHirsch(2005)ofallauthorsattimeT(herethebeginningofeachyear),selecttheauthorswithhighh-index,and evaluatethepapersthatarefirstcitedbyhighh-indexauthors.Theh-indexthresholdsare14–20,differentfromyearto yearsothatthenumberoftheselectedpapersisclosetothenumberofpaperscitedbydiscoverers.Theresulting14001 papersfirstcitedbyhighh-indexresearchershaveonaverageC10of14.4.Fig.5ashowsC10ofthepaperschosenbythe
threecriteria(allpaperswithC10≥1,firstcitedbydiscoverers,andfirstcitedbyhighh-indexauthors)bypaperpublication
year.WeseethatthepaperscitedbythediscoverershavethehighestC10ineachyear.AsshowninFig.5b,ifwesortthe
papersbyitsrankingofC10ineachgroup,thepapersfirstcitedbydiscoverersoutperformsthepaperswhichhavethesame
rankingsintheothertwogroups.
Inourprocedure,thepapersfirstcitedbythediscoverersarefoundtobehighly-citedinthefuture.Tomeasurehow manyhighly-citedpapersaremissedbyourprediction,wecalculatetheFalsePositiveRate(FPR)andtheTruePositiveRate (TPR)inourmethod.TheFPRisdefinedasFPR=FP/Nn,whereFPisthenumberoffalsepositives,andNnisthetotalnumberof
negatives.MeanwhiletheTPRisdefinedasTPR=TP/Np,whereTPisthenumberoftruepositives,andNpisthetotalnumber
ofpositives.
WedefineapaperthathasC10>68(ranksinthetop1%ofallpaperswithC10≥1from1990to2006)ashighly-cited.As
aconsequence,atruepositiveeventisdefinedasapaperfirstcitedbydiscoverershavingC10>68inthefuture.Wefindthe
numberofoccasionsoftruepositiveeventsTPd=368.Afalsepositiveeventisdefinedasapaperfirstcitedbydiscoverer
havingC10<68inthefuture.WeobtainintotalFPd=12789falsepositiveevents.ThesevaluesimplytheFPRFPRd=6.9%
andtheTPRTPRd=20.3%forthepredictionofhighly-citedpapersusingdiscoverers.Forcomparison,wealsostudythe papersfirstcitedbyhighh-indexauthors,findingtheFPRFPRh=7.4%andtheTPRTPRh=17.0%.Discoverer-basedmethod
outperformsthemethodbasedonhighh-indexauthorsinbothFPRandTPR.Wefurthertestedotherthresholdsofdefining thepositiveeventsandfindthattheconclusionholdsforallthresholds.
4.2. Discoverersandsocialinfluence
Theobservationthatthepapersfirstcitedbythediscoverersaremorelikelytobecitedinthefuturecallsforpossible explanations.InlinewithMedoetal.(2016),oneexplanationisthatthediscoverersarefitness-drivenresearcherswhoare sensitivetotheintrinsicqualityofpapers,thusthepapersthattheycitehaveoverallhigherqualitywhichinturnleadsto themorecitations.Anotherpossibleexplanationisthatthediscoverersthemselvesareinfluentialresearchers,whichmeans
Fig.6.(a)Acomparisonbetweenthepapersfirstcitedbydiscoverersandallpapers(c10≥1).Theredbarsindicatethefirstcitation,greenbarsandorange barsrepresentthosecitationsfrompaperswhichcitedthefirstcitationpaperaswell,ornot,respectively.See(b)forillustration.
theyhavemanyfollowerswhofrequentlycitetheirpapersandalsopossiblycopytheirreferences(Wu&Holme,2009). Suchcopyingofreferencesisknowntobeanimportantelementofthegrowthofcitationdata(Simkin&Roychowdhury,
2005,2007).Inaddition,whenenteringanewresearchfield,apossiblewaytostartistoreviewthemilestonepapers
byoutstandingresearchersinthefield(andthereferencestherein).Insummary,itisplausiblethatresearchersaremore exposedtothepapersthatwerecitedbyfamousresearchers.
Tobetterunderstandtheroleofdiscoverersandtheircontributiontoboostingpaperpopularity,itiscrucialtodistinguish betweenimpactsof thesetwo effects.Similareffectshave beenrarelydiscussedbeforebecausemostdatasetslacka directwaytodeterminewhetherachoiceisanindependentdecisionorcopyingthebehaviorofothers.Fortunately,the scientificcitationdatagiveusaninformationaboutthepathsofknowledgepropagationandthusprovideusawaytofurther investigatethisproblem.
Tothisend,wedividepaper˛’scitationsintothreegroups: 1 Thefirstcitationofpaper˛islabeledasˇ.(RedbarinFig.6a)
2Papersthatciteboth˛andˇ.Thesepaperscanbefurtherdividedintothreetypes:(i)Thepapersthatcite˛becauseof theinformationfromˇ.(ii)Thepapersthatciteˇbecauseoftheinformationfrom˛.(iii)Thepapersthatciteboth˛and ˇjustbecauseofcoincidenceorotherunknowncauses.Itisgenerallydifficulttodistinguishthesetypesofcitations.For simplicity,weconsiderthenumberofallthesepapersasameasureofupperlimitofthesocialinfluenceofˇ.(Greenbar inFig.6a)
3Papersthatcite˛butdonotciteˇ.(OrangebarinFig.6a)
Whileclasstwocorrespondstothehypothesisthatthediscoverersutilizetheirsocialinfluenceandboostthecitation countofatargetpaperbyhavingitco-citedwiththeirpaper,classthreecorrespondstothehypothesisofdiscoverersbeing thefirstonestospothighqualitypapers.AsshowninFig.6a,thethirdclassofcitationsisclearlythedominantcomponent. Whilesomesocialinfluenceofdiscoverersstillmayexist,weseethatcopyingofthediscoverers’behaviorisaminorinfluence here.
4.3. Dothediscoverersactasfollowers?
Anotherquestioniswhetherthesuccessofdiscovererscanpossiblylieinpreferentiallyciting(following)highly-cited authorsthat,inturn,increasestheirchanceofmakingdiscoveries.Toinvestigatethispossibility,wecomputethemaximum h-indexofeachpaper’sauthorsatthetimethatthepaperpublished,hmax.Wethendividethepapersingroupsbyhmax.
AsshowninFig.7,discoverershavepreferenceincitingpaperswithhighhmax,forexample,3.12%ofpapersfirstcitedby
discoverershavehmax≥30,whileforallthepaperswithC10≥1,theproportionis1.12%.Despitethepreference,ineach
hmaxgroup,papersfirstcitedbythediscoverersreceivesignificantlymorecitationsthanthosecitedbyallauthors,this
showsthatthepapersfirstcitedbythediscoverersareonaveragemorecitedregardlessofwhetherthepapers’authorsare highly-citedornot.
4.4. Papersauthoredbythediscoverers
Anotherpertinentquestioniswhetherthediscoverers’papersalsoreceivemorecitations.Fromthe233087papers publishedin1990–2006,16358papershaveatleastonediscovereramongtheirauthors,and17966papershaveatleast onehighh-indexauthoramongtheirauthors(Theh-indexthresholdsare14–21,differentfromyeartoyearsothatthe
Fig.7. Acomparisonofc10forthepapersfirstcitedbythediscoverers(bluebars),andallauthors(yellowbars).Paperspublishedin1990–2006are considered.Thepapersaredividedingroupsbytheirhmax(thehighesth-indexoftheirauthorsinthepaper’spublicationyear).Theerrorbarsrepresent thestandarderrors.Thepercentagesabovethebarsindicatetheproportionofthecontributingpapers.
Fig.8.(a)AcomparisonbetweenC10ofthepapersauthoredbythediscoverers(red),highh-indexauthors(blue),andallauthors(black),eachdotrepresents themeanC10ofallpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrbyC10,risthe normalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupauthoredbydiscoverers,highh-indexauthorsand allauthorsrespectively.
Table1
MeananditsstandarderrorofC10forvarioussubsetsofpapers;Nsdenotesthesubsetsize.Thebottompartincludestheresultsforvariousintersections ofpapersubsetsdiscussedinthetoppart.
Firstcitedby Authoredby
(A)Discoverers (B)Highh-index All (C)Discoverers (D)Highh-index All
C10 16.7±0.3 14.4±0.2 9.85±0.05 17.1±0.2 14.9±0.2 7.88±0.03
Ns 13157 14001 186567 16358 17966 233087
A∩B A∩C A∩D B∩C B∩D C∩D
C10 17.2±0.4 24.4±0.7 24.3±0.8 23.0±0.8 21.5±0.8 19.6±0.4
Ns 5490 3172 2292 2211 2161 6822
numberoftheselectedpapersisclosetothenumberofpapersauthoredbydiscoverers.).MeanC10valuesforthesethree
groupsofpapersare7.88,17.1,and14.9,respectively.Similarlytothepapersfirstcitedbythediscoverers,papersauthored bythediscoverersalsohavesignificantcitationleadoverthetwobenchmarkgroups(seeFig.8aforacomparisonbypaper publicationyear).
4.5. Furtherimprovingthepredictions
Finally,wecomparethecitationcountsofvarioussubsetsofpapersinTable1.Notably,papersbelongingtomultiple inter-sectionsofthepapersubsetsdiscussedsofar(e.g.,papersthatarebothfirstcitedandauthoredbyadiscoverer—intersection A∩C)haveevenhigherC10whichsuggestsfurtherpossibilitiesforstudyingthepredictivepowerofvariousauthorsubsets
(herewefocusedonthediscoverersandusedhighh-indexauthorsasabenchmark)inthefuture.
5. Discussion
Thispaperrevealsaheterogeneityofresearcher’scitingbehavior.Weidentifytheresearchers—thediscoverers—who frequentlyciteamongthefirstpapersthatbecomehighly-citedinthefuture.Wefocusonthesediscovererstoassesstheir futurecitingbehavior.Ourresultsshowthatthediscoverers’abilityispersistentintimetotheextentthatthediscoverers identifiedattimeTcanbeusedtoidentifyfuturehighly-citedpapers.Weshowthatthesocialinfluenceplaysaminorrole inexplainingthediscoverers’behavior,forwhichthemainhypothesisstillremainsthatthediscoverersaretheauthorsthat aredriven(oratleastmoresensitivethantheaverage)bytheintrinsicpaperfitness(asoriginallysuggestedbyMedoetal.
(2016)).
Despitethehugesuccessofpreferentialattachmentmechanisminexplainingthericher-get-richerphenomenon,there isanumberofcitationpatternsthatitcannotaccountfor:newlypublishedpapersquicklybecomingpopular,andhighly citedoldpapersbeingnotoftencitedanymore,forexample.Tosolvethesedeviationsfromthepreferentialattachment, researchershavedevelopedmodelslikethefitnessmodel(Bianconi&Barabási,2001;Caldarellietal.,2002)andtheaging model(Dorogovtsev&Mendes,2000;Wuetal.,2014).Whilethesetwomodelsareabletoprovidepossibleexplanations fortheanomaliesinthecitationdatadivergingfromthepreferentialattachmentprediction,adetailedmodelthatutilizes boththefitnessinformationandtemporalinformationtoextractthefitness-drivenbehaviorofusershasbeenlackingfor years.Ourworkprovidesawell-calibratedmethodassumingamixofuserswithdifferentsensitivitytopaperfitness.
Wedividetheresearchersintotwogroups:fitness-drivenresearchersandpopularity-drivenresearchers.Fitness-driven researchers,namelytheDiscoverers,canfindhigh-qualitypapersatearlystageandincreasetheearlycitationofthese papers,whichmakeshigh-qualitypaperseasiertobenoticedbypopularity-drivenresearchersandfinallystandout.This contributestotherelationbetweenthequalityofapaperanditscitationcount.Althoughthefitnessmodelservingas thefundamentalofthispapercouldofferapossibleexplanationofthephenomenonthatafewresearcherscanconstantly discoverthenewpapersthatcanbecomepopularinthefuture,acomprehensiveunderstandingofwhytheseresearchers havethepredictivitystillawaitfurtherstudies.
Ourfindingscanbeusedforearlyassessmentofpaperpotentialimpact.Accordingtoourstudy,focusingonthe discover-ersandassigningmoreweighttotheircitationswhenevaluatingnewpapersishelpfulinfindinghigh-qualitypapersshort aftertheyarepublished.Anotherinterestingresultisthatthepapersauthoredbydiscovererstendtoreceivemorecitations thanthoseofresearcherswithhighh-index.Forthefirsttime,thispaperrevealsacorrelationbetweenresearchers’citing behaviorandtheiracademicpotential.Thismayinturnprovidenewinputsfortheproblemofidentifyingpotentialfuture outstandingresearchers.
Authorcontributions
Guan-YuanShi:Collectedthedata;Performedtheanalysis;Wrotethepaper. Yi-XiuKong:Collectedthedata;Performedtheanalysis.
Guang-HuiYuan:Collectedthedata;Performedtheanalysis. Rui-JieWu:Performedtheanalysis;Wrotethepaper.
AnZeng:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. MatusMedo:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. Acknowledgments
TheauthorswouldliketothankProf.Yi-ChengZhangforhelpfuldiscussion.ThisworkwassupportedbytheNational NaturalScienceFoundationofChina(GrantNo.61603046),theNaturalScienceFoundationofBeijing(GrantNo.L160008), andtheChinaScholarshipCouncil.
References
Barabási,A.L.,&Albert,R.(1999).Emergenceofscalinginrandomnetworks.Science,286(5439),509–512.
Benjamini,Y.,&Hochberg,Y.(1995).Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.JournaloftheRoyal
StatisticalSociety.SeriesB(Methodological),289–300.
Berset,Y.,&Medo,M.(2013).Theeffectoftheinitialnetworkconfigurationonpreferentialattachment.EuropeanPhysicalJournalB,86,260.
Bianconi,G.,&Barabási,A.L.(2001).Bose-Einsteincondensationincomplexnetworks.PhysicalReviewLetters,86(24),5632.
Caldarelli,G.,Capocci,A.,DeLosRios,P.,&Munoz,M.A.(2002).Scale-freenetworksfromvaryingvertexintrinsicfitness.PhysicalReviewLetters,89(25), 258702.
Cervellini,P.,Menezes,A.G.,&Mago,V.K.(2016).FindingTrendsettersonYelpDataset.2016IEEESymposiumSeriesonComputationalIntelligence(SSCI).pp. 1.
Chen,P.,&Redner,S.(2010).Communitystructureofthephysicalreviewcitationnetwork.JournalofInformetrics,4,p278.
Dorogovtsev,S.N.,&Mendes,J.F.F.(2000).Evolutionofnetworkswithagingofsites.PhysicalReviewE,62(2),1842.
Golosovsky,M.,&Solomon,S.(2013).Thetransitiontowardsimmortality:Non-linearautocatalyticgrowthofcitationstoscientificpapers.Journalof
StatisticalPhysics,151,340.
Golosovsky,M.,&Solomon,S.(2017).Growingcomplexnetworkofcitationsofscientificpapers:Modelingandmeasurements.PhysicalReviewE,95, 012324.
Hirsch,J.E.(2005).Anindextoquantifyanindividual’sscientificresearchoutput.ProceedingsoftheNationalacademyofSciencesoftheUnitedStatesof
America,102(46),16569.
Jeong,H.,Néda,Z.,&Barabási,A.L.(2003).Measuringpreferentialattachmentinevolvingnetworks.EPL(EurophysicsLetters),61(4),567.
Mariani,M.S.,Medo,M.,&Zhang,Y.C.(2016).Identificationofmilestonepapersthroughtime-balancednetworkcentrality.JournalofInformetrics,10(4), 1207–1223.
Martin,T.,Ball,B.,Karrer,B.,&Newman,M.E.J.(2013).CoauthorshipandcitationpatternsinthePhysicalReview.PhysicalReviewE,88,012814.
Medo,M.,Cimini,G.,&Gualdi,S.(2011).Temporaleffectsinthegrowthofnetworks.PhysicalReviewLetters,107(23),238701.
Medo,M.,Mariani,M.S.,Zeng,A.,&Zhang,Y.C.(2016).Identificationandimpactofdiscoverersinonlinesocialsystems.ScientificReports,6,34218.
Medo,M.,&Cimini,G.(2016).Model-basedevaluationofscientificimpactindicators.PhysicalReviewE,94,032312.
Mitzenmacher,M.(2004).Abriefhistoryofgenerativemodelsforpowerlawandlognormaldistributions.InternetMathematics,1,226.
Newman,M.E.(2009).Thefirst-moveradvantageinscientificpublication.EPL(EurophysicsLetters),86(6),68001.
Papadopoulos,F.,Kitsak,M.,Serrano,M.Á.,Boguná,M.,&Krioukov,D.(2012).Popularityversussimilarityingrowingnetworks.Nature,489(7417),537.
Perc,M.(2014).TheMattheweffectinempiricaldata.JournalofTheRoyalSocietyInterface,11,20140378.
Price,D.J.D.S.(1965).Networksofscientificpapers.Science,510–515.
Price,D.D.S.(1976).Ageneraltheoryofbibliometricandothercumulativeadvantageprocesses.JournaloftheAssociationforInformationScienceand
Technology,27(5),292–306.
Redner,S.(2004).Citationstatisticsfrommorethanacenturyofphysicalreview.,arXivpreprintphysics/0407137.
Simkin,M.V.,&Roychowdhury,V.P.(2005).Stochasticmodelingofcitationslips.Scientometrics,62,367.
Simkin,M.V.,&Roychowdhury,V.P.(2007).Amathematicaltheoryofciting.JournaloftheAmericanSocietyforInformationScienceandTechnology,58, 1661.
Wang,M.,Yu,G.,&Yu,D.(2008).Measuringthepreferentialattachmentmechanismincitationnetworks.PhysicaA:StatisticalMechanicsandits
Applications,387(18),4692–4698.
Wang,M.,Yu,G.,&Yu,D.(2009).Effectoftheageofpapersonthepreferentialattachmentincitationnetworks.PhysicaA:StatisticalMechanicsandits
Applications,388(19),4273–4276.
Wu,Y.,Fu,T.Z.,&Chiu,D.M.(2014).Generalizedpreferentialattachmentconsideringaging.JournalofInformetrics,8(3),650–658.
Wu,Z.X.,&Holme,P.(2009).Modelingscientific-citationpatternsandothertriangle-richacyclicnetworks.PhysicalReviewE,80(3),037101.
Zeng,A.,Shen,Z.,Zhou,J.,Wu,J.,Fan,Y.,Wang,Y.,&Stanley,H.E.(2017).Thescienceofscience:Fromtheperspectiveofcomplexsystems.Physics
Reports,714/715,1–73.