Discoverers in scientific citation data

(1)

Discoverers in scientiﬁc citation data

Gui-Yuan Shi

a

_{, Yi-Xiu Kong}

a

_{, Guang-Hui Yuan}

b

_{, Rui-Jie Wu}

a

_{, An Zeng}

c,∗

_,

Matúˇs Medo

d,a,e,∗∗

a_{Department of Physics, University of Fribourg, Fribourg 1700, Switzerland}

b_{Fintech Research Institute, Shanghai University of Finance and Economics, Shanghai 200433, PR China} c_{School of Systems Science, Beijing Normal University, Beijing 100875, PR China}

d_{Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, PR China} e_{Department of Radiation Oncology, Inselspital, University Hospital of Bern and University of Bern, Bern 3010, Switzerland}

Keywords: Citation networks Preferential attachment Microscopic mechanisms Prediction

Identifying the future influential papers among the newly published ones is an important yet challenging issue in bibliometrics. As newly published papers have no or limited citation history, linear extrapolation of their citation counts—which is motivated by the well-known preferential attachment mechanism—is not applicable. We translate the recently intro-duced notion of discoverers to the citation network setting, and show that there are authors who frequently cite recent papers that become highly-cited in the future; these authors are referred to as discoverers. We develop a method for early identification of highly-cited papers based on the early citations from discoverers. The results show that the identified discoverers have a consistent citing pattern over time, and the early citations from them can be used as a valuable indicator to predict the future citation counts of a paper. The discoverers themselves are potential future outstanding researchers as they receive more citations than average.

1. Introduction

Preferential attachment, also known as the rich-get-richer phenomenon, cumulative advantage, or the Matthew effect, is omnipresent in bibliometrics. This phenomenon was ﬁrst systematically studied in 1965 (Price, 1965), and later modeled under the name cumulative advantage process (Price, 1976). At the beginning of the surge of interest in complex networks, preferential attachment was independently proposed to explain the power-law distribution of the number of links con-necting to web pages (Barabási & Albert, 1999). SeeMitzenmacher (2004),Perc (2014)andZeng et al. (2017)for reviews of preferential attachment and related models.

In the context of bibliometric research, the preferential attachment model assumes that the rate at which a paper receives new citations is proportional to the number of citations that the paper already has. The simplistic original model has been later generalized by introducing node ﬁtness (Bianconi & Barabási, 2001;Caldarelli, Capocci, De Los Rios, & Munoz, 2002), aging (Dorogovtsev & Mendes, 2000;Wu, Fu, & Chiu, 2014), node similarity (Papadopoulos, Kitsak, Serrano, Boguná, &

Krioukov, 2012). The introduction of aging is particularly important as in its absence, preferential attachment naturally

leads to strong ﬁrst-mover advantage (Newman, 2009). Without aging, the oldest papers will constistenly be the most cited

∗ Corresponding author. ∗∗ Corresponding author.

E-mail addresses:anzeng@bnu.edu.cn(A. Zeng),matus.medo@unifr.ch(M. Medo).

http://doc.rero.ch

Published in "Journal of Informetrics 13(2): 717–725, 2019"

which should be cited to refer to this work.

(2)

(Berset&Medo,2013;Mariani,Medo,&Zhang,2016).Inparticular,agingincombinationwithheterogeneousnodeﬁtness

(Medo,Cimini,&Gualdi,2011)producesvariousdegreedistributionsthatarecommonlyfoundinrealsystems.InMedoand

Cimini(2016),thismodelwasusedtocompareresearchercitationimpactmetricsinacontrolledsetting.Atthesametime,

therearemanymorefactorsthatcanbeincludedtomakethemodelsmorerealistic(Golosovsky&Solomon,2017;Wang,

Yu,&Yu,2009).

AparticularexampleofextendingthebasicpreferentialparadigmisduetoMedo,Mariani,Zeng,andZhang(2016)who showthatine-commercedatarepresentedbybipartiteuser-itemnetworks,thereareuserswhoarerepeatedlyamongthe ﬁrstincollectingitemsthatbecomepopularmuchlater.Theseusersarereferredtoasdiscoverers(seeCervellini,Menezes,

andMago(2016)forarelatedconceptoftrendsetters.)Onesimpleandfundamentalexplanationofthisobservationisthat

besidesthecommonuserswhoaredrivenbyitempopularity(inagreementwithpreferentialattachment),therearealso userswhoaredrivenbyintrinsicitemfitness.Herefitnessistheintrinsicitemqualityasviewedbyagivenaudience;high fitnessitemsarelikelytobecomepopular.Auserwhoissensitivetoitemfitnesscanthuslinkwithahigh-fitnessitem despiteitslowerpopularity.AsshowninMedoetal.(2016),discoverersareubiquitousacrossvariouse-commercesystems. Anothermechanismthat,albeitfundamentallydifferent,canleadtoasimilarpatternhasbeenputforwardbyWuand

Holme(2009)whostudiedtheprocessofthecitationsmadebyinﬂuentialresearchersbeingcopiedbytheirfollowers,and

itsimpactonthenetworkclusteringcoefﬁcient.

Whilescholarlycitationdatacanbenaturallyrepresentedasanauthor-paperbipartitenetworks,anditisthusnatural toapplytheframeworkofdiscoverersonthem,thishasnotbeendonebefore.Weusethemuch-studiedAmericanPhysical Society(APS)data(Chen&Redner,2010;Martin,Ball,Karrer,&Newman,2013;Medoetal.,2011)toshowthatdiscoverers arecommonalsoincitationdata.Uponidentifyingthediscoverers,wetrackthepapersﬁrstcitedbytheminthefutureand ﬁndthattheyareabove-averagecited.Finally,wefocusonthediscoverersthemselvesandassessthesuccessoftheirown publicationsandtheiroverallimpact(ascomparedwiththeimpactofotherauthors).

2. Dataandmodel

2.1. Datadescriptionandauthornamedisambiguation

WeusethecitationdatamadeavailablebytheAmericanPhysicalSociety(APS).Thedataspanfrom1893to2016with thetimeresolutionofoneday.Thereare593443papers,and5553199citationsamongthemintotal.Notethatthepapers publishedbyReviewsofModernPhysicsareexcludedfromtheanalysis,becausereviewarticlesusuallyhavemanymore citations,sotheyareinprinciple“easytargets”whichshouldnotcontributetotheauthors’recordofearlycitingpapersthat laterbecomeverypopular.Wealsoexcludeself-citations,i.e.,thecitingpaperandthecitedpaperhaveatleastoneauthor incommon.Itisconventionalthatresearcherscitetheirownpapers,sowebelievethisislargelydeviatedfromhowwe deﬁnea“discovery”.

Weconsidertwoauthorstobethesameindividualiftheysatisfythefollowingtwoconditions:(1)Theirlastnames, andtheinitialsofbothfirstnamesandgivennamesareexactlythesame,(2)Onehasbeencitedbyorcitingtheotherone, ortheyhaveco-authorswhosenamesatisfiesthefirstcondition.Afterthenamedisambiguation,wefinallyobtain559940 authorsintotal.

2.2. Model

ForagiventimepointT,thepaperspublishedbetweenT−K−10andT−Kareusedasthetrainingset(here10represents 10years).Toidentifythediscoverers,i.e.theresearcherswhooftenearlycitepapersthatareeventuallyhighly-cited,the ﬁrststepistosetthethresholdsfor“citeearly”and“highly-cited”.Sinceinrealitywecannotknowanyinformationafter timeT,weuseashort-termcitationcountCK(thecitationcountKyearsafterpublication)asaproxyforthepaper’seventual

popularity.AsKincreases,CKbecomesclosertotheeventualpopularity.Atthesametime,largeKmeansthatthetraining

setisfarfromthecurrenttimeTwhichlimitsthepredictivepoweroftheidentiﬁeddiscoverersfortwomainreasons: (1)AuthorsactiveinthetrainingsetmaybecomeinactiveuntiltimeT,(2)Authors’discoveryabilitydoesnothavetobe stationary,sotheauthorswhowerediscoverersinthetrainingsetneednottobediscoverersattimeT.Thepaperswhose CKisinthetopfp%amongallpaperspublishedbetween1970and2006formthetargetgroupofhighly-citedpapers.For

eachtrainingsetpaper,itsﬁrstFcitationsreceivedwithinoneyearafterthepublicationdatearedenotedasexplorations ofthepaper’spotential.Ifapaperturnsouttobeahighly-citedpaper,thoseFexplorationsarealsolabeledasdiscoveries. Theauthorsareassessedbythenumberofexplorationsanddiscoveriesthattheyhavemade(seebelow).Fig.1showsa schematicdiagramoftheevaluationscheme.NotethatoursettingismorecontrolledthanthatusedinMedoetal.(2016)

wherethecurrentpopularityofitemsattimeTwasusedinsteadofCK.TheuseofCKisadvantageouswhenanalyzingcitation

datawherecitationcountsofhighly-citedpapersareknowntoaccumulatecitationsoverextremelylongperiodsoftime

(Golosovsky&Solomon,2013).

Usingtheabove-describedprocedure,wedeterminethetotalnumberofexplorations,Ei,anddiscoveries,Di,foreach

researcheri.Nowthenullhypothesisisthattherearenodifferencesbetweentheauthorsinhowtheycitepapersthatlater becomehighly-cited.Inourterminology,each“exploration”hasthesamechancetobecomea“discovery”(i.e.,eachearly citationhasthesamechancetobeacitationofafuturehighly-citedpaper).Theoverallprobabilitythatanexplorationturns

(3)

Fig.1.Aschematicplotoftheevaluationofdiscoverers.1)˛isapaperinthetrainingset.Ifˇisoneof˛’sﬁrstFcitationswithin1year,ˇ’sauthors’number ofexplorationsincrease1.If˛’scitationafterKyearsranksthetopfp%,ˇ’sauthors’numberofdiscoveriesincrease1.2)isapaperinthetestset,and isﬁrstcitedbyı,ifoneofı’sauthorsisadiscoverer,ispredictedtobeahighly-citedpaper.

outtobeadiscoveryisPd=

Di/

Ei.Underthenullhypothesis,thenumberofdiscoveriesaresearchermakesfollowsthe

binomialdistribution P(D_i;E_i,P_d,H0)=

Ei Di

PDi d (1−Pd)Ei−Di. (1)

Weshalltestwhetherthenullhypothesisissupportedbythedata.Ifnot,howmuchtheresultdisagreewiththenull hypothesis?

3. Results

3.1. Statisticalsigniﬁcanceofdiscoverers

Withalargenumberofresearchers,someofthemcanachievemanydiscoveriesevenifthenullhypothesisactuallyholds. Tobeabletofocusonreallynotableauthors,wecomparethefoundempiricalnumbersofexplorationsanddiscoverieswith theirexpectedbehaviorunderthenullhypothesis.

SimilarlytothediscoveryprobabilityPD,onecanintroducealsotheexplorationprobabilityPe=

Ei/

LiwhereLiisthe

numberofcitationsmadebyauthoriinthetrainingset.Theexpecteddistributionofthenumberofexplorationsisthen closelyinterwoundwiththedistributionofthenumberofcitations.Underthenullhypothesis,theexpectednumberof authorswhomadeEexplorationsis

Ne(E)= ∞

x=1 A(x)

x E

PE e(1−Pe)x−E, (2)

whereA(x)isthenumberofauthorswhocitedxpapers.TheexpectednumberofauthorswhomadeDdiscoveries,Nd(D),can

beobtainedsimilarly.TheresultsinFig.2a,bdemonstratetherearemanyresearcherswhomakemanymoreexplorations anddiscoveriesthanexpectedunderthenullhypothesis.

SimilarlytoMedoetal.(2016),weusethep-valuetoquantifythestatisticalsigniﬁcanceofauthorimakingDidiscoveries

outofEiexplorations.Foragivenauthor,thep-value,i.e.theprobabilityofmakingatleastDidiscoveriesunderthenull

hypothesisH0is Pv(Di,Ei)= Ei

D=Di P(D;Ei,Pd,H0). (3)

Authorachievesmallp-valuePvonlyiftheycitenewpaperssufﬁcientlyoften(whenEiissmallorzero,eventhemaximal

possibleDi=Eimaybeinsufﬁcientforastatisticallysigniﬁcantobservation,seeFig.2dforanillustration),anda

dispro-portionatenumberofthemturnsouttobehighly-citedinthefuture.Toidentifythestatisticallysigniﬁcantdiscoverers, wecontroltheFalseDiscoveryRate(FDR)atthelevelof5%byapplyingtheBenjamini-Hochbergprocedure.Theprocedure worksasfollows:(1)rankalltheactiveNauthorswhomadeatleastoneexplorationinascendingorderofPv;(2)ﬁndthe

largestxsuchthatPv(x)≤x˛/N,here˛istheFDRlevel;(3)thesexresearcherswhohavetheminimalPvareidentiﬁed

asdiscoverers.BenjaminiandHochbergprovedthattheexpectedproportionoffalsediscoveries(identiﬁedasdiscoverers becauseofpureluck)is˛(Benjamini&Hochberg,1995).SeeFig.2cforanexample.

Weﬁndthattheresultsoftheouralgorithmarelittlesensitivetothechoiceofparameters.Toidentifyasmanydiscoverers aspossible,weusemultipleparameterchoicesandmarktheauthorswhoareidentiﬁedbythestatisticalprocedureforat leastoneparametersettingasdiscoverers.Inparticular,weusedF∈{1,...,5},K∈{1,...,10},andfP∈{0.2,0.4,...,2.0},

whichcorrespondsto5×10×10=500differentparametersettings.

Weidentifythediscoverersfrom1990to2006withthemethoddescribedabove.Thetotalnumberofdiscoverersineach year,showninFig.3a,waversaround900.InFig.3b,weshowtheproportionofpapersﬁrstcitedbydiscoverersineach year.About8%ofallpapersareﬁrstcitedbydiscoverersandthuspredictedtobepotentialhighly-citedpapers.

(4)

Fig.2. Statisticalsigniﬁcanceofdiscoverers.WeusetheparametersF=5,K=3,fp=1.0(%),andT=2006asanexample,thatimplyPD=7.6%andPE=5.4%.

(a)Numberofresearcherswhomakedifferentnumberofexplorations(reddots);theblacklinerepresentstheresultunderthenullhypothesis,seeEq.(2).

(b)Numberofresearcherswhomakedifferentnumberofdiscoveries(reddots);theblacklinerepresenttheresultunderthenullhypothesis.(c)Author p-valuesobtainedwithEq.(3)andtheBenjamini-HochbergproceduretocontroltheFalseDiscoveryRate.Thesolidlineisr˛/NwhereNisthenumber ofallactiveauthorswhomadeatleastoneexplorationundertheaboveparametersettingatT(hereN=72931),ristheauthorrank(thex-axis),and˛is theFDRlevel(wechoose5%).Thedotsbelowthelinemarktheauthorswhosignificantlydeviatefromthenullhypothesis(redsymbols);inthiscase,124 discoverersareidentifiedwhosep-valuesarebelowpm=8.9×10−5.(d)Thediscoveryrate(Di/Ei)oftheidentifieddiscoverers.Thesolidlineshowsthe smallestdiscoveryratethatwouldsufficetoachievethep-valueunderpmatagivennumberofexplorations.

Fig.3.(a)Numberofdiscoverersineachyear.(b)TheproportionofpapersﬁrstcitedbydiscoverersamongallthepaperswithC10≥1.

3.2. Anexampleofdiscoverers

Previousstudies(Jeong,Néda,&Barabási,2003;Redner,2004;Wang,Yu,&Yu,2008)showthatmostoftheresearchers citepapersaccordingtothepreferentialattachmentmechanism,andthereexistsastrongtendencyforthoseresearchers tocitepopularpapers.However,wefindthatthereexistsatinyfractionofresearcherswhorepeatedlycitenewpapersthat willbecomehighly-citedinthefuture.Herewegiveanexampleofatypicaldiscovereridentifiedinyear1990,withhis/her publishingtimeseriesillustratedinFig.4.Thediscovererismarkedasdiscoverer301timesamongallthe500parameter settingsatT=1990,whichisthehighestofallresearchers.Thediscoverer’sfirstpaperwaspublishedinJuly,1973. 4. Thebehaviorofdiscoverers

4.1. Futuresuccessofpapers

Sincethediscoverershavehistoricallyhighsuccessinearlycitinghighly-citedpapers,itisnaturaltoinvestigatewhether theyalsohavetheabilitytodiscoverhighly-citedpapersinthefuture.Tothisend,weemployaprocedureillustratedin

Fig.1:usingthedescribedprocedure,weidentifythediscoversatthebeginningofayear(timeT,from1990to2006)and considerallpaperspublishedbetweenTandT+1thathaveC10≥1.Amongthosepapers,ifapaper’sﬁrstcitationcomes

fromanotherpaperthathasatleastonediscovereramongtheauthors,wesaythatthepaperis“ﬁrstcitedbydiscoverers”.

(5)

Fig.4.Themostoutstandingdiscovereridentiﬁedintheyearof1990.Bluebarsshowthetotalcitationcountsofallthepapersthediscovererhascited, andtheredbarsshowthecitationcountofthepaperwhenthediscoverercitedit.

Fig.5.(a)AcomparisonbetweenC10ofthepapersﬁrstcitedbythediscoverers(red),highh-indexauthors(blue),andallresearchers(black),eachdot representsthemeanC10ofpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrby C10,risthenormalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupﬁrstcitedbydiscoverers,highh-index authorsandallauthorsrespectively.

IntheAPSdata,fromthebeginningof1990totheendof2006,186567papershavebeenpublishedandhaveC10≥1

citation.Amongthem,13157papershavebeenﬁrstcitedbydiscoverers.Onaverage,thepapersﬁrstcitedbydiscoverers haveC10of16.7,whiletheoverallaverage(forpaperswithC10≥1)is9.85.Asanadditionalbenchmark,wecomputethe

h-indexHirsch(2005)ofallauthorsattimeT(herethebeginningofeachyear),selecttheauthorswithhighh-index,and evaluatethepapersthatareﬁrstcitedbyhighh-indexauthors.Theh-indexthresholdsare14–20,differentfromyearto yearsothatthenumberoftheselectedpapersisclosetothenumberofpaperscitedbydiscoverers.Theresulting14001 papersﬁrstcitedbyhighh-indexresearchershaveonaverageC10of14.4.Fig.5ashowsC10ofthepaperschosenbythe

threecriteria(allpaperswithC10≥1,ﬁrstcitedbydiscoverers,andﬁrstcitedbyhighh-indexauthors)bypaperpublication

year.WeseethatthepaperscitedbythediscoverershavethehighestC10ineachyear.AsshowninFig.5b,ifwesortthe

papersbyitsrankingofC10ineachgroup,thepapersﬁrstcitedbydiscoverersoutperformsthepaperswhichhavethesame

rankingsintheothertwogroups.

Inourprocedure,thepapersﬁrstcitedbythediscoverersarefoundtobehighly-citedinthefuture.Tomeasurehow manyhighly-citedpapersaremissedbyourprediction,wecalculatetheFalsePositiveRate(FPR)andtheTruePositiveRate (TPR)inourmethod.TheFPRisdeﬁnedasFPR=FP/Nn,whereFPisthenumberoffalsepositives,andNnisthetotalnumberof

negatives.MeanwhiletheTPRisdeﬁnedasTPR=TP/Np,whereTPisthenumberoftruepositives,andNpisthetotalnumber

ofpositives.

WedeﬁneapaperthathasC10>68(ranksinthetop1%ofallpaperswithC10≥1from1990to2006)ashighly-cited.As

aconsequence,atruepositiveeventisdefinedasapaperfirstcitedbydiscoverershavingC10>68inthefuture.Wefindthe

numberofoccasionsoftruepositiveeventsTPd=368.Afalsepositiveeventisdeﬁnedasapaperﬁrstcitedbydiscoverer

havingC10<68inthefuture.WeobtainintotalFPd=12789falsepositiveevents.ThesevaluesimplytheFPRFPRd=6.9%

andtheTPRTPR_d=20.3%forthepredictionofhighly-citedpapersusingdiscoverers.Forcomparison,wealsostudythe papersﬁrstcitedbyhighh-indexauthors,ﬁndingtheFPRFPRh=7.4%andtheTPRTPRh=17.0%.Discoverer-basedmethod

outperformsthemethodbasedonhighh-indexauthorsinbothFPRandTPR.Wefurthertestedotherthresholdsofdeﬁning thepositiveeventsandﬁndthattheconclusionholdsforallthresholds.

4.2. Discoverersandsocialinﬂuence

Theobservationthatthepapersfirstcitedbythediscoverersaremorelikelytobecitedinthefuturecallsforpossible explanations.InlinewithMedoetal.(2016),oneexplanationisthatthediscoverersarefitness-drivenresearcherswhoare sensitivetotheintrinsicqualityofpapers,thusthepapersthattheycitehaveoverallhigherqualitywhichinturnleadsto themorecitations.Anotherpossibleexplanationisthatthediscoverersthemselvesareinfluentialresearchers,whichmeans

(6)

Fig.6.(a)Acomparisonbetweenthepapersfirstcitedbydiscoverersandallpapers(c10≥1).Theredbarsindicatethefirstcitation,greenbarsandorange barsrepresentthosecitationsfrompaperswhichcitedthefirstcitationpaperaswell,ornot,respectively.See(b)forillustration.

theyhavemanyfollowerswhofrequentlycitetheirpapersandalsopossiblycopytheirreferences(Wu&Holme,2009). Suchcopyingofreferencesisknowntobeanimportantelementofthegrowthofcitationdata(Simkin&Roychowdhury,

2005,2007).Inaddition,whenenteringanewresearchﬁeld,apossiblewaytostartistoreviewthemilestonepapers

byoutstandingresearchersintheﬁeld(andthereferencestherein).Insummary,itisplausiblethatresearchersaremore exposedtothepapersthatwerecitedbyfamousresearchers.

Tobetterunderstandtheroleofdiscoverersandtheircontributiontoboostingpaperpopularity,itiscrucialtodistinguish betweenimpactsof thesetwo effects.Similareffectshave beenrarelydiscussedbeforebecausemostdatasetslacka directwaytodeterminewhetherachoiceisanindependentdecisionorcopyingthebehaviorofothers.Fortunately,the scientiﬁccitationdatagiveusaninformationaboutthepathsofknowledgepropagationandthusprovideusawaytofurther investigatethisproblem.

Tothisend,wedividepaper˛’scitationsintothreegroups: 1 Theﬁrstcitationofpaper˛islabeledasˇ.(RedbarinFig.6a)

2Papersthatciteboth˛andˇ.Thesepaperscanbefurtherdividedintothreetypes:(i)Thepapersthatcite˛becauseof theinformationfromˇ.(ii)Thepapersthatciteˇbecauseoftheinformationfrom˛.(iii)Thepapersthatciteboth˛and ˇjustbecauseofcoincidenceorotherunknowncauses.Itisgenerallydifﬁculttodistinguishthesetypesofcitations.For simplicity,weconsiderthenumberofallthesepapersasameasureofupperlimitofthesocialinﬂuenceofˇ.(Greenbar inFig.6a)

3Papersthatcite˛butdonotciteˇ.(OrangebarinFig.6a)

Whileclasstwocorrespondstothehypothesisthatthediscoverersutilizetheirsocialinfluenceandboostthecitation countofatargetpaperbyhavingitco-citedwiththeirpaper,classthreecorrespondstothehypothesisofdiscoverersbeing thefirstonestospothighqualitypapers.AsshowninFig.6a,thethirdclassofcitationsisclearlythedominantcomponent. Whilesomesocialinfluenceofdiscoverersstillmayexist,weseethatcopyingofthediscoverers’behaviorisaminorinfluence here.

4.3. Dothediscoverersactasfollowers?

Anotherquestioniswhetherthesuccessofdiscovererscanpossiblylieinpreferentiallyciting(following)highly-cited authorsthat,inturn,increasestheirchanceofmakingdiscoveries.Toinvestigatethispossibility,wecomputethemaximum h-indexofeachpaper’sauthorsatthetimethatthepaperpublished,hmax.Wethendividethepapersingroupsbyhmax.

AsshowninFig.7,discoverershavepreferenceincitingpaperswithhighhmax,forexample,3.12%ofpapersﬁrstcitedby

discoverershavehmax≥30,whileforallthepaperswithC10≥1,theproportionis1.12%.Despitethepreference,ineach

hmaxgroup,papersﬁrstcitedbythediscoverersreceivesigniﬁcantlymorecitationsthanthosecitedbyallauthors,this

showsthatthepapersﬁrstcitedbythediscoverersareonaveragemorecitedregardlessofwhetherthepapers’authorsare highly-citedornot.

4.4. Papersauthoredbythediscoverers

Anotherpertinentquestioniswhetherthediscoverers’papersalsoreceivemorecitations.Fromthe233087papers publishedin1990–2006,16358papershaveatleastonediscovereramongtheirauthors,and17966papershaveatleast onehighh-indexauthoramongtheirauthors(Theh-indexthresholdsare14–21,differentfromyeartoyearsothatthe

(7)

Fig.7. Acomparisonofc10forthepapersﬁrstcitedbythediscoverers(bluebars),andallauthors(yellowbars).Paperspublishedin1990–2006are considered.Thepapersaredividedingroupsbytheirhmax(thehighesth-indexoftheirauthorsinthepaper’spublicationyear).Theerrorbarsrepresent thestandarderrors.Thepercentagesabovethebarsindicatetheproportionofthecontributingpapers.

Fig.8.(a)AcomparisonbetweenC10ofthepapersauthoredbythediscoverers(red),highh-indexauthors(blue),andallauthors(black),eachdotrepresents themeanC10ofallpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrbyC10,risthe normalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupauthoredbydiscoverers,highh-indexauthorsand allauthorsrespectively.

Table1

MeananditsstandarderrorofC10forvarioussubsetsofpapers;Nsdenotesthesubsetsize.Thebottompartincludestheresultsforvariousintersections ofpapersubsetsdiscussedinthetoppart.

Firstcitedby Authoredby

(A)Discoverers (B)Highh-index All (C)Discoverers (D)Highh-index All

C10 16.7±0.3 14.4±0.2 9.85±0.05 17.1±0.2 14.9±0.2 7.88±0.03

Ns 13157 14001 186567 16358 17966 233087

A∩B A∩C A∩D B∩C B∩D C∩D

C10 17.2±0.4 24.4±0.7 24.3±0.8 23.0±0.8 21.5±0.8 19.6±0.4

Ns 5490 3172 2292 2211 2161 6822

numberoftheselectedpapersisclosetothenumberofpapersauthoredbydiscoverers.).MeanC10valuesforthesethree

groupsofpapersare7.88,17.1,and14.9,respectively.Similarlytothepapersﬁrstcitedbythediscoverers,papersauthored bythediscoverersalsohavesigniﬁcantcitationleadoverthetwobenchmarkgroups(seeFig.8aforacomparisonbypaper publicationyear).

4.5. Furtherimprovingthepredictions

Finally,wecomparethecitationcountsofvarioussubsetsofpapersinTable1.Notably,papersbelongingtomultiple inter-sectionsofthepapersubsetsdiscussedsofar(e.g.,papersthatarebothﬁrstcitedandauthoredbyadiscoverer—intersection A∩C)haveevenhigherC10whichsuggestsfurtherpossibilitiesforstudyingthepredictivepowerofvariousauthorsubsets

(herewefocusedonthediscoverersandusedhighh-indexauthorsasabenchmark)inthefuture.

(8)

5. Discussion

Thispaperrevealsaheterogeneityofresearcher’scitingbehavior.Weidentifytheresearchers—thediscoverers—who frequentlyciteamongthefirstpapersthatbecomehighly-citedinthefuture.Wefocusonthesediscovererstoassesstheir futurecitingbehavior.Ourresultsshowthatthediscoverers’abilityispersistentintimetotheextentthatthediscoverers identifiedattimeTcanbeusedtoidentifyfuturehighly-citedpapers.Weshowthatthesocialinfluenceplaysaminorrole inexplainingthediscoverers’behavior,forwhichthemainhypothesisstillremainsthatthediscoverersaretheauthorsthat aredriven(oratleastmoresensitivethantheaverage)bytheintrinsicpaperfitness(asoriginallysuggestedbyMedoetal.

(2016)).

Despitethehugesuccessofpreferentialattachmentmechanisminexplainingthericher-get-richerphenomenon,there isanumberofcitationpatternsthatitcannotaccountfor:newlypublishedpapersquicklybecomingpopular,andhighly citedoldpapersbeingnotoftencitedanymore,forexample.Tosolvethesedeviationsfromthepreferentialattachment, researchershavedevelopedmodelslikethefitnessmodel(Bianconi&Barabási,2001;Caldarellietal.,2002)andtheaging model(Dorogovtsev&Mendes,2000;Wuetal.,2014).Whilethesetwomodelsareabletoprovidepossibleexplanations fortheanomaliesinthecitationdatadivergingfromthepreferentialattachmentprediction,adetailedmodelthatutilizes boththefitnessinformationandtemporalinformationtoextractthefitness-drivenbehaviorofusershasbeenlackingfor years.Ourworkprovidesawell-calibratedmethodassumingamixofuserswithdifferentsensitivitytopaperfitness.

Wedividetheresearchersintotwogroups:fitness-drivenresearchersandpopularity-drivenresearchers.Fitness-driven researchers,namelytheDiscoverers,canfindhigh-qualitypapersatearlystageandincreasetheearlycitationofthese papers,whichmakeshigh-qualitypaperseasiertobenoticedbypopularity-drivenresearchersandfinallystandout.This contributestotherelationbetweenthequalityofapaperanditscitationcount.Althoughthefitnessmodelservingas thefundamentalofthispapercouldofferapossibleexplanationofthephenomenonthatafewresearcherscanconstantly discoverthenewpapersthatcanbecomepopularinthefuture,acomprehensiveunderstandingofwhytheseresearchers havethepredictivitystillawaitfurtherstudies.

Ourfindingscanbeusedforearlyassessmentofpaperpotentialimpact.Accordingtoourstudy,focusingonthe discover-ersandassigningmoreweighttotheircitationswhenevaluatingnewpapersishelpfulinfindinghigh-qualitypapersshort aftertheyarepublished.Anotherinterestingresultisthatthepapersauthoredbydiscovererstendtoreceivemorecitations thanthoseofresearcherswithhighh-index.Forthefirsttime,thispaperrevealsacorrelationbetweenresearchers’citing behaviorandtheiracademicpotential.Thismayinturnprovidenewinputsfortheproblemofidentifyingpotentialfuture outstandingresearchers.

Authorcontributions

Guan-YuanShi:Collectedthedata;Performedtheanalysis;Wrotethepaper. Yi-XiuKong:Collectedthedata;Performedtheanalysis.

Guang-HuiYuan:Collectedthedata;Performedtheanalysis. Rui-JieWu:Performedtheanalysis;Wrotethepaper.

AnZeng:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. MatusMedo:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. Acknowledgments

TheauthorswouldliketothankProf.Yi-ChengZhangforhelpfuldiscussion.ThisworkwassupportedbytheNational NaturalScienceFoundationofChina(GrantNo.61603046),theNaturalScienceFoundationofBeijing(GrantNo.L160008), andtheChinaScholarshipCouncil.

References

Barabási,A.L.,&Albert,R.(1999).Emergenceofscalinginrandomnetworks.Science,286(5439),509–512.

Benjamini,Y.,&Hochberg,Y.(1995).Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.JournaloftheRoyal

StatisticalSociety.SeriesB(Methodological),289–300.

Berset,Y.,&Medo,M.(2013).Theeffectoftheinitialnetworkconﬁgurationonpreferentialattachment.EuropeanPhysicalJournalB,86,260.

Bianconi,G.,&Barabási,A.L.(2001).Bose-Einsteincondensationincomplexnetworks.PhysicalReviewLetters,86(24),5632.

Caldarelli,G.,Capocci,A.,DeLosRios,P.,&Munoz,M.A.(2002).Scale-freenetworksfromvaryingvertexintrinsicﬁtness.PhysicalReviewLetters,89(25), 258702.

Cervellini,P.,Menezes,A.G.,&Mago,V.K.(2016).FindingTrendsettersonYelpDataset.2016IEEESymposiumSeriesonComputationalIntelligence(SSCI).pp. 1.

Chen,P.,&Redner,S.(2010).Communitystructureofthephysicalreviewcitationnetwork.JournalofInformetrics,4,p278.

Dorogovtsev,S.N.,&Mendes,J.F.F.(2000).Evolutionofnetworkswithagingofsites.PhysicalReviewE,62(2),1842.

Golosovsky,M.,&Solomon,S.(2013).Thetransitiontowardsimmortality:Non-linearautocatalyticgrowthofcitationstoscientiﬁcpapers.Journalof

StatisticalPhysics,151,340.

Golosovsky,M.,&Solomon,S.(2017).Growingcomplexnetworkofcitationsofscientiﬁcpapers:Modelingandmeasurements.PhysicalReviewE,95, 012324.

Hirsch,J.E.(2005).Anindextoquantifyanindividual’sscientiﬁcresearchoutput.ProceedingsoftheNationalacademyofSciencesoftheUnitedStatesof

America,102(46),16569.

(9)

Jeong,H.,Néda,Z.,&Barabási,A.L.(2003).Measuringpreferentialattachmentinevolvingnetworks.EPL(EurophysicsLetters),61(4),567.

Mariani,M.S.,Medo,M.,&Zhang,Y.C.(2016).Identiﬁcationofmilestonepapersthroughtime-balancednetworkcentrality.JournalofInformetrics,10(4), 1207–1223.

Martin,T.,Ball,B.,Karrer,B.,&Newman,M.E.J.(2013).CoauthorshipandcitationpatternsinthePhysicalReview.PhysicalReviewE,88,012814.

Medo,M.,Cimini,G.,&Gualdi,S.(2011).Temporaleffectsinthegrowthofnetworks.PhysicalReviewLetters,107(23),238701.

Medo,M.,Mariani,M.S.,Zeng,A.,&Zhang,Y.C.(2016).Identiﬁcationandimpactofdiscoverersinonlinesocialsystems.ScientiﬁcReports,6,34218.

Medo,M.,&Cimini,G.(2016).Model-basedevaluationofscientiﬁcimpactindicators.PhysicalReviewE,94,032312.

Mitzenmacher,M.(2004).Abriefhistoryofgenerativemodelsforpowerlawandlognormaldistributions.InternetMathematics,1,226.

Newman,M.E.(2009).Theﬁrst-moveradvantageinscientiﬁcpublication.EPL(EurophysicsLetters),86(6),68001.

Papadopoulos,F.,Kitsak,M.,Serrano,M.Á.,Boguná,M.,&Krioukov,D.(2012).Popularityversussimilarityingrowingnetworks.Nature,489(7417),537.

Perc,M.(2014).TheMattheweffectinempiricaldata.JournalofTheRoyalSocietyInterface,11,20140378.

Price,D.J.D.S.(1965).Networksofscientiﬁcpapers.Science,510–515.

Price,D.D.S.(1976).Ageneraltheoryofbibliometricandothercumulativeadvantageprocesses.JournaloftheAssociationforInformationScienceand

Technology,27(5),292–306.

Redner,S.(2004).Citationstatisticsfrommorethanacenturyofphysicalreview.,arXivpreprintphysics/0407137.

Simkin,M.V.,&Roychowdhury,V.P.(2005).Stochasticmodelingofcitationslips.Scientometrics,62,367.

Simkin,M.V.,&Roychowdhury,V.P.(2007).Amathematicaltheoryofciting.JournaloftheAmericanSocietyforInformationScienceandTechnology,58, 1661.

Wang,M.,Yu,G.,&Yu,D.(2008).Measuringthepreferentialattachmentmechanismincitationnetworks.PhysicaA:StatisticalMechanicsandits

Applications,387(18),4692–4698.

Wang,M.,Yu,G.,&Yu,D.(2009).Effectoftheageofpapersonthepreferentialattachmentincitationnetworks.PhysicaA:StatisticalMechanicsandits

Applications,388(19),4273–4276.

Wu,Y.,Fu,T.Z.,&Chiu,D.M.(2014).Generalizedpreferentialattachmentconsideringaging.JournalofInformetrics,8(3),650–658.

Wu,Z.X.,&Holme,P.(2009).Modelingscientiﬁc-citationpatternsandothertriangle-richacyclicnetworks.PhysicalReviewE,80(3),037101.

Zeng,A.,Shen,Z.,Zhou,J.,Wu,J.,Fan,Y.,Wang,Y.,&Stanley,H.E.(2017).Thescienceofscience:Fromtheperspectiveofcomplexsystems.Physics

Reports,714/715,1–73.