• Aucun résultat trouvé

Discoverers in scientific citation data

N/A
N/A
Protected

Academic year: 2021

Partager "Discoverers in scientific citation data"

Copied!
9
0
0

Texte intégral

(1)

Discoverers in scientific citation data

Gui-Yuan Shi

a

, Yi-Xiu Kong

a

, Guang-Hui Yuan

b

, Rui-Jie Wu

a

, An Zeng

c,∗

,

Matúˇs Medo

d,a,e,∗∗

aDepartment of Physics, University of Fribourg, Fribourg 1700, Switzerland

bFintech Research Institute, Shanghai University of Finance and Economics, Shanghai 200433, PR China cSchool of Systems Science, Beijing Normal University, Beijing 100875, PR China

dInstitute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, PR China eDepartment of Radiation Oncology, Inselspital, University Hospital of Bern and University of Bern, Bern 3010, Switzerland

Keywords: Citation networks Preferential attachment Microscopic mechanisms Prediction

Identifying the future influential papers among the newly published ones is an important yet challenging issue in bibliometrics. As newly published papers have no or limited citation history, linear extrapolation of their citation counts—which is motivated by the well-known preferential attachment mechanism—is not applicable. We translate the recently intro-duced notion of discoverers to the citation network setting, and show that there are authors who frequently cite recent papers that become highly-cited in the future; these authors are referred to as discoverers. We develop a method for early identification of highly-cited papers based on the early citations from discoverers. The results show that the identified discoverers have a consistent citing pattern over time, and the early citations from them can be used as a valuable indicator to predict the future citation counts of a paper. The discoverers themselves are potential future outstanding researchers as they receive more citations than average.

1. Introduction

Preferential attachment, also known as the rich-get-richer phenomenon, cumulative advantage, or the Matthew effect, is omnipresent in bibliometrics. This phenomenon was first systematically studied in 1965 (Price, 1965), and later modeled under the name cumulative advantage process (Price, 1976). At the beginning of the surge of interest in complex networks, preferential attachment was independently proposed to explain the power-law distribution of the number of links con-necting to web pages (Barabási & Albert, 1999). SeeMitzenmacher (2004),Perc (2014)andZeng et al. (2017)for reviews of preferential attachment and related models.

In the context of bibliometric research, the preferential attachment model assumes that the rate at which a paper receives new citations is proportional to the number of citations that the paper already has. The simplistic original model has been later generalized by introducing node fitness (Bianconi & Barabási, 2001;Caldarelli, Capocci, De Los Rios, & Munoz, 2002), aging (Dorogovtsev & Mendes, 2000;Wu, Fu, & Chiu, 2014), node similarity (Papadopoulos, Kitsak, Serrano, Boguná, &

Krioukov, 2012). The introduction of aging is particularly important as in its absence, preferential attachment naturally

leads to strong first-mover advantage (Newman, 2009). Without aging, the oldest papers will constistenly be the most cited

∗ Corresponding author. ∗∗ Corresponding author.

E-mail addresses:anzeng@bnu.edu.cn(A. Zeng),matus.medo@unifr.ch(M. Medo).

http://doc.rero.ch

Published in "Journal of Informetrics 13(2): 717–725, 2019"

which should be cited to refer to this work.

(2)

(Berset&Medo,2013;Mariani,Medo,&Zhang,2016).Inparticular,agingincombinationwithheterogeneousnodefitness

(Medo,Cimini,&Gualdi,2011)producesvariousdegreedistributionsthatarecommonlyfoundinrealsystems.InMedoand

Cimini(2016),thismodelwasusedtocompareresearchercitationimpactmetricsinacontrolledsetting.Atthesametime,

therearemanymorefactorsthatcanbeincludedtomakethemodelsmorerealistic(Golosovsky&Solomon,2017;Wang,

Yu,&Yu,2009).

AparticularexampleofextendingthebasicpreferentialparadigmisduetoMedo,Mariani,Zeng,andZhang(2016)who showthatine-commercedatarepresentedbybipartiteuser-itemnetworks,thereareuserswhoarerepeatedlyamongthe firstincollectingitemsthatbecomepopularmuchlater.Theseusersarereferredtoasdiscoverers(seeCervellini,Menezes,

andMago(2016)forarelatedconceptoftrendsetters.)Onesimpleandfundamentalexplanationofthisobservationisthat

besidesthecommonuserswhoaredrivenbyitempopularity(inagreementwithpreferentialattachment),therearealso userswhoaredrivenbyintrinsicitemfitness.Herefitnessistheintrinsicitemqualityasviewedbyagivenaudience;high fitnessitemsarelikelytobecomepopular.Auserwhoissensitivetoitemfitnesscanthuslinkwithahigh-fitnessitem despiteitslowerpopularity.AsshowninMedoetal.(2016),discoverersareubiquitousacrossvariouse-commercesystems. Anothermechanismthat,albeitfundamentallydifferent,canleadtoasimilarpatternhasbeenputforwardbyWuand

Holme(2009)whostudiedtheprocessofthecitationsmadebyinfluentialresearchersbeingcopiedbytheirfollowers,and

itsimpactonthenetworkclusteringcoefficient.

Whilescholarlycitationdatacanbenaturallyrepresentedasanauthor-paperbipartitenetworks,anditisthusnatural toapplytheframeworkofdiscoverersonthem,thishasnotbeendonebefore.Weusethemuch-studiedAmericanPhysical Society(APS)data(Chen&Redner,2010;Martin,Ball,Karrer,&Newman,2013;Medoetal.,2011)toshowthatdiscoverers arecommonalsoincitationdata.Uponidentifyingthediscoverers,wetrackthepapersfirstcitedbytheminthefutureand findthattheyareabove-averagecited.Finally,wefocusonthediscoverersthemselvesandassessthesuccessoftheirown publicationsandtheiroverallimpact(ascomparedwiththeimpactofotherauthors).

2. Dataandmodel

2.1. Datadescriptionandauthornamedisambiguation

WeusethecitationdatamadeavailablebytheAmericanPhysicalSociety(APS).Thedataspanfrom1893to2016with thetimeresolutionofoneday.Thereare593443papers,and5553199citationsamongthemintotal.Notethatthepapers publishedbyReviewsofModernPhysicsareexcludedfromtheanalysis,becausereviewarticlesusuallyhavemanymore citations,sotheyareinprinciple“easytargets”whichshouldnotcontributetotheauthors’recordofearlycitingpapersthat laterbecomeverypopular.Wealsoexcludeself-citations,i.e.,thecitingpaperandthecitedpaperhaveatleastoneauthor incommon.Itisconventionalthatresearcherscitetheirownpapers,sowebelievethisislargelydeviatedfromhowwe definea“discovery”.

Weconsidertwoauthorstobethesameindividualiftheysatisfythefollowingtwoconditions:(1)Theirlastnames, andtheinitialsofbothfirstnamesandgivennamesareexactlythesame,(2)Onehasbeencitedbyorcitingtheotherone, ortheyhaveco-authorswhosenamesatisfiesthefirstcondition.Afterthenamedisambiguation,wefinallyobtain559940 authorsintotal.

2.2. Model

ForagiventimepointT,thepaperspublishedbetweenT−K−10andT−Kareusedasthetrainingset(here10represents 10years).Toidentifythediscoverers,i.e.theresearcherswhooftenearlycitepapersthatareeventuallyhighly-cited,the firststepistosetthethresholdsfor“citeearly”and“highly-cited”.Sinceinrealitywecannotknowanyinformationafter timeT,weuseashort-termcitationcountCK(thecitationcountKyearsafterpublication)asaproxyforthepaper’seventual

popularity.AsKincreases,CKbecomesclosertotheeventualpopularity.Atthesametime,largeKmeansthatthetraining

setisfarfromthecurrenttimeTwhichlimitsthepredictivepoweroftheidentifieddiscoverersfortwomainreasons: (1)AuthorsactiveinthetrainingsetmaybecomeinactiveuntiltimeT,(2)Authors’discoveryabilitydoesnothavetobe stationary,sotheauthorswhowerediscoverersinthetrainingsetneednottobediscoverersattimeT.Thepaperswhose CKisinthetopfp%amongallpaperspublishedbetween1970and2006formthetargetgroupofhighly-citedpapers.For

eachtrainingsetpaper,itsfirstFcitationsreceivedwithinoneyearafterthepublicationdatearedenotedasexplorations ofthepaper’spotential.Ifapaperturnsouttobeahighly-citedpaper,thoseFexplorationsarealsolabeledasdiscoveries. Theauthorsareassessedbythenumberofexplorationsanddiscoveriesthattheyhavemade(seebelow).Fig.1showsa schematicdiagramoftheevaluationscheme.NotethatoursettingismorecontrolledthanthatusedinMedoetal.(2016)

wherethecurrentpopularityofitemsattimeTwasusedinsteadofCK.TheuseofCKisadvantageouswhenanalyzingcitation

datawherecitationcountsofhighly-citedpapersareknowntoaccumulatecitationsoverextremelylongperiodsoftime

(Golosovsky&Solomon,2013).

Usingtheabove-describedprocedure,wedeterminethetotalnumberofexplorations,Ei,anddiscoveries,Di,foreach

researcheri.Nowthenullhypothesisisthattherearenodifferencesbetweentheauthorsinhowtheycitepapersthatlater becomehighly-cited.Inourterminology,each“exploration”hasthesamechancetobecomea“discovery”(i.e.,eachearly citationhasthesamechancetobeacitationofafuturehighly-citedpaper).Theoverallprobabilitythatanexplorationturns

(3)

Fig.1.Aschematicplotoftheevaluationofdiscoverers.1)˛isapaperinthetrainingset.Ifˇisoneof˛’sfirstFcitationswithin1year,ˇ’sauthors’number ofexplorationsincrease1.If˛’scitationafterKyearsranksthetopfp%,ˇ’sauthors’numberofdiscoveriesincrease1.2)isapaperinthetestset,and isfirstcitedbyı,ifoneofı’sauthorsisadiscoverer,ispredictedtobeahighly-citedpaper.

outtobeadiscoveryisPd=



Di/



Ei.Underthenullhypothesis,thenumberofdiscoveriesaresearchermakesfollowsthe

binomialdistribution P(Di;Ei,Pd,H0)=



Ei Di



PDi d (1−Pd)Ei−Di. (1)

Weshalltestwhetherthenullhypothesisissupportedbythedata.Ifnot,howmuchtheresultdisagreewiththenull hypothesis?

3. Results

3.1. Statisticalsignificanceofdiscoverers

Withalargenumberofresearchers,someofthemcanachievemanydiscoveriesevenifthenullhypothesisactuallyholds. Tobeabletofocusonreallynotableauthors,wecomparethefoundempiricalnumbersofexplorationsanddiscoverieswith theirexpectedbehaviorunderthenullhypothesis.

SimilarlytothediscoveryprobabilityPD,onecanintroducealsotheexplorationprobabilityPe=



Ei/



LiwhereLiisthe

numberofcitationsmadebyauthoriinthetrainingset.Theexpecteddistributionofthenumberofexplorationsisthen closelyinterwoundwiththedistributionofthenumberofcitations.Underthenullhypothesis,theexpectednumberof authorswhomadeEexplorationsis

Ne(E)= ∞



x=1 A(x)



x E



PE e(1−Pe)x−E, (2)

whereA(x)isthenumberofauthorswhocitedxpapers.TheexpectednumberofauthorswhomadeDdiscoveries,Nd(D),can

beobtainedsimilarly.TheresultsinFig.2a,bdemonstratetherearemanyresearcherswhomakemanymoreexplorations anddiscoveriesthanexpectedunderthenullhypothesis.

SimilarlytoMedoetal.(2016),weusethep-valuetoquantifythestatisticalsignificanceofauthorimakingDidiscoveries

outofEiexplorations.Foragivenauthor,thep-value,i.e.theprobabilityofmakingatleastDidiscoveriesunderthenull

hypothesisH0is Pv(Di,Ei)= Ei



D=Di P(D;Ei,Pd,H0). (3)

Authorachievesmallp-valuePvonlyiftheycitenewpaperssufficientlyoften(whenEiissmallorzero,eventhemaximal

possibleDi=Eimaybeinsufficientforastatisticallysignificantobservation,seeFig.2dforanillustration),anda

dispro-portionatenumberofthemturnsouttobehighly-citedinthefuture.Toidentifythestatisticallysignificantdiscoverers, wecontroltheFalseDiscoveryRate(FDR)atthelevelof5%byapplyingtheBenjamini-Hochbergprocedure.Theprocedure worksasfollows:(1)rankalltheactiveNauthorswhomadeatleastoneexplorationinascendingorderofPv;(2)findthe

largestxsuchthatPv(x)≤x˛/N,here˛istheFDRlevel;(3)thesexresearcherswhohavetheminimalPvareidentified

asdiscoverers.BenjaminiandHochbergprovedthattheexpectedproportionoffalsediscoveries(identifiedasdiscoverers becauseofpureluck)is˛(Benjamini&Hochberg,1995).SeeFig.2cforanexample.

Wefindthattheresultsoftheouralgorithmarelittlesensitivetothechoiceofparameters.Toidentifyasmanydiscoverers aspossible,weusemultipleparameterchoicesandmarktheauthorswhoareidentifiedbythestatisticalprocedureforat leastoneparametersettingasdiscoverers.Inparticular,weusedF∈{1,...,5},K∈{1,...,10},andfP∈{0.2,0.4,...,2.0},

whichcorrespondsto5×10×10=500differentparametersettings.

Weidentifythediscoverersfrom1990to2006withthemethoddescribedabove.Thetotalnumberofdiscoverersineach year,showninFig.3a,waversaround900.InFig.3b,weshowtheproportionofpapersfirstcitedbydiscoverersineach year.About8%ofallpapersarefirstcitedbydiscoverersandthuspredictedtobepotentialhighly-citedpapers.

(4)

Fig.2. Statisticalsignificanceofdiscoverers.WeusetheparametersF=5,K=3,fp=1.0(%),andT=2006asanexample,thatimplyPD=7.6%andPE=5.4%.

(a)Numberofresearcherswhomakedifferentnumberofexplorations(reddots);theblacklinerepresentstheresultunderthenullhypothesis,seeEq.(2).

(b)Numberofresearcherswhomakedifferentnumberofdiscoveries(reddots);theblacklinerepresenttheresultunderthenullhypothesis.(c)Author p-valuesobtainedwithEq.(3)andtheBenjamini-HochbergproceduretocontroltheFalseDiscoveryRate.Thesolidlineisr˛/NwhereNisthenumber ofallactiveauthorswhomadeatleastoneexplorationundertheaboveparametersettingatT(hereN=72931),ristheauthorrank(thex-axis),and˛is theFDRlevel(wechoose5%).Thedotsbelowthelinemarktheauthorswhosignificantlydeviatefromthenullhypothesis(redsymbols);inthiscase,124 discoverersareidentifiedwhosep-valuesarebelowpm=8.9×10−5.(d)Thediscoveryrate(Di/Ei)oftheidentifieddiscoverers.Thesolidlineshowsthe smallestdiscoveryratethatwouldsufficetoachievethep-valueunderpmatagivennumberofexplorations.

Fig.3.(a)Numberofdiscoverersineachyear.(b)TheproportionofpapersfirstcitedbydiscoverersamongallthepaperswithC10≥1.

3.2. Anexampleofdiscoverers

Previousstudies(Jeong,Néda,&Barabási,2003;Redner,2004;Wang,Yu,&Yu,2008)showthatmostoftheresearchers citepapersaccordingtothepreferentialattachmentmechanism,andthereexistsastrongtendencyforthoseresearchers tocitepopularpapers.However,wefindthatthereexistsatinyfractionofresearcherswhorepeatedlycitenewpapersthat willbecomehighly-citedinthefuture.Herewegiveanexampleofatypicaldiscovereridentifiedinyear1990,withhis/her publishingtimeseriesillustratedinFig.4.Thediscovererismarkedasdiscoverer301timesamongallthe500parameter settingsatT=1990,whichisthehighestofallresearchers.Thediscoverer’sfirstpaperwaspublishedinJuly,1973. 4. Thebehaviorofdiscoverers

4.1. Futuresuccessofpapers

Sincethediscoverershavehistoricallyhighsuccessinearlycitinghighly-citedpapers,itisnaturaltoinvestigatewhether theyalsohavetheabilitytodiscoverhighly-citedpapersinthefuture.Tothisend,weemployaprocedureillustratedin

Fig.1:usingthedescribedprocedure,weidentifythediscoversatthebeginningofayear(timeT,from1990to2006)and considerallpaperspublishedbetweenTandT+1thathaveC10≥1.Amongthosepapers,ifapaper’sfirstcitationcomes

fromanotherpaperthathasatleastonediscovereramongtheauthors,wesaythatthepaperis“firstcitedbydiscoverers”.

(5)

Fig.4.Themostoutstandingdiscovereridentifiedintheyearof1990.Bluebarsshowthetotalcitationcountsofallthepapersthediscovererhascited, andtheredbarsshowthecitationcountofthepaperwhenthediscoverercitedit.

Fig.5.(a)AcomparisonbetweenC10ofthepapersfirstcitedbythediscoverers(red),highh-indexauthors(blue),andallresearchers(black),eachdot representsthemeanC10ofpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrby C10,risthenormalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupfirstcitedbydiscoverers,highh-index authorsandallauthorsrespectively.

IntheAPSdata,fromthebeginningof1990totheendof2006,186567papershavebeenpublishedandhaveC10≥1

citation.Amongthem,13157papershavebeenfirstcitedbydiscoverers.Onaverage,thepapersfirstcitedbydiscoverers haveC10of16.7,whiletheoverallaverage(forpaperswithC10≥1)is9.85.Asanadditionalbenchmark,wecomputethe

h-indexHirsch(2005)ofallauthorsattimeT(herethebeginningofeachyear),selecttheauthorswithhighh-index,and evaluatethepapersthatarefirstcitedbyhighh-indexauthors.Theh-indexthresholdsare14–20,differentfromyearto yearsothatthenumberoftheselectedpapersisclosetothenumberofpaperscitedbydiscoverers.Theresulting14001 papersfirstcitedbyhighh-indexresearchershaveonaverageC10of14.4.Fig.5ashowsC10ofthepaperschosenbythe

threecriteria(allpaperswithC10≥1,firstcitedbydiscoverers,andfirstcitedbyhighh-indexauthors)bypaperpublication

year.WeseethatthepaperscitedbythediscoverershavethehighestC10ineachyear.AsshowninFig.5b,ifwesortthe

papersbyitsrankingofC10ineachgroup,thepapersfirstcitedbydiscoverersoutperformsthepaperswhichhavethesame

rankingsintheothertwogroups.

Inourprocedure,thepapersfirstcitedbythediscoverersarefoundtobehighly-citedinthefuture.Tomeasurehow manyhighly-citedpapersaremissedbyourprediction,wecalculatetheFalsePositiveRate(FPR)andtheTruePositiveRate (TPR)inourmethod.TheFPRisdefinedasFPR=FP/Nn,whereFPisthenumberoffalsepositives,andNnisthetotalnumberof

negatives.MeanwhiletheTPRisdefinedasTPR=TP/Np,whereTPisthenumberoftruepositives,andNpisthetotalnumber

ofpositives.

WedefineapaperthathasC10>68(ranksinthetop1%ofallpaperswithC10≥1from1990to2006)ashighly-cited.As

aconsequence,atruepositiveeventisdefinedasapaperfirstcitedbydiscoverershavingC10>68inthefuture.Wefindthe

numberofoccasionsoftruepositiveeventsTPd=368.Afalsepositiveeventisdefinedasapaperfirstcitedbydiscoverer

havingC10<68inthefuture.WeobtainintotalFPd=12789falsepositiveevents.ThesevaluesimplytheFPRFPRd=6.9%

andtheTPRTPRd=20.3%forthepredictionofhighly-citedpapersusingdiscoverers.Forcomparison,wealsostudythe papersfirstcitedbyhighh-indexauthors,findingtheFPRFPRh=7.4%andtheTPRTPRh=17.0%.Discoverer-basedmethod

outperformsthemethodbasedonhighh-indexauthorsinbothFPRandTPR.Wefurthertestedotherthresholdsofdefining thepositiveeventsandfindthattheconclusionholdsforallthresholds.

4.2. Discoverersandsocialinfluence

Theobservationthatthepapersfirstcitedbythediscoverersaremorelikelytobecitedinthefuturecallsforpossible explanations.InlinewithMedoetal.(2016),oneexplanationisthatthediscoverersarefitness-drivenresearcherswhoare sensitivetotheintrinsicqualityofpapers,thusthepapersthattheycitehaveoverallhigherqualitywhichinturnleadsto themorecitations.Anotherpossibleexplanationisthatthediscoverersthemselvesareinfluentialresearchers,whichmeans

(6)

Fig.6.(a)Acomparisonbetweenthepapersfirstcitedbydiscoverersandallpapers(c10≥1).Theredbarsindicatethefirstcitation,greenbarsandorange barsrepresentthosecitationsfrompaperswhichcitedthefirstcitationpaperaswell,ornot,respectively.See(b)forillustration.

theyhavemanyfollowerswhofrequentlycitetheirpapersandalsopossiblycopytheirreferences(Wu&Holme,2009). Suchcopyingofreferencesisknowntobeanimportantelementofthegrowthofcitationdata(Simkin&Roychowdhury,

2005,2007).Inaddition,whenenteringanewresearchfield,apossiblewaytostartistoreviewthemilestonepapers

byoutstandingresearchersinthefield(andthereferencestherein).Insummary,itisplausiblethatresearchersaremore exposedtothepapersthatwerecitedbyfamousresearchers.

Tobetterunderstandtheroleofdiscoverersandtheircontributiontoboostingpaperpopularity,itiscrucialtodistinguish betweenimpactsof thesetwo effects.Similareffectshave beenrarelydiscussedbeforebecausemostdatasetslacka directwaytodeterminewhetherachoiceisanindependentdecisionorcopyingthebehaviorofothers.Fortunately,the scientificcitationdatagiveusaninformationaboutthepathsofknowledgepropagationandthusprovideusawaytofurther investigatethisproblem.

Tothisend,wedividepaper˛’scitationsintothreegroups: 1 Thefirstcitationofpaper˛islabeledasˇ.(RedbarinFig.6a)

2Papersthatciteboth˛andˇ.Thesepaperscanbefurtherdividedintothreetypes:(i)Thepapersthatcite˛becauseof theinformationfromˇ.(ii)Thepapersthatciteˇbecauseoftheinformationfrom˛.(iii)Thepapersthatciteboth˛and ˇjustbecauseofcoincidenceorotherunknowncauses.Itisgenerallydifficulttodistinguishthesetypesofcitations.For simplicity,weconsiderthenumberofallthesepapersasameasureofupperlimitofthesocialinfluenceofˇ.(Greenbar inFig.6a)

3Papersthatcite˛butdonotciteˇ.(OrangebarinFig.6a)

Whileclasstwocorrespondstothehypothesisthatthediscoverersutilizetheirsocialinfluenceandboostthecitation countofatargetpaperbyhavingitco-citedwiththeirpaper,classthreecorrespondstothehypothesisofdiscoverersbeing thefirstonestospothighqualitypapers.AsshowninFig.6a,thethirdclassofcitationsisclearlythedominantcomponent. Whilesomesocialinfluenceofdiscoverersstillmayexist,weseethatcopyingofthediscoverers’behaviorisaminorinfluence here.

4.3. Dothediscoverersactasfollowers?

Anotherquestioniswhetherthesuccessofdiscovererscanpossiblylieinpreferentiallyciting(following)highly-cited authorsthat,inturn,increasestheirchanceofmakingdiscoveries.Toinvestigatethispossibility,wecomputethemaximum h-indexofeachpaper’sauthorsatthetimethatthepaperpublished,hmax.Wethendividethepapersingroupsbyhmax.

AsshowninFig.7,discoverershavepreferenceincitingpaperswithhighhmax,forexample,3.12%ofpapersfirstcitedby

discoverershavehmax≥30,whileforallthepaperswithC10≥1,theproportionis1.12%.Despitethepreference,ineach

hmaxgroup,papersfirstcitedbythediscoverersreceivesignificantlymorecitationsthanthosecitedbyallauthors,this

showsthatthepapersfirstcitedbythediscoverersareonaveragemorecitedregardlessofwhetherthepapers’authorsare highly-citedornot.

4.4. Papersauthoredbythediscoverers

Anotherpertinentquestioniswhetherthediscoverers’papersalsoreceivemorecitations.Fromthe233087papers publishedin1990–2006,16358papershaveatleastonediscovereramongtheirauthors,and17966papershaveatleast onehighh-indexauthoramongtheirauthors(Theh-indexthresholdsare14–21,differentfromyeartoyearsothatthe

(7)

Fig.7. Acomparisonofc10forthepapersfirstcitedbythediscoverers(bluebars),andallauthors(yellowbars).Paperspublishedin1990–2006are considered.Thepapersaredividedingroupsbytheirhmax(thehighesth-indexoftheirauthorsinthepaper’spublicationyear).Theerrorbarsrepresent thestandarderrors.Thepercentagesabovethebarsindicatetheproportionofthecontributingpapers.

Fig.8.(a)AcomparisonbetweenC10ofthepapersauthoredbythediscoverers(red),highh-indexauthors(blue),andallauthors(black),eachdotrepresents themeanC10ofallpaperspublishedintheyear.Theerrorbarsrepresentthestandarderrorofpapers’C10.(b)C10ofthepapersthatranksrbyC10,risthe normalizedrankingofthepaperinthegroup.Theredline,bluelineandblacklinerepresentthegroupauthoredbydiscoverers,highh-indexauthorsand allauthorsrespectively.

Table1

MeananditsstandarderrorofC10forvarioussubsetsofpapers;Nsdenotesthesubsetsize.Thebottompartincludestheresultsforvariousintersections ofpapersubsetsdiscussedinthetoppart.

Firstcitedby Authoredby

(A)Discoverers (B)Highh-index All (C)Discoverers (D)Highh-index All

C10 16.7±0.3 14.4±0.2 9.85±0.05 17.1±0.2 14.9±0.2 7.88±0.03

Ns 13157 14001 186567 16358 17966 233087

A∩B A∩C A∩D B∩C B∩D C∩D

C10 17.2±0.4 24.4±0.7 24.3±0.8 23.0±0.8 21.5±0.8 19.6±0.4

Ns 5490 3172 2292 2211 2161 6822

numberoftheselectedpapersisclosetothenumberofpapersauthoredbydiscoverers.).MeanC10valuesforthesethree

groupsofpapersare7.88,17.1,and14.9,respectively.Similarlytothepapersfirstcitedbythediscoverers,papersauthored bythediscoverersalsohavesignificantcitationleadoverthetwobenchmarkgroups(seeFig.8aforacomparisonbypaper publicationyear).

4.5. Furtherimprovingthepredictions

Finally,wecomparethecitationcountsofvarioussubsetsofpapersinTable1.Notably,papersbelongingtomultiple inter-sectionsofthepapersubsetsdiscussedsofar(e.g.,papersthatarebothfirstcitedandauthoredbyadiscoverer—intersection A∩C)haveevenhigherC10whichsuggestsfurtherpossibilitiesforstudyingthepredictivepowerofvariousauthorsubsets

(herewefocusedonthediscoverersandusedhighh-indexauthorsasabenchmark)inthefuture.

(8)

5. Discussion

Thispaperrevealsaheterogeneityofresearcher’scitingbehavior.Weidentifytheresearchers—thediscoverers—who frequentlyciteamongthefirstpapersthatbecomehighly-citedinthefuture.Wefocusonthesediscovererstoassesstheir futurecitingbehavior.Ourresultsshowthatthediscoverers’abilityispersistentintimetotheextentthatthediscoverers identifiedattimeTcanbeusedtoidentifyfuturehighly-citedpapers.Weshowthatthesocialinfluenceplaysaminorrole inexplainingthediscoverers’behavior,forwhichthemainhypothesisstillremainsthatthediscoverersaretheauthorsthat aredriven(oratleastmoresensitivethantheaverage)bytheintrinsicpaperfitness(asoriginallysuggestedbyMedoetal.

(2016)).

Despitethehugesuccessofpreferentialattachmentmechanisminexplainingthericher-get-richerphenomenon,there isanumberofcitationpatternsthatitcannotaccountfor:newlypublishedpapersquicklybecomingpopular,andhighly citedoldpapersbeingnotoftencitedanymore,forexample.Tosolvethesedeviationsfromthepreferentialattachment, researchershavedevelopedmodelslikethefitnessmodel(Bianconi&Barabási,2001;Caldarellietal.,2002)andtheaging model(Dorogovtsev&Mendes,2000;Wuetal.,2014).Whilethesetwomodelsareabletoprovidepossibleexplanations fortheanomaliesinthecitationdatadivergingfromthepreferentialattachmentprediction,adetailedmodelthatutilizes boththefitnessinformationandtemporalinformationtoextractthefitness-drivenbehaviorofusershasbeenlackingfor years.Ourworkprovidesawell-calibratedmethodassumingamixofuserswithdifferentsensitivitytopaperfitness.

Wedividetheresearchersintotwogroups:fitness-drivenresearchersandpopularity-drivenresearchers.Fitness-driven researchers,namelytheDiscoverers,canfindhigh-qualitypapersatearlystageandincreasetheearlycitationofthese papers,whichmakeshigh-qualitypaperseasiertobenoticedbypopularity-drivenresearchersandfinallystandout.This contributestotherelationbetweenthequalityofapaperanditscitationcount.Althoughthefitnessmodelservingas thefundamentalofthispapercouldofferapossibleexplanationofthephenomenonthatafewresearcherscanconstantly discoverthenewpapersthatcanbecomepopularinthefuture,acomprehensiveunderstandingofwhytheseresearchers havethepredictivitystillawaitfurtherstudies.

Ourfindingscanbeusedforearlyassessmentofpaperpotentialimpact.Accordingtoourstudy,focusingonthe discover-ersandassigningmoreweighttotheircitationswhenevaluatingnewpapersishelpfulinfindinghigh-qualitypapersshort aftertheyarepublished.Anotherinterestingresultisthatthepapersauthoredbydiscovererstendtoreceivemorecitations thanthoseofresearcherswithhighh-index.Forthefirsttime,thispaperrevealsacorrelationbetweenresearchers’citing behaviorandtheiracademicpotential.Thismayinturnprovidenewinputsfortheproblemofidentifyingpotentialfuture outstandingresearchers.

Authorcontributions

Guan-YuanShi:Collectedthedata;Performedtheanalysis;Wrotethepaper. Yi-XiuKong:Collectedthedata;Performedtheanalysis.

Guang-HuiYuan:Collectedthedata;Performedtheanalysis. Rui-JieWu:Performedtheanalysis;Wrotethepaper.

AnZeng:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. MatusMedo:Concievedanddesignedtheanalysis;Contributeddataoranalysistools;Wrotethepaper. Acknowledgments

TheauthorswouldliketothankProf.Yi-ChengZhangforhelpfuldiscussion.ThisworkwassupportedbytheNational NaturalScienceFoundationofChina(GrantNo.61603046),theNaturalScienceFoundationofBeijing(GrantNo.L160008), andtheChinaScholarshipCouncil.

References

Barabási,A.L.,&Albert,R.(1999).Emergenceofscalinginrandomnetworks.Science,286(5439),509–512.

Benjamini,Y.,&Hochberg,Y.(1995).Controllingthefalsediscoveryrate:apracticalandpowerfulapproachtomultipletesting.JournaloftheRoyal

StatisticalSociety.SeriesB(Methodological),289–300.

Berset,Y.,&Medo,M.(2013).Theeffectoftheinitialnetworkconfigurationonpreferentialattachment.EuropeanPhysicalJournalB,86,260.

Bianconi,G.,&Barabási,A.L.(2001).Bose-Einsteincondensationincomplexnetworks.PhysicalReviewLetters,86(24),5632.

Caldarelli,G.,Capocci,A.,DeLosRios,P.,&Munoz,M.A.(2002).Scale-freenetworksfromvaryingvertexintrinsicfitness.PhysicalReviewLetters,89(25), 258702.

Cervellini,P.,Menezes,A.G.,&Mago,V.K.(2016).FindingTrendsettersonYelpDataset.2016IEEESymposiumSeriesonComputationalIntelligence(SSCI).pp. 1.

Chen,P.,&Redner,S.(2010).Communitystructureofthephysicalreviewcitationnetwork.JournalofInformetrics,4,p278.

Dorogovtsev,S.N.,&Mendes,J.F.F.(2000).Evolutionofnetworkswithagingofsites.PhysicalReviewE,62(2),1842.

Golosovsky,M.,&Solomon,S.(2013).Thetransitiontowardsimmortality:Non-linearautocatalyticgrowthofcitationstoscientificpapers.Journalof

StatisticalPhysics,151,340.

Golosovsky,M.,&Solomon,S.(2017).Growingcomplexnetworkofcitationsofscientificpapers:Modelingandmeasurements.PhysicalReviewE,95, 012324.

Hirsch,J.E.(2005).Anindextoquantifyanindividual’sscientificresearchoutput.ProceedingsoftheNationalacademyofSciencesoftheUnitedStatesof

America,102(46),16569.

(9)

Jeong,H.,Néda,Z.,&Barabási,A.L.(2003).Measuringpreferentialattachmentinevolvingnetworks.EPL(EurophysicsLetters),61(4),567.

Mariani,M.S.,Medo,M.,&Zhang,Y.C.(2016).Identificationofmilestonepapersthroughtime-balancednetworkcentrality.JournalofInformetrics,10(4), 1207–1223.

Martin,T.,Ball,B.,Karrer,B.,&Newman,M.E.J.(2013).CoauthorshipandcitationpatternsinthePhysicalReview.PhysicalReviewE,88,012814.

Medo,M.,Cimini,G.,&Gualdi,S.(2011).Temporaleffectsinthegrowthofnetworks.PhysicalReviewLetters,107(23),238701.

Medo,M.,Mariani,M.S.,Zeng,A.,&Zhang,Y.C.(2016).Identificationandimpactofdiscoverersinonlinesocialsystems.ScientificReports,6,34218.

Medo,M.,&Cimini,G.(2016).Model-basedevaluationofscientificimpactindicators.PhysicalReviewE,94,032312.

Mitzenmacher,M.(2004).Abriefhistoryofgenerativemodelsforpowerlawandlognormaldistributions.InternetMathematics,1,226.

Newman,M.E.(2009).Thefirst-moveradvantageinscientificpublication.EPL(EurophysicsLetters),86(6),68001.

Papadopoulos,F.,Kitsak,M.,Serrano,M.Á.,Boguná,M.,&Krioukov,D.(2012).Popularityversussimilarityingrowingnetworks.Nature,489(7417),537.

Perc,M.(2014).TheMattheweffectinempiricaldata.JournalofTheRoyalSocietyInterface,11,20140378.

Price,D.J.D.S.(1965).Networksofscientificpapers.Science,510–515.

Price,D.D.S.(1976).Ageneraltheoryofbibliometricandothercumulativeadvantageprocesses.JournaloftheAssociationforInformationScienceand

Technology,27(5),292–306.

Redner,S.(2004).Citationstatisticsfrommorethanacenturyofphysicalreview.,arXivpreprintphysics/0407137.

Simkin,M.V.,&Roychowdhury,V.P.(2005).Stochasticmodelingofcitationslips.Scientometrics,62,367.

Simkin,M.V.,&Roychowdhury,V.P.(2007).Amathematicaltheoryofciting.JournaloftheAmericanSocietyforInformationScienceandTechnology,58, 1661.

Wang,M.,Yu,G.,&Yu,D.(2008).Measuringthepreferentialattachmentmechanismincitationnetworks.PhysicaA:StatisticalMechanicsandits

Applications,387(18),4692–4698.

Wang,M.,Yu,G.,&Yu,D.(2009).Effectoftheageofpapersonthepreferentialattachmentincitationnetworks.PhysicaA:StatisticalMechanicsandits

Applications,388(19),4273–4276.

Wu,Y.,Fu,T.Z.,&Chiu,D.M.(2014).Generalizedpreferentialattachmentconsideringaging.JournalofInformetrics,8(3),650–658.

Wu,Z.X.,&Holme,P.(2009).Modelingscientific-citationpatternsandothertriangle-richacyclicnetworks.PhysicalReviewE,80(3),037101.

Zeng,A.,Shen,Z.,Zhou,J.,Wu,J.,Fan,Y.,Wang,Y.,&Stanley,H.E.(2017).Thescienceofscience:Fromtheperspectiveofcomplexsystems.Physics

Reports,714/715,1–73.

Figure

Fig. 1. A schematic plot of the evaluation of discoverers. 1) ˛ is a paper in the training set
Fig. 2. Statistical significance of discoverers. We use the parameters F = 5, K = 3, f p = 1.0(%), and T = 2006 as an example, that imply P D = 7.6% and P E = 5.4%.
Fig. 4. The most outstanding discoverer identified in the year of 1990. Blue bars show the total citation counts of all the papers the discoverer has cited, and the red bars show the citation count of the paper when the discoverer cited it.
Fig. 6. (a) A comparison between the papers first cited by discoverers and all papers (c 10 ≥ 1)
+2

Références

Documents relatifs

Our main conclusions can then be restated as follows: (i) intuitions of Euclidean geometry as studied by mathematical cognition are likely to play a role in reasoning to

Figure XII: Comparison of applying two-variable block CD to solve dual of SVM with/without a linear constraint. All settings are the same as Figure 5, though for the

For C = 1, 000, L-CommDir-BFGS is the fastest on most data sets, and the only exception is KDD2010-b, on which L-CommDir-BFGS is slightly slower than L-CommDir- Step, but faster

Keywords: scientific journal information system, Open Journal Systems, peer review workflow, automated reviewers selection, Mathematics Subject Classifi- cation 2010,

We studied the distributions of sentences in the body of the papers that are reused in abstracts with respect to the IMRaD structure [2] and found out that the beginning of

When a noun phrase is annotated, the borders of the head word are highlighted with HeadBegin and HeadEnd attributes, its grammatical form is normalized (reduced to

Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)... The time factor as an associative concept relation

Manso, Luis Calderita, Marco Antonio Guti´ errez Giraldo, Pedro N´ u˜ nez, Antonio Bandera, Adri´ an. Romero-Garc´ es, Juan Bandera and