• Aucun résultat trouvé

Multi-cultural Wikipedia mining of geopolitics interactions leveraging reduced Google matrix analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Multi-cultural Wikipedia mining of geopolitics interactions leveraging reduced Google matrix analysis"

Copied!
11
0
0

Texte intégral

(1)

HAL Id: hal-01422586

https://hal.archives-ouvertes.fr/hal-01422586

Submitted on 28 May 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Multi-cultural Wikipedia mining of geopolitics

interactions leveraging reduced Google matrix analysis

Klaus Frahm, Samer El Zant, Katia Jaffrès-Runser, Dima Shepelyansky

To cite this version:

Klaus Frahm, Samer El Zant, Katia Jaffrès-Runser, Dima Shepelyansky. Multi-cultural Wikipedia

mining of geopolitics interactions leveraging reduced Google matrix analysis. Physics Letters A,

Elsevier, 2017, 381 (33), pp.2677-2685. �10.1016/j.physleta.2017.06.021�. �hal-01422586�

(2)

OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible

Any correspondence concerning this service should be sent

to the repository administrator: tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in: http://oatao.univ-toulouse.fr/2 2156

To cite this version:

Frahm, Klaus and El Zant, Samer and Jaffrès-Runser, Katia and Shepelyansky, Dima Multi-cultural Wikipedia mining of geopolitics interactions leveraging reduced Google matrix analysis. (2017) Physics Letters A, 381 (33).

2677-2685. ISSN 0375-9601.

Official URL:

https://doi.org/10.1016/j.physleta.2017.06.021

Open Archive Toulouse Archive Ouverte

(3)

Multi-cultural Wikipedia mining of geopolitics interactions leveraging reduced Google matrix analysis

KlausM. Frahma, Samer El Zantb,Katia Jaffrès-Runserb,Dima L. Shepelyanskya,

aLaboratoiredePhysiqueThéoriqueduCNRS,IRSAMC,UniversitédeToulouse,UPS,F-31062Toulouse,France bInstitutdeRechercheenInformatiquedeToulouse,UniversitédeToulouse,INPT,31061Toulouse,France

a b s t ra c t

Keywords:

Markovchains Googlematrix Wikipedianetwork

Geopoliticalinteractionsofcountries

Geopoliticsfocusesonpoliticalpowerinrelationtogeographicspace.Interactionsamongworldcountries havebeenwidelystudiedatvariousscales,observingeconomicexchanges,worldhistoryorinternational politics amongothers.Thiswork exhibits the potential ofWikipediamining for suchstudies.Indeed, Wikipediastoresvaluablefine-graineddependenciesamongcountriesbylinkingwebpagestogetherfor diversetypesofinteractions(notonlyrelatedtoeconomical,politicalorhistoricalfacts).Wemineherein the Wikipedianetworks of severallanguage editions usingthe recently proposed methodofreduced Googlematrixanalysis.Thisapproachallowstoestablishdirectand hiddenlinksbetweenasubsetof nodes that belongtoamuch largerdirected network.Ourstudy concentrateson40 major countries chosenworldwide. Ouraimisto offeramulticultural perspectiveontheir interactionsby comparing networks extracted fromfivedifferent Wikipedialanguage editions,emphasizing English,Russianand Arabicones.We demonstratethatthisapproachallowstorecovermeaningfuldirectand hiddenlinks amongthe40countriesofinterest.

1. Introduction

Political and economic interactions between regions of the worldhavealwaysbeenofutmostinteresttomeasureandpredict theirrelativeinfluence.Suchstudiesbelongtothefieldofgeopoli- ticsthatfocusesonpoliticalpowerinrelationtogeographicspace.

Interactionsamong world countries havebeen widely studied at variousscales(worldwide,continentalorregional)usingdifferent types of information. Studies are driven by observing economic exchanges,socialchanges,history,internationalpoliticsanddiplo- macy among others [1,2]. The major finding of this paper is to showthatmeaningfulworldwideinteractionscanbeautomatically extractedfromtheglobalandfreeonlineEncyclopaediaWikipedia [3]for a given set of countries. All information gatheredin this collaborativeknowledgebasecanbeleveragedtoprovideapicture ofcountriesrelationships,fosteringanewframeworkforthorough geopoliticsstudies.

Wikipedia has become the largest open source of knowledge beingclosetoEncyclopaediaBritannica [4]by theaccuracy ofits

* Correspondingauthor.

E-mailaddresses:frahm@irsamc.ups-tlse.fr(K.M. Frahm),

scientific entries [5] andovercoming the latter by the enormous quantityofavailableinformation.Adetailedanalysisofstrongand weak features of Wikipedia is given at [6,7]. Wikipedia articles make citations to each other,providing a direct relationship be- tweenwebpagesandtopics.Assuch,Wikipediageneratesalarger directednetwork ofarticletitleswitharatherclearmeaning.For thesereasons, it is interesting to apply algorithms developed for searchenginesofWorldWideWeb(WWW)suchasthePageRank algorithm[8](seealso[9]), toanalyzetherankingpropertiesand relationsbetweenWikipediaarticles.Forvariouslanguageeditions of Wikipediait was shownthat the PageRank vector produces a reliable ranking of historical figures over 35 centuries of human history[10–14]andasolidWikipediarankingofworlduniversities (WRWU)[10,15].IthasbeenshownthattheWikipediarankingof historicalfiguresisinagoodagreementwiththewell-knownHart ranking [16], while the WRWUis in a goodagreement with the ShanghaiAcademicrankingofworlduniversities[17].

Atpresentdirected networksofrealsystemscanbeverylarge (about 4.2 million articles for the English Wikipedia edition in 2013[13] or3.5 billionweb pages(calledalso nodes)forapub- liclyaccessiblewebcrawlthatwasgatheredbytheCommonCrawl Foundation in2012 [18]). Forsome studies, one might be inter- estedonlyintheparticularinteractionsbetweenaverysmallsub- set of nodes compared to the full network size. For instance, in samer.elzant@enseeiht.fr (S. ElZant),kjr@enseeiht.fr (K. Jaffrès-Runser),

dima@irsamc.ups-tlse.fr (D.L. Shepelyansky).

(4)

Fig. 1. Geographicaldistributionofthe 40selected countries.Color codegroups countriesinto7sets:orange(OC)forEnglishspeakingcountries,blue(BC)forfor- merSovietunionones,red(RC)forEuropeanones,green(GC)forLatinAmerican ones,yellow(YC)forMiddleEasternones,purple(PUC)forNorth-EastAsianones andfinallypink(PIC)forSouth-Easterncountries(seecolorsandcountrynamesin Table 1;othercountriesareshowninblack).(Forinterpretationofthereferencesto colorinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

thispaper, we are interested in capturingthe interactions ofthe 40 countries represented in Fig. 1 using the networks extracted fromfiveWikipedia language editions covering afew millions of articleseach.However,letusassumethatthereisaratherimpor- tantperson(having hisownWikipediaarticlecorresponding toa nodeC)whowasbornincountry A andworkedthemainpartof hislife incountry B;therefore A and B mayhave linkstoC (in eitherdirection)andthus theremaybe an indirectlink between thetwonodesA andB viathenodeC (orothernodes).Inprevi- ousworks, a solutionto thisgeneralproblemhas beenproposed in[19,20]by definingthereducedGooglematrixtheory.Mainel- ementsofReducedGooglematrix GR willbepresentednext, but inafewwords,itcapturesina40-by-40Perron–Frobeniusmatrix thefullcontribution ofdirect andindirectinteractions happening in the full Google matrix between the 40 nodes of interest (we tooktop 40countriesofPageRankvectorofEnWiki). Elementsof reducedmatrixGR(i,j)canbeinterpretedastheprobabilityfora randomsurferstarting atwebpage j toarriveinwebpage i using directandindirectinteractions.Indirectinteractionsrefertopaths composed inpart ofwebpages differentfromthe 40 onesof in- terest.EvenmoreinterestinganduniquetoreducedGooglematrix theory,we showherethat intermediate computation stepsof GR offeradecompositionof GRintomatricesthat clearlydistinguish directfromindirectinteractions. As such,it ispossible toextract the probability for an indirect interaction betweentwo nodes to happen.

ReducedGooglematrixtheoryisaperfectcandidateforanalyz- ingthedirectandindirectinteractionsbetweencountriesselected worldwide. Inthispaper,we extractfrom GR andits decomposi- tion intodirect and indirectmatrices ofselected subset network of Nr=40 countries. The Google matrix of this subset network of Nr nodes is computed taking into account direct and hidden (i.e. indirect) directed links. More specifically, we deduce a fine- grainedclassification ofcountriesthat captures what we callthe hiddenfriendsandhiddenfollowersofagivencountry.Thestructure of these graphs provides relevant social information: communi- ties of countries withstrong tiescan be clearly exhibited while countriesactingasbridges arepresentaswell.Thisismainlythe case for the hidden interactions networks of friends (or follow- ers)that offer newinformationcompared to thedirect networks offriends(orfollowers)whosetopologyismainlyenforcedbytop PageRank countries. The mathematical procedure of the reduced

Table 1

Listofnamesof40selectedcountrieswithPageRankK,CheiRank KforEnWiki, ArWikiandRuWiki,orderedbyincreasingPageRankofEnWikiedition.Fig. 1gives colorcorrespondencedetails.AWikipediaarticlewithcountrynamerepresentsone nodeofthewholenetworkwithNnodes,e.g.https://en.wikipedia.org/wiki/France for“France”inEnWiki.(Forinterpretationofthereferencestocolorinthistable, thereaderisreferredtothewebversionofthisarticle.)

Country EnWiki ArWiki RuWiki

K K K K K K

United States (US) OC 1 9 1 5 2 27

France (FR) RC 2 19 3 31 3 14

United Kingdom (UK) OC 3 25 6 20 7 3

Germany (DE) RC 4 33 8 14 4 24

Canada (CA) OC 5 26 13 19 12 26

India (IN) PIC 6 23 9 25 13 8

Australia (AU) OC 7 35 16 22 18 12

Italy (IT) RC 8 15 5 1 6 32

Japan (JP) PUC 9 4 11 9 11 7

China (CN) PUC 10 8 12 17 9 21

Russia (RU) BC 11 6 7 2 1 2

Spain (ES) RC 12 30 4 8 8 15

Poland (PL) RC 13 12 26 32 10 17

The Netherlands (NL) RC 14 37 18 33 15 31

Iran (IR) YC 15 2 14 15 30 22

Brazil (BR) GC 16 3 21 26 20 1

Sweden (SE) RC 17 22 22 7 19 5

New Zealand (NZ) OC 18 28 34 24 36 4

Mexico (MX) GC 19 40 23 38 22 37

Switzerland (CH) RC 20 38 20 34 16 18

Norway (NO) RC 21 32 35 16 27 11

Romania (RO) RC 22 10 19 6 32 36

Turkey (TR) YC 23 7 15 13 21 38

South Africa (ZA) OC 24 24 29 39 35 20

Belgium (BE) RC 25 18 27 37 29 30

Austria (AT) RC 26 39 28 28 14 28

Greece (GR) RC 27 21 10 36 25 25

Argentina (AR) GC 28 1 32 29 33 23

Philippines (PH) PIC 29 17 36 21 39 33

Portugal (PT) RC 30 36 24 12 17 9

Pakistan (PK) PUC 31 5 25 35 37 29

Denmark (DK) RC 32 16 33 10 31 19

Israel (IL) YC 33 20 17 18 28 6

Finland (FI) RC 34 14 37 4 26 16

Egypt (EG) YC 35 31 2 3 24 39

Indonesia (ID) PIC 36 13 31 11 34 10

Hungary (HU) RC 37 11 40 40 23 40

Taiwan (TW) PUC 38 27 39 27 40 34

South Korea (KR) PUC 39 34 38 30 38 35

Ukraine (UA) BC 40 29 30 23 5 13

GR matrixconstructionfor Nr nodesisdescribed indetailinSec- tion2.

The networksof GR directandhiddeninteractions canbecal- culated for different Wikipedia language editions. In this paper, reduced Google matrix analysisis applied to the same setof 40 countries on networks representing five different Wikipedia edi- tions:English(EnWiki),Arabic (ArWiki),Russian(RuWiki), French (FrWiki) andGerman(DeWiki) editions. Wetake for analysisthe top 40 countries according to the EnWiki PageRank. Wikipedia languageeditionsareusuallymodifiedbyauthorswhomainlybe- long to the regionassociated withthis language.Thus ourstudy showstheimpactofthisculturalbiaswhencomparingdirectand hidden networks of friends (or followers) among different lan- guage editions. We show that part ofthe interactions are cross- cultural whileothersare clearly biasedby the culture ofthe au- thors.

InSection2weintroducethemainelementsofreducedGoogle matrix theory,Section 3describes GR calculated for40countries andforfivedifferentWikipediaeditions.Specificemphasisisgiven totheverydifferentEnglish,ArabicandRussianeditions.Networks offriendsandfollowersfordirectandhiddeninteractionmatrices arecreatedanddiscussedinSection4,andconclusionisdrawnin Section5.

(5)

Fig. 2.Position of countries in the local(K,K)plane of the reduced network of 40 countries in the EnWiki (left), ArWiki (middle) and RuWiki (right) networks.

2. ReducedGooglematrixtheory 2.1.Googlematrix

Itis convenient to describe thenetwork of N Wikipediaarti- cles: a network node is givenby the article name,all nodes are numbered,somenodescorrespondtoarticleswithcountrynames, seeTable 1;thenumberofcountriesismuchsmallerthantheto- talnumberofarticles N.ThentheGoogle matrixGisconstructed fromtheadjacencymatrix Ai j withelements 1 ifarticle(node) j pointstoarticle(node)iandzerootherwise.Inthiscase,elements oftheGooglematrixtakethestandardform[8,9]

Gi j=αSi j+(1−α)/N , (1)

where S isthe matrixofMarkovtransitionswithelements Si j= Ai j/kout(j), kout(j)=N

i=1Ai j=0 being the node j out-degree (number ofoutgoing links) and with Si j=1/N if j has no out- goinglinks(danglingnode).Here0<α<1 isthedampingfactor whichfora random surferdetermines the probability (1α) to jumptoanynode;belowwe use α=0.85.Therighteigenvectors ψi(j)ofG aredefinedby:

j

Gj jψi(j)=λiψi(j) . (2)

The PageRank eigenvector P(j)= ψi=0(j) corresponds to the largest eigenvalue λi=0=1 [8,9]. It has positive elements which give the probability to find a random surfer on a given node in thestationarylongtimelimitoftheMarkovprocess.Allnodescan beordered by a monotonicallydecreasing probability P(Ki) with thehighestprobabilityat K=1.Theindex K isthePageRankin- dex.ThePageRankvectoriscomputedbythePageRankalgorithm withiterativemultiplication ofinitialrandomvector by G matrix [8,9,14]. Left eigenvectors are biorthogonal to right eigenvectors ofdifferent eigenvalues. The left eigenvector for λ=1 has iden- tical(unit) entriesduetothecolumnsumnormalizationofG.In thefollowingwe usethe notationsψLT andψR forleft andright eigenvectors,respectively. Notation T stands forvector ormatrix transposition.

Inaddition tothe matrix G it is usefultointroduce a Google matrix G constructed from the adjacency matrix of the same network but with inverted direction of all links [21]. The vec- tor P(K) is called the CheiRank vector [10,21] and the index numbering nodes in order of monotonic decrease of probability P is notedasCheiRankindex K.Thus, nodeswithmanyingo- ing (or outgoing) links have small values of K =1,2,3... (or of

K=1,2,3,...)[9,14].Weshowthedistributionofselectedcoun- triesonthePageRank–ChiRankplane(K,K)inFig. 2.

2.2. ReducedGooglematrix

We construct the reduced Google matrix fora certain subset of Nr selectednodes fromthe globalWikipedianetwork with N nodes(NrN).Asasubsetwe choosetop40countrieswiththe largestPageRankprobabilitiesforEnWikinetwork(thenamesare given in Table 1). The reduced Google matrix GR is constructed onthe mathematicalbasisdescribed below.The mainelement of thisconstructionistokeepthesamePageRankprobabilitiesofNr nodesasintheglobalnetwork(uptoafixedmultipliercoefficient) andtotakeintoaccountallindirectlinksbetweenNr nodescou- pled by transitions via NNr nodesof the globalnetwork (see also[20]).

Let G be a typical Google matrix (1) for a network with N nodes such that Gi j0 and the column sum normalization

N

i=1Gi j=1 isverified.We considerasub-network withNr<N nodes,called“reducednetwork”.Inthiscasewecanwrite G ina blockform:

G=

Grr Grs

Gsr Gss

(3)

where the index“r” refers to the nodes ofthe reduced network and“s”totheotherNs=NNrnodeswhichformacomplemen- tarynetworkwhichwewillcallthe“scatteringnetwork”.ThusGrr is given by the direct links between selected Nr nodes, Gss de- scribes links between all other NNr nodes, Grs and Gsr give linksbetweenthesetwoparts.PageRankvectorofthefullnetwork isgivenby:

P= Pr

Ps

(4)

which satisfiesthe equation GP=P orin other words P is the right eigenvector of G for the unit eigenvalue. This eigenvalue equationreadsinblocknotations:

(1Grr)PrGrsPs=0, (5)

GsrPr+(1Gss)Ps=0. (6) Here1istheunitmatrixofcorrespondingsizeNrorNs.Assuming thatthe matrix1Gss isnotsingular,i.e.all eigenvalues Gssare strictlysmallerthanunity(inmodulus),weobtainfrom(6)that Ps=(1Gss)1GsrPr (7)

Références

Documents relatifs

L’objectif de ce travail est le d´ eveloppement d’une approche analytique bas´ ee sur une th´ eorie raffin´ ee de d´ eformation par cisaillement avec l’effet d’´ etirement

We tested expression and localization of Foxp2 in HD mouse model tissue to determine whether levels of soluble Foxp2 change and whether Foxp2 co-aggregates with mHTT in vivo.. R6/2

Normalized magnetization as a function of time delay recorded with three different probe photon energies (57.4, 60.5, and 63.6 eV) for the CoPt-2 sample magnetized out-of-plane.

Therefore, we seek to provide a joint configuration of the predis- tortion and clipping techniques which maximizes the PA power efficiency taking into account the complexity of

On average the probability given by the PageRank vector is proportional to the number of ingoing links [Langville and Meyer, 2006], this relation is established for scale-free

Dans le chapitre 4 nous nous posent sur une structure de précodage linéaire optimal récemment décrit [36, Eq (3.33)] pour proposer un schéma de précodage adapté à

2) Slide the remaining four magnets and spacers into the assembly tube.. Figure 15: Assembly Rig, Third Configuration... The third configuration of the assembly rig

Pour cela, cet article se propose de décrire en logique quelques propriétés de sûreté de fonctionnement, de sécurité, ou encore de couplage, afin de diriger deux types