Web Search Graph Mining Pierre Senellart (

(1)

Basics of Graph Theory

Web Search

Graph Mining

Pierre Senellart

(pierre.senellart@telecom-paristech.fr)

18 March 2009

(2)

Outline

1 Basics of Graph Theory

2 The Web as a graph

3 Social Networking Sites

4 Conclusion

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 1 / 46

(3)

Graphs

1 2 3

4 5 6

Definition

A directed graphis a pair(S,A) where:

S is a finite set ofvertices(or nodes)

Ais a subset of S² defining theedges (orarcs)

1 2 3

4 5 6

Definition

An undirected graphis a pair (S,A)where:

S is a finite set ofvertices(or nodes)

Ais a set of (unordered) pairs of elements of S defining theedges (orarcs)

Remark

Graph is the mathematical term,network is used to describe real-world graphs.

(4)

Paths and Connectedness

Definition

A pathis a sequence of verticesv1. . .vn such thatvk is connected by an edge to v_k+1 for 1≤k ≤n−1.

Definition

The underlying undirected graphof a directed graphG is the graph obtained by adding all reverse edges.

Definition

An undirected graph isconnectedif for every two verticesu andv, there exists a path starting from u and ending in v.

A directed graph is strongly connectedif it is connected, and is weakly connectedif the underlying undirected graph is connected.

(5)

Connected Components

Definition

(S⁰,A⁰)is a subgraph of(S,A) ifS⁰⊆S andA⁰ is the restriction ofA to edges whose vertices are in S⁰.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

(6)

Connected Components

Definition

1 2

4 5

3 6

Strongly connected components

(7)

Connected Components

Definition

1 2

4 5

3 6

Weakly connected components

(8)

Vocabulary

Incident: an edge is said to beincidentto a vertex if it it hasthis vertex for endpoint

Degree (of a vertex): number of edgesincident to a vertex, in an undirected graph

Indegree (of a vertex): number of edges arriving to a vertex, in a directed graph

Outdegree (of a vertex): number of edgesleaving from a vertex, in a directed graph

Cycle: Path whose start and end vertex is thesame Distance: Length of theshortest pathbetween two vertices

Sparse: a graph(S,A) is sparse if|A| |S|²

(9)

Bipartite graphs

Definition

A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2

(with S₁∩S₂ =∅), and no edge of Ais incident to two vertices inS₁ or two vertices inS₂.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(10)

Bipartite graphs

Definition

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(11)

Bipartite graphs

Definition

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(12)

Matrix Representation of a Graph

A graph G can be represented by itsadjacency matrixM:

1 2 3

4 5 6







0 1 0 0 0 0

0 1 0 1 1 0

0 0 0 0 0 1

1 0 0 0 1 0

0 0 0 1 0 0

0 0 0 0 0 0







Adjacency matrices of undirectedgraphs aresymmetric.

M^Tis the adjacency matrix obtained fromG byreversing the arrows.

Mⁿ is the matrix of the graph of allpaths of length n in G.

(13)

Concrete Representation of Graphs

In programming, graphs are usually represented as:

its adjacency matrix(stored as a multidimensional array), for non-sparsegraphs

the list of alledges incident to each node, for sparse graphs

1 2 3

4 5 6

1 2 2 2,4,5 3 6 4 1,5 5 4 6

(14)

The Web as a graph

Outline

2 The Web as a graph Introduction

Computing the importance of a page Spamdexing

Discovering communities

4 Conclusion

(15)

The Web as a graph Introduction

The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages Edges: hyperlinks

Same for other interlinkedenvironments:

dictionaries encyclopedias

scientific publications social networks

(16)

The shape of the Web graph

[Broder et al., 2000]

(17)

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|²)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]

k P

(18)

Characteristics of Interest of a Graph

k P

(19)

Characteristics of Interest of a Graph

Typical distance. What is the mean distancebetween any pairs of vertices?

Logarithmic typical distance

k P

(20)

Characteristics of Interest of a Graph

k P

(21)

Characteristics of Interest of a Graph

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected?

Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

k P

(22)

Characteristics of Interest of a Graph

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]

k P

(23)

Characteristics of Interest of a Graph

k P

(24)

Characteristics of Interest of a Graph

k P

(25)

The Web as a graph Computing the importance of a page

PageRank (Google’s Ranking [Brin and Page, 1998])

Idea

Important pages are pages pointed to by importantpages.

(g_ĳ =0 if there is no link between page i and j;

g_ĳ = _n¹

i otherwise, with n_i the number of outgoing links of page i.

Definition (Tentative)

Probability that the surfer following therandom walkin G has arrived on page i at some distant given point in the future.

pr(i) =

k→+∞lim (G^T)^kv

i

where v is some initial column vector.

(26)

Illustrating PageRank Computation

0.100 0.100

0.100

0.100 0.100

0.100

(27)

Illustrating PageRank Computation

0.033 0.317

0.075

0.108

0.025 0.058

0.083

0.150

0.117

0.033

(28)

Illustrating PageRank Computation

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.094

0.008

(29)

Illustrating PageRank Computation

0.054 0.212

0.093

0.152

0.048 0.051

0.108

0.149

0.106

0.026

(30)

Illustrating PageRank Computation

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.099

0.016

(31)

Illustrating PageRank Computation

0.048 0.232

0.093

0.156

0.062 0.067

0.087

0.138

0.099

0.018

(32)

Illustrating PageRank Computation

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.096

0.021

(33)

Illustrating PageRank Computation

0.049 0.238

0.088

0.149

0.057 0.063

0.095

0.141

0.099

0.019

(34)

Illustrating PageRank Computation

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.096

0.019

(35)

Illustrating PageRank Computation

0.050 0.233

0.091

0.150

0.058 0.064

0.095

0.142

0.098

0.020

(36)

Illustrating PageRank Computation

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.097

0.019

(37)

Illustrating PageRank Computation

0.049 0.233

0.091

0.149

0.058 0.065

0.095

0.142

0.098

0.019

(38)

Illustrating PageRank Computation

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(39)

Illustrating PageRank Computation

0.050 0.234

0.091

0.149

0.058 0.065

0.095

0.142

0.097

0.019

(40)

PageRank with Damping

May not always converge, or convergence may not be unique.

To fix this, the random surfer can at each step randomly jumpto any page of the Web with some probability d (1−d: damping factor).

pr(i) =

k→+∞lim ((1−d)G^T +dU)^kv

i

where U is the matrix with all _N¹ values withN the number of vertices.

(41)

Using PageRank to Score Query Results

PageRank: globalscore, independent of the query Can be used to raise the weight ofimportant pages:

weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.

(42)

Iterative Computation of PageRank

1 ComputeG (often stored as its adjacency list). Make sure lines sum to 1.

2 Let u be the uniform vector of sum 1,v =u,w thezero vector.

3 While v is different enoughfromw:

I Setw =v.

I Setv = (1−d)G^Tv+du.

Exercise

1 2

4 3

Run the first iteration of the PageRank computation.

(43)

Another take on PageRank: OPIC [Abiteboul et al., 2003]

Online Page Importance Computation

Allows computing PageRankwithout storingthe Web graph Remember for each page itscash and itshistory

Initially, each page has some initial cash

Each time a page is accessed, distribute uniformly its cash to outgoing links

PageRank is approximated as its cashflow: history divided by number of visits

Convergewhatever the strategy for crawling the Web is (as soon as each page is accessed repeatedly)

Good strategy: crawl page with largest cash first

(44)

HITS (Kleinberg, [Kleinberg, 1999])

Idea

Two kinds of important pages: hubsand authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.

G⁰ transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting with aandh vectors of norm 1):







a:= _kG0T¹hk G^0Th h:= _kG¹0ak G⁰a

Converges under some technical assumptions to authorityandhub scores.

(45)

Using HITS to Order Web Query Results

1 Retrieve the set D of Web pagesmatching a keyword query.

2 Retrieve the set D^∗ of Web pages obtained from D by adding all linked pages, as well as all pages linking topages ofD.

3 Build fromD^∗ the correspondingsubgraph G⁰ of the Web graph.

4 Computeiterativelyhubs and authority scores.

5 Sort documents fromD byauthority scores.

Less efficient than PageRank, becauselocal scores.

(46)

Biasing PageRank: Green Measures

Equivalent definitions of Green measure centered at node i

PageRank with source ati: standard PageRank computation while, at each iteration, adding 1 to the measure ofi, and subtractingpr(j) to every nodej.

Time spent at a node knowing the initial node is i.

(47)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.050 -0.233

-0.091

-0.149

-0.058 0.935

-0.095

-0.143

-0.097

-0.019

(48)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.099 0.034

0.319

-0.298

-0.117 0.870

-0.190

-0.285

-0.194

-0.039

(49)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.149 -0.199

0.353

-0.322

-0.050 0.930

-0.035

-0.178

-0.291

-0.058

(50)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.078

0.325

-0.304

-0.109 0.865

-0.026

-0.258

-0.222

-0.036

(51)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.151 -0.096

0.322

-0.333

-0.078 0.903

-0.034

-0.203

-0.273

-0.055

(52)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.161 -0.096

0.337

-0.300

-0.083 0.892

-0.045

-0.256

-0.242

-0.045

(53)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.150 -0.108

0.331

-0.328

-0.083 0.895

-0.027

-0.218

-0.267

-0.047

(54)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.159 -0.086

0.330

-0.312

-0.086 0.892

-0.039

-0.246

-0.248

-0.047

(55)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.154 -0.104

0.334

-0.321

-0.081 0.897

-0.034

-0.228

-0.262

-0.048

(56)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.094

0.332

-0.314

-0.085 0.892

-0.035

-0.241

-0.252

-0.046

(57)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.320

-0.083 0.895

-0.034

-0.231

-0.259

-0.047

(58)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.095

0.332

-0.315

-0.084 0.893

-0.036

-0.239

-0.253

-0.046

(59)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.318

-0.084 0.894

-0.034

-0.234

-0.257

-0.047

(60)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.095

0.332

-0.316

-0.084 0.893

-0.035

-0.238

-0.254

-0.046

(61)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.318

-0.084 0.894

-0.034

-0.235

-0.256

-0.047

(62)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.035

-0.237

-0.254

-0.046

(63)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.317

-0.084 0.894

-0.034

-0.236

-0.255

-0.046

(64)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.034

-0.237

-0.254

-0.046

(65)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.317

-0.084 0.893

-0.034

-0.236

-0.255

-0.046

(66)

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.085 0.893

-0.034

-0.237

-0.254

-0.046

(67)

The Web as a graph Spamdexing

Spamdexing

Definition

Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines Purpose: attracting visitors to websites to make profit.

Unceasing war between spamdexersandsearch engines

(68)

Spamdexing: Lying about the Content

Technique

Put unrelatedterms in:

meta-information (<meta name="description">,

text content hidden to the user with JavaScript, CSS, or HTML presentational elements

Countertechnique

Ignoremeta-information Try and detectinvisible text

(69)

Link Farm Attacks

Technique

Huge number of hosts on the Internet used for the sole purpose ofreferencing each other, without any content in themselves, to raise the importance of a given website or set of websites.

Countertechnique

Detection of websites with emptyor duplicatecontent

Use of heuristics to discoversubgraphs that look like link farms

(70)

Link Pollution

Technique

Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to add artificiallinks to websites, in order to raise its importance.

Countertechnique

rel="nofollow" attribute to<a> links not validated by a page’s owner

(71)

The Web as a graph Discovering communities

Discovery of communities

Identifyingcommunitiesof Web pages using thegraph structure: can be used for clustering the Web, or for finding boundaries oflogical Web sites

Two subproblems:

1 Given some initial vertex or vertex set, finding the corresponding community

2 Given the graph as a whole, finding a partition in communities

(72)

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

/4

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n²m) (n: vertices,m: edges)

(73)

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3 source

4 0

3 2

1 /4 4

1 sink

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n²m)(n: vertices,m: edges)

(74)

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

4 0

3 2

1 /4 4

1

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n²m)(n: vertices,m: edges)

(75)

Markov Cluster Algorithm (MCL) [van Dongen, 2000]

Graph clusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

I Expansion(matrix multiplication, corresponding to flow propagation)

I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n³) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

(76)

Markov Cluster Algorithm (MCL) [van Dongen, 2000]

Graph clusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

I Expansion(matrix multiplication, corresponding to flow propagation)

I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n³) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

(77)

Deletion of the edges with the highest betwenness [Newman and Girvan, 2004]

Top-downgraph clustering algorithm

Betwenness of an edge: number of minimal paths between two arbitrary vertices going through this edge

General principle:

1 Compute thebetweennessof each edge in the graph

2 Removethe edge with the highest betweenness

3 Redo the whole process, betweenness computation included Complexity: O(n³) for a sparse graph

[Newman and Girvan, 2004]

(78)

Social Networking Sites

Outline

3 Social Networking Sites Typology

Social Network Graphs Algorithms on social networks

4 Conclusion

(79)

Social Networking Sites

Graph Mining on Other Graphs

Graph miningtools used for the Web: also applicable to a variety of different graphs:

I dictionary

I encyclopedia

I citation graph

I . . .

Among these: social networking Web sites

Attention: on undirected graphs, PageRank is proportional to degree!

Not very useful.

(80)

Social Networking Sites Typology

Most popular social networks

Social networking sites the most popular in the world and in France (Web site rank with the most traffic, according to Alexa)

Monde France

SkyRock 51 3

YouTube 3 4

MySpace 17 7

Facebook 5 8

Dailymotion 61 11

EBay 18 12

Wikipedia 8 13

Meetic 565 27

ImageShack 47 53

hi5 15 59

Megavideo 133 80

Adult Friendfinder 55 82

Wat.tv 1568 88

Flickr 33 94

Orkut 19 >100

V Kontakte 28 >100

Friendster 39 >100

(81)

Social Networking Sites Typology

Typology of social networking sites

Site de réseau social

Orienté contenu Orienté utilisateur

Catalogue Partage Édition Vente Discussion Pur Communautés de blogs Rencontre

Livres Musique Liens Films Publications Jeux Images Vidéos Adulte everything2 Wikipedia EBay Yahoo! Answers

Flickr (Yahoo!) Photobucket (Fox) YouTube (Google) Dailymotion Megavideo Wat (TF1)

Personnel Professionnel Mélangé SkyRock Twitter FriendFinder Meetic

MySpace (Fox) hi5 Friendster LinkedIn Facebook Orkut (Google) V Kontakte LibraryThing Shelfari (Amazon) Last.fm (CBS) Delicious (Yahoo!) Flixster Yahoo! Movies CiteULike MobyGames

(82)

Social Networking Sites Social Network Graphs

Graphs of Social Networks

Natural model: social network = graph Entities =nodes, Relations =edges Depending on the specific case:

I mostly undirected graphs

I bipartite,n-partite

I annotated edges, weighted edges

(83)

Undirected graph

Adapted forpure social networking sites with symmetrical relations (e.g., LinkedIn)

(84)

Multipartite graph

Adapted to most social networks for sharing content, with annotations, users, content, etc. (e.g., Flickr)

mason.flickr manufrakass

france chateau

(85)

Six degrees of separation

Idea that any two persons on Earth are separated bya chain of six personssuch that any two successive links know each other Demonstrated by an experiment by Stanley Milgram [Travers and Milgram, 1969] (package to transmit to an unknown person, using acquaintances)

Popularized in numerous media

Number6 is not too be taken too seriously! But main ideavalidated in more recent experiments

In other fields:

I Erdős numberfor scientific publications

I Kevin Baconfor Hollywood movies

Common Characteristicsof (most) social networks !

(86)

Characteristics of Social Networking Sites

Four important characteristics [Newman et al., 2006]:

Sparse graphs: much more edges than a complete graph

Small typical distance: shortest path between two nodes logarithmic with respect to the size of the graph

High clustering: ifa is connected tob andb to c, thenb has good probability of being linked toc

Power-law degree distribution: the number of nodes with degree k of the order ofk^−γ (γ constant)

k P

More on this in Athens course on Collective Intelligence (TPT09)

(87)

Social Networking Sites Algorithms on social networks

Information retrieval with social score [Schenkel et al., 2008]

Context: Multipartite graph, e.g., Flickr Purpose: bias results with one’s social network Social weighting:

I Given a friendship relationF(u,u⁰)(explicit or implicit) between two users, anextended friendshiprelation can be computed

F(u,˜ u⁰) = α

|U|+ (1−α) max

cheminu=u0. . .uk=u⁰ k−1

Y

i=0

F(ui,ui+1) (0< α <1 constant;|U|: number of users)

I Instead of aglobal weighting

tf-idf(t,d) =tf(t,d)×idf(t,d) we choose asocial weightingdependent of u:

tf-idfu(t,d) = X

u⁰∈U

F(u,u⁰)·tfu⁰(t,d)

!

×idf(t,d)

(88)

Social Networking Sites Algorithms on social networks

Top-k with social score [Benedikt et al., 2008]

Possible to adapt Fagin’s trheshold algorithm. . .

. . . butimpossible to precomputethe scores tf-idf_u(t,d) for each user To avoid too much computation time:

1 Cluster the graph of users intostrongly similarcomponents

2 Use the scoresinside these componentsas estimates of the threshold in Fagin’s algorithm

3 ⇒givesapproximateresults, but good quality

(89)

Conclusion

Outline

4 Conclusion

(90)

Conclusion

What you should remember The Web seen as a graph

PageRank, and its iterative computation

Graph mining techniques applicable to a wide range of graphs Social networks share important common characteristics with the Web graph

To go further

A good textbook [Chakrabarti, 2003]

Two accessible influential research papers:

I On HITS [Kleinberg, 1999]

I On PageRank [Brin and Page, 1998]

(91)

Bibliography I

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.

Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, and Julia Stoyanovich. Efficient network-aware search in collaborative tagging sites. In Proc. VLDB, Auckland, New Zealand, August 2008.

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.

Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer Networks, 33(1-6):309–320, 2000.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

(92)

Bibliography II

Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921–940, October 1988.

Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment.

Journal of the ACM, 46(5):604–632, 1999.

M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2), 2004.

Mark Newman, Albert-László Barabási, and Duncan J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.

Yann Ollivier and Pierre Senellart. Finding related pages using Green measures: An illustration with Wikipedia. InProc. AAAI, Vancouver, Canada, July 2007.

(93)

Bibliography III

Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum. Efficient top-k querying over social-tagging networks. InProc. SIGIR, Singapore, Singapore, July 2008.

Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 34(4), December 1969.

Stĳn Marinus van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.