Basics of Graph Theory
Web Search
Graph Mining
Pierre Senellart
(pierre.senellart@telecom-paristech.fr)
18 March 2009
Basics of Graph Theory
Outline
1 Basics of Graph Theory
2 The Web as a graph
3 Social Networking Sites
4 Conclusion
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 1 / 46
Basics of Graph Theory
Graphs
1 2 3
4 5 6
Definition
A directed graphis a pair(S,A) where:
S is a finite set ofvertices(or nodes)
Ais a subset of S2 defining theedges (orarcs)
1 2 3
4 5 6
Definition
An undirected graphis a pair (S,A)where:
S is a finite set ofvertices(or nodes)
Ais a set of (unordered) pairs of elements of S defining theedges (orarcs)
Remark
Graph is the mathematical term,network is used to describe real-world graphs.
Basics of Graph Theory
Paths and Connectedness
Definition
A pathis a sequence of verticesv1. . .vn such thatvk is connected by an edge to vk+1 for 1≤k ≤n−1.
Definition
The underlying undirected graphof a directed graphG is the graph obtained by adding all reverse edges.
Definition
An undirected graph isconnectedif for every two verticesu andv, there exists a path starting from u and ending in v.
A directed graph is strongly connectedif it is connected, and is weakly connectedif the underlying undirected graph is connected.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 3 / 46
Basics of Graph Theory
Connected Components
Definition
(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
Basics of Graph Theory
Connected Components
Definition
(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
Strongly connected components
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 4 / 46
Basics of Graph Theory
Connected Components
Definition
(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
Weakly connected components
Basics of Graph Theory
Vocabulary
Incident: an edge is said to beincidentto a vertex if it it hasthis vertex for endpoint
Degree (of a vertex): number of edgesincident to a vertex, in an undirected graph
Indegree (of a vertex): number of edges arriving to a vertex, in a directed graph
Outdegree (of a vertex): number of edgesleaving from a vertex, in a directed graph
Cycle: Path whose start and end vertex is thesame Distance: Length of theshortest pathbetween two vertices
Sparse: a graph(S,A) is sparse if|A| |S|2
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 5 / 46
Basics of Graph Theory
Bipartite graphs
Definition
A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2
(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
Basics of Graph Theory
Bipartite graphs
Definition
A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2
(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 6 / 46
Basics of Graph Theory
Bipartite graphs
Definition
A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2
(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
Basics of Graph Theory
Matrix Representation of a Graph
A graph G can be represented by itsadjacency matrixM:
1 2 3
4 5 6
0 1 0 0 0 0
0 1 0 1 1 0
0 0 0 0 0 1
1 0 0 0 1 0
0 0 0 1 0 0
0 0 0 0 0 0
Adjacency matrices of undirectedgraphs aresymmetric.
MTis the adjacency matrix obtained fromG byreversing the arrows.
Mn is the matrix of the graph of allpaths of length n in G.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 7 / 46
Basics of Graph Theory
Concrete Representation of Graphs
In programming, graphs are usually represented as:
its adjacency matrix(stored as a multidimensional array), for non-sparsegraphs
the list of alledges incident to each node, for sparse graphs
1 2 3
4 5 6
1 2 2 2,4,5 3 6 4 1,5 5 4 6
The Web as a graph
Outline
1 Basics of Graph Theory
2 The Web as a graph Introduction
Computing the importance of a page Spamdexing
Discovering communities
3 Social Networking Sites
4 Conclusion
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 9 / 46
The Web as a graph Introduction
The Web Graph
The World Wide Web seen as a (directed) graph:
Vertices: Web pages Edges: hyperlinks
Same for other interlinkedenvironments:
dictionaries encyclopedias
scientific publications social networks
The Web as a graph Introduction
The shape of the Web graph
[Broder et al., 2000]
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 11 / 46
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices?
Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected?
Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
The Web as a graph Introduction
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤γ ≤3) [Broder et al., 2000]
k P
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46
The Web as a graph Computing the importance of a page
PageRank (Google’s Ranking [Brin and Page, 1998])
Idea
Important pages are pages pointed to by importantpages.
(gij =0 if there is no link between page i and j;
gij = n1
i otherwise, with ni the number of outgoing links of page i.
Definition (Tentative)
Probability that the surfer following therandom walkin G has arrived on page i at some distant given point in the future.
pr(i) =
k→+∞lim (GT)kv
i
where v is some initial column vector.
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.100 0.100
0.100
0.100
0.100 0.100
0.100
0.100
0.100
0.100
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.033 0.317
0.075
0.108
0.025 0.058
0.083
0.150
0.117
0.033
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.036 0.193
0.108
0.163
0.079 0.090
0.074
0.154
0.094
0.008
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.054 0.212
0.093
0.152
0.048 0.051
0.108
0.149
0.106
0.026
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.051 0.247
0.078
0.143
0.053 0.062
0.097
0.153
0.099
0.016
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.048 0.232
0.093
0.156
0.062 0.067
0.087
0.138
0.099
0.018
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.052 0.226
0.092
0.148
0.058 0.064
0.098
0.146
0.096
0.021
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.049 0.238
0.088
0.149
0.057 0.063
0.095
0.141
0.099
0.019
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.050 0.232
0.091
0.149
0.060 0.066
0.094
0.143
0.096
0.019
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.050 0.233
0.091
0.150
0.058 0.064
0.095
0.142
0.098
0.020
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.050 0.234
0.090
0.148
0.058 0.065
0.095
0.143
0.097
0.019
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.049 0.233
0.091
0.149
0.058 0.065
0.095
0.142
0.098
0.019
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46
The Web as a graph Computing the importance of a page
Illustrating PageRank Computation
0.050 0.234
0.091
0.149
0.058 0.065
0.095
0.142
0.097
0.019
The Web as a graph Computing the importance of a page
PageRank with Damping
May not always converge, or convergence may not be unique.
To fix this, the random surfer can at each step randomly jumpto any page of the Web with some probability d (1−d: damping factor).
pr(i) =
k→+∞lim ((1−d)GT +dU)kv
i
where U is the matrix with all N1 values withN the number of vertices.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 15 / 46
The Web as a graph Computing the importance of a page
Using PageRank to Score Query Results
PageRank: globalscore, independent of the query Can be used to raise the weight ofimportant pages:
weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.
The Web as a graph Computing the importance of a page
Iterative Computation of PageRank
1 ComputeG (often stored as its adjacency list). Make sure lines sum to 1.
2 Let u be the uniform vector of sum 1,v =u,w thezero vector.
3 While v is different enoughfromw:
I Setw =v.
I Setv = (1−d)GTv+du.
Exercise
1 2
4 3
Run the first iteration of the PageRank computation.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 17 / 46
The Web as a graph Computing the importance of a page
Another take on PageRank: OPIC [Abiteboul et al., 2003]
Online Page Importance Computation
Allows computing PageRankwithout storingthe Web graph Remember for each page itscash and itshistory
Initially, each page has some initial cash
Each time a page is accessed, distribute uniformly its cash to outgoing links
PageRank is approximated as its cashflow: history divided by number of visits
Convergewhatever the strategy for crawling the Web is (as soon as each page is accessed repeatedly)
Good strategy: crawl page with largest cash first
The Web as a graph Computing the importance of a page
HITS (Kleinberg, [Kleinberg, 1999])
Idea
Two kinds of important pages: hubsand authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.
G0 transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting with aandh vectors of norm 1):
a:= kG0T1hk G0Th h:= kG10ak G0a
Converges under some technical assumptions to authorityandhub scores.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 19 / 46
The Web as a graph Computing the importance of a page
Using HITS to Order Web Query Results
1 Retrieve the set D of Web pagesmatching a keyword query.
2 Retrieve the set D∗ of Web pages obtained from D by adding all linked pages, as well as all pages linking topages ofD.
3 Build fromD∗ the correspondingsubgraph G0 of the Web graph.
4 Computeiterativelyhubs and authority scores.
5 Sort documents fromD byauthority scores.
Less efficient than PageRank, becauselocal scores.
The Web as a graph Computing the importance of a page
Biasing PageRank: Green Measures
Equivalent definitions of Green measure centered at node i
PageRank with source ati: standard PageRank computation while, at each iteration, adding 1 to the measure ofi, and subtractingpr(j) to every nodej.
Time spent at a node knowing the initial node is i.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 21 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.050 -0.233
-0.091
-0.149
-0.058 0.935
-0.095
-0.143
-0.097
-0.019
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.099 0.034
0.319
-0.298
-0.117 0.870
-0.190
-0.285
-0.194
-0.039
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.149 -0.199
0.353
-0.322
-0.050 0.930
-0.035
-0.178
-0.291
-0.058
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.078
0.325
-0.304
-0.109 0.865
-0.026
-0.258
-0.222
-0.036
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.151 -0.096
0.322
-0.333
-0.078 0.903
-0.034
-0.203
-0.273
-0.055
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.161 -0.096
0.337
-0.300
-0.083 0.892
-0.045
-0.256
-0.242
-0.045
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.150 -0.108
0.331
-0.328
-0.083 0.895
-0.027
-0.218
-0.267
-0.047
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.159 -0.086
0.330
-0.312
-0.086 0.892
-0.039
-0.246
-0.248
-0.047
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.154 -0.104
0.334
-0.321
-0.081 0.897
-0.034
-0.228
-0.262
-0.048
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.094
0.332
-0.314
-0.085 0.892
-0.035
-0.241
-0.252
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.155 -0.098
0.332
-0.320
-0.083 0.895
-0.034
-0.231
-0.259
-0.047
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.095
0.332
-0.315
-0.084 0.893
-0.036
-0.239
-0.253
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.155 -0.098
0.332
-0.318
-0.084 0.894
-0.034
-0.234
-0.257
-0.047
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.095
0.332
-0.316
-0.084 0.893
-0.035
-0.238
-0.254
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.097
0.332
-0.318
-0.084 0.894
-0.034
-0.235
-0.256
-0.047
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.035
-0.237
-0.254
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.097
0.332
-0.317
-0.084 0.894
-0.034
-0.236
-0.255
-0.046
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.034
-0.237
-0.254
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.317
-0.084 0.893
-0.034
-0.236
-0.255
-0.046
The Web as a graph Computing the importance of a page
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.085 0.893
-0.034
-0.237
-0.254
-0.046
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46
The Web as a graph Spamdexing
Spamdexing
Definition
Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines Purpose: attracting visitors to websites to make profit.
Unceasing war between spamdexersandsearch engines
The Web as a graph Spamdexing
Spamdexing: Lying about the Content
Technique
Put unrelatedterms in:
meta-information (<meta name="description">,
<meta name="keywords">)
text content hidden to the user with JavaScript, CSS, or HTML presentational elements
Countertechnique
Ignoremeta-information Try and detectinvisible text
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 24 / 46
The Web as a graph Spamdexing
Link Farm Attacks
Technique
Huge number of hosts on the Internet used for the sole purpose ofreferencing each other, without any content in themselves, to raise the importance of a given website or set of websites.
Countertechnique
Detection of websites with emptyor duplicatecontent
Use of heuristics to discoversubgraphs that look like link farms
The Web as a graph Spamdexing
Link Pollution
Technique
Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to add artificiallinks to websites, in order to raise its importance.
Countertechnique
rel="nofollow" attribute to<a> links not validated by a page’s owner
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 26 / 46
The Web as a graph Discovering communities
Discovery of communities
Identifyingcommunitiesof Web pages using thegraph structure: can be used for clustering the Web, or for finding boundaries oflogical Web sites
Two subproblems:
1 Given some initial vertex or vertex set, finding the corresponding community
2 Given the graph as a whole, finding a partition in communities
The Web as a graph Discovering communities
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
/4
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m) (n: vertices,m: edges)
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 28 / 46
The Web as a graph Discovering communities
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3 source
4 0
3 2
1 /4 4
1 sink
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m)(n: vertices,m: edges)
The Web as a graph Discovering communities
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
4 0
3 2
1 /4 4
1
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m)(n: vertices,m: edges)
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 28 / 46
The Web as a graph Discovering communities
Markov Cluster Algorithm (MCL) [van Dongen, 2000]
Graph clusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
I Expansion(matrix multiplication, corresponding to flow propagation)
I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one
[van Dongen, 2000]
The Web as a graph Discovering communities
Markov Cluster Algorithm (MCL) [van Dongen, 2000]
Graph clusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
I Expansion(matrix multiplication, corresponding to flow propagation)
I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one
[van Dongen, 2000]
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 29 / 46
The Web as a graph Discovering communities
Deletion of the edges with the highest betwenness [Newman and Girvan, 2004]
Top-downgraph clustering algorithm
Betwenness of an edge: number of minimal paths between two arbitrary vertices going through this edge
General principle:
1 Compute thebetweennessof each edge in the graph
2 Removethe edge with the highest betweenness
3 Redo the whole process, betweenness computation included Complexity: O(n3) for a sparse graph
[Newman and Girvan, 2004]
Social Networking Sites
Outline
1 Basics of Graph Theory
2 The Web as a graph
3 Social Networking Sites Typology
Social Network Graphs Algorithms on social networks
4 Conclusion
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 31 / 46
Social Networking Sites
Graph Mining on Other Graphs
Graph miningtools used for the Web: also applicable to a variety of different graphs:
I dictionary
I encyclopedia
I citation graph
I . . .
Among these: social networking Web sites
Attention: on undirected graphs, PageRank is proportional to degree!
Not very useful.
Social Networking Sites Typology
Most popular social networks
Social networking sites the most popular in the world and in France (Web site rank with the most traffic, according to Alexa)
Monde France
SkyRock 51 3
YouTube 3 4
MySpace 17 7
Facebook 5 8
Dailymotion 61 11
EBay 18 12
Wikipedia 8 13
Meetic 565 27
ImageShack 47 53
hi5 15 59
Megavideo 133 80
Adult Friendfinder 55 82
Wat.tv 1568 88
Flickr 33 94
Orkut 19 >100
V Kontakte 28 >100
Friendster 39 >100
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 33 / 46
Social Networking Sites Typology
Typology of social networking sites
Site de réseau social
Orienté contenu Orienté utilisateur
Catalogue Partage Édition Vente Discussion Pur Communautés de blogs Rencontre
Livres Musique Liens Films Publications Jeux Images Vidéos Adulte everything2 Wikipedia EBay Yahoo! Answers
Flickr (Yahoo!) Photobucket (Fox) YouTube (Google) Dailymotion Megavideo Wat (TF1)
Personnel Professionnel Mélangé SkyRock Twitter FriendFinder Meetic
MySpace (Fox) hi5 Friendster LinkedIn Facebook Orkut (Google) V Kontakte LibraryThing Shelfari (Amazon) Last.fm (CBS) Delicious (Yahoo!) Flixster Yahoo! Movies CiteULike MobyGames
Social Networking Sites Social Network Graphs
Graphs of Social Networks
Natural model: social network = graph Entities =nodes, Relations =edges Depending on the specific case:
I mostly undirected graphs
I bipartite,n-partite
I annotated edges, weighted edges
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 35 / 46
Social Networking Sites Social Network Graphs
Undirected graph
Adapted forpure social networking sites with symmetrical relations (e.g., LinkedIn)
Social Networking Sites Social Network Graphs
Multipartite graph
Adapted to most social networks for sharing content, with annotations, users, content, etc. (e.g., Flickr)
mason.flickr manufrakass
france chateau
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 37 / 46
Social Networking Sites Social Network Graphs
Six degrees of separation
Idea that any two persons on Earth are separated bya chain of six personssuch that any two successive links know each other Demonstrated by an experiment by Stanley Milgram [Travers and Milgram, 1969] (package to transmit to an unknown person, using acquaintances)
Popularized in numerous media
Number6 is not too be taken too seriously! But main ideavalidated in more recent experiments
In other fields:
I Erdős numberfor scientific publications
I Kevin Baconfor Hollywood movies
Common Characteristicsof (most) social networks !
Social Networking Sites Social Network Graphs
Characteristics of Social Networking Sites
Four important characteristics [Newman et al., 2006]:
Sparse graphs: much more edges than a complete graph
Small typical distance: shortest path between two nodes logarithmic with respect to the size of the graph
High clustering: ifa is connected tob andb to c, thenb has good probability of being linked toc
Power-law degree distribution: the number of nodes with degree k of the order ofk−γ (γ constant)
k P
More on this in Athens course on Collective Intelligence (TPT09)
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 39 / 46
Social Networking Sites Algorithms on social networks
Information retrieval with social score [Schenkel et al., 2008]
Context: Multipartite graph, e.g., Flickr Purpose: bias results with one’s social network Social weighting:
I Given a friendship relationF(u,u0)(explicit or implicit) between two users, anextended friendshiprelation can be computed
F(u,˜ u0) = α
|U|+ (1−α) max
cheminu=u0. . .uk=u0 k−1
Y
i=0
F(ui,ui+1) (0< α <1 constant;|U|: number of users)
I Instead of aglobal weighting
tf-idf(t,d) =tf(t,d)×idf(t,d) we choose asocial weightingdependent of u:
tf-idfu(t,d) = X
u0∈U
F(u,u0)·tfu0(t,d)
!
×idf(t,d)
Social Networking Sites Algorithms on social networks
Top-k with social score [Benedikt et al., 2008]
Possible to adapt Fagin’s trheshold algorithm. . .
. . . butimpossible to precomputethe scores tf-idfu(t,d) for each user To avoid too much computation time:
1 Cluster the graph of users intostrongly similarcomponents
2 Use the scoresinside these componentsas estimates of the threshold in Fagin’s algorithm
3 ⇒givesapproximateresults, but good quality
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 41 / 46
Conclusion
Outline
1 Basics of Graph Theory
2 The Web as a graph
3 Social Networking Sites
4 Conclusion
Conclusion
Conclusion
What you should remember The Web seen as a graph
PageRank, and its iterative computation
Graph mining techniques applicable to a wide range of graphs Social networks share important common characteristics with the Web graph
To go further
A good textbook [Chakrabarti, 2003]
Two accessible influential research papers:
I On HITS [Kleinberg, 1999]
I On PageRank [Brin and Page, 1998]
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 43 / 46
Bibliography I
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.
Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, and Julia Stoyanovich. Efficient network-aware search in collaborative tagging sites. In Proc. VLDB, Auckland, New Zealand, August 2008.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer Networks, 33(1-6):309–320, 2000.
Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.
Bibliography II
Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921–940, October 1988.
Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment.
Journal of the ACM, 46(5):604–632, 1999.
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2), 2004.
Mark Newman, Albert-László Barabási, and Duncan J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.
Yann Ollivier and Pierre Senellart. Finding related pages using Green measures: An illustration with Wikipedia. InProc. AAAI, Vancouver, Canada, July 2007.
P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 45 / 46
Bibliography III
Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum. Efficient top-k querying over social-tagging networks. InProc. SIGIR, Singapore, Singapore, July 2008.
Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 34(4), December 1969.
Stijn Marinus van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.