7 July 2011
Web search
Graph Mining
7 July 2011
Outline
Basics of Graph Theory
The Web as a graph Social Networking Sites Conclusion
7 July 2011
Graphs
1 2 3
4 5 6
Definition
Adirected graphis a pair(S,A)where:
Sis a finite set ofvertices(ornodes) Ais a subset ofS2defining theedges(or arcs)
1 2 3
4 5 6
Definition
Anundirected graphis a pair(S,A)where:
Sis a finite set ofvertices(ornodes)
Ais a set of (unordered) pairs of elements of Sdefining theedges(orarcs)
Remark
Graphis the mathematical term,networkis used to describe real-world graphs.
7 July 2011
Paths and Connectedness
Definition
Apathis a sequence of verticesv1. . .vnsuch thatvk is connected by an edge tovk+1for 1≤k ≤n−1.
Definition
Theunderlying undirected graphof a directed graphGis the graph obtained by adding all reverse edges.
Definition
An undirected graph isconnectedif for every two verticesu andv, there exists a path starting fromu and ending inv.
A directed graph isstrongly connectedif it is connected, and isweakly connectedif the underlying undirected graph is connected.
7 July 2011
Connected Components
Definition
(S′,A′)is asubgraphof(S,A)ifS′ ⊆SandA′ is the restriction ofAto edges whose vertices are inS′.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph
Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
7 July 2011
Connected Components
Definition
(S′,A′)is asubgraphof(S,A)ifS′ ⊆SandA′ is the restriction ofAto edges whose vertices are inS′.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph
Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
Strongly connected components
7 July 2011
Connected Components
Definition
(S′,A′)is asubgraphof(S,A)ifS′ ⊆SandA′ is the restriction ofAto edges whose vertices are inS′.
Connected component: maximal connected subgraph
Strongly connected component: maximal strongly connected subgraph
Weakly connected component: maximal weakly connected subgraph
1 2
4 5
3 6
Weakly connected components
7 July 2011
Vocabulary
Incident: an edge is said to beincidentto a vertex if it it hasthis vertex for endpoint
Degree (of a vertex): number of edgesincident toa vertex, in an undirected graph
Indegree (of a vertex): number of edgesarriving toa vertex, in a directed graph
Outdegree (of a vertex): number of edgesleaving froma vertex, in a directed graph
Cycle: Path whose start and end vertex is thesame Distance: Length of theshortest pathbetween two vertices
Sparse: a graph(S,A)is sparse if|A| ≪ |S|2
7 July 2011
Bipartite graphs
Definition
Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
7 July 2011
Bipartite graphs
Definition
Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
7 July 2011
Bipartite graphs
Definition
Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.
Paths of length 2 in a bipartite graph define two regular undirected graphs.
1 2 3 4
5 6 7
1 2
3 4
5 6
7
7 July 2011
Matrix Representation of a Graph
A graphGcan be represented by itsadjacency matrixM:
1 2 3
4 5 6
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
Adjacency matrices ofundirectedgraphs aresymmetric.
MT is the adjacency matrix obtained fromGbyreversingthe arrows.
Mnis the matrix of the graph of allpaths of lengthninG.
7 July 2011
Concrete Representation of Graphs
In programming, graphs are usually represented as:
itsadjacency matrix(stored as a multidimensional array), for non-sparsegraphs
the list of alledgesincident to each node, forsparsegraphs
1 2 3
4 5 6
1 2 2 2,4,5 3 6 4 1,5 5 4 6
7 July 2011
Outline
Basics of Graph Theory The Web as a graph
Introduction
Computing the importance of a page Spamdexing
Discovering communities Social Networking Sites Conclusion
7 July 2011
Outline
Basics of Graph Theory The Web as a graph
Introduction
Computing the importance of a page Spamdexing
Discovering communities Social Networking Sites Conclusion
7 July 2011
The Web Graph
The World Wide Web seenas a (directed) graph:
Vertices: Web pages Edges: hyperlinks
Same for otherinterlinkedenvironments:
dictionaries encyclopedias scientific publications social networks
7 July 2011
The shape of the Web graph
[Broder et al., 2000]
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices?Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices?Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices?
Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?
Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering
Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution
(2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering
Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Characteristics of Interest of a Graph
Sparsity. Is the graph sparse (|A| ≪ |S|2)?
Yes, the World Wide Web is sparse
Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance
Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering
Degree distribution. What is the distribution of the degree of vertices?
Power-lawindegree and outdegree distribution (2≤𝛾 ≤3) [Broder et al., 2000]
𝑘 𝑃
7 July 2011
Outline
Basics of Graph Theory The Web as a graph
Introduction
Computing the importance of a page Spamdexing
Discovering communities Social Networking Sites Conclusion
7 July 2011
PageRank (Google’s Ranking [Brin and Page, 1998])
Idea
Importantpages are pages pointed to byimportantpages.
{︃gij =0 if there is no link between pageiandj;
gij = n1
i otherwise, withni the number of outgoing links of pagei.
Definition (Tentative)
Probabilitythat the surfer following therandom walkinGhas arrived on pagei at some distant given point in the future.
pr(i) = (︂
k→+∞lim (GT)kv )︂
i
wherev is some initial column vector.
7 July 2011
Illustrating PageRank Computation
0.100 0.100
0.100
0.100
0.100 0.100
0.100
0.100
0.100
0.100
7 July 2011
Illustrating PageRank Computation
0.033 0.317
0.075
0.108
0.025 0.058
0.083
0.150
0.117
0.033
7 July 2011
Illustrating PageRank Computation
0.036 0.193
0.108
0.163
0.079 0.090
0.074
0.154
0.094
0.008
7 July 2011
Illustrating PageRank Computation
0.054 0.212
0.093
0.152
0.048 0.051
0.108
0.149
0.106
0.026
7 July 2011
Illustrating PageRank Computation
0.051 0.247
0.078
0.143
0.053 0.062
0.097
0.153
0.099
0.016
7 July 2011
Illustrating PageRank Computation
0.048 0.232
0.093
0.156
0.062 0.067
0.087
0.138
0.099
0.018
7 July 2011
Illustrating PageRank Computation
0.052 0.226
0.092
0.148
0.058 0.064
0.098
0.146
0.096
0.021
7 July 2011
Illustrating PageRank Computation
0.049 0.238
0.088
0.149
0.057 0.063
0.095
0.141
0.099
0.019
7 July 2011
Illustrating PageRank Computation
0.050 0.232
0.091
0.149
0.060 0.066
0.094
0.143
0.096
0.019
7 July 2011
Illustrating PageRank Computation
0.050 0.233
0.091
0.150
0.058 0.064
0.095
0.142
0.098
0.020
7 July 2011
Illustrating PageRank Computation
0.050 0.234
0.090
0.148
0.058 0.065
0.095
0.143
0.097
0.019
7 July 2011
Illustrating PageRank Computation
0.049 0.233
0.091
0.149
0.058 0.065
0.095
0.142
0.098
0.019
7 July 2011
Illustrating PageRank Computation
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
7 July 2011
Illustrating PageRank Computation
0.050 0.234
0.091
0.149
0.058 0.065
0.095
0.142
0.097
0.019
7 July 2011
PageRank with Damping
May not always converge, or convergence may not be unique.
To fix this, the random surfer can at each steprandomly jumpto any page of the Web with some probabilityd (1−d:damping factor).
pr(i) = (︂
k→+∞lim ((1−d)GT +dU)kv )︂
i
whereU is the matrix with all N1 values withN the number of vertices.
7 July 2011
Using PageRank to Score Query Results
PageRank: globalscore, independent of the query Can be used to raise the weight ofimportantpages:
weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.
7 July 2011
Iterative Computation of PageRank
1. ComputeG(often stored as its adjacency list). Make sure lines sum to 1.
2. Letu be the uniform vector of sum 1,v =u,w thezerovector.
3. Whilev isdifferent enoughfromw:
Setw =v.
Setv = (1−d)GTv+du.
Exercise
1 2
3 4
Run the first iteration of the PageRank computation.
7 July 2011
Another take on PageRank: OPIC [Abite- boul et al., 2003]
Online Page Importance Computation
Allows computing PageRankwithout storingthe Web graph Remember for each page itscashand itshistory
Initially, each page has some initial cash
Each time a page is accessed, distribute uniformly its cash to outgoing links
PageRank is approximated as itscashflow: history divided by number of visits
Convergewhatever the strategy for crawling the Web is (as soon as each page is accessed repeatedly)
Good strategy: crawl page with largest cash first
7 July 2011
HITS (Kleinberg, [Kleinberg, 1999])
Idea
Two kinds of important pages: hubs and authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.
G′ transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting withaandhvectors of norm 1):
⎧
⎨
⎩
a:= ‖G′T1h‖ G′Th h:= ‖G1′a‖ G′a
Convergesunder some technical assumptions toauthorityandhub scores.
7 July 2011
Using HITS to Order Web Query Results
1. Retrieve the setD of Web pagesmatchinga keyword query.
2. Retrieve the setD* of Web pages obtained fromDby addingall linked pages, as well as allpages linking topages ofD.
3. Build fromD*the correspondingsubgraphG′ of the Web graph.
4. Computeiterativelyhubs and authority scores.
5. Sort documents fromDbyauthority scores.
Less efficient than PageRank, becauselocalscores.
7 July 2011
Biasing PageRank: Green Measures
Equivalent definitions of Green measure centered at nodei PageRank with sourceati: standard PageRank computation while, at each iteration, adding 1 to the measure ofi, and subtractingpr(j)to every nodej.
Time spent at a nodeknowing the initial node isi.
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.050 -0.233
-0.091
-0.149
-0.058 0.935
-0.095
-0.143
-0.097
-0.019
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.099 0.034
0.319
-0.298
-0.117 0.870
-0.190
-0.285
-0.194
-0.039
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.149 -0.199
0.353
-0.322
-0.050 0.930
-0.035
-0.178
-0.291
-0.058
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.078
0.325
-0.304
-0.109 0.865
-0.026
-0.258
-0.222
-0.036
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.151 -0.096
0.322
-0.333
-0.078 0.903
-0.034
-0.203
-0.273
-0.055
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.161 -0.096
0.337
-0.300
-0.083 0.892
-0.045
-0.256
-0.242
-0.045
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.150 -0.108
0.331
-0.328
-0.083 0.895
-0.027
-0.218
-0.267
-0.047
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.159 -0.086
0.330
-0.312
-0.086 0.892
-0.039
-0.246
-0.248
-0.047
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.154 -0.104
0.334
-0.321
-0.081 0.897
-0.034
-0.228
-0.262
-0.048
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.094
0.332
-0.314
-0.085 0.892
-0.035
-0.241
-0.252
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.155 -0.098
0.332
-0.320
-0.083 0.895
-0.034
-0.231
-0.259
-0.047
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.157 -0.095
0.332
-0.315
-0.084 0.893
-0.036
-0.239
-0.253
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.155 -0.098
0.332
-0.318
-0.084 0.894
-0.034
-0.234
-0.257
-0.047
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.095
0.332
-0.316
-0.084 0.893
-0.035
-0.238
-0.254
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.097
0.332
-0.318
-0.084 0.894
-0.034
-0.235
-0.256
-0.047
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.035
-0.237
-0.254
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.097
0.332
-0.317
-0.084 0.894
-0.034
-0.236
-0.255
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.034
-0.237
-0.254
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.317
-0.084 0.893
-0.034
-0.236
-0.255
-0.046
7 July 2011
Illustrating Green Measures [Ollivier and Senellart, 2007]
-0.156 -0.096
0.332
-0.316
-0.085 0.893
-0.034
-0.237
-0.254
-0.046
7 July 2011
Outline
Basics of Graph Theory The Web as a graph
Introduction
Computing the importance of a page Spamdexing
Discovering communities Social Networking Sites Conclusion
7 July 2011
Spamdexing
Definition
Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines Purpose: attracting visitors to websites to make profit.
Unceasing war betweenspamdexersandsearch engines
7 July 2011
Spamdexing: Lying about the Content
Technique
Putunrelatedterms in:
meta-information (<meta name="description">,
<meta name="keywords">)
text content hidden to the user with JavaScript, CSS, or HTML presentational elements
Countertechnique
Ignoremeta-information Try anddetectinvisible text
7 July 2011
Link Farm Attacks
Technique
Huge number of hosts on the Internet used for the sole purpose ofref- erencing each other, without any content in themselves, to raise the importanceof a given website or set of websites.
Countertechnique
Detection of websites withemptyorduplicatecontent
Use of heuristics to discoversubgraphsthat look like link farms
7 July 2011
Link Pollution
Technique
Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to addartificiallinks to websites, in order to raise its importance.
Countertechnique
rel="nofollow"attribute to<a>links not validated by a page’s owner
7 July 2011
Outline
Basics of Graph Theory The Web as a graph
Introduction
Computing the importance of a page Spamdexing
Discovering communities Social Networking Sites Conclusion
7 July 2011
Discovery of communities
Identifyingcommunitiesof Web pages using thegraph structure:
can be used for clustering the Web, or for finding boundaries of logical Web sites
Two subproblems:
1. Given some initial vertex or vertex set, finding the corresponding community
2. Given the graph as a whole, finding a partition in communities
7 July 2011
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
/4
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph
ComplexityO(n2m)(n: vertices,m: edges)
7 July 2011
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3 source
4 0
3 2
1 /4 4
1 sink
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph
ComplexityO(n2m)(n: vertices,m: edges)
7 July 2011
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
4 0
3 2
1 /4 4
1
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph
ComplexityO(n2m)(n: vertices,m: edges)
7 July 2011
Markov Cluster Algorithm (MCL) [van Don- gen, 2000]
Graphclusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
Expansion(matrix multiplication, corresponding to flow propagation)
Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3)for an exact computation,O(n)for an approximate one
[van Dongen, 2000]
7 July 2011
Markov Cluster Algorithm (MCL) [van Don- gen, 2000]
Graphclusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
Expansion(matrix multiplication, corresponding to flow propagation)
Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3)for an exact computation,O(n)for an approximate one
[van Dongen, 2000]
7 July 2011
Deletion of the edges with the highest be- twenness [Newman and Girvan, 2004]
Top-downgraph clustering algorithm
Betwennessof an edge: number of minimal paths between two arbitrary vertices going through this edge
General principle:
1. Compute thebetweennessof each edge in the graph 2. Removethe edge with the highest betweenness
3. Redo the whole process, betweenness computation included Complexity: O(n3)for a sparse graph
[Newman and Girvan, 2004]
7 July 2011
Outline
Basics of Graph Theory The Web as a graph Social Networking Sites
Typology
Social Network Graphs Algorithms on social networks Conclusion
7 July 2011
Graph Mining on Other Graphs
Graph miningtools used for the Web: also applicable to a variety of different graphs:
dictionary encyclopedia citation graph . . .
Among these: social networking Web sites
Attention: on undirected graphs, PageRank is proportional to degree! Not very useful.
7 July 2011
Outline
Basics of Graph Theory The Web as a graph Social Networking Sites
Typology
Social Network Graphs Algorithms on social networks Conclusion
7 July 2011
Most popular social networks
Social networking sites the most popular in the world and in France (Web site rank with the most traffic, according to Alexa)
World France
SkyRock 51 3
YouTube 3 4
MySpace 17 7
Facebook 5 8
Dailymotion 61 11
EBay 18 12
Wikipedia 8 13
Meetic 565 27
ImageShack 47 53
hi5 15 59
Megavideo 133 80
Adult Friendfinder 55 82
Wat.tv 1568 88
Flickr 33 94
Orkut 19 >100
V Kontakte 28 >100
7 July 2011
Typology of social networking sites
Content-oriented
Cataloging (Books, Music, Links, Movies, Publications, Games) Sharing (Images, Videos)
Edition Sales Discussion User-oriented
Pure social networks (Personal, Professional, Mixed) Blogs
Dating
7 July 2011
Outline
Basics of Graph Theory The Web as a graph Social Networking Sites
Typology
Social Network Graphs Algorithms on social networks Conclusion
7 July 2011
Graphs of Social Networks
Natural model: social network =graph Entities =nodes, Relations =edges Depending on the specific case:
mostly undirected graphs bipartite,n-partite
annotated edges, weighted edges
7 July 2011
Undirected graph
Adapted forpuresocial networking sites with symmetrical relations (e.g., LinkedIn)
7 July 2011
Multipartite graph
Adapted to most social networks forsharingcontent, with annotations, users, content, etc. (e.g., Flickr)
mason.flickr manufrakass
france chateau
7 July 2011
Six degrees of separation
Idea that any two persons on Earth are separated bya chain of six personssuch that any two successive links know each other Demonstrated by an experiment byStanley Milgram[Travers and Milgram, 1969] (package to transmit to an unknown person, using acquaintances)
Popularized in numerous media
Number6is not too be taken too seriously! But main idea validatedin more recent experiments
In other fields:
Erd ˝os numberfor scientific publications Kevin Baconfor Hollywood movies
Common Characteristicsof (most) social networks !
7 July 2011
Characteristics of Social Networking Sites
Four important characteristics [Newman et al., 2006]:
Sparse graphs: much more edges than a complete graph
Small typical distance: shortest path between two nodes logarithmic with respect to the size of the graph
High clustering: ifais connected tob andb toc, thenbhas good probability of being linked toc
Power-law degree distribution: the number of nodes with degreek of the order ofk−𝛾 (𝛾 constant)
𝑘 𝑃
More on this in Athens course on Collective Intelligence (TPT09)
7 July 2011
Outline
Basics of Graph Theory The Web as a graph Social Networking Sites
Typology
Social Network Graphs Algorithms on social networks Conclusion
7 July 2011
Information retrieval with social score [Schenkel et al., 2008]
Context: Multipartite graph, e.g., Flickr
Purpose: bias results with one’s social network Social weighting:
Given a friendship relationF(u,u′)(explicit or implicit) between two users, anextended friendshiprelation can be computed
F˜(u,u′) = 𝛼
|U| + (1−𝛼) max
cheminu=u0. . .uk=u′ k−1
∏︁
i=0
F(ui,ui+1)
(0< 𝛼 <1 constant;|U|: number of users) Instead of aglobal weighting
tf-idf(t,d) =tf(t,d)×idf(t,d) we choose asocial weightingdependent ofu:
tf-idfu(t,d) = (︃
∑︁F(u,u′)·tfu′(t,d) )︃
×idf(t,d)
7 July 2011
Top-k with social score [Benedikt et al., 2008]
Possible to adapt Fagin’strheshold algorithm. . .
. . . butimpossible to precomputethe scores tf-idfu(t,d)for each user
To avoid too much computation time:
1. Cluster the graph of users intostrongly similarcomponents 2. Use the scoresinside these componentsas estimates of the
threshold in Fagin’s algorithm
3. ⇒givesapproximateresults, but good quality
7 July 2011
Outline
Basics of Graph Theory The Web as a graph Social Networking Sites Conclusion
7 July 2011
Conclusion
What you should remember The Web seen as a graph
PageRank, and its iterative computation
Graph mining techniques applicable to a wide range of graphs Social networks share important common characteristics with the Web graph
To go further
A good textbook [Chakrabarti, 2003]
Two accessible influential research papers:
On HITS [Kleinberg, 1999]
On PageRank [Brin and Page, 1998]
7 July 2011
Bibliography I
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.
Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, and Julia Stoyanovich. Efficient network-aware search in collaborative tagging sites. InProc. VLDB, Auckland, New Zealand, August 2008.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):
107–117, April 1998.
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web.Computer Networks, 33(1-6):
309–320, 2000.
Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.
7 July 2011
Bibliography II
Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921–940, October 1988.
Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment.
Journal of the ACM, 46(5):604–632, 1999.
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks.Physical Review E, 69(2), 2004.
Mark Newman, Albert-László Barabási, and Duncan J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.
Yann Ollivier and Pierre Senellart. Finding related pages using Green measures: An illustration with Wikipedia. InProc. AAAI, Vancouver, Canada, July 2007.
7 July 2011
Bibliography III
Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum.
Efficient top-k querying over social-tagging networks. InProc.
SIGIR, Singapore, Singapore, July 2008.
Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 34(4), December 1969.
Stijn Marinus van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.
7 July 2011
Licence de droits d’usage
Contexte public} avec modifications
Par le téléchargement ou la consultation de ce document, l’utilisateur accepte la licence d’utilisation qui y est attachée, telle que détaillée dans les dispositions suivantes, et s’engage à la respecter intégralement.
La licence confère à l’utilisateur un droit d’usage sur le document consulté ou téléchargé, totalement ou en partie, dans les conditions définies ci-après et à l’exclusion expresse de toute utilisation commerciale.
Le droit d’usage défini par la licence autorise un usage à destination de tout public qui comprend : – le droit de reproduire tout ou partie du document sur support informatique ou papier,
– le droit de diffuser tout ou partie du document au public sur support papier ou informatique, y compris par la mise à la disposition du public sur un réseau numérique,
– le droit de modifier la forme ou la présentation du document,
– le droit d’intégrer tout ou partie du document dans un document composite et de le diffuser dans ce nouveau document, à condition que : – L’auteur soit informé.
Les mentions relatives à la source du document et/ou à son auteur doivent être conservées dans leur intégralité.
Le droit d’usage défini par la licence est personnel et non exclusif.
Tout autre usage que ceux prévus par la licence est soumis à autorisation préalable et expresse de l’auteur :[email protected]