• Aucun résultat trouvé

Web Search Graph Mining Pierre Senellart (

N/A
N/A
Protected

Academic year: 2022

Partager "Web Search Graph Mining Pierre Senellart ("

Copied!
93
0
0

Texte intégral

(1)

Basics of Graph Theory

Web Search

Graph Mining

Pierre Senellart

(pierre.senellart@telecom-paristech.fr)

18 March 2009

(2)

Basics of Graph Theory

Outline

1 Basics of Graph Theory

2 The Web as a graph

3 Social Networking Sites

4 Conclusion

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 1 / 46

(3)

Basics of Graph Theory

Graphs

1 2 3

4 5 6

Definition

A directed graphis a pair(S,A) where:

S is a finite set ofvertices(or nodes)

Ais a subset of S2 defining theedges (orarcs)

1 2 3

4 5 6

Definition

An undirected graphis a pair (S,A)where:

S is a finite set ofvertices(or nodes)

Ais a set of (unordered) pairs of elements of S defining theedges (orarcs)

Remark

Graph is the mathematical term,network is used to describe real-world graphs.

(4)

Basics of Graph Theory

Paths and Connectedness

Definition

A pathis a sequence of verticesv1. . .vn such thatvk is connected by an edge to vk+1 for 1≤k ≤n−1.

Definition

The underlying undirected graphof a directed graphG is the graph obtained by adding all reverse edges.

Definition

An undirected graph isconnectedif for every two verticesu andv, there exists a path starting from u and ending in v.

A directed graph is strongly connectedif it is connected, and is weakly connectedif the underlying undirected graph is connected.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 3 / 46

(5)

Basics of Graph Theory

Connected Components

Definition

(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

(6)

Basics of Graph Theory

Connected Components

Definition

(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

Strongly connected components

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 4 / 46

(7)

Basics of Graph Theory

Connected Components

Definition

(S0,A0)is a subgraph of(S,A) ifS0⊆S andA0 is the restriction ofA to edges whose vertices are in S0.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

Weakly connected components

(8)

Basics of Graph Theory

Vocabulary

Incident: an edge is said to beincidentto a vertex if it it hasthis vertex for endpoint

Degree (of a vertex): number of edgesincident to a vertex, in an undirected graph

Indegree (of a vertex): number of edges arriving to a vertex, in a directed graph

Outdegree (of a vertex): number of edgesleaving from a vertex, in a directed graph

Cycle: Path whose start and end vertex is thesame Distance: Length of theshortest pathbetween two vertices

Sparse: a graph(S,A) is sparse if|A| |S|2

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 5 / 46

(9)

Basics of Graph Theory

Bipartite graphs

Definition

A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2

(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(10)

Basics of Graph Theory

Bipartite graphs

Definition

A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2

(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 6 / 46

(11)

Basics of Graph Theory

Bipartite graphs

Definition

A bipartitegraph is an undirected graph (S,A) such thatS =S1∪S2

(with S1∩S2 =∅), and no edge of Ais incident to two vertices inS1 or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(12)

Basics of Graph Theory

Matrix Representation of a Graph

A graph G can be represented by itsadjacency matrixM:

1 2 3

4 5 6

0 1 0 0 0 0

0 1 0 1 1 0

0 0 0 0 0 1

1 0 0 0 1 0

0 0 0 1 0 0

0 0 0 0 0 0

Adjacency matrices of undirectedgraphs aresymmetric.

MTis the adjacency matrix obtained fromG byreversing the arrows.

Mn is the matrix of the graph of allpaths of length n in G.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 7 / 46

(13)

Basics of Graph Theory

Concrete Representation of Graphs

In programming, graphs are usually represented as:

its adjacency matrix(stored as a multidimensional array), for non-sparsegraphs

the list of alledges incident to each node, for sparse graphs

1 2 3

4 5 6

1 2 2 2,4,5 3 6 4 1,5 5 4 6

(14)

The Web as a graph

Outline

1 Basics of Graph Theory

2 The Web as a graph Introduction

Computing the importance of a page Spamdexing

Discovering communities

3 Social Networking Sites

4 Conclusion

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 9 / 46

(15)

The Web as a graph Introduction

The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages Edges: hyperlinks

Same for other interlinkedenvironments:

dictionaries encyclopedias

scientific publications social networks

(16)

The Web as a graph Introduction

The shape of the Web graph

[Broder et al., 2000]

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 11 / 46

(17)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

(18)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46

(19)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices?

Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

(20)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46

(21)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected?

Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

(22)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46

(23)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

(24)

The Web as a graph Introduction

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is the mean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifa is connected to bothb andc, is the probability that b is connected toc significantly greater than the probability any two nodes are connected? Strong local clustering Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤γ 3) [Broder et al., 2000]

k P

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 12 / 46

(25)

The Web as a graph Computing the importance of a page

PageRank (Google’s Ranking [Brin and Page, 1998])

Idea

Important pages are pages pointed to by importantpages.

(gij =0 if there is no link between page i and j;

gij = n1

i otherwise, with ni the number of outgoing links of page i.

Definition (Tentative)

Probability that the surfer following therandom walkin G has arrived on page i at some distant given point in the future.

pr(i) =

k→+∞lim (GT)kv

i

where v is some initial column vector.

(26)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.100 0.100

0.100

0.100

0.100 0.100

0.100

0.100

0.100

0.100

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(27)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.033 0.317

0.075

0.108

0.025 0.058

0.083

0.150

0.117

0.033

(28)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.094

0.008

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(29)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.054 0.212

0.093

0.152

0.048 0.051

0.108

0.149

0.106

0.026

(30)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.099

0.016

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(31)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.048 0.232

0.093

0.156

0.062 0.067

0.087

0.138

0.099

0.018

(32)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.096

0.021

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(33)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.049 0.238

0.088

0.149

0.057 0.063

0.095

0.141

0.099

0.019

(34)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.096

0.019

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(35)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.050 0.233

0.091

0.150

0.058 0.064

0.095

0.142

0.098

0.020

(36)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.097

0.019

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(37)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.049 0.233

0.091

0.149

0.058 0.065

0.095

0.142

0.098

0.019

(38)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 14 / 46

(39)

The Web as a graph Computing the importance of a page

Illustrating PageRank Computation

0.050 0.234

0.091

0.149

0.058 0.065

0.095

0.142

0.097

0.019

(40)

The Web as a graph Computing the importance of a page

PageRank with Damping

May not always converge, or convergence may not be unique.

To fix this, the random surfer can at each step randomly jumpto any page of the Web with some probability d (1−d: damping factor).

pr(i) =

k→+∞lim ((1−d)GT +dU)kv

i

where U is the matrix with all N1 values withN the number of vertices.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 15 / 46

(41)

The Web as a graph Computing the importance of a page

Using PageRank to Score Query Results

PageRank: globalscore, independent of the query Can be used to raise the weight ofimportant pages:

weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.

(42)

The Web as a graph Computing the importance of a page

Iterative Computation of PageRank

1 ComputeG (often stored as its adjacency list). Make sure lines sum to 1.

2 Let u be the uniform vector of sum 1,v =u,w thezero vector.

3 While v is different enoughfromw:

I Setw =v.

I Setv = (1d)GTv+du.

Exercise

1 2

4 3

Run the first iteration of the PageRank computation.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 17 / 46

(43)

The Web as a graph Computing the importance of a page

Another take on PageRank: OPIC [Abiteboul et al., 2003]

Online Page Importance Computation

Allows computing PageRankwithout storingthe Web graph Remember for each page itscash and itshistory

Initially, each page has some initial cash

Each time a page is accessed, distribute uniformly its cash to outgoing links

PageRank is approximated as its cashflow: history divided by number of visits

Convergewhatever the strategy for crawling the Web is (as soon as each page is accessed repeatedly)

Good strategy: crawl page with largest cash first

(44)

The Web as a graph Computing the importance of a page

HITS (Kleinberg, [Kleinberg, 1999])

Idea

Two kinds of important pages: hubsand authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.

G0 transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting with aandh vectors of norm 1):

a:= kG0T1hk G0Th h:= kG10ak G0a

Converges under some technical assumptions to authorityandhub scores.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 19 / 46

(45)

The Web as a graph Computing the importance of a page

Using HITS to Order Web Query Results

1 Retrieve the set D of Web pagesmatching a keyword query.

2 Retrieve the set D of Web pages obtained from D by adding all linked pages, as well as all pages linking topages ofD.

3 Build fromD the correspondingsubgraph G0 of the Web graph.

4 Computeiterativelyhubs and authority scores.

5 Sort documents fromD byauthority scores.

Less efficient than PageRank, becauselocal scores.

(46)

The Web as a graph Computing the importance of a page

Biasing PageRank: Green Measures

Equivalent definitions of Green measure centered at node i

PageRank with source ati: standard PageRank computation while, at each iteration, adding 1 to the measure ofi, and subtractingpr(j) to every nodej.

Time spent at a node knowing the initial node is i.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 21 / 46

(47)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.050 -0.233

-0.091

-0.149

-0.058 0.935

-0.095

-0.143

-0.097

-0.019

(48)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.099 0.034

0.319

-0.298

-0.117 0.870

-0.190

-0.285

-0.194

-0.039

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(49)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.149 -0.199

0.353

-0.322

-0.050 0.930

-0.035

-0.178

-0.291

-0.058

(50)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.078

0.325

-0.304

-0.109 0.865

-0.026

-0.258

-0.222

-0.036

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(51)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.151 -0.096

0.322

-0.333

-0.078 0.903

-0.034

-0.203

-0.273

-0.055

(52)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.161 -0.096

0.337

-0.300

-0.083 0.892

-0.045

-0.256

-0.242

-0.045

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(53)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.150 -0.108

0.331

-0.328

-0.083 0.895

-0.027

-0.218

-0.267

-0.047

(54)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.159 -0.086

0.330

-0.312

-0.086 0.892

-0.039

-0.246

-0.248

-0.047

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(55)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.154 -0.104

0.334

-0.321

-0.081 0.897

-0.034

-0.228

-0.262

-0.048

(56)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.094

0.332

-0.314

-0.085 0.892

-0.035

-0.241

-0.252

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(57)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.320

-0.083 0.895

-0.034

-0.231

-0.259

-0.047

(58)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.095

0.332

-0.315

-0.084 0.893

-0.036

-0.239

-0.253

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(59)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.318

-0.084 0.894

-0.034

-0.234

-0.257

-0.047

(60)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.095

0.332

-0.316

-0.084 0.893

-0.035

-0.238

-0.254

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(61)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.318

-0.084 0.894

-0.034

-0.235

-0.256

-0.047

(62)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.035

-0.237

-0.254

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(63)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.317

-0.084 0.894

-0.034

-0.236

-0.255

-0.046

(64)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.034

-0.237

-0.254

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(65)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.317

-0.084 0.893

-0.034

-0.236

-0.255

-0.046

(66)

The Web as a graph Computing the importance of a page

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.085 0.893

-0.034

-0.237

-0.254

-0.046

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 22 / 46

(67)

The Web as a graph Spamdexing

Spamdexing

Definition

Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines Purpose: attracting visitors to websites to make profit.

Unceasing war between spamdexersandsearch engines

(68)

The Web as a graph Spamdexing

Spamdexing: Lying about the Content

Technique

Put unrelatedterms in:

meta-information (<meta name="description">,

<meta name="keywords">)

text content hidden to the user with JavaScript, CSS, or HTML presentational elements

Countertechnique

Ignoremeta-information Try and detectinvisible text

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 24 / 46

(69)

The Web as a graph Spamdexing

Link Farm Attacks

Technique

Huge number of hosts on the Internet used for the sole purpose ofreferencing each other, without any content in themselves, to raise the importance of a given website or set of websites.

Countertechnique

Detection of websites with emptyor duplicatecontent

Use of heuristics to discoversubgraphs that look like link farms

(70)

The Web as a graph Spamdexing

Link Pollution

Technique

Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to add artificiallinks to websites, in order to raise its importance.

Countertechnique

rel="nofollow" attribute to<a> links not validated by a page’s owner

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 26 / 46

(71)

The Web as a graph Discovering communities

Discovery of communities

Identifyingcommunitiesof Web pages using thegraph structure: can be used for clustering the Web, or for finding boundaries oflogical Web sites

Two subproblems:

1 Given some initial vertex or vertex set, finding the corresponding community

2 Given the graph as a whole, finding a partition in communities

(72)

The Web as a graph Discovering communities

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

/4

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m) (n: vertices,m: edges)

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 28 / 46

(73)

The Web as a graph Discovering communities

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3 source

4 0

3 2

1 /4 4

1 sink

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m)(n: vertices,m: edges)

(74)

The Web as a graph Discovering communities

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

4 0

3 2

1 /4 4

1

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseed of users from the remaining of the graph ComplexityO(n2m)(n: vertices,m: edges)

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 28 / 46

(75)

The Web as a graph Discovering communities

Markov Cluster Algorithm (MCL) [van Dongen, 2000]

Graph clusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

I Expansion(matrix multiplication, corresponding to flow propagation)

I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

(76)

The Web as a graph Discovering communities

Markov Cluster Algorithm (MCL) [van Dongen, 2000]

Graph clusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

I Expansion(matrix multiplication, corresponding to flow propagation)

I Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 29 / 46

(77)

The Web as a graph Discovering communities

Deletion of the edges with the highest betwenness [Newman and Girvan, 2004]

Top-downgraph clustering algorithm

Betwenness of an edge: number of minimal paths between two arbitrary vertices going through this edge

General principle:

1 Compute thebetweennessof each edge in the graph

2 Removethe edge with the highest betweenness

3 Redo the whole process, betweenness computation included Complexity: O(n3) for a sparse graph

[Newman and Girvan, 2004]

(78)

Social Networking Sites

Outline

1 Basics of Graph Theory

2 The Web as a graph

3 Social Networking Sites Typology

Social Network Graphs Algorithms on social networks

4 Conclusion

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 31 / 46

(79)

Social Networking Sites

Graph Mining on Other Graphs

Graph miningtools used for the Web: also applicable to a variety of different graphs:

I dictionary

I encyclopedia

I citation graph

I . . .

Among these: social networking Web sites

Attention: on undirected graphs, PageRank is proportional to degree!

Not very useful.

(80)

Social Networking Sites Typology

Most popular social networks

Social networking sites the most popular in the world and in France (Web site rank with the most traffic, according to Alexa)

Monde France

SkyRock 51 3

YouTube 3 4

MySpace 17 7

Facebook 5 8

Dailymotion 61 11

EBay 18 12

Wikipedia 8 13

Meetic 565 27

ImageShack 47 53

hi5 15 59

Megavideo 133 80

Adult Friendfinder 55 82

Wat.tv 1568 88

Flickr 33 94

Orkut 19 >100

V Kontakte 28 >100

Friendster 39 >100

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 33 / 46

(81)

Social Networking Sites Typology

Typology of social networking sites

Site de réseau social

Orienté contenu Orienté utilisateur

Catalogue Partage Édition Vente Discussion Pur Communautés de blogs Rencontre

Livres Musique Liens Films Publications Jeux Images Vidéos Adulte everything2 Wikipedia EBay Yahoo! Answers

Flickr (Yahoo!) Photobucket (Fox) YouTube (Google) Dailymotion Megavideo Wat (TF1)

Personnel Professionnel Mélangé SkyRock Twitter FriendFinder Meetic

MySpace (Fox) hi5 Friendster LinkedIn Facebook Orkut (Google) V Kontakte LibraryThing Shelfari (Amazon) Last.fm (CBS) Delicious (Yahoo!) Flixster Yahoo! Movies CiteULike MobyGames

(82)

Social Networking Sites Social Network Graphs

Graphs of Social Networks

Natural model: social network = graph Entities =nodes, Relations =edges Depending on the specific case:

I mostly undirected graphs

I bipartite,n-partite

I annotated edges, weighted edges

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 35 / 46

(83)

Social Networking Sites Social Network Graphs

Undirected graph

Adapted forpure social networking sites with symmetrical relations (e.g., LinkedIn)

(84)

Social Networking Sites Social Network Graphs

Multipartite graph

Adapted to most social networks for sharing content, with annotations, users, content, etc. (e.g., Flickr)

mason.flickr manufrakass

france chateau

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 37 / 46

(85)

Social Networking Sites Social Network Graphs

Six degrees of separation

Idea that any two persons on Earth are separated bya chain of six personssuch that any two successive links know each other Demonstrated by an experiment by Stanley Milgram [Travers and Milgram, 1969] (package to transmit to an unknown person, using acquaintances)

Popularized in numerous media

Number6 is not too be taken too seriously! But main ideavalidated in more recent experiments

In other fields:

I Erdős numberfor scientific publications

I Kevin Baconfor Hollywood movies

Common Characteristicsof (most) social networks !

(86)

Social Networking Sites Social Network Graphs

Characteristics of Social Networking Sites

Four important characteristics [Newman et al., 2006]:

Sparse graphs: much more edges than a complete graph

Small typical distance: shortest path between two nodes logarithmic with respect to the size of the graph

High clustering: ifa is connected tob andb to c, thenb has good probability of being linked toc

Power-law degree distribution: the number of nodes with degree k of the order ofk−γ (γ constant)

k P

More on this in Athens course on Collective Intelligence (TPT09)

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 39 / 46

(87)

Social Networking Sites Algorithms on social networks

Information retrieval with social score [Schenkel et al., 2008]

Context: Multipartite graph, e.g., Flickr Purpose: bias results with one’s social network Social weighting:

I Given a friendship relationF(u,u0)(explicit or implicit) between two users, anextended friendshiprelation can be computed

F(u,˜ u0) = α

|U|+ (1α) max

cheminu=u0. . .uk=u0 k−1

Y

i=0

F(ui,ui+1) (0< α <1 constant;|U|: number of users)

I Instead of aglobal weighting

tf-idf(t,d) =tf(t,d)×idf(t,d) we choose asocial weightingdependent of u:

tf-idfu(t,d) = X

u0∈U

F(u,u0)·tfu0(t,d)

!

×idf(t,d)

(88)

Social Networking Sites Algorithms on social networks

Top-k with social score [Benedikt et al., 2008]

Possible to adapt Fagin’s trheshold algorithm. . .

. . . butimpossible to precomputethe scores tf-idfu(t,d) for each user To avoid too much computation time:

1 Cluster the graph of users intostrongly similarcomponents

2 Use the scoresinside these componentsas estimates of the threshold in Fagin’s algorithm

3 givesapproximateresults, but good quality

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 41 / 46

(89)

Conclusion

Outline

1 Basics of Graph Theory

2 The Web as a graph

3 Social Networking Sites

4 Conclusion

(90)

Conclusion

Conclusion

What you should remember The Web seen as a graph

PageRank, and its iterative computation

Graph mining techniques applicable to a wide range of graphs Social networks share important common characteristics with the Web graph

To go further

A good textbook [Chakrabarti, 2003]

Two accessible influential research papers:

I On HITS [Kleinberg, 1999]

I On PageRank [Brin and Page, 1998]

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 43 / 46

(91)

Bibliography I

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.

Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, and Julia Stoyanovich. Efficient network-aware search in collaborative tagging sites. In Proc. VLDB, Auckland, New Zealand, August 2008.

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):107–117, April 1998.

Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer Networks, 33(1-6):309–320, 2000.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

(92)

Bibliography II

Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921–940, October 1988.

Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment.

Journal of the ACM, 46(5):604–632, 1999.

M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2), 2004.

Mark Newman, Albert-László Barabási, and Duncan J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.

Yann Ollivier and Pierre Senellart. Finding related pages using Green measures: An illustration with Wikipedia. InProc. AAAI, Vancouver, Canada, July 2007.

P. Senellart (TELECOM ParisTech) Graph Mining 2009/03/18 45 / 46

(93)

Bibliography III

Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum. Efficient top-k querying over social-tagging networks. InProc. SIGIR, Singapore, Singapore, July 2008.

Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 34(4), December 1969.

Stijn Marinus van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.

Références

Documents relatifs

A user can put, within an HTTP parameter that will be displayed in the page, HTML code (and therefore CSS styling or JavaScript code). He thus modifies the code of the produced

Sécurité côté client Introduction Failles de sécurité Abus de confiance Capture d’informations Sécurité côté serveur Références... 29

For a HTML page, or for text, the browser must also know what character set is used (this has precedence over the information contained in the document itself). Also returned:

• Web document representation models: Vector space model, ontology based ones.. • Web document clustering: Similarity measures, Web document

Among the tag-based tag predictors, Majority Rule method predicts the best tags for unan- notated web pages which means Tag Similarity Assumption dominates Tag Col-

We will show live demos to nd relations in the Wikipedia or to improve image search as well as our current research in the topic. Our nal goal is to produce a virtuous data

From our point of view, the Web page is entity carrying information about these communities and this paper describes techniques, which can be used to extract mentioned informa- tion

This integrated object schema is called the Business Information Model (BIM) and can be used as the basis for a database schema from which an underlying database can be created.