• Aucun résultat trouvé

Web search Graph Mining

N/A
N/A
Protected

Academic year: 2022

Partager "Web search Graph Mining"

Copied!
101
0
0

Texte intégral

(1)

7 July 2011

Web search

Graph Mining

(2)

7 July 2011

Outline

Basics of Graph Theory

The Web as a graph Social Networking Sites Conclusion

(3)

7 July 2011

Graphs

1 2 3

4 5 6

Definition

Adirected graphis a pair(S,A)where:

Sis a finite set ofvertices(ornodes) Ais a subset ofS2defining theedges(or arcs)

1 2 3

4 5 6

Definition

Anundirected graphis a pair(S,A)where:

Sis a finite set ofvertices(ornodes)

Ais a set of (unordered) pairs of elements of Sdefining theedges(orarcs)

Remark

Graphis the mathematical term,networkis used to describe real-world graphs.

(4)

7 July 2011

Paths and Connectedness

Definition

Apathis a sequence of verticesv1. . .vnsuch thatvk is connected by an edge tovk+1for 1≤k ≤n−1.

Definition

Theunderlying undirected graphof a directed graphGis the graph obtained by adding all reverse edges.

Definition

An undirected graph isconnectedif for every two verticesu andv, there exists a path starting fromu and ending inv.

A directed graph isstrongly connectedif it is connected, and isweakly connectedif the underlying undirected graph is connected.

(5)

7 July 2011

Connected Components

Definition

(S,A)is asubgraphof(S,A)ifS ⊆SandA is the restriction ofAto edges whose vertices are inS.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph

Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

(6)

7 July 2011

Connected Components

Definition

(S,A)is asubgraphof(S,A)ifS ⊆SandA is the restriction ofAto edges whose vertices are inS.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph

Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

Strongly connected components

(7)

7 July 2011

Connected Components

Definition

(S,A)is asubgraphof(S,A)ifS ⊆SandA is the restriction ofAto edges whose vertices are inS.

Connected component: maximal connected subgraph

Strongly connected component: maximal strongly connected subgraph

Weakly connected component: maximal weakly connected subgraph

1 2

4 5

3 6

Weakly connected components

(8)

7 July 2011

Vocabulary

Incident: an edge is said to beincidentto a vertex if it it hasthis vertex for endpoint

Degree (of a vertex): number of edgesincident toa vertex, in an undirected graph

Indegree (of a vertex): number of edgesarriving toa vertex, in a directed graph

Outdegree (of a vertex): number of edgesleaving froma vertex, in a directed graph

Cycle: Path whose start and end vertex is thesame Distance: Length of theshortest pathbetween two vertices

Sparse: a graph(S,A)is sparse if|A| ≪ |S|2

(9)

7 July 2011

Bipartite graphs

Definition

Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(10)

7 July 2011

Bipartite graphs

Definition

Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(11)

7 July 2011

Bipartite graphs

Definition

Abipartitegraph is an undirected graph(S,A)such thatS =S1∪S2 (withS1∩S2=∅), and no edge ofAis incident to two vertices inS1or two vertices inS2.

Paths of length 2 in a bipartite graph define two regular undirected graphs.

1 2 3 4

5 6 7

1 2

3 4

5 6

7

(12)

7 July 2011

Matrix Representation of a Graph

A graphGcan be represented by itsadjacency matrixM:

1 2 3

4 5 6

0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0

Adjacency matrices ofundirectedgraphs aresymmetric.

MT is the adjacency matrix obtained fromGbyreversingthe arrows.

Mnis the matrix of the graph of allpaths of lengthninG.

(13)

7 July 2011

Concrete Representation of Graphs

In programming, graphs are usually represented as:

itsadjacency matrix(stored as a multidimensional array), for non-sparsegraphs

the list of alledgesincident to each node, forsparsegraphs

1 2 3

4 5 6

1 2 2 2,4,5 3 6 4 1,5 5 4 6

(14)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph

Introduction

Computing the importance of a page Spamdexing

Discovering communities Social Networking Sites Conclusion

(15)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph

Introduction

Computing the importance of a page Spamdexing

Discovering communities Social Networking Sites Conclusion

(16)

7 July 2011

The Web Graph

The World Wide Web seenas a (directed) graph:

Vertices: Web pages Edges: hyperlinks

Same for otherinterlinkedenvironments:

dictionaries encyclopedias scientific publications social networks

(17)

7 July 2011

The shape of the Web graph

[Broder et al., 2000]

(18)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices?Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(19)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices?Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(20)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices?

Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(21)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(22)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected?

Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(23)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering

Degree distribution. What is the distribution of the degree of vertices? Power-lawindegree and outdegree distribution

(2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(24)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering

Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(25)

7 July 2011

Characteristics of Interest of a Graph

Sparsity. Is the graph sparse (|A| ≪ |S|2)?

Yes, the World Wide Web is sparse

Typical distance. What is themean distancebetween any pairs of vertices? Logarithmic typical distance

Local clustering. Ifais connected to bothbandc, is the probability thatbis connected toc significantly greater than the probability any two nodes are connected? Strong local clustering

Degree distribution. What is the distribution of the degree of vertices?

Power-lawindegree and outdegree distribution (2≤𝛾 ≤3) [Broder et al., 2000]

𝑘 𝑃

(26)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph

Introduction

Computing the importance of a page Spamdexing

Discovering communities Social Networking Sites Conclusion

(27)

7 July 2011

PageRank (Google’s Ranking [Brin and Page, 1998])

Idea

Importantpages are pages pointed to byimportantpages.

{︃gij =0 if there is no link between pageiandj;

gij = n1

i otherwise, withni the number of outgoing links of pagei.

Definition (Tentative)

Probabilitythat the surfer following therandom walkinGhas arrived on pagei at some distant given point in the future.

pr(i) = (︂

k→+∞lim (GT)kv )︂

i

wherev is some initial column vector.

(28)

7 July 2011

Illustrating PageRank Computation

0.100 0.100

0.100

0.100

0.100 0.100

0.100

0.100

0.100

0.100

(29)

7 July 2011

Illustrating PageRank Computation

0.033 0.317

0.075

0.108

0.025 0.058

0.083

0.150

0.117

0.033

(30)

7 July 2011

Illustrating PageRank Computation

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.094

0.008

(31)

7 July 2011

Illustrating PageRank Computation

0.054 0.212

0.093

0.152

0.048 0.051

0.108

0.149

0.106

0.026

(32)

7 July 2011

Illustrating PageRank Computation

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.099

0.016

(33)

7 July 2011

Illustrating PageRank Computation

0.048 0.232

0.093

0.156

0.062 0.067

0.087

0.138

0.099

0.018

(34)

7 July 2011

Illustrating PageRank Computation

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.096

0.021

(35)

7 July 2011

Illustrating PageRank Computation

0.049 0.238

0.088

0.149

0.057 0.063

0.095

0.141

0.099

0.019

(36)

7 July 2011

Illustrating PageRank Computation

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.096

0.019

(37)

7 July 2011

Illustrating PageRank Computation

0.050 0.233

0.091

0.150

0.058 0.064

0.095

0.142

0.098

0.020

(38)

7 July 2011

Illustrating PageRank Computation

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.097

0.019

(39)

7 July 2011

Illustrating PageRank Computation

0.049 0.233

0.091

0.149

0.058 0.065

0.095

0.142

0.098

0.019

(40)

7 July 2011

Illustrating PageRank Computation

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(41)

7 July 2011

Illustrating PageRank Computation

0.050 0.234

0.091

0.149

0.058 0.065

0.095

0.142

0.097

0.019

(42)

7 July 2011

PageRank with Damping

May not always converge, or convergence may not be unique.

To fix this, the random surfer can at each steprandomly jumpto any page of the Web with some probabilityd (1−d:damping factor).

pr(i) = (︂

k→+∞lim ((1−d)GT +dU)kv )︂

i

whereU is the matrix with all N1 values withN the number of vertices.

(43)

7 July 2011

Using PageRank to Score Query Results

PageRank: globalscore, independent of the query Can be used to raise the weight ofimportantpages:

weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.

(44)

7 July 2011

Iterative Computation of PageRank

1. ComputeG(often stored as its adjacency list). Make sure lines sum to 1.

2. Letu be the uniform vector of sum 1,v =u,w thezerovector.

3. Whilev isdifferent enoughfromw:

Setw =v.

Setv = (1d)GTv+du.

Exercise

1 2

3 4

Run the first iteration of the PageRank computation.

(45)

7 July 2011

Another take on PageRank: OPIC [Abite- boul et al., 2003]

Online Page Importance Computation

Allows computing PageRankwithout storingthe Web graph Remember for each page itscashand itshistory

Initially, each page has some initial cash

Each time a page is accessed, distribute uniformly its cash to outgoing links

PageRank is approximated as itscashflow: history divided by number of visits

Convergewhatever the strategy for crawling the Web is (as soon as each page is accessed repeatedly)

Good strategy: crawl page with largest cash first

(46)

7 July 2011

HITS (Kleinberg, [Kleinberg, 1999])

Idea

Two kinds of important pages: hubs and authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.

G transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting withaandhvectors of norm 1):

a:= ‖G′T1h‖ G′Th h:= ‖G1a‖ Ga

Convergesunder some technical assumptions toauthorityandhub scores.

(47)

7 July 2011

Using HITS to Order Web Query Results

1. Retrieve the setD of Web pagesmatchinga keyword query.

2. Retrieve the setD* of Web pages obtained fromDby addingall linked pages, as well as allpages linking topages ofD.

3. Build fromD*the correspondingsubgraphG of the Web graph.

4. Computeiterativelyhubs and authority scores.

5. Sort documents fromDbyauthority scores.

Less efficient than PageRank, becauselocalscores.

(48)

7 July 2011

Biasing PageRank: Green Measures

Equivalent definitions of Green measure centered at nodei PageRank with sourceati: standard PageRank computation while, at each iteration, adding 1 to the measure ofi, and subtractingpr(j)to every nodej.

Time spent at a nodeknowing the initial node isi.

(49)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.050 -0.233

-0.091

-0.149

-0.058 0.935

-0.095

-0.143

-0.097

-0.019

(50)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.099 0.034

0.319

-0.298

-0.117 0.870

-0.190

-0.285

-0.194

-0.039

(51)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.149 -0.199

0.353

-0.322

-0.050 0.930

-0.035

-0.178

-0.291

-0.058

(52)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.078

0.325

-0.304

-0.109 0.865

-0.026

-0.258

-0.222

-0.036

(53)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.151 -0.096

0.322

-0.333

-0.078 0.903

-0.034

-0.203

-0.273

-0.055

(54)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.161 -0.096

0.337

-0.300

-0.083 0.892

-0.045

-0.256

-0.242

-0.045

(55)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.150 -0.108

0.331

-0.328

-0.083 0.895

-0.027

-0.218

-0.267

-0.047

(56)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.159 -0.086

0.330

-0.312

-0.086 0.892

-0.039

-0.246

-0.248

-0.047

(57)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.154 -0.104

0.334

-0.321

-0.081 0.897

-0.034

-0.228

-0.262

-0.048

(58)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.094

0.332

-0.314

-0.085 0.892

-0.035

-0.241

-0.252

-0.046

(59)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.320

-0.083 0.895

-0.034

-0.231

-0.259

-0.047

(60)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.157 -0.095

0.332

-0.315

-0.084 0.893

-0.036

-0.239

-0.253

-0.046

(61)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.155 -0.098

0.332

-0.318

-0.084 0.894

-0.034

-0.234

-0.257

-0.047

(62)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.095

0.332

-0.316

-0.084 0.893

-0.035

-0.238

-0.254

-0.046

(63)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.318

-0.084 0.894

-0.034

-0.235

-0.256

-0.047

(64)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.035

-0.237

-0.254

-0.046

(65)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.097

0.332

-0.317

-0.084 0.894

-0.034

-0.236

-0.255

-0.046

(66)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.034

-0.237

-0.254

-0.046

(67)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.317

-0.084 0.893

-0.034

-0.236

-0.255

-0.046

(68)

7 July 2011

Illustrating Green Measures [Ollivier and Senellart, 2007]

-0.156 -0.096

0.332

-0.316

-0.085 0.893

-0.034

-0.237

-0.254

-0.046

(69)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph

Introduction

Computing the importance of a page Spamdexing

Discovering communities Social Networking Sites Conclusion

(70)

7 July 2011

Spamdexing

Definition

Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines Purpose: attracting visitors to websites to make profit.

Unceasing war betweenspamdexersandsearch engines

(71)

7 July 2011

Spamdexing: Lying about the Content

Technique

Putunrelatedterms in:

meta-information (<meta name="description">,

<meta name="keywords">)

text content hidden to the user with JavaScript, CSS, or HTML presentational elements

Countertechnique

Ignoremeta-information Try anddetectinvisible text

(72)

7 July 2011

Link Farm Attacks

Technique

Huge number of hosts on the Internet used for the sole purpose ofref- erencing each other, without any content in themselves, to raise the importanceof a given website or set of websites.

Countertechnique

Detection of websites withemptyorduplicatecontent

Use of heuristics to discoversubgraphsthat look like link farms

(73)

7 July 2011

Link Pollution

Technique

Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to addartificiallinks to websites, in order to raise its importance.

Countertechnique

rel="nofollow"attribute to<a>links not validated by a page’s owner

(74)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph

Introduction

Computing the importance of a page Spamdexing

Discovering communities Social Networking Sites Conclusion

(75)

7 July 2011

Discovery of communities

Identifyingcommunitiesof Web pages using thegraph structure:

can be used for clustering the Web, or for finding boundaries of logical Web sites

Two subproblems:

1. Given some initial vertex or vertex set, finding the corresponding community

2. Given the graph as a whole, finding a partition in communities

(76)

7 July 2011

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

/4

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph

ComplexityO(n2m)(n: vertices,m: edges)

(77)

7 July 2011

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3 source

4 0

3 2

1 /4 4

1 sink

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph

ComplexityO(n2m)(n: vertices,m: edges)

(78)

7 July 2011

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

4 0

3 2

1 /4 4

1

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate aseedof users from the remaining of the graph

ComplexityO(n2m)(n: vertices,m: edges)

(79)

7 July 2011

Markov Cluster Algorithm (MCL) [van Don- gen, 2000]

Graphclusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

Expansion(matrix multiplication, corresponding to flow propagation)

Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3)for an exact computation,O(n)for an approximate one

[van Dongen, 2000]

(80)

7 July 2011

Markov Cluster Algorithm (MCL) [van Don- gen, 2000]

Graphclusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

Expansion(matrix multiplication, corresponding to flow propagation)

Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3)for an exact computation,O(n)for an approximate one

[van Dongen, 2000]

(81)

7 July 2011

Deletion of the edges with the highest be- twenness [Newman and Girvan, 2004]

Top-downgraph clustering algorithm

Betwennessof an edge: number of minimal paths between two arbitrary vertices going through this edge

General principle:

1. Compute thebetweennessof each edge in the graph 2. Removethe edge with the highest betweenness

3. Redo the whole process, betweenness computation included Complexity: O(n3)for a sparse graph

[Newman and Girvan, 2004]

(82)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph Social Networking Sites

Typology

Social Network Graphs Algorithms on social networks Conclusion

(83)

7 July 2011

Graph Mining on Other Graphs

Graph miningtools used for the Web: also applicable to a variety of different graphs:

dictionary encyclopedia citation graph . . .

Among these: social networking Web sites

Attention: on undirected graphs, PageRank is proportional to degree! Not very useful.

(84)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph Social Networking Sites

Typology

Social Network Graphs Algorithms on social networks Conclusion

(85)

7 July 2011

Most popular social networks

Social networking sites the most popular in the world and in France (Web site rank with the most traffic, according to Alexa)

World France

SkyRock 51 3

YouTube 3 4

MySpace 17 7

Facebook 5 8

Dailymotion 61 11

EBay 18 12

Wikipedia 8 13

Meetic 565 27

ImageShack 47 53

hi5 15 59

Megavideo 133 80

Adult Friendfinder 55 82

Wat.tv 1568 88

Flickr 33 94

Orkut 19 >100

V Kontakte 28 >100

(86)

7 July 2011

Typology of social networking sites

Content-oriented

Cataloging (Books, Music, Links, Movies, Publications, Games) Sharing (Images, Videos)

Edition Sales Discussion User-oriented

Pure social networks (Personal, Professional, Mixed) Blogs

Dating

(87)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph Social Networking Sites

Typology

Social Network Graphs Algorithms on social networks Conclusion

(88)

7 July 2011

Graphs of Social Networks

Natural model: social network =graph Entities =nodes, Relations =edges Depending on the specific case:

mostly undirected graphs bipartite,n-partite

annotated edges, weighted edges

(89)

7 July 2011

Undirected graph

Adapted forpuresocial networking sites with symmetrical relations (e.g., LinkedIn)

(90)

7 July 2011

Multipartite graph

Adapted to most social networks forsharingcontent, with annotations, users, content, etc. (e.g., Flickr)

mason.flickr manufrakass

france chateau

(91)

7 July 2011

Six degrees of separation

Idea that any two persons on Earth are separated bya chain of six personssuch that any two successive links know each other Demonstrated by an experiment byStanley Milgram[Travers and Milgram, 1969] (package to transmit to an unknown person, using acquaintances)

Popularized in numerous media

Number6is not too be taken too seriously! But main idea validatedin more recent experiments

In other fields:

Erd ˝os numberfor scientific publications Kevin Baconfor Hollywood movies

Common Characteristicsof (most) social networks !

(92)

7 July 2011

Characteristics of Social Networking Sites

Four important characteristics [Newman et al., 2006]:

Sparse graphs: much more edges than a complete graph

Small typical distance: shortest path between two nodes logarithmic with respect to the size of the graph

High clustering: ifais connected tob andb toc, thenbhas good probability of being linked toc

Power-law degree distribution: the number of nodes with degreek of the order ofk−𝛾 (𝛾 constant)

𝑘 𝑃

More on this in Athens course on Collective Intelligence (TPT09)

(93)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph Social Networking Sites

Typology

Social Network Graphs Algorithms on social networks Conclusion

(94)

7 July 2011

Information retrieval with social score [Schenkel et al., 2008]

Context: Multipartite graph, e.g., Flickr

Purpose: bias results with one’s social network Social weighting:

Given a friendship relationF(u,u)(explicit or implicit) between two users, anextended friendshiprelation can be computed

F˜(u,u) = 𝛼

|U| + (1𝛼) max

cheminu=u0. . .uk=u k−1

∏︁

i=0

F(ui,ui+1)

(0< 𝛼 <1 constant;|U|: number of users) Instead of aglobal weighting

tf-idf(t,d) =tf(t,d)×idf(t,d) we choose asocial weightingdependent ofu:

tf-idfu(t,d) = (︃

∑︁F(u,u)·tfu(t,d) )︃

×idf(t,d)

(95)

7 July 2011

Top-k with social score [Benedikt et al., 2008]

Possible to adapt Fagin’strheshold algorithm. . .

. . . butimpossible to precomputethe scores tf-idfu(t,d)for each user

To avoid too much computation time:

1. Cluster the graph of users intostrongly similarcomponents 2. Use the scoresinside these componentsas estimates of the

threshold in Fagin’s algorithm

3. givesapproximateresults, but good quality

(96)

7 July 2011

Outline

Basics of Graph Theory The Web as a graph Social Networking Sites Conclusion

(97)

7 July 2011

Conclusion

What you should remember The Web seen as a graph

PageRank, and its iterative computation

Graph mining techniques applicable to a wide range of graphs Social networks share important common characteristics with the Web graph

To go further

A good textbook [Chakrabarti, 2003]

Two accessible influential research papers:

On HITS [Kleinberg, 1999]

On PageRank [Brin and Page, 1998]

(98)

7 July 2011

Bibliography I

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.

Michael Benedikt, Sihem Amer Yahia, Laks Lakshmanan, and Julia Stoyanovich. Efficient network-aware search in collaborative tagging sites. InProc. VLDB, Auckland, New Zealand, August 2008.

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):

107–117, April 1998.

Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web.Computer Networks, 33(1-6):

309–320, 2000.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

(99)

7 July 2011

Bibliography II

Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM, 35(4):921–940, October 1988.

Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment.

Journal of the ACM, 46(5):604–632, 1999.

M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks.Physical Review E, 69(2), 2004.

Mark Newman, Albert-László Barabási, and Duncan J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.

Yann Ollivier and Pierre Senellart. Finding related pages using Green measures: An illustration with Wikipedia. InProc. AAAI, Vancouver, Canada, July 2007.

(100)

7 July 2011

Bibliography III

Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum.

Efficient top-k querying over social-tagging networks. InProc.

SIGIR, Singapore, Singapore, July 2008.

Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 34(4), December 1969.

Stijn Marinus van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.

(101)

7 July 2011

Licence de droits d’usage

Contexte public} avec modifications

Par le téléchargement ou la consultation de ce document, l’utilisateur accepte la licence d’utilisation qui y est attachée, telle que détaillée dans les dispositions suivantes, et s’engage à la respecter intégralement.

La licence confère à l’utilisateur un droit d’usage sur le document consulté ou téléchargé, totalement ou en partie, dans les conditions définies ci-après et à l’exclusion expresse de toute utilisation commerciale.

Le droit d’usage défini par la licence autorise un usage à destination de tout public qui comprend : – le droit de reproduire tout ou partie du document sur support informatique ou papier,

– le droit de diffuser tout ou partie du document au public sur support papier ou informatique, y compris par la mise à la disposition du public sur un réseau numérique,

– le droit de modifier la forme ou la présentation du document,

– le droit d’intégrer tout ou partie du document dans un document composite et de le diffuser dans ce nouveau document, à condition que : – L’auteur soit informé.

Les mentions relatives à la source du document et/ou à son auteur doivent être conservées dans leur intégralité.

Le droit d’usage défini par la licence est personnel et non exclusif.

Tout autre usage que ceux prévus par la licence est soumis à autorisation préalable et expresse de l’auteur :[email protected]

Références

Documents relatifs

A web graph of Web fragment is a directed graph without loops and multiple arcs, whose vertex set is represented by the target site set, and whose arc set is constructed as

Thus, on the basis of research of existing software tools and existing mechanisms for obtaining information for a software complex that will implement the process of ob- taining

This data was collected on a version of Snag’em permitting multiple missions and allowing players to connect with the same user multiple times.We classified players as active if

The Triangle-finding SPARQL query finds all triangles in the queried graph and each solution is

We then provide the results of an extensive experimental study of the distributional properties of relevant measures on graphs generated by several stochastic models, including

Among the tag-based tag predictors, Majority Rule method predicts the best tags for unan- notated web pages which means Tag Similarity Assumption dominates Tag Col-

To solve the problem, we present a method for performing large-scale object consolidation to merge identifiers of equiv- alent instances occurring across data sources; the algorithm

Linkedin’s graph and search systems help our users discover other users, jobs, companies, schools, and relevant professional informa- tion. I will present the evolution of