• Aucun résultat trouvé

Finding Related Pages Using Green Measures: The Example of Wikipedia

N/A
N/A
Protected

Academic year: 2022

Partager "Finding Related Pages Using Green Measures: The Example of Wikipedia"

Copied!
178
0
0

Texte intégral

(1)

Finding Related Pages Using Green Measures:

The Example of Wikipedia

Yann Ollivier 1,2 Pierre Senellart 3,4

1 2 3 4

Gemo Seminar

November 10th, 2006

(2)

Introduction Related nodes in a graph

Related nodes in a graph

Given a hyperlinked environment (= a graph)...

2 1 3

6

7 9

4

5

10

8

Problem

Finding nodes semantically related to some given node.

(3)

Introduction Related nodes in a graph

Related nodes in a graph

Given a hyperlinked environment (= a graph)...

2 1 3

6

7 9

4

5

10

8

Problem

Finding nodes semantically related to some given node.

(4)

Introduction Related nodes in a graph

Example of related nodes

Example (World Wide Web) Nodes: Web pages

Edges: hyperlinks

Related nodes: similar/related pages (cf Google)

Example (Publications) Nodes: publications

Edges: citations

Related nodes: related articles (cf Google Scholar)

Example (Wikipedia) Nodes: articles

Edges: hyperlinks

Related nodes: related articles (= articles on semantically related topics)

(5)

Introduction Related nodes in a graph

Example of related nodes

Example (World Wide Web) Nodes: Web pages

Edges: hyperlinks

Related nodes: similar/related pages (cf Google)

Example (Publications) Nodes: publications

Edges: citations

Related nodes: related articles (cf Google Scholar)

Example (Wikipedia) Nodes: articles

Edges: hyperlinks

Related nodes: related articles (= articles on semantically related topics)

(6)

Introduction Related nodes in a graph

Example of related nodes

Example (World Wide Web) Nodes: Web pages

Edges: hyperlinks

Related nodes: similar/related pages (cf Google)

Example (Publications) Nodes: publications

Edges: citations

Related nodes: related articles (cf Google Scholar)

Example (Wikipedia) Nodes: articles

Edges: hyperlinks

Related nodes: related articles (= articles on semantically related topics)

(7)

Introduction Related nodes in a graph

Example of related nodes

Example (World Wide Web) Nodes: Web pages

Edges: hyperlinks

Related nodes: similar/related pages (cf Google)

Example (Publications) Nodes: publications

Edges: citations

Related nodes: related articles (cf Google Scholar)

Example (Wikipedia) Nodes: articles

Edges: hyperlinks

Related nodes: related articles (= articles on semantically related topics)

(8)

Introduction Methods

Classical approaches

Classical approached for finding related nodes (e.g. on the World Wide Web):

Based on the use of variants of PageRank on local subgraphs. Text Mining techniques : cocitations, vector-space model...

Our approach

Use of a classical Markov chain tool: Green measures.

(9)

Introduction Methods

Classical approaches

Classical approached for finding related nodes (e.g. on the World Wide Web):

Based on the use of variants of PageRank on local subgraphs.

Text Mining techniques : cocitations, vector-space model...

Our approach

Use of a classical Markov chain tool: Green measures.

(10)

Introduction Methods

Classical approaches

Classical approached for finding related nodes (e.g. on the World Wide Web):

Based on the use of variants of PageRank on local subgraphs.

Text Mining techniques : cocitations, vector-space model...

Our approach

Use of a classical Markov chain tool: Green measures.

(11)

Introduction Methods

Classical approaches

Classical approached for finding related nodes (e.g. on the World Wide Web):

Based on the use of variants of PageRank on local subgraphs.

Text Mining techniques : cocitations, vector-space model...

Our approach

Use of a classical Markov chain tool: Green measures.

(12)

Introduction Contributions

Contributions

Our contributions:

1

A novel use of Green measures for extracting semantic information in a graph.

2

An extensive comparative study with classical approaches, on the English version of Wikipedia.

Remark

Only pure mathematical methods, no Wikipedia-specific tricks included.

(13)

Introduction Contributions

Contributions

Our contributions:

1

A novel use of Green measures for extracting semantic information in a graph.

2

An extensive comparative study with classical approaches, on the English version of Wikipedia.

Remark

Only pure mathematical methods, no Wikipedia-specific tricks included.

(14)

Introduction Contributions

Contributions

Our contributions:

1

A novel use of Green measures for extracting semantic information in a graph.

2

An extensive comparative study with classical approaches, on the English version of Wikipedia.

Remark

Only pure mathematical methods, no Wikipedia-specific tricks included.

(15)

Introduction Contributions

Contributions

Our contributions:

1

A novel use of Green measures for extracting semantic information in a graph.

2

An extensive comparative study with classical approaches, on the English version of Wikipedia.

Remark

Only pure mathematical methods, no Wikipedia-specific tricks included.

(16)

Green measures

Outline

1 Introduction

2 Green measures

Graphs as Markov chains Green measures

3 Methods Compared

4 Experiment on Wikipedia

5 Conclusion

(17)

Green measures Graphs as Markov chains

Graph = Markov chain

Definition (Markov chain)

Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .

For a directed graph: State space: set of nodes

Transition probabilities: based on existence (and weight) of edges

Remark

All graphs will be supposed strongly connected and with gcd of length of

all cycles equal to 1.

(18)

Green measures Graphs as Markov chains

Graph = Markov chain

Definition (Markov chain)

Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .

For a directed graph:

State space: set of nodes

Transition probabilities: based on existence (and weight) of edges

Remark

All graphs will be supposed strongly connected and with gcd of length of

all cycles equal to 1.

(19)

Green measures Graphs as Markov chains

Graph = Markov chain

Definition (Markov chain)

Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .

For a directed graph:

State space: set of nodes

Transition probabilities: based on existence (and weight) of edges

Remark

All graphs will be supposed strongly connected and with gcd of length of

all cycles equal to 1.

(20)

Green measures Graphs as Markov chains

Graph = Markov chain

Definition (Markov chain)

Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .

For a directed graph:

State space: set of nodes

Transition probabilities: based on existence (and weight) of edges

Remark

All graphs will be supposed strongly connected and with gcd of length of

all cycles equal to 1.

(21)

Green measures Graphs as Markov chains

Equilibrium measure

Definition (Measure)

Assignments of real numbers to the state set.

Definition (Propagation operator)

Operator which maps a measure µ to a measure µ 0 computed as:

µ 0 j = X

i

(µ i p ij )

Result

If we iterate the propagation operator from any measure summing to 1, we

converge to a unique equilibrium measure. (PageRank with no random

jumps).

(22)

Green measures Graphs as Markov chains

Equilibrium measure

Definition (Measure)

Assignments of real numbers to the state set.

Definition (Propagation operator)

Operator which maps a measure µ to a measure µ 0 computed as:

µ 0 j = X

i

(µ i p ij )

Result

If we iterate the propagation operator from any measure summing to 1, we

converge to a unique equilibrium measure. (PageRank with no random

jumps).

(23)

Green measures Graphs as Markov chains

Equilibrium measure

Definition (Measure)

Assignments of real numbers to the state set.

Definition (Propagation operator)

Operator which maps a measure µ to a measure µ 0 computed as:

µ 0 j = X

i

(µ i p ij )

Result

If we iterate the propagation operator from any measure summing to 1, we

converge to a unique equilibrium measure. (PageRank with no random

jumps).

(24)

Green measures Graphs as Markov chains

PageRank — Iteration ]1

0.100 0.100

0.100

0.100

0.100 0.100

0.100

0.100

0.100

(25)

Green measures Graphs as Markov chains

PageRank — Iteration ]2

0.033 0.317

0.075

0.108

0.025 0.058

0.083

0.150

0.117

0.033

(26)

Green measures Graphs as Markov chains

PageRank — Iteration ]3

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.008

(27)

Green measures Graphs as Markov chains

PageRank — Iteration ]4

0.054 0.212

0.093

0.152

0.048 0.051

0.108

0.149

0.106

0.026

(28)

Green measures Graphs as Markov chains

PageRank — Iteration ]5

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.016

(29)

Green measures Graphs as Markov chains

PageRank — Iteration ]6

0.048 0.232

0.093

0.156

0.062 0.067

0.087

0.138

0.099

0.018

(30)

Green measures Graphs as Markov chains

PageRank — Iteration ]7

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.021

(31)

Green measures Graphs as Markov chains

PageRank — Iteration ]8

0.049 0.238

0.088

0.149

0.057 0.063

0.095

0.141

0.099

0.019

(32)

Green measures Graphs as Markov chains

PageRank — Iteration ]9

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.019

(33)

Green measures Graphs as Markov chains

PageRank — Iteration ]10

0.050 0.233

0.091

0.150

0.058 0.064

0.095

0.142

0.098

0.020

(34)

Green measures Graphs as Markov chains

PageRank — Iteration ]11

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.019

(35)

Green measures Graphs as Markov chains

PageRank — Iteration ]12

0.049 0.233

0.091

0.149

0.058 0.065

0.095

0.142

0.098

0.019

(36)

Green measures Graphs as Markov chains

PageRank — Iteration ]13

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.019

(37)

Green measures Graphs as Markov chains

PageRank — Iteration ]14

0.050 0.234

0.091

0.149

0.058 0.065

0.095

0.142

0.097

0.019

(38)

Green measures Green measures

Background on Green measures

Green functions

Come from electrostatic theory (potential created by a charge distribution).

Analogy between electrostatic potential theory and Markov chains.

Green measures: discrete analogues of Green functions.

(39)

Green measures Green measures

Background on Green measures

Green functions

Come from electrostatic theory (potential created by a charge distribution).

Analogy between electrostatic potential theory and Markov chains.

Green measures: discrete analogues of Green functions.

(40)

Green measures Green measures

Background on Green measures

Green functions

Come from electrostatic theory (potential created by a charge distribution).

Analogy between electrostatic potential theory and Markov chains.

Green measures: discrete analogues of Green functions.

(41)

Green measures Green measures

Definition of Green measures

Definition (Green measure centered at node i)

Only fixed point of the operator on measures defined by:

µ j 7→ P

k (µ k p kj ) + (δ ij − ν j ) where δ ij =

1 if i = j 0 otherwise

Interpretations

PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .

Time spent at a node knowing the initial node is i.

(42)

Green measures Green measures

Definition of Green measures

Definition (Green measure centered at node i)

Only fixed point of the operator on measures defined by:

µ j 7→ P

k (µ k p kj ) + (δ ij − ν j ) where δ ij =

1 if i = j 0 otherwise

Interpretations

PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .

Time spent at a node knowing the initial node is i.

(43)

Green measures Green measures

Definition of Green measures

Definition (Green measure centered at node i)

Only fixed point of the operator on measures defined by:

µ j 7→ P

k (µ k p kj ) + (δ ij − ν j ) where δ ij =

1 if i = j 0 otherwise

Interpretations

PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .

Time spent at a node knowing the initial node is i.

(44)

Green measures Green measures

Definition of Green measures

Definition (Green measure centered at node i)

Only fixed point of the operator on measures defined by:

µ j 7→ P

k (µ k p kj ) + (δ ij − ν j ) where δ ij =

1 if i = j 0 otherwise

Interpretations

PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .

Time spent at a node knowing the initial node is i.

(45)

Methods Compared

Outline

1 Introduction

2 Green measures

3 Methods Compared Green

SymGreen

PageRankOfLinks LocalPageRank Cosine

Cocitations

4 Experiment on Wikipedia

5 Conclusion

(46)

Methods Compared

Purpose

Finding nodes in the graph related to i.

For each method, output an ordered list of nodes related to i.

Each method provides a similarity score to i.

(47)

Methods Compared

Purpose

Finding nodes in the graph related to i.

For each method, output an ordered list of nodes related to i.

Each method provides a similarity score to i.

(48)

Methods Compared

Purpose

Finding nodes in the graph related to i.

For each method, output an ordered list of nodes related to i.

Each method provides a similarity score to i.

(49)

Methods Compared Green

Green — Method Description

Method Description

Direct application of the theory of Green measures.

Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).

Iteration until reasonable convergence on the top results.

(50)

Methods Compared Green

Green — Method Description

Method Description

Direct application of the theory of Green measures.

Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).

Iteration until reasonable convergence on the top results.

(51)

Methods Compared Green

Green — Method Description

Method Description

Direct application of the theory of Green measures.

Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).

Iteration until reasonable convergence on the top results.

(52)

Methods Compared Green

Green — Iteration ]1

-0.050 -0.233

-0.091

-0.149

-0.058 0.935

-0.095

-0.143

-0.019

(53)

Methods Compared Green

Green — Iteration ]2

-0.099 0.034

0.319

-0.298

-0.117 0.870

-0.190

-0.285

-0.194

-0.039

(54)

Methods Compared Green

Green — Iteration ]3

-0.149 -0.199

0.353

-0.322

-0.050 0.930

-0.035

-0.178

-0.058

(55)

Methods Compared Green

Green — Iteration ]4

-0.157 -0.078

0.325

-0.304

-0.109 0.865

-0.026

-0.258

-0.222

-0.036

(56)

Methods Compared Green

Green — Iteration ]5

-0.151 -0.096

0.322

-0.333

-0.078 0.903

-0.034

-0.203

-0.055

(57)

Methods Compared Green

Green — Iteration ]6

-0.161 -0.096

0.337

-0.300

-0.083 0.892

-0.045

-0.256

-0.242

-0.045

(58)

Methods Compared Green

Green — Iteration ]7

-0.150 -0.108

0.331

-0.328

-0.083 0.895

-0.027

-0.218

-0.047

(59)

Methods Compared Green

Green — Iteration ]8

-0.159 -0.086

0.330

-0.312

-0.086 0.892

-0.039

-0.246

-0.248

-0.047

(60)

Methods Compared Green

Green — Iteration ]9

-0.154 -0.104

0.334

-0.321

-0.081 0.897

-0.034

-0.228

-0.048

(61)

Methods Compared Green

Green — Iteration ]10

-0.157 -0.094

0.332

-0.314

-0.085 0.892

-0.035

-0.241

-0.252

-0.046

(62)

Methods Compared Green

Green — Iteration ]11

-0.155 -0.098

0.332

-0.320

-0.083 0.895

-0.034

-0.231

-0.047

(63)

Methods Compared Green

Green — Iteration ]12

-0.157 -0.095

0.332

-0.315

-0.084 0.893

-0.036

-0.239

-0.253

-0.046

(64)

Methods Compared Green

Green — Iteration ]13

-0.155 -0.098

0.332

-0.318

-0.084 0.894

-0.034

-0.234

-0.047

(65)

Methods Compared Green

Green — Iteration ]14

-0.156 -0.095

0.332

-0.316

-0.084 0.893

-0.035

-0.238

-0.254

-0.046

(66)

Methods Compared Green

Green — Iteration ]15

-0.156 -0.097

0.332

-0.318

-0.084 0.894

-0.034

-0.235

-0.047

(67)

Methods Compared Green

Green — Iteration ]16

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.035

-0.237

-0.254

-0.046

(68)

Methods Compared Green

Green — Iteration ]17

-0.156 -0.097

0.332

-0.317

-0.084 0.894

-0.034

-0.236

-0.046

(69)

Methods Compared Green

Green — Iteration ]18

-0.156 -0.096

0.332

-0.316

-0.084 0.893

-0.034

-0.237

-0.254

-0.046

(70)

Methods Compared Green

Green — Iteration ]19

-0.156 -0.096

0.332

-0.317

-0.084 0.893

-0.034

-0.236

-0.046

(71)

Methods Compared Green

Green — Iteration ]20

-0.156 -0.096

0.332

-0.316

-0.085 0.893

-0.034

-0.237

-0.254

-0.046

(72)

Methods Compared SymGreen

SymGreen — Method Description

Method Description

Green goes only forward, may be a limitation.

Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:

˜ p ij = 1

2 p ij + p ji ν j

ν i

The resulting graph has the same equilibrium measure.

Same as Green on this symmetrized graph.

(73)

Methods Compared SymGreen

SymGreen — Method Description

Method Description

Green goes only forward, may be a limitation.

Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:

˜ p ij = 1

2 p ij + p ji ν j

ν i

The resulting graph has the same equilibrium measure.

Same as Green on this symmetrized graph.

(74)

Methods Compared SymGreen

SymGreen — Method Description

Method Description

Green goes only forward, may be a limitation.

Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:

˜ p ij = 1

2 p ij + p ji ν j

ν i

The resulting graph has the same equilibrium measure.

Same as Green on this symmetrized graph.

(75)

Methods Compared SymGreen

SymGreen — Original Graph

2 1 3

6

7 9

4

5

10

8

(76)

Methods Compared SymGreen

SymGreen — Symmetrized Graph

2 1

6 3

4

7

8 9

5

(77)

Methods Compared SymGreen

SymGreen — Iteration ]1

-0.050 -0.233

-0.149 -0.091

-0.095

-0.058

-0.019 0.935

-0.143

-0.097

(78)

Methods Compared SymGreen

SymGreen — Iteration ]2

-0.099 0.233

-0.298 0.069

-0.190

-0.117

0.011 0.870

-0.285

(79)

Methods Compared SymGreen

SymGreen — Iteration ]3

-0.075 0.088

-0.285 0.065

-0.080

-0.063

0.001 0.994

-0.366

-0.283

(80)

Methods Compared SymGreen

SymGreen — Iteration ]4

-0.088 0.271

-0.288 0.092

-0.108

-0.094

0.012 0.964

-0.441

(81)

Methods Compared SymGreen

SymGreen — Iteration ]5

-0.069 0.223

-0.284 0.088

-0.065

-0.070

0.006 1.006

-0.469

-0.370

(82)

Methods Compared SymGreen

SymGreen — Iteration ]6

-0.073 0.295

-0.277 0.099

-0.075

-0.083

0.010 0.995

-0.511

(83)

Methods Compared SymGreen

SymGreen — Iteration ]7

-0.065 0.280

-0.278 0.096

-0.056

-0.073

0.008 1.011

-0.519

-0.410

(84)

Methods Compared SymGreen

SymGreen — Iteration ]8

-0.066 0.309

-0.273 0.101

-0.060

-0.079

0.009 1.008

-0.542

(85)

Methods Compared SymGreen

SymGreen — Iteration ]9

-0.062 0.304

-0.274 0.099

-0.052

-0.075

0.008 1.014

-0.542

-0.427

(86)

Methods Compared SymGreen

SymGreen — Iteration ]10

-0.063 0.316

-0.271 0.102

-0.054

-0.077

0.009 1.013

-0.556

(87)

Methods Compared SymGreen

SymGreen — Iteration ]11

-0.061 0.315

-0.273 0.101

-0.050

-0.075

0.009 1.016

-0.554

-0.435

(88)

Methods Compared SymGreen

SymGreen — Iteration ]12

-0.062 0.319

-0.270 0.102

-0.051

-0.077

0.009 1.015

-0.562

(89)

Methods Compared SymGreen

SymGreen — Iteration ]13

-0.061 0.319

-0.272 0.101

-0.049

-0.076

0.009 1.016

-0.560

-0.439

(90)

Methods Compared SymGreen

SymGreen — Iteration ]14

-0.061 0.321

-0.270 0.102

-0.050

-0.077

0.009 1.016

-0.565

(91)

Methods Compared SymGreen

SymGreen — Iteration ]15

-0.061 0.321

-0.271 0.102

-0.049

-0.076

0.009 1.016

-0.563

-0.440

(92)

Methods Compared SymGreen

SymGreen — Iteration ]16

-0.061 0.321

-0.270 0.102

-0.049

-0.077

0.009 1.016

-0.566

(93)

Methods Compared SymGreen

SymGreen — Iteration ]17

-0.061 0.321

-0.271 0.102

-0.049

-0.076

0.009 1.016

-0.565

-0.440

(94)

Methods Compared SymGreen

SymGreen — Iteration ]18

-0.061 0.321

-0.270 0.102

-0.049

-0.077

0.009 1.016

-0.567

(95)

Methods Compared SymGreen

SymGreen — Iteration ]19

-0.061 0.321

-0.271 0.101

-0.049

-0.077

0.009 1.016

-0.566

-0.441

(96)

Methods Compared SymGreen

SymGreen — Iteration ]20

-0.061 0.321

-0.271 0.102

-0.049

-0.077

0.009 1.016

-0.567

(97)

Methods Compared SymGreen

SymGreen — Iteration ]21

-0.061 0.321

-0.271 0.101

-0.049

-0.077

0.009 1.016

-0.567

-0.441

(98)

Methods Compared SymGreen

SymGreen — Iteration ]22

-0.061 0.321

-0.271 0.101

-0.049

-0.077

0.009 1.016

-0.568

(99)

Methods Compared PageRankOfLinks

PageRankOfLinks — Method Description

Method Description

Related nodes: nodes pointed by i.

Sort the nodes pointed by i by their equilibrium measure value.

(100)

Methods Compared PageRankOfLinks

PageRankOfLinks — Method Description

Method Description

Related nodes: nodes pointed by i.

Sort the nodes pointed by i by their equilibrium measure value.

(101)

Methods Compared PageRankOfLinks

PageRankOfLinks

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(102)

Methods Compared PageRankOfLinks

PageRankOfLinks

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.019

(103)

Methods Compared PageRankOfLinks

PageRankOfLinks

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(104)

Methods Compared LocalPageRank

LocalPageRank — Method Description

Method Description

Compute a neighborhood subgraph of i .

Compute the strongly connected component of this subgraph containing i .

Order the nodes by their equilibrium measure in this component.

(Approach widely used on the World Wide Web).

(105)

Methods Compared LocalPageRank

LocalPageRank — Method Description

Method Description

Compute a neighborhood subgraph of i .

Compute the strongly connected component of this subgraph containing i .

Order the nodes by their equilibrium measure in this component.

(Approach widely used on the World Wide Web).

(106)

Methods Compared LocalPageRank

LocalPageRank — Method Description

Method Description

Compute a neighborhood subgraph of i .

Compute the strongly connected component of this subgraph containing i .

Order the nodes by their equilibrium measure in this component.

(Approach widely used on the World Wide Web).

(107)

Methods Compared LocalPageRank

LocalPageRank — Method Description

Method Description

Compute a neighborhood subgraph of i .

Compute the strongly connected component of this subgraph containing i .

Order the nodes by their equilibrium measure in this component.

(Approach widely used on the World Wide Web).

(108)

Methods Compared LocalPageRank

LocalPageRank — Original Graph

2 1 3

6

7 9

4

5

8

(109)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

10

8

(110)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

8

(111)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

10

8

(112)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

8

(113)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

10

8

(114)

Methods Compared LocalPageRank

LocalPageRank — Local Subgraph

2 1 3

6

7 9

4

5

8

(115)

Methods Compared LocalPageRank

LocalPageRank — Local Strongly Connected Subgraph

2 1 3

6

7 9

4

5

10

8

(116)

Methods Compared LocalPageRank

LocalPageRank — Local Strongly Connected Subgraph

2 1 3

6

7 9

4

5

8

(117)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]1

0.125 0.125

0.125

0.125

0.125 0.125

0.125

0.125

(118)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]2

0.042 0.417

0.094

0.094

0.031 0.094

0.167

0.063

(119)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]3

0.031 0.318

0.151

0.120

0.104 0.135

0.125

0.016

(120)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]4

0.040 0.272

0.147

0.132

0.079 0.087

0.191

0.052

(121)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]5

0.044 0.344

0.112

0.108

0.068 0.094

0.191

0.040

(122)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]6

0.036 0.338

0.133

0.120

0.086 0.106

0.147

0.034

(123)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]7

0.040 0.293

0.137

0.127

0.084 0.101

0.173

0.043

(124)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]8

0.042 0.328

0.124

0.116

0.073 0.095

0.180

0.042

(125)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]9

0.039 0.329

0.129

0.119

0.082 0.103

0.163

0.037

(126)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]10

0.040 0.310

0.134

0.123

0.082 0.101

0.169

0.041

(127)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]11

0.041 0.320

0.128

0.119

0.078 0.098

0.175

0.041

(128)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]12

0.040 0.325

0.129

0.119

0.080 0.101

0.168

0.039

(129)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]13

0.040 0.316

0.132

0.121

0.081 0.101

0.169

0.040

(130)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]14

0.040 0.319

0.129

0.120

0.079 0.099

0.172

0.041

(131)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]15

0.040 0.322

0.129

0.119

0.080 0.100

0.169

0.040

(132)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]16

0.040 0.319

0.131

0.121

0.081 0.100

0.169

0.040

(133)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]17

0.040 0.319

0.130

0.120

0.080 0.100

0.171

0.040

(134)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]18

0.040 0.321

0.130

0.120

0.080 0.100

0.170

0.040

(135)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]19

0.040 0.320

0.130

0.120

0.080 0.100

0.170

0.040

(136)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]20

0.040 0.320

0.130

0.120

0.080 0.100

0.170

0.040

(137)

Methods Compared LocalPageRank

LocalPageRank — Iteration ]21

0.040 0.320

0.130

0.120

0.080 0.100

0.170

0.040

(138)

Methods Compared Cosine

Cosine — Method Description

Method Description

Vector-space model: nodes are both documents and dimensions.

Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:

cos(x (i) , x (j ) ) =

P x k (i) x k (j) q

P x k (i) 2 q

P x k (j) 2

Classical Text Mining technique, more often used with textual words

as dimensions.

(139)

Methods Compared Cosine

Cosine — Method Description

Method Description

Vector-space model: nodes are both documents and dimensions.

Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j .

Score of similarity between i and j : cosine of the corresponding vectors:

cos(x (i) , x (j ) ) =

P x k (i) x k (j) q

P x k (i) 2 q

P x k (j) 2

Classical Text Mining technique, more often used with textual words

as dimensions.

(140)

Methods Compared Cosine

Cosine — Method Description

Method Description

Vector-space model: nodes are both documents and dimensions.

Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:

cos(x (i) , x (j ) ) =

P x k (i) x k (j) q

P x k (i) 2 q

P x k (j) 2

Classical Text Mining technique, more often used with textual words

as dimensions.

(141)

Methods Compared Cosine

Cosine — Method Description

Method Description

Vector-space model: nodes are both documents and dimensions.

Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:

cos(x (i) , x (j ) ) =

P x k (i) x k (j) q

P x k (i) 2 q

P x k (j) 2

Classical Text Mining technique, more often used with textual words

as dimensions.

(142)

Methods Compared Cosine

Cosine

2 1 3

6

7 9

4

5

8

(143)

Methods Compared Cosine

Cosine

2 1 3

6

7 9

4

5

10

8

(144)

Methods Compared Cosine

Cosine

2 1 3

6

7 9

4

5

8

(145)

Methods Compared Cosine

Cosine

Dimensions

1 2 3 4 6 7 9 10 Cosine with 9

Do cuments

1 X 0.40

2 X X X X 0.43

4 X 0.40

6 X X X 0.09

8 X X X 0.13

9 X X 1.00

(146)

Methods Compared Cocitations

Cocitations — Method Description

Method Description

Related nodes: nodes that are pointed to by the same nodes.

Score of similarity between i and j : number of nodes pointing to both.

Classically used in bibliometrics.

(147)

Methods Compared Cocitations

Cocitations — Method Description

Method Description

Related nodes: nodes that are pointed to by the same nodes.

Score of similarity between i and j : number of nodes pointing to both.

Classically used in bibliometrics.

(148)

Methods Compared Cocitations

Cocitations — Method Description

Method Description

Related nodes: nodes that are pointed to by the same nodes.

Score of similarity between i and j : number of nodes pointing to both.

Classically used in bibliometrics.

(149)

Methods Compared Cocitations

Cocitations

2 1 3

6

7 9

4

5

10

8

(150)

Methods Compared Cocitations

Cocitations

2 1 3

6

7 9

4

5

8

(151)

Methods Compared Cocitations

Cocitations

2 1 3

6

7 9

4

5

10

8

(152)

Experiment on Wikipedia

Outline

1 Introduction

2 Green measures

3 Methods Compared

4 Experiment on Wikipedia Wikipedia graph Evaluation Results

5 Conclusion

(153)

Experiment on Wikipedia Wikipedia graph

The graph of Wikipedia

Extraction

Extraction of the graph from a full XML dump of Wikipedia.

Multiplicity of links used as weight of edges.

Redirections and Categories dealt with consistently with the Web interface.

Only most used Templates are expanded, in order not to “pollute” too much the graph.

Statistics

1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.

95% of the nodes belong to the largest strongly connected component.

(154)

Experiment on Wikipedia Wikipedia graph

The graph of Wikipedia

Extraction

Extraction of the graph from a full XML dump of Wikipedia.

Multiplicity of links used as weight of edges.

Redirections and Categories dealt with consistently with the Web interface.

Only most used Templates are expanded, in order not to “pollute” too much the graph.

Statistics

1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.

95% of the nodes belong to the largest strongly connected component.

(155)

Experiment on Wikipedia Wikipedia graph

The graph of Wikipedia

Extraction

Extraction of the graph from a full XML dump of Wikipedia.

Multiplicity of links used as weight of edges.

Redirections and Categories dealt with consistently with the Web interface.

Only most used Templates are expanded, in order not to “pollute” too much the graph.

Statistics

1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.

95% of the nodes belong to the largest strongly connected component.

(156)

Experiment on Wikipedia Wikipedia graph

The graph of Wikipedia

Extraction

Extraction of the graph from a full XML dump of Wikipedia.

Multiplicity of links used as weight of edges.

Redirections and Categories dealt with consistently with the Web interface.

Only most used Templates are expanded, in order not to “pollute” too much the graph.

Statistics

1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.

95% of the nodes belong to the largest strongly connected component.

(157)

Experiment on Wikipedia Wikipedia graph

The graph of Wikipedia

Extraction

Extraction of the graph from a full XML dump of Wikipedia.

Multiplicity of links used as weight of edges.

Redirections and Categories dealt with consistently with the Web interface.

Only most used Templates are expanded, in order not to “pollute” too much the graph.

Statistics

1, 606, 896 nodes (as of September 25th, 2006).

38, 896, 462 edges.

95% of the nodes belong to the largest strongly connected component.

(158)

Experiment on Wikipedia Evaluation

Evaluation methodology

Blind evaluation of the methods.

Articles selected for their diversity: Clique (graph theory)

Germany

Hungarian language Pierre de Fermat Star Wars

Theory of relativity 1989

66 evaluators asked to give a mark to each list of words.

(159)

Experiment on Wikipedia Evaluation

Evaluation methodology

Blind evaluation of the methods.

Articles selected for their diversity:

Clique (graph theory) Germany

Hungarian language Pierre de Fermat Star Wars

Theory of relativity 1989

66 evaluators asked to give a mark to each list of words.

(160)

Experiment on Wikipedia Evaluation

Evaluation methodology

Blind evaluation of the methods.

Articles selected for their diversity:

Clique (graph theory) Germany

Hungarian language Pierre de Fermat Star Wars

Theory of relativity 1989

66 evaluators asked to give a mark to each list of words.

(161)

Experiment on Wikipedia Evaluation

Output on Pierre de Fermat

Green SymGreen PageRankOfLinks Cosine Cocitations

1.Pierre de Fermat 2.Toulouse 3.Fermat’s Last

Theorem 4.Diophantine

equation 5.Fermat’s little

theorem 6.Fermat number 7.Grandes écoles 8.Blaise Pascal 9.France 10.Pseudoprime

1.Pierre de Fermat 2.Mathematics 3.Probability

theory 4.Fermat’s Last

Theorem 5.Number theory 6.Toulouse 7.Diophantine

equation 8.Blaise Pascal 9.Fermat’s little

theorem 10.Calculus

1.France 2.17th century 3.March 4 4.January 12 5.August 17 6.Calculus 7.Lawyer 8.1660 9.Number theory 10.René Descartes

1.Pierre de Fermat 2.ENSICA 3.Fermat’s theorem 4.International

School of Toulouse 5.École Nationale

Supérieure d’Électronique, d’Électrotechnique, d’Informatique, d’Hydraulique, et de Télécommuni- cations 6.Languedoc 7.Hélène Pince 8.Community of

Agglomeration of Greater Toulouse 9.Lilhac 10.Institut

d’études politiques de Toulouse

1.Pierre de Fermat 2.Leonhard Euler 3.Mathematics 4.René Descartes 5.Mathematician 6.Gottfried

Leibniz 7.Calculus 8.Isaac Newton 9.Blaise Pascal 10.Carl Friedrich

Gauss

(162)

Experiment on Wikipedia Evaluation

Output on Germany

Green SymGreen PageRankOfLinks Cosine Cocitations

1.Germany 2.Berlin 3.German

language 4.Christian

Democratic Union (Germany) 5.Austria 6.Hamburg 7.German

reunification 8.Social

Democratic Party of Germany 9.German Empire 10.German

Democratic Republic

1.Germany 2.Berlin 3.France 4.Austria 5.German

language 6.Bavaria 7.World War II 8.German

Democratic Republic 9.European Union 10.Hamburg

1.United States 2.United Kingdom 3.France 4.2005 5.Germany 6.World War II 7.Canada 8.English language 9.Japan 10.Italy

1.Germany 2.History of

Germany since 1945 3.History of

Germany 4.Timeline of

German history 5.States of

Germany 6.Politics of

Germany 7.List of

Germany-related topics 8.Hildesheimer

Rabbinical Seminary 9.Pleasure Victim 10.German Unity

Day

1.Germany 2.United States 3.France 4.United Kingdom 5.World War II 6.Italy 7.Netherlands 8.Japan 9.2005 10.Cate-

gory:Living people

(163)

Experiment on Wikipedia Results

Results

(164)

Experiment on Wikipedia Results

Discussion of the results

PageRankOfLinks (like LocalPageRank): usually very bad.

Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.

Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).

Green: good robustness, very few false positives.

Green & SymGreen: only methods to find some important semantic

relations: Finnish language for Hungarian language, Berlin Wall or The

Satanic Verses (novel) for 1989. . .

(165)

Experiment on Wikipedia Results

Discussion of the results

PageRankOfLinks (like LocalPageRank): usually very bad.

Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.

Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).

Green: good robustness, very few false positives.

Green & SymGreen: only methods to find some important semantic

relations: Finnish language for Hungarian language, Berlin Wall or The

Satanic Verses (novel) for 1989. . .

(166)

Experiment on Wikipedia Results

Discussion of the results

PageRankOfLinks (like LocalPageRank): usually very bad.

Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.

Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).

Green: good robustness, very few false positives.

Green & SymGreen: only methods to find some important semantic

relations: Finnish language for Hungarian language, Berlin Wall or The

Satanic Verses (novel) for 1989. . .

(167)

Experiment on Wikipedia Results

Discussion of the results

PageRankOfLinks (like LocalPageRank): usually very bad.

Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.

Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).

Green: good robustness, very few false positives.

Green & SymGreen: only methods to find some important semantic

relations: Finnish language for Hungarian language, Berlin Wall or The

Satanic Verses (novel) for 1989. . .

(168)

Experiment on Wikipedia Results

Discussion of the results

PageRankOfLinks (like LocalPageRank): usually very bad.

Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.

Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).

Green: good robustness, very few false positives.

Green & SymGreen: only methods to find some important semantic

relations: Finnish language for Hungarian language, Berlin Wall or The

Satanic Verses (novel) for 1989. . .

(169)

Conclusion

Outline

1 Introduction

2 Green measures

3 Methods Compared

4 Experiment on Wikipedia

5 Conclusion

Summary

Perspectives

(170)

Conclusion Summary

Summary

Green measures: a tool for extracting semantic information in a graph.

In comparison to other methods, in the case of Wikipedia: Better overall performance.

Robustness.

Discovery of relevant semantic relations.

(171)

Conclusion Summary

Summary

Green measures: a tool for extracting semantic information in a graph.

In comparison to other methods, in the case of Wikipedia:

Better overall performance. Robustness.

Discovery of relevant semantic relations.

(172)

Conclusion Summary

Summary

Green measures: a tool for extracting semantic information in a graph.

In comparison to other methods, in the case of Wikipedia:

Better overall performance.

Robustness.

Discovery of relevant semantic relations.

(173)

Conclusion Summary

Summary

Green measures: a tool for extracting semantic information in a graph.

In comparison to other methods, in the case of Wikipedia:

Better overall performance.

Robustness.

Discovery of relevant semantic relations.

(174)

Conclusion Summary

Summary

Green measures: a tool for extracting semantic information in a graph.

In comparison to other methods, in the case of Wikipedia:

Better overall performance.

Robustness.

Discovery of relevant semantic relations.

(175)

Conclusion Perspectives

Perspectives

Application to the Web graph.

Interpolation between Green and SymGreen.

Clustering using Green measures: unpractical now because of computation times.

Use of Green measures on other Markov

chains, e.g. for computing authority

scores.

(176)

Conclusion Perspectives

Perspectives

Application to the Web graph.

Interpolation between Green and SymGreen.

Clustering using Green measures: unpractical now because of computation times.

Use of Green measures on other Markov

chains, e.g. for computing authority

scores.

(177)

Conclusion Perspectives

Perspectives

Application to the Web graph.

Interpolation between Green and SymGreen.

Clustering using Green measures:

unpractical now because of computation times.

Use of Green measures on other Markov

chains, e.g. for computing authority

scores.

(178)

Conclusion Perspectives

Perspectives

Application to the Web graph.

Interpolation between Green and SymGreen.

Clustering using Green measures:

unpractical now because of computation times.

Use of Green measures on other Markov

chains, e.g. for computing authority

scores.

Références

Documents relatifs

Our vali- dation results with the last year 2011 dataset indicates both ITML and diffusion distance to be promising choices for the ad-hoc image-based retrieval task for

We introduce a new method for finding nodes semantically related to a given node in a hyperlinked graph: the Green method, based on a classical Markov chain tool.. It is

Ollivier, Senellart (CNRS & INRIA) Related Pages & Green Measures AAAI, 2007/07/24 1 / 34... Introduction Green measures Methods Compared Experiment on

Concerning RQ2 (“What are the object-oriented measures that are correlated to functional size measures? More specifically: are FSM-oriented UML models needed to get such measures,

Concept lattices arising from noisy or high dimensional data have huge amount of formal concepts, which complicates the analysis of concepts and dependencies in data.. In this paper,

In this case, the zero section is very regular (even C ∞ ), but the Lyapunov exponents of every invariant measure whose support is contained in N are non-zero (except two, the

As we have seen, the posterior versions of Bayes vulnerability and g-vulnerability, as well as of Shannon entropy and guessing entropy, are all defined as the ex- pectation of

Характеристики узлов ядра включая разбиение на связные компоненты можно посмотреть в таблице указанной