Finding Related Pages Using Green Measures:
The Example of Wikipedia
Yann Ollivier 1,2 Pierre Senellart 3,4
1 2 3 4
Gemo Seminar
November 10th, 2006
Introduction Related nodes in a graph
Related nodes in a graph
Given a hyperlinked environment (= a graph)...
2 1 3
6
7 9
4
5
10
8
Problem
Finding nodes semantically related to some given node.
Introduction Related nodes in a graph
Related nodes in a graph
Given a hyperlinked environment (= a graph)...
2 1 3
6
7 9
4
5
10
8
Problem
Finding nodes semantically related to some given node.
Introduction Related nodes in a graph
Example of related nodes
Example (World Wide Web) Nodes: Web pages
Edges: hyperlinks
Related nodes: similar/related pages (cf Google)
Example (Publications) Nodes: publications
Edges: citations
Related nodes: related articles (cf Google Scholar)
Example (Wikipedia) Nodes: articles
Edges: hyperlinks
Related nodes: related articles (= articles on semantically related topics)
Introduction Related nodes in a graph
Example of related nodes
Example (World Wide Web) Nodes: Web pages
Edges: hyperlinks
Related nodes: similar/related pages (cf Google)
Example (Publications) Nodes: publications
Edges: citations
Related nodes: related articles (cf Google Scholar)
Example (Wikipedia) Nodes: articles
Edges: hyperlinks
Related nodes: related articles (= articles on semantically related topics)
Introduction Related nodes in a graph
Example of related nodes
Example (World Wide Web) Nodes: Web pages
Edges: hyperlinks
Related nodes: similar/related pages (cf Google)
Example (Publications) Nodes: publications
Edges: citations
Related nodes: related articles (cf Google Scholar)
Example (Wikipedia) Nodes: articles
Edges: hyperlinks
Related nodes: related articles (= articles on semantically related topics)
Introduction Related nodes in a graph
Example of related nodes
Example (World Wide Web) Nodes: Web pages
Edges: hyperlinks
Related nodes: similar/related pages (cf Google)
Example (Publications) Nodes: publications
Edges: citations
Related nodes: related articles (cf Google Scholar)
Example (Wikipedia) Nodes: articles
Edges: hyperlinks
Related nodes: related articles (= articles on semantically related topics)
Introduction Methods
Classical approaches
Classical approached for finding related nodes (e.g. on the World Wide Web):
Based on the use of variants of PageRank on local subgraphs. Text Mining techniques : cocitations, vector-space model...
Our approach
Use of a classical Markov chain tool: Green measures.
Introduction Methods
Classical approaches
Classical approached for finding related nodes (e.g. on the World Wide Web):
Based on the use of variants of PageRank on local subgraphs.
Text Mining techniques : cocitations, vector-space model...
Our approach
Use of a classical Markov chain tool: Green measures.
Introduction Methods
Classical approaches
Classical approached for finding related nodes (e.g. on the World Wide Web):
Based on the use of variants of PageRank on local subgraphs.
Text Mining techniques : cocitations, vector-space model...
Our approach
Use of a classical Markov chain tool: Green measures.
Introduction Methods
Classical approaches
Classical approached for finding related nodes (e.g. on the World Wide Web):
Based on the use of variants of PageRank on local subgraphs.
Text Mining techniques : cocitations, vector-space model...
Our approach
Use of a classical Markov chain tool: Green measures.
Introduction Contributions
Contributions
Our contributions:
1
A novel use of Green measures for extracting semantic information in a graph.
2
An extensive comparative study with classical approaches, on the English version of Wikipedia.
Remark
Only pure mathematical methods, no Wikipedia-specific tricks included.
Introduction Contributions
Contributions
Our contributions:
1
A novel use of Green measures for extracting semantic information in a graph.
2
An extensive comparative study with classical approaches, on the English version of Wikipedia.
Remark
Only pure mathematical methods, no Wikipedia-specific tricks included.
Introduction Contributions
Contributions
Our contributions:
1
A novel use of Green measures for extracting semantic information in a graph.
2
An extensive comparative study with classical approaches, on the English version of Wikipedia.
Remark
Only pure mathematical methods, no Wikipedia-specific tricks included.
Introduction Contributions
Contributions
Our contributions:
1
A novel use of Green measures for extracting semantic information in a graph.
2
An extensive comparative study with classical approaches, on the English version of Wikipedia.
Remark
Only pure mathematical methods, no Wikipedia-specific tricks included.
Green measures
Outline
1 Introduction
2 Green measures
Graphs as Markov chains Green measures
3 Methods Compared
4 Experiment on Wikipedia
5 Conclusion
Green measures Graphs as Markov chains
Graph = Markov chain
Definition (Markov chain)
Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .
For a directed graph: State space: set of nodes
Transition probabilities: based on existence (and weight) of edges
Remark
All graphs will be supposed strongly connected and with gcd of length of
all cycles equal to 1.
Green measures Graphs as Markov chains
Graph = Markov chain
Definition (Markov chain)
Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .
For a directed graph:
State space: set of nodes
Transition probabilities: based on existence (and weight) of edges
Remark
All graphs will be supposed strongly connected and with gcd of length of
all cycles equal to 1.
Green measures Graphs as Markov chains
Graph = Markov chain
Definition (Markov chain)
Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .
For a directed graph:
State space: set of nodes
Transition probabilities: based on existence (and weight) of edges
Remark
All graphs will be supposed strongly connected and with gcd of length of
all cycles equal to 1.
Green measures Graphs as Markov chains
Graph = Markov chain
Definition (Markov chain)
Probabilistic process on a state space, defined by transition probabilities p ij from each state i to each state j .
For a directed graph:
State space: set of nodes
Transition probabilities: based on existence (and weight) of edges
Remark
All graphs will be supposed strongly connected and with gcd of length of
all cycles equal to 1.
Green measures Graphs as Markov chains
Equilibrium measure
Definition (Measure)
Assignments of real numbers to the state set.
Definition (Propagation operator)
Operator which maps a measure µ to a measure µ 0 computed as:
µ 0 j = X
i
(µ i p ij )
Result
If we iterate the propagation operator from any measure summing to 1, we
converge to a unique equilibrium measure. (PageRank with no random
jumps).
Green measures Graphs as Markov chains
Equilibrium measure
Definition (Measure)
Assignments of real numbers to the state set.
Definition (Propagation operator)
Operator which maps a measure µ to a measure µ 0 computed as:
µ 0 j = X
i
(µ i p ij )
Result
If we iterate the propagation operator from any measure summing to 1, we
converge to a unique equilibrium measure. (PageRank with no random
jumps).
Green measures Graphs as Markov chains
Equilibrium measure
Definition (Measure)
Assignments of real numbers to the state set.
Definition (Propagation operator)
Operator which maps a measure µ to a measure µ 0 computed as:
µ 0 j = X
i
(µ i p ij )
Result
If we iterate the propagation operator from any measure summing to 1, we
converge to a unique equilibrium measure. (PageRank with no random
jumps).
Green measures Graphs as Markov chains
PageRank — Iteration ]1
0.100 0.100
0.100
0.100
0.100 0.100
0.100
0.100
0.100
Green measures Graphs as Markov chains
PageRank — Iteration ]2
0.033 0.317
0.075
0.108
0.025 0.058
0.083
0.150
0.117
0.033
Green measures Graphs as Markov chains
PageRank — Iteration ]3
0.036 0.193
0.108
0.163
0.079 0.090
0.074
0.154
0.008
Green measures Graphs as Markov chains
PageRank — Iteration ]4
0.054 0.212
0.093
0.152
0.048 0.051
0.108
0.149
0.106
0.026
Green measures Graphs as Markov chains
PageRank — Iteration ]5
0.051 0.247
0.078
0.143
0.053 0.062
0.097
0.153
0.016
Green measures Graphs as Markov chains
PageRank — Iteration ]6
0.048 0.232
0.093
0.156
0.062 0.067
0.087
0.138
0.099
0.018
Green measures Graphs as Markov chains
PageRank — Iteration ]7
0.052 0.226
0.092
0.148
0.058 0.064
0.098
0.146
0.021
Green measures Graphs as Markov chains
PageRank — Iteration ]8
0.049 0.238
0.088
0.149
0.057 0.063
0.095
0.141
0.099
0.019
Green measures Graphs as Markov chains
PageRank — Iteration ]9
0.050 0.232
0.091
0.149
0.060 0.066
0.094
0.143
0.019
Green measures Graphs as Markov chains
PageRank — Iteration ]10
0.050 0.233
0.091
0.150
0.058 0.064
0.095
0.142
0.098
0.020
Green measures Graphs as Markov chains
PageRank — Iteration ]11
0.050 0.234
0.090
0.148
0.058 0.065
0.095
0.143
0.019
Green measures Graphs as Markov chains
PageRank — Iteration ]12
0.049 0.233
0.091
0.149
0.058 0.065
0.095
0.142
0.098
0.019
Green measures Graphs as Markov chains
PageRank — Iteration ]13
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.019
Green measures Graphs as Markov chains
PageRank — Iteration ]14
0.050 0.234
0.091
0.149
0.058 0.065
0.095
0.142
0.097
0.019
Green measures Green measures
Background on Green measures
Green functions
Come from electrostatic theory (potential created by a charge distribution).
Analogy between electrostatic potential theory and Markov chains.
Green measures: discrete analogues of Green functions.
Green measures Green measures
Background on Green measures
Green functions
Come from electrostatic theory (potential created by a charge distribution).
Analogy between electrostatic potential theory and Markov chains.
Green measures: discrete analogues of Green functions.
Green measures Green measures
Background on Green measures
Green functions
Come from electrostatic theory (potential created by a charge distribution).
Analogy between electrostatic potential theory and Markov chains.
Green measures: discrete analogues of Green functions.
Green measures Green measures
Definition of Green measures
Definition (Green measure centered at node i)
Only fixed point of the operator on measures defined by:
µ j 7→ P
k (µ k p kj ) + (δ ij − ν j ) where δ ij =
1 if i = j 0 otherwise
Interpretations
PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .
Time spent at a node knowing the initial node is i.
Green measures Green measures
Definition of Green measures
Definition (Green measure centered at node i)
Only fixed point of the operator on measures defined by:
µ j 7→ P
k (µ k p kj ) + (δ ij − ν j ) where δ ij =
1 if i = j 0 otherwise
Interpretations
PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .
Time spent at a node knowing the initial node is i.
Green measures Green measures
Definition of Green measures
Definition (Green measure centered at node i)
Only fixed point of the operator on measures defined by:
µ j 7→ P
k (µ k p kj ) + (δ ij − ν j ) where δ ij =
1 if i = j 0 otherwise
Interpretations
PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .
Time spent at a node knowing the initial node is i.
Green measures Green measures
Definition of Green measures
Definition (Green measure centered at node i)
Only fixed point of the operator on measures defined by:
µ j 7→ P
k (µ k p kj ) + (δ ij − ν j ) where δ ij =
1 if i = j 0 otherwise
Interpretations
PageRank with source at i: standard PageRank computation while, at each iteration, adding 1 to the measure of i, and subtracting ν j to every node j .
Time spent at a node knowing the initial node is i.
Methods Compared
Outline
1 Introduction
2 Green measures
3 Methods Compared Green
SymGreen
PageRankOfLinks LocalPageRank Cosine
Cocitations
4 Experiment on Wikipedia
5 Conclusion
Methods Compared
Purpose
Finding nodes in the graph related to i.
For each method, output an ordered list of nodes related to i.
Each method provides a similarity score to i.
Methods Compared
Purpose
Finding nodes in the graph related to i.
For each method, output an ordered list of nodes related to i.
Each method provides a similarity score to i.
Methods Compared
Purpose
Finding nodes in the graph related to i.
For each method, output an ordered list of nodes related to i.
Each method provides a similarity score to i.
Methods Compared Green
Green — Method Description
Method Description
Direct application of the theory of Green measures.
Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).
Iteration until reasonable convergence on the top results.
Methods Compared Green
Green — Method Description
Method Description
Direct application of the theory of Green measures.
Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).
Iteration until reasonable convergence on the top results.
Methods Compared Green
Green — Method Description
Method Description
Direct application of the theory of Green measures.
Improvement: multiplication by a term favoring uncommon nodes log(1/ν j ) (quantity of information).
Iteration until reasonable convergence on the top results.
Methods Compared Green
Green — Iteration ]1
-0.050 -0.233
-0.091
-0.149
-0.058 0.935
-0.095
-0.143
-0.019
Methods Compared Green
Green — Iteration ]2
-0.099 0.034
0.319
-0.298
-0.117 0.870
-0.190
-0.285
-0.194
-0.039
Methods Compared Green
Green — Iteration ]3
-0.149 -0.199
0.353
-0.322
-0.050 0.930
-0.035
-0.178
-0.058
Methods Compared Green
Green — Iteration ]4
-0.157 -0.078
0.325
-0.304
-0.109 0.865
-0.026
-0.258
-0.222
-0.036
Methods Compared Green
Green — Iteration ]5
-0.151 -0.096
0.322
-0.333
-0.078 0.903
-0.034
-0.203
-0.055
Methods Compared Green
Green — Iteration ]6
-0.161 -0.096
0.337
-0.300
-0.083 0.892
-0.045
-0.256
-0.242
-0.045
Methods Compared Green
Green — Iteration ]7
-0.150 -0.108
0.331
-0.328
-0.083 0.895
-0.027
-0.218
-0.047
Methods Compared Green
Green — Iteration ]8
-0.159 -0.086
0.330
-0.312
-0.086 0.892
-0.039
-0.246
-0.248
-0.047
Methods Compared Green
Green — Iteration ]9
-0.154 -0.104
0.334
-0.321
-0.081 0.897
-0.034
-0.228
-0.048
Methods Compared Green
Green — Iteration ]10
-0.157 -0.094
0.332
-0.314
-0.085 0.892
-0.035
-0.241
-0.252
-0.046
Methods Compared Green
Green — Iteration ]11
-0.155 -0.098
0.332
-0.320
-0.083 0.895
-0.034
-0.231
-0.047
Methods Compared Green
Green — Iteration ]12
-0.157 -0.095
0.332
-0.315
-0.084 0.893
-0.036
-0.239
-0.253
-0.046
Methods Compared Green
Green — Iteration ]13
-0.155 -0.098
0.332
-0.318
-0.084 0.894
-0.034
-0.234
-0.047
Methods Compared Green
Green — Iteration ]14
-0.156 -0.095
0.332
-0.316
-0.084 0.893
-0.035
-0.238
-0.254
-0.046
Methods Compared Green
Green — Iteration ]15
-0.156 -0.097
0.332
-0.318
-0.084 0.894
-0.034
-0.235
-0.047
Methods Compared Green
Green — Iteration ]16
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.035
-0.237
-0.254
-0.046
Methods Compared Green
Green — Iteration ]17
-0.156 -0.097
0.332
-0.317
-0.084 0.894
-0.034
-0.236
-0.046
Methods Compared Green
Green — Iteration ]18
-0.156 -0.096
0.332
-0.316
-0.084 0.893
-0.034
-0.237
-0.254
-0.046
Methods Compared Green
Green — Iteration ]19
-0.156 -0.096
0.332
-0.317
-0.084 0.893
-0.034
-0.236
-0.046
Methods Compared Green
Green — Iteration ]20
-0.156 -0.096
0.332
-0.316
-0.085 0.893
-0.034
-0.237
-0.254
-0.046
Methods Compared SymGreen
SymGreen — Method Description
Method Description
Green goes only forward, may be a limitation.
Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:
˜ p ij = 1
2 p ij + p ji ν j
ν i
The resulting graph has the same equilibrium measure.
Same as Green on this symmetrized graph.
Methods Compared SymGreen
SymGreen — Method Description
Method Description
Green goes only forward, may be a limitation.
Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:
˜ p ij = 1
2 p ij + p ji ν j
ν i
The resulting graph has the same equilibrium measure.
Same as Green on this symmetrized graph.
Methods Compared SymGreen
SymGreen — Method Description
Method Description
Green goes only forward, may be a limitation.
Symmetrize the graph, in a canonical sense in relation to the equilibrium measure:
˜ p ij = 1
2 p ij + p ji ν j
ν i
The resulting graph has the same equilibrium measure.
Same as Green on this symmetrized graph.
Methods Compared SymGreen
SymGreen — Original Graph
2 1 3
6
7 9
4
5
10
8
Methods Compared SymGreen
SymGreen — Symmetrized Graph
2 1
6 3
4
7
8 9
5
Methods Compared SymGreen
SymGreen — Iteration ]1
-0.050 -0.233
-0.149 -0.091
-0.095
-0.058
-0.019 0.935
-0.143
-0.097
Methods Compared SymGreen
SymGreen — Iteration ]2
-0.099 0.233
-0.298 0.069
-0.190
-0.117
0.011 0.870
-0.285
Methods Compared SymGreen
SymGreen — Iteration ]3
-0.075 0.088
-0.285 0.065
-0.080
-0.063
0.001 0.994
-0.366
-0.283
Methods Compared SymGreen
SymGreen — Iteration ]4
-0.088 0.271
-0.288 0.092
-0.108
-0.094
0.012 0.964
-0.441
Methods Compared SymGreen
SymGreen — Iteration ]5
-0.069 0.223
-0.284 0.088
-0.065
-0.070
0.006 1.006
-0.469
-0.370
Methods Compared SymGreen
SymGreen — Iteration ]6
-0.073 0.295
-0.277 0.099
-0.075
-0.083
0.010 0.995
-0.511
Methods Compared SymGreen
SymGreen — Iteration ]7
-0.065 0.280
-0.278 0.096
-0.056
-0.073
0.008 1.011
-0.519
-0.410
Methods Compared SymGreen
SymGreen — Iteration ]8
-0.066 0.309
-0.273 0.101
-0.060
-0.079
0.009 1.008
-0.542
Methods Compared SymGreen
SymGreen — Iteration ]9
-0.062 0.304
-0.274 0.099
-0.052
-0.075
0.008 1.014
-0.542
-0.427
Methods Compared SymGreen
SymGreen — Iteration ]10
-0.063 0.316
-0.271 0.102
-0.054
-0.077
0.009 1.013
-0.556
Methods Compared SymGreen
SymGreen — Iteration ]11
-0.061 0.315
-0.273 0.101
-0.050
-0.075
0.009 1.016
-0.554
-0.435
Methods Compared SymGreen
SymGreen — Iteration ]12
-0.062 0.319
-0.270 0.102
-0.051
-0.077
0.009 1.015
-0.562
Methods Compared SymGreen
SymGreen — Iteration ]13
-0.061 0.319
-0.272 0.101
-0.049
-0.076
0.009 1.016
-0.560
-0.439
Methods Compared SymGreen
SymGreen — Iteration ]14
-0.061 0.321
-0.270 0.102
-0.050
-0.077
0.009 1.016
-0.565
Methods Compared SymGreen
SymGreen — Iteration ]15
-0.061 0.321
-0.271 0.102
-0.049
-0.076
0.009 1.016
-0.563
-0.440
Methods Compared SymGreen
SymGreen — Iteration ]16
-0.061 0.321
-0.270 0.102
-0.049
-0.077
0.009 1.016
-0.566
Methods Compared SymGreen
SymGreen — Iteration ]17
-0.061 0.321
-0.271 0.102
-0.049
-0.076
0.009 1.016
-0.565
-0.440
Methods Compared SymGreen
SymGreen — Iteration ]18
-0.061 0.321
-0.270 0.102
-0.049
-0.077
0.009 1.016
-0.567
Methods Compared SymGreen
SymGreen — Iteration ]19
-0.061 0.321
-0.271 0.101
-0.049
-0.077
0.009 1.016
-0.566
-0.441
Methods Compared SymGreen
SymGreen — Iteration ]20
-0.061 0.321
-0.271 0.102
-0.049
-0.077
0.009 1.016
-0.567
Methods Compared SymGreen
SymGreen — Iteration ]21
-0.061 0.321
-0.271 0.101
-0.049
-0.077
0.009 1.016
-0.567
-0.441
Methods Compared SymGreen
SymGreen — Iteration ]22
-0.061 0.321
-0.271 0.101
-0.049
-0.077
0.009 1.016
-0.568
Methods Compared PageRankOfLinks
PageRankOfLinks — Method Description
Method Description
Related nodes: nodes pointed by i.
Sort the nodes pointed by i by their equilibrium measure value.
Methods Compared PageRankOfLinks
PageRankOfLinks — Method Description
Method Description
Related nodes: nodes pointed by i.
Sort the nodes pointed by i by their equilibrium measure value.
Methods Compared PageRankOfLinks
PageRankOfLinks
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
Methods Compared PageRankOfLinks
PageRankOfLinks
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.019
Methods Compared PageRankOfLinks
PageRankOfLinks
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
Methods Compared LocalPageRank
LocalPageRank — Method Description
Method Description
Compute a neighborhood subgraph of i .
Compute the strongly connected component of this subgraph containing i .
Order the nodes by their equilibrium measure in this component.
(Approach widely used on the World Wide Web).
Methods Compared LocalPageRank
LocalPageRank — Method Description
Method Description
Compute a neighborhood subgraph of i .
Compute the strongly connected component of this subgraph containing i .
Order the nodes by their equilibrium measure in this component.
(Approach widely used on the World Wide Web).
Methods Compared LocalPageRank
LocalPageRank — Method Description
Method Description
Compute a neighborhood subgraph of i .
Compute the strongly connected component of this subgraph containing i .
Order the nodes by their equilibrium measure in this component.
(Approach widely used on the World Wide Web).
Methods Compared LocalPageRank
LocalPageRank — Method Description
Method Description
Compute a neighborhood subgraph of i .
Compute the strongly connected component of this subgraph containing i .
Order the nodes by their equilibrium measure in this component.
(Approach widely used on the World Wide Web).
Methods Compared LocalPageRank
LocalPageRank — Original Graph
2 1 3
6
7 9
4
5
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
10
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
10
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
10
8
Methods Compared LocalPageRank
LocalPageRank — Local Subgraph
2 1 3
6
7 9
4
5
8
Methods Compared LocalPageRank
LocalPageRank — Local Strongly Connected Subgraph
2 1 3
6
7 9
4
5
10
8
Methods Compared LocalPageRank
LocalPageRank — Local Strongly Connected Subgraph
2 1 3
6
7 9
4
5
8
Methods Compared LocalPageRank
LocalPageRank — Iteration ]1
0.125 0.125
0.125
0.125
0.125 0.125
0.125
0.125
Methods Compared LocalPageRank
LocalPageRank — Iteration ]2
0.042 0.417
0.094
0.094
0.031 0.094
0.167
0.063
Methods Compared LocalPageRank
LocalPageRank — Iteration ]3
0.031 0.318
0.151
0.120
0.104 0.135
0.125
0.016
Methods Compared LocalPageRank
LocalPageRank — Iteration ]4
0.040 0.272
0.147
0.132
0.079 0.087
0.191
0.052
Methods Compared LocalPageRank
LocalPageRank — Iteration ]5
0.044 0.344
0.112
0.108
0.068 0.094
0.191
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]6
0.036 0.338
0.133
0.120
0.086 0.106
0.147
0.034
Methods Compared LocalPageRank
LocalPageRank — Iteration ]7
0.040 0.293
0.137
0.127
0.084 0.101
0.173
0.043
Methods Compared LocalPageRank
LocalPageRank — Iteration ]8
0.042 0.328
0.124
0.116
0.073 0.095
0.180
0.042
Methods Compared LocalPageRank
LocalPageRank — Iteration ]9
0.039 0.329
0.129
0.119
0.082 0.103
0.163
0.037
Methods Compared LocalPageRank
LocalPageRank — Iteration ]10
0.040 0.310
0.134
0.123
0.082 0.101
0.169
0.041
Methods Compared LocalPageRank
LocalPageRank — Iteration ]11
0.041 0.320
0.128
0.119
0.078 0.098
0.175
0.041
Methods Compared LocalPageRank
LocalPageRank — Iteration ]12
0.040 0.325
0.129
0.119
0.080 0.101
0.168
0.039
Methods Compared LocalPageRank
LocalPageRank — Iteration ]13
0.040 0.316
0.132
0.121
0.081 0.101
0.169
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]14
0.040 0.319
0.129
0.120
0.079 0.099
0.172
0.041
Methods Compared LocalPageRank
LocalPageRank — Iteration ]15
0.040 0.322
0.129
0.119
0.080 0.100
0.169
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]16
0.040 0.319
0.131
0.121
0.081 0.100
0.169
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]17
0.040 0.319
0.130
0.120
0.080 0.100
0.171
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]18
0.040 0.321
0.130
0.120
0.080 0.100
0.170
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]19
0.040 0.320
0.130
0.120
0.080 0.100
0.170
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]20
0.040 0.320
0.130
0.120
0.080 0.100
0.170
0.040
Methods Compared LocalPageRank
LocalPageRank — Iteration ]21
0.040 0.320
0.130
0.120
0.080 0.100
0.170
0.040
Methods Compared Cosine
Cosine — Method Description
Method Description
Vector-space model: nodes are both documents and dimensions.
Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:
cos(x (i) , x (j ) ) =
P x k (i) x k (j) q
P x k (i) 2 q
P x k (j) 2
Classical Text Mining technique, more often used with textual words
as dimensions.
Methods Compared Cosine
Cosine — Method Description
Method Description
Vector-space model: nodes are both documents and dimensions.
Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j .
Score of similarity between i and j : cosine of the corresponding vectors:
cos(x (i) , x (j ) ) =
P x k (i) x k (j) q
P x k (i) 2 q
P x k (j) 2
Classical Text Mining technique, more often used with textual words
as dimensions.
Methods Compared Cosine
Cosine — Method Description
Method Description
Vector-space model: nodes are both documents and dimensions.
Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:
cos(x (i) , x (j ) ) =
P x k (i) x k (j) q
P x k (i) 2 q
P x k (j) 2
Classical Text Mining technique, more often used with textual words
as dimensions.
Methods Compared Cosine
Cosine — Method Description
Method Description
Vector-space model: nodes are both documents and dimensions.
Coordinate of node i along dimension j : term frequency/inverse document frequency weight of the edge between i and j . Score of similarity between i and j : cosine of the corresponding vectors:
cos(x (i) , x (j ) ) =
P x k (i) x k (j) q
P x k (i) 2 q
P x k (j) 2
Classical Text Mining technique, more often used with textual words
as dimensions.
Methods Compared Cosine
Cosine
2 1 3
6
7 9
4
5
8
Methods Compared Cosine
Cosine
2 1 3
6
7 9
4
5
10
8
Methods Compared Cosine
Cosine
2 1 3
6
7 9
4
5
8
Methods Compared Cosine
Cosine
Dimensions
1 2 3 4 6 7 9 10 Cosine with 9
Do cuments
1 X 0.40
2 X X X X 0.43
4 X 0.40
6 X X X 0.09
8 X X X 0.13
9 X X 1.00
Methods Compared Cocitations
Cocitations — Method Description
Method Description
Related nodes: nodes that are pointed to by the same nodes.
Score of similarity between i and j : number of nodes pointing to both.
Classically used in bibliometrics.
Methods Compared Cocitations
Cocitations — Method Description
Method Description
Related nodes: nodes that are pointed to by the same nodes.
Score of similarity between i and j : number of nodes pointing to both.
Classically used in bibliometrics.
Methods Compared Cocitations
Cocitations — Method Description
Method Description
Related nodes: nodes that are pointed to by the same nodes.
Score of similarity between i and j : number of nodes pointing to both.
Classically used in bibliometrics.
Methods Compared Cocitations
Cocitations
2 1 3
6
7 9
4
5
10
8
Methods Compared Cocitations
Cocitations
2 1 3
6
7 9
4
5
8
Methods Compared Cocitations
Cocitations
2 1 3
6
7 9
4
5
10
8
Experiment on Wikipedia
Outline
1 Introduction
2 Green measures
3 Methods Compared
4 Experiment on Wikipedia Wikipedia graph Evaluation Results
5 Conclusion
Experiment on Wikipedia Wikipedia graph
The graph of Wikipedia
Extraction
Extraction of the graph from a full XML dump of Wikipedia.
Multiplicity of links used as weight of edges.
Redirections and Categories dealt with consistently with the Web interface.
Only most used Templates are expanded, in order not to “pollute” too much the graph.
Statistics
1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.
95% of the nodes belong to the largest strongly connected component.
Experiment on Wikipedia Wikipedia graph
The graph of Wikipedia
Extraction
Extraction of the graph from a full XML dump of Wikipedia.
Multiplicity of links used as weight of edges.
Redirections and Categories dealt with consistently with the Web interface.
Only most used Templates are expanded, in order not to “pollute” too much the graph.
Statistics
1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.
95% of the nodes belong to the largest strongly connected component.
Experiment on Wikipedia Wikipedia graph
The graph of Wikipedia
Extraction
Extraction of the graph from a full XML dump of Wikipedia.
Multiplicity of links used as weight of edges.
Redirections and Categories dealt with consistently with the Web interface.
Only most used Templates are expanded, in order not to “pollute” too much the graph.
Statistics
1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.
95% of the nodes belong to the largest strongly connected component.
Experiment on Wikipedia Wikipedia graph
The graph of Wikipedia
Extraction
Extraction of the graph from a full XML dump of Wikipedia.
Multiplicity of links used as weight of edges.
Redirections and Categories dealt with consistently with the Web interface.
Only most used Templates are expanded, in order not to “pollute” too much the graph.
Statistics
1, 606, 896 nodes (as of September 25th, 2006). 38, 896, 462 edges.
95% of the nodes belong to the largest strongly connected component.
Experiment on Wikipedia Wikipedia graph
The graph of Wikipedia
Extraction
Extraction of the graph from a full XML dump of Wikipedia.
Multiplicity of links used as weight of edges.
Redirections and Categories dealt with consistently with the Web interface.
Only most used Templates are expanded, in order not to “pollute” too much the graph.
Statistics
1, 606, 896 nodes (as of September 25th, 2006).
38, 896, 462 edges.
95% of the nodes belong to the largest strongly connected component.
Experiment on Wikipedia Evaluation
Evaluation methodology
Blind evaluation of the methods.
Articles selected for their diversity: Clique (graph theory)
Germany
Hungarian language Pierre de Fermat Star Wars
Theory of relativity 1989
66 evaluators asked to give a mark to each list of words.
Experiment on Wikipedia Evaluation
Evaluation methodology
Blind evaluation of the methods.
Articles selected for their diversity:
Clique (graph theory) Germany
Hungarian language Pierre de Fermat Star Wars
Theory of relativity 1989
66 evaluators asked to give a mark to each list of words.
Experiment on Wikipedia Evaluation
Evaluation methodology
Blind evaluation of the methods.
Articles selected for their diversity:
Clique (graph theory) Germany
Hungarian language Pierre de Fermat Star Wars
Theory of relativity 1989
66 evaluators asked to give a mark to each list of words.
Experiment on Wikipedia Evaluation
Output on Pierre de Fermat
Green SymGreen PageRankOfLinks Cosine Cocitations
1.Pierre de Fermat 2.Toulouse 3.Fermat’s Last
Theorem 4.Diophantine
equation 5.Fermat’s little
theorem 6.Fermat number 7.Grandes écoles 8.Blaise Pascal 9.France 10.Pseudoprime
1.Pierre de Fermat 2.Mathematics 3.Probability
theory 4.Fermat’s Last
Theorem 5.Number theory 6.Toulouse 7.Diophantine
equation 8.Blaise Pascal 9.Fermat’s little
theorem 10.Calculus
1.France 2.17th century 3.March 4 4.January 12 5.August 17 6.Calculus 7.Lawyer 8.1660 9.Number theory 10.René Descartes
1.Pierre de Fermat 2.ENSICA 3.Fermat’s theorem 4.International
School of Toulouse 5.École Nationale
Supérieure d’Électronique, d’Électrotechnique, d’Informatique, d’Hydraulique, et de Télécommuni- cations 6.Languedoc 7.Hélène Pince 8.Community of
Agglomeration of Greater Toulouse 9.Lilhac 10.Institut
d’études politiques de Toulouse
1.Pierre de Fermat 2.Leonhard Euler 3.Mathematics 4.René Descartes 5.Mathematician 6.Gottfried
Leibniz 7.Calculus 8.Isaac Newton 9.Blaise Pascal 10.Carl Friedrich
Gauss
Experiment on Wikipedia Evaluation
Output on Germany
Green SymGreen PageRankOfLinks Cosine Cocitations
1.Germany 2.Berlin 3.German
language 4.Christian
Democratic Union (Germany) 5.Austria 6.Hamburg 7.German
reunification 8.Social
Democratic Party of Germany 9.German Empire 10.German
Democratic Republic
1.Germany 2.Berlin 3.France 4.Austria 5.German
language 6.Bavaria 7.World War II 8.German
Democratic Republic 9.European Union 10.Hamburg
1.United States 2.United Kingdom 3.France 4.2005 5.Germany 6.World War II 7.Canada 8.English language 9.Japan 10.Italy
1.Germany 2.History of
Germany since 1945 3.History of
Germany 4.Timeline of
German history 5.States of
Germany 6.Politics of
Germany 7.List of
Germany-related topics 8.Hildesheimer
Rabbinical Seminary 9.Pleasure Victim 10.German Unity
Day
1.Germany 2.United States 3.France 4.United Kingdom 5.World War II 6.Italy 7.Netherlands 8.Japan 9.2005 10.Cate-
gory:Living people
Experiment on Wikipedia Results
Results
Experiment on Wikipedia Results
Discussion of the results
PageRankOfLinks (like LocalPageRank): usually very bad.
Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.
Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).
Green: good robustness, very few false positives.
Green & SymGreen: only methods to find some important semantic
relations: Finnish language for Hungarian language, Berlin Wall or The
Satanic Verses (novel) for 1989. . .
Experiment on Wikipedia Results
Discussion of the results
PageRankOfLinks (like LocalPageRank): usually very bad.
Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.
Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).
Green: good robustness, very few false positives.
Green & SymGreen: only methods to find some important semantic
relations: Finnish language for Hungarian language, Berlin Wall or The
Satanic Verses (novel) for 1989. . .
Experiment on Wikipedia Results
Discussion of the results
PageRankOfLinks (like LocalPageRank): usually very bad.
Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.
Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).
Green: good robustness, very few false positives.
Green & SymGreen: only methods to find some important semantic
relations: Finnish language for Hungarian language, Berlin Wall or The
Satanic Verses (novel) for 1989. . .
Experiment on Wikipedia Results
Discussion of the results
PageRankOfLinks (like LocalPageRank): usually very bad.
Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.
Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).
Green: good robustness, very few false positives.
Green & SymGreen: only methods to find some important semantic
relations: Finnish language for Hungarian language, Berlin Wall or The
Satanic Verses (novel) for 1989. . .
Experiment on Wikipedia Results
Discussion of the results
PageRankOfLinks (like LocalPageRank): usually very bad.
Cosine: very variable, good performance on some articles, poor on other, a lot of false positives in all cases.
Green & SymGreen: similar performances, a notable advantage for Green on these examples (may change).
Green: good robustness, very few false positives.
Green & SymGreen: only methods to find some important semantic
relations: Finnish language for Hungarian language, Berlin Wall or The
Satanic Verses (novel) for 1989. . .
Conclusion
Outline
1 Introduction
2 Green measures
3 Methods Compared
4 Experiment on Wikipedia
5 Conclusion
Summary
Perspectives
Conclusion Summary
Summary
Green measures: a tool for extracting semantic information in a graph.
In comparison to other methods, in the case of Wikipedia: Better overall performance.
Robustness.
Discovery of relevant semantic relations.
Conclusion Summary
Summary
Green measures: a tool for extracting semantic information in a graph.
In comparison to other methods, in the case of Wikipedia:
Better overall performance. Robustness.
Discovery of relevant semantic relations.
Conclusion Summary
Summary
Green measures: a tool for extracting semantic information in a graph.
In comparison to other methods, in the case of Wikipedia:
Better overall performance.
Robustness.
Discovery of relevant semantic relations.
Conclusion Summary
Summary
Green measures: a tool for extracting semantic information in a graph.
In comparison to other methods, in the case of Wikipedia:
Better overall performance.
Robustness.
Discovery of relevant semantic relations.
Conclusion Summary
Summary
Green measures: a tool for extracting semantic information in a graph.
In comparison to other methods, in the case of Wikipedia:
Better overall performance.
Robustness.
Discovery of relevant semantic relations.
Conclusion Perspectives
Perspectives
Application to the Web graph.
Interpolation between Green and SymGreen.
Clustering using Green measures: unpractical now because of computation times.
Use of Green measures on other Markov
chains, e.g. for computing authority
scores.
Conclusion Perspectives
Perspectives
Application to the Web graph.
Interpolation between Green and SymGreen.
Clustering using Green measures: unpractical now because of computation times.
Use of Green measures on other Markov
chains, e.g. for computing authority
scores.
Conclusion Perspectives
Perspectives
Application to the Web graph.
Interpolation between Green and SymGreen.
Clustering using Green measures:
unpractical now because of computation times.
Use of Green measures on other Markov
chains, e.g. for computing authority
scores.
Conclusion Perspectives