• Aucun résultat trouvé

Identifying Websites with Flow Simulation

N/A
N/A
Protected

Academic year: 2022

Partager "Identifying Websites with Flow Simulation"

Copied!
43
0
0

Texte intégral

(1)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Identifying Websites with Flow Simulation

Pierre Senellart

École normale supérieure INRIA Futurs

(Paris, France) (Orsay, France)

International Conference on Web Engineering

28 July 2005

(2)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

What is a website?

Simple idea: website = webserver But:

Some websites span over several webservers, or even over several DNS domains.

Some webservers host different websites.

Limits of a website: a subjective notion.

(3)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

What is a website?

Simple idea: website = webserver But:

Some websites span over several webservers, or even over several DNS domains.

Some webservers host different websites.

Limits of a website: a subjective notion.

(4)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

What is a website?

Simple idea: website = webserver But:

Some websites span over several webservers, or even over several DNS domains.

Some webservers host different websites.

Limits of a website: a subjective notion.

(5)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

What is a website?

Simple idea: website = webserver But:

Some websites span over several webservers, or even over several DNS domains.

Some webservers host different websites.

Limits of a website: a subjective notion.

(6)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

What is a website?

Simple idea: website = webserver But:

Some websites span over several webservers, or even over several DNS domains.

Some webservers host different websites.

Limits of a website: a subjective notion.

(7)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Why is it important?

Several applications:

Automatic archiving of websites (without asking the content providers the list of the webpages belonging to their site).

SiteRank: a ranking measure for websites, as

PageRank for webpages.

(8)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Why is it important?

Several applications:

Automatic archiving of websites (without asking the content providers the list of the webpages belonging to their site).

SiteRank: a ranking measure for websites, as

PageRank for webpages.

(9)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Why is it important?

Several applications:

Automatic archiving of websites (without asking the content providers the list of the webpages belonging to their site).

SiteRank: a ranking measure for websites, as

PageRank for webpages.

(10)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Outline

1 Introduction

2 Flow Simulation

3 Seed Extension

4 Experiment

5 Conclusion

(11)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Maximum flow/Minimum cut

Traffic network.

Maximum flow ≡ Minimum cut.

/6 /2

/1 /5 /2

/3

sink source

/4

(12)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Maximum flow/Minimum cut

Traffic network.

Maximum flow ≡ Minimum cut.

/6 /2

/1 /5 /2

/3 source

4 0

3 2

1 /4 4

1 sink

(13)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Maximum flow/Minimum cut

Traffic network.

Maximum flow ≡ Minimum cut.

/6 /2

/1 /5 /2

/3

sink source

4 0

3 2

1 /4 4

1

(14)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Preflow-Push algorithm

1

All nodes are assigned a height h: h(source) = N,

∀k 6= source, h(k ) = 0 (N is the number of nodes)

2

Nodes with an overflow are visited, in any order.

If possible, the flow is pushed toward a lower node.

Capacities of edges are respected.

Otherwise, the node is heigtened.

Theorem

The process converges, whatever the sequence of visited

nodes may be. The maximum flow is obtained at the limit.

(15)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Preflow-Push algorithm

1

All nodes are assigned a height h: h(source) = N,

∀k 6= source, h(k ) = 0 (N is the number of nodes)

2

Nodes with an overflow are visited, in any order.

If possible, the flow is pushed toward a lower node.

Capacities of edges are respected.

Otherwise, the node is heigtened.

Theorem

The process converges, whatever the sequence of visited

nodes may be. The maximum flow is obtained at the limit.

(16)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Preflow-Push algorithm

1

All nodes are assigned a height h: h(source) = N,

∀k 6= source, h(k ) = 0 (N is the number of nodes)

2

Nodes with an overflow are visited, in any order.

If possible, the flow is pushed toward a lower node.

Capacities of edges are respected.

Otherwise, the node is heigtened.

Theorem

The process converges, whatever the sequence of visited

nodes may be. The maximum flow is obtained at the limit.

(17)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Preflow-Push algorithm

1

All nodes are assigned a height h: h(source) = N,

∀k 6= source, h(k ) = 0 (N is the number of nodes)

2

Nodes with an overflow are visited, in any order.

If possible, the flow is pushed toward a lower node.

Capacities of edges are respected.

Otherwise, the node is heigtened.

Theorem

The process converges, whatever the sequence of visited

nodes may be. The maximum flow is obtained at the limit.

(18)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Preflow-Push algorithm

1

All nodes are assigned a height h: h(source) = N,

∀k 6= source, h(k ) = 0 (N is the number of nodes)

2

Nodes with an overflow are visited, in any order.

If possible, the flow is pushed toward a lower node.

Capacities of edges are respected.

Otherwise, the node is heigtened.

Theorem

The process converges, whatever the sequence of visited

nodes may be. The maximum flow is obtained at the limit.

(19)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(20)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(21)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(22)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(23)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(24)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Adaptation to the World Wide Web

Website

Nodes of a traffic network delimited by a MaxFlow / MinCut.

Nodes: webpages, progressively crawled.

Edges: hyperlinks.

Capacities: edit distance between URLs.

A virtual source, pointing to a seed of pages with infinite capacity edges.

A virtual sink, pointed by all nodes with very low

capacity edges.

(25)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Markov CLustering algorithm (MCL)

MCL

An off-line graph clustering algorithm

(26)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Markov CLustering algorithm (MCL)

MCL

An off-line graph clustering algorithm

(27)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(28)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(29)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(30)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(31)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(32)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Flow simulation from MCL clusters

Process

1

MCL Clustering of a large, a priori relevant, portion of the Web graph.

2

Identification of the most relevant cluster(s).

3

Flow simulation starting from this cluster.

Advantages over MCL alone Dynamic discovery of clusters.

Use the fact that the graph is directed.

(33)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Experiment description

G EMO website identification

1

Crawl of a large part of *.inria.fr/*

2

MCL clustering of the obtained graph

3

Identification of the G EMO cluster

4

Flow Simulation from this cluster

(34)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Experiment description

G EMO website identification

1

Crawl of a large part of *.inria.fr/*

2

MCL clustering of the obtained graph

3

Identification of the G EMO cluster

4

Flow Simulation from this cluster

(35)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Experiment description

G EMO website identification

1

Crawl of a large part of *.inria.fr/*

2

MCL clustering of the obtained graph

3

Identification of the G EMO cluster

4

Flow Simulation from this cluster

(36)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Experiment description

G EMO website identification

1

Crawl of a large part of *.inria.fr/*

2

MCL clustering of the obtained graph

3

Identification of the G EMO cluster

4

Flow Simulation from this cluster

(37)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Results

Pages Precision Recall

Flow Simulation 8 87.5% 1.3%

MCL 320 99.7% 33.0%

MCL + Flow Sim. 788 90.4% 86.4%

http://www-rocq.inria.fr/ 221 100% 44.4%

verso/*

http://*.inria.fr/verso/* 683 100% 68.6%

(38)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Summary

Website: subjective, non-obvious but important notion.

Flow simulation used to discover the boundaries of a website.

Best results obtained by combining off-line graph

clustering and on-line flow simulation.

(39)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Summary

Website: subjective, non-obvious but important notion.

Flow simulation used to discover the boundaries of a website.

Best results obtained by combining off-line graph

clustering and on-line flow simulation.

(40)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Summary

Website: subjective, non-obvious but important notion.

Flow simulation used to discover the boundaries of a website.

Best results obtained by combining off-line graph

clustering and on-line flow simulation.

(41)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Perspectives

On-line MCL computation.

Efficient crawling strategy.

Combination with semantic methods.

(42)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Perspectives

On-line MCL computation.

Efficient crawling strategy.

Combination with semantic methods.

(43)

Identifying Websites with

Flow Simulation

Pierre Senellart

Introduction Flow Simulation Seed Extension Experiment Conclusion

Perspectives

On-line MCL computation.

Efficient crawling strategy.

Combination with semantic methods.

Références

Documents relatifs

This result can be explained by two effects: (a) because firm I cannot deduct VAT, higher taxes will directly lead to higher production costs, and (b) a higher tax rate

The training approach for the model that we propose in this paper is given as algorithm 1, where P s , P h , λ 1 , λ 2 , η, α and N e are the dropout rate for the identity

(A) When facing the in vivo conditions (intra- and extracellular defense systems are adopted by the host) the payoff of the bacterium in the red zones is equal to zero (if the

A travers la citoyenneté thérapeutique se dessine certes une forme d’émancipation politique appelée à durer au vu des transformations sociales et politiques induites en

Thus, our system design pays special attention to detecting and correcting phase inconsistencies, and to reducing measure- ment noise. Hence, our contributions are along the same

For the first time, we show that young adult (3 months) and elderly (22 months) mice are together able to restore normal postural-locomotor function following transient

Or c’est précisément la faiblesse de cette documentation qu’il convient d’interroger : au moment où le royaume himyarite apparaît politiquement fort et mène une

Le faible taux d'adhésion à la DTAP dans la première année suite à l'angioplastie coronarienne a été documenté par plusieurs études (Bonaca, Bhatt , Oude Ophuis ,