• Aucun résultat trouvé

Scalable, Generic, and Adaptive Systems for Focused Crawling

N/A
N/A
Protected

Academic year: 2022

Partager "Scalable, Generic, and Adaptive Systems for Focused Crawling"

Copied!
68
0
0

Texte intégral

(1)

Scalable, Generic, and Adaptive Systems for Focused Crawling

Georges Gouriten* - [email protected] Silviu Maniu°

(2)

What is focused crawling?

(3)

A directed graph

(4)

Web

Social network P2P

etc.

(5)

Weighted

3

5 0

4

0

0

3 5

3

4 2

2 3

(6)

Let u be a node,

β(u) = count of the word Bhutan in all the tweets of u

(7)

Even more weighted

2 3

0

0 0

0 1

0 1

0 1

3

0 0

0

0

(8)

Let (u, v) be an edge,

α(u) = count of the word Bhutan in all the tweets of u mentioning v

(9)

3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1

0 1

0 1

3

0 0

0

0

The total graph

(10)

3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1

0 1

0 1

3

0 0

0

0

A seed list

(11)

The frontier

3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1

0 1

0 1

3

0 0

0

0

(12)

Crawling one node

3

5 0

4

0

0

3 5

3

4 2

2 3

3

5 0

4

0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1

0 1

0 1

3

0 0

0

0

(13)

A crawl sequence

Let V0 be the seed list, a set of nodes, a crawl sequence, starting from V0, is

{ vi, vi in frontier(V0 U {v0, v1, .. , vi-1}) }

(14)

Goal of a focused crawler

Produce crawl sequences with

global scores (sum) as high as possible

(15)

A high-level algorithm

Estimate scores at the frontier Pick a node from the frontier

Crawl the node

(16)

Supposing a perfect estimator

(17)

Finding an optimal crawl sequence offline:

NP-hard

Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better

(18)

Estimation in practice

(19)

Different kinds of estimators

(20)

bfs

3

5 0

4

0

0

3 5

3

4 2

2 3

(21)

bfs

3

5 0

4

0

0

3 5

3

4 2

2 3

(22)

bfs

(23)

nr

navigational rank

score propagation from the ancestors of a node

(24)

nr

(25)

opic

online page importance computation

~ online pageRank computation

(26)

opic

2. ->

(27)

Open spaces in the state-of-the-art

nr has a quadratic complexity opic focus on popularity

the rest is about how to score

(28)

First-level neighboorhood

(29)

Second-level neighboorhood

(30)

Neighborhood-based estimators

(31)

deg, e, n, ne

deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes

ne: sum of incoming (node*edge)s

(32)

Linear regressions

(33)

Multi-armed bandits (1)

slot machine

1

slot machine

2

slot machine

3

slot machine

4 ...

(34)

Multi-armed bandits (2)

Budget n, how to maximize the reward?

Balance exploration and exploitation

(35)

Applied to focused crawling

Slot machines: estimators

Reward: score of the top node

(36)

mab_ε

probability 1-ε: slot machine with the highest average reward

probability ε: random slot machine

(37)

mab_ε-first

steps [0, ε x N]: random slot machine

steps [ε x N+1, N]: slot machine with the

highest average reward

(38)

mab_var

Succession of ε-first strategies, with a reset every r steps, r varying with the context

(39)

Their running times

(40)

Expected running times

Twitter API for one week:

- 3s

- 200,000 nodes

One domain website for one week:

- 1s

(41)

Experimental framework (1)

(42)

Experimental framework (2)

─ Graph score 10 seed graphs 1 seed graph:

50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum)

(43)

Datasets and code are online

http://netiru.fr/research/14fc

(44)

To measure the running times

Same crawl sequence: the oracle Storage in RAM (20G)

3.6 GHz

(45)

The running times (ms)

(46)

nr

(47)

Their precision

(48)

The precision

Same crawl sequence: the oracle

Precision: distance of the top node to the actual top node

Arithmetically averaged over a window of 1000

(49)

For bretagne

(50)

Their ability to lead crawls

(51)

Leading the crawl

Different crawl sequences:

defined by the top estimated nodes

(52)

Average graph scores for France

(53)

The multi armed-bandits

(54)

All the estimators

(55)

Conclusion

(56)

What we learnt

Generic model

NP-hardness offline Refresh rate of 1

Greedy

(57)

Future work

Approximation of the optimal score Distributed crawl

Recrawling nodes

(58)

Thank you.

[email protected]

(59)

Finding the optimal crawl sequences

in a known graph

(60)

PTime many-one reduction from the LST-Graph problem

Problem remains hard if nodes, not edges, are weighted

A subtree rooted at r is seen as a crawl sequence starting from r

Free edges are added to the graph to allow free

(61)

Rich friends will make you richer

(62)

The greedy strategy

Node picked = argmax(β(v)), v in frontier

(63)

Is not always optimal

2

1 2

3

4 20

12

(64)

The altered greedy strategy

Node picked = probability q: argmax(β(v))

probability 1-q: random v so that,

max(β(u)) - β(v) <= ζ x max(β(u))

(65)

Altered greedy vs greedy for jazz

(66)

The refresh rate disadvantage

(67)

When estimation takes too long

(68)

The score degradation (%) at different steps

Références

Documents relatifs

For the learning phase membership queries are used to conjecture models of a software system, whereas equivalence queries are used for testing, by comparing respective conjectured

Section 6 describes two proposals of DS strate- gies with respect to time for GSP NLMS, LMS, and RLS algorithms; constraint parameters related to these DS strategies are also

In order to avoid the implementation and configuration of an omniscient organisation component, the software objects, representing hardware components, are organized within

The transcoding from multi-view to single view video would afford the decoding of compressed 3D video bitstreams on 2D video decoders such as the laptop and personal

Motivated by the iterative power method, used in Numerical Analysis for estimating singular values and singular vectors, we develop RLS and LMS subspace based adaptive algorithms

fusion boils down to parallel composition for contracts attached to two different sub-components of a same compound component, whereas contracts attached to a same component

(iii) Finally, to propose a blind hill climbing optimizer for adaptive partitioning, in sub-Section III-B. This module implements arguably the simplest heuristic solution to

In particular, the Flexcode method originally proposes to transfer successful procedures for high dimensional regression to the conditional density estimation setting by