• Aucun résultat trouvé

Crawling the Web

N/A
N/A
Protected

Academic year: 2022

Partager "Crawling the Web"

Copied!
57
0
0

Texte intégral

(1)

1/44

Timely Crawling of the Structured Web

Pierre Senellart

É C O L E N O R M A L E S U P É R I E U R E

RESEARCH UNIVERSITY PARIS

23 April 2018,TempWeb 2018

(2)

2/44

Crawling the Web

The World Wide Web: largest information repository

By nature (mostly) publicand freely accessibleinformation

Numerous applications require navigatingthe Web and retrieving Web contentin bulk:

Web search engines(e.g., Google, Bing), to create an index of Web content

Shopping aggregators(e.g., Google Shopping, Kelkoo), to build a database of products and their prices on different sites

Web archiving(e.g., Internet Archive), to preserve parts of the Web for future generations

(3)

3/44

Crawling ethics

Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994]

User-agent: *

Allow: /searchhistory/

Disallow: /search

Per-page exclusion

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Per-link exclusion

<a href="toto.html" rel="nofollow">Toto</a>

AvoidDenial Of Service (DOS), wait 1s between two repeated requests to the same Web server

(4)

4/44

Crawling Social Networks

Theoretically possible to crawl social networking sites using a regular Web crawler

Often not possible:

https://www.facebook.com/robots.txt

Oftenvery inefficient, considering politeness constraints

Better solution: Use provided social networkingAPIs https://developer.twitter.com/en/docs/

api-reference-index

https://developers.facebook.com/docs/graph-api/

reference/v2.1/

https://developer.linkedin.com/docs/rest-api https://developers.google.com/youtube/v3/

(5)

5/44

Social Networking APIs

Most social networking Web sites (and some other kinds of Web sites) provide APIs to effectively access their content

Usually a RESTfulAPI, occasionally SOAP-baed

Usually require atoken identifying the application using the API, sometimes a cryptographic signature as well

May access the API as an authenticated user of the social network, or as an external party

APIs seriously limit the rate of requests: https:

//dev.twitter.com/docs/api/1.1/get/search/tweets

(6)

6/44

Scope of a crawl

Web-scale

The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server

Focus onimportantpages [Abiteboul et al., 2003]

Web servers under a list of DNS domains: easy filtering of URLs

A given topic: focused crawling techniques[Chakrabarti et al., 1999, Diligenti et al., 2000]based on classifiers of Web page content and predictors of the interest of a link.

The national Web (cf.public deposit, national libraries):

what is this? [Abiteboul et al., 2002]

A given Web site: what is a Web site? [Senellart, 2005]

(7)

7/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Importance of Timely Crawling

The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]

For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]

Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.

Crawling takes timeand consumes resources:

Limited bandwidth, limiting computing power on the crawling side

Because of crawling ethics, crawling a 5 million page site takes around2 months!

Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;

using the Timeline API, at most 20 000 tweets per minute. . .

(8)

7/44

Importance of Timely Crawling

The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]

For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]

Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.

Crawling takes timeand consumes resources:

Limited bandwidth, limiting computing power on the crawling side

Because of crawling ethics, crawling a 5 million page site takes around2 months!

Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;

using the Timeline API, at most 20 000 tweets per minute. . . 350 000new tweets per minute on average!

(9)

8/44

This Talk

Focus onlimiting the number of HTTP requests: avoiding crawling useless or redundant content

Two different scopes:

Crawling a givenstructuredWeb site (URLs under a given domain)

Focusedcrawling

Also applies in a large way to the crawl of non-Web graphs, such as social networking sites through APIs[Gouriten and Senellart, 2012]

(10)

9/44

Outline

Introduction

Crawling of Structured Websites The Structured Web

Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments

Learning as You Crawl Conclusion

(11)

10/44

Outline

Introduction

Crawling of Structured Websites The Structured Web

Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments

Learning as You Crawl Conclusion

(12)

11/44

The CMS-Based Web

40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]

(13)

11/44

The CMS-Based Web

40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]

(14)

12/44

Traditional crawling approach

A traditional Web crawler (such asHeritrix) crawls the Web in a conceptually very simple way.

Queue Man- agement

Page

Fetching Link Extraction

URL Selection

This approach does not take into account the nature of the Web site crawled, its structure, its template.

Unoptimal! Most Web sites have lots of redundancies:

several URLs host the same content, different ways to navigate within a Web site to the same content, etc.

(15)

13/44

Application-Aware Helper (AAH)

[Faheem and Senellart, 2013b,a]

with

Archivist Interface

Web application

detection module Indexing

module

Crawling module Web applications

to crawl

Web application adaptation

Module

Content extraction and annotation

module Crawled Web

pages with annotated Contents

3 2 1

5 6

8

9 12

4

RDF store Stored

WARC files

7 11 10

XML knowledge

base

What is crawled depends on the CMS the Web site is based on, but relies on ahand-written knowledge base of existing CMSs

(16)

14/44

AAH: Web application detection module

with

One main challenge in intelligent crawling and content extraction is to identify the Web application and then perform the best crawling strategyaccordingly.

Detecting Web application using:

1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.

For instance thevBulletin Web forum content management system, that can be identified by searching for a reference to a vbulletin_global.jsJavaScript script by using a simple //script/@srcXPath expression.

(17)

15/44

AAH: Adaptation to template change

with

Determine when a change has occurred wrt the template described in the knowledge base and adapting actions accordingly

Two cases:

Recrawl of a Web application Structural changes are detected by looking for the content in the archive. When structural change is detected, the system first marks the failed crawling actions and then aligns them using content from the archive.

Crawl of a new Web application If actions in the knowledge base do not match, the system collects all candidate attributes, values, tag names from the knowledge base, creates all possible combinations ofrelaxed expressions, and then evaluates them.

(18)

16/44

Desiderata

Fully automated systemfor Web crawling

Crawls the content of template-based Web sites with fewer requests than traditional Web crawlers

Automatically identifies interesting navigation patternsin a Web site

Adapts to the structure of new Web sites, based on unknown CMSs

(19)

17/44

Outline

Introduction

Crawling of Structured Websites The Structured Web

Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments

Learning as You Crawl Conclusion

(20)

18/44

Structure-driven crawler

[Faheem and Senellart, 2015]

with

ACEBot (AdaptiveCrawler Botfor dataExtraction) is a fully automatic, structure-driven crawler.

ACEBotutilizes the inner structure of the Web pages and guides the crawling process based on the importance of their content.

ACEBot has two phases:

Offline phase: constructs a dynamic sitemap (limiting the number of URLs retrieved), learns the best traversal strategy based on importance of navigation patterns (selecting those leading to valuable content).

Online phase: performsmassivecrawling following the chosen navigation patterns.

(21)

19/44

Simple example: structure of a Web page

with

html

div div a table

thead table thead

tbody a tbody

a tbody

a

`1 html-table-thead-table- thead-tbody-a

`2 html-div-div-a

`3 html-table-thead-tbody-a

(22)

20/44

Simple example: structure of a Web site

with

p0 p4

2039 p1

239

p2 754

p3 3227

p5

`2 2600

`3

`1

`1 `4 p0 p3 p4

5107 p1

239

p2

754

p5

`2 2600

`3

`1 `4

(23)

21/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Objective

with

Determine the navigation patternson a Web site that lead to most content items

Content items: can be anything, e.g., k-grams (succession of kwords)

Score of a navigation pattern: number of distinct items number of requests

Finding a set of navigation patterns that optimizes the score is coNP-hard.

)Greedyapproach, until sufficiently many content items have been retrieved

(24)

21/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Objective

with

Determine the navigation patternson a Web site that lead to most content items

Content items: can be anything, e.g., k-grams (succession of kwords)

Score of a navigation pattern: number of distinct items number of requests Proposition

Finding a set of navigation patterns that optimizes the score is coNP-hard.

been retrieved

(25)

21/44

Objective

with

Determine the navigation patternson a Web site that lead to most content items

Content items: can be anything, e.g., k-grams (succession of kwords)

Score of a navigation pattern: number of distinct items number of requests Proposition

Finding a set of navigation patterns that optimizes the score is coNP-hard.

)Greedyapproach, until sufficiently many content items have been retrieved

(26)

22/44

Language of navigation patterns

with

Non-disjunctiveregular expressions over link paths, expressed asOXPath expressions [Furche et al., 2011] (language for Web navigation)

hexpri ::= "doc" "("hurli ")" (hestepi)+

hestepi ::= hstepi |hkleenei

hstepi ::= "/" "("hactioni ")"

hkleenei ::= "/" "(" hactioni ")" ( "∗" |hnumberi ) hactioni ::= ("/"hnodetesti)+ "{click/}"

hnodetesti ::= tag |"@" tag

(27)

23/44

Simple example: best navigation pattern

with

p0 p3 p4

5107 p1

239

p2

754 p5 2600

(`2,239)

(`3,754)

(`1,2553.5) (`,2600)4 NP total distinct score

2-grams 2-grams

`1 5266 5107 2553.5

`1; `4 7866 7214 2404.7

`3 754 754 754

`2 239 239 239

(28)

23/44

Simple example: best navigation pattern

with

p0 p3 p4

5107 p1

239

p2

754 p5 2600

(`2,239)

(`3,754)

(`1,2553.5) (`,2600)4 NP total distinct score

2-grams 2-grams

`1 5266 5107 2553.5

`1; `4 7866 7214 2404.7

`3 754 754 754

`2 239 239 239

(29)

24/44

ACEBot architecture

with

I

II I

Navigation Patterns Construction Sitemap

Construction

Navigation Patterns Selection

Crawler Repository

XML Knowledge

base

(30)

25/44

Outline

Introduction

Crawling of Structured Websites The Structured Web

Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments

Learning as You Crawl Conclusion

(31)

26/44

Experiment setup

with

The experiments are performed with ACEBot,AAHand GNU wget

Crawled 50 Web sites (totaling nearly 2 million Web pages) with both ACEBotandwget

Crawled 10 Web sites (totaling nearly 0.5 million Web pages) with both ACEBotandAAH

Dynamic site map has been constructed with randomly chosen 3 000 Web pages

(32)

27/44

Performance metrics

with

The number of HTTP requests made by both systems vs the amount of useful contentretrieved

Coverage of useful content is calculated by comparing the proportion of:

distinct 2-grams

external links

in the crawl result of different systems.

(33)

28/44

Crawl efficiency

with

WordPress vBulletin phpBB other 0

10 20 30

Proportionof HTTP requests(%) ACE

AAH

(34)

29/44

Navigation pattern efficiency

with

0 2 4 6 8 10 12

0 2 4 6 8 10

Number of selected navigation patterns

NumberofHTTPrequests(1;000)

requests 0 20 40 60 80 100

Proportionofseen2-grams(%)

2-grams

(35)

30/44

Crawling a particular Web site

with

0 2;000 4;000 6;000

0 100 200 300

Number of HTTP requests

Numberofdistinct2-grams(1;000)

ACEAAH wget

(36)

31/44

Impact of the sitemap size

with

500 1;000 1;500 2;000 2;500 3;000 20

40 60 80 100

6 6

7 9

10 10

Size of site map

Proportionofseenn-grams(%)

(Labels are the number of navigation patterns learned)

(37)

32/44

Lessons learned

In brief:

It is possible to retrieve the significant content of a Web sitemuch more efficientlythan by crawling it whole

Comparable resultsbetween fully automatic approach and approach relying on hand-written knowledge base

Sample, learn, apply to whole Web site

Main limitation:

Requires a separate first crawl of a sitemap of fixed size

Is it possible:

Toadaptivelyselect the size of the sitemap?

To learn the site structureas it is being crawled?

(38)

33/44

Outline

Introduction

Crawling of Structured Websites Learning as You Crawl

Reinforcement learning RL for Focused Crawling Conclusion

(39)

34/44

Outline

Introduction

Crawling of Structured Websites Learning as You Crawl

Reinforcement learning RL for Focused Crawling Conclusion

(40)

35/44

Reinforcement learning

Deals with agents learning to interact with a partially unknown environment

Agents can perform actions, resulting inrewards (or penalties) and in a possiblestate change

Agents learn about the world by interacting with it

Goal: find the right sequence of actions that maximizes overall reward, or minimizes overall penalty

Classic tradeoff: exploration vs exploitation

(41)

36/44

Multi-armed bandits

Stateless model

k > 1 different actions, with unknown rewards

often assumed that the rewards are from a

parametrized probabilistic distribution (Bernoulli, Gaussian, exponential, etc.)

(42)

37/44

Markov decision process (MDP)

Finite set of states

In each state, actions lead:

to astate change

to areward

Depending on cases:

state changes may be deterministic or probabilistic

state changes may be known or unknown (to be learned)

rewards may be known or unknown (to be learned)

current state may even be unknown! (poMDP)

(43)

38/44

Outline

Introduction

Crawling of Structured Websites Learning as You Crawl

Reinforcement learning RL for Focused Crawling Conclusion

(44)

39/44

Abstracting Out Focused Crawling

As usual, the Web is seen as a graph

Nodes areweighted with scores that correspond to relevance to the topic of the crawl

Edges are weighted with scores indicating clues of the edge being relevant to the crawl

Scores are unknowntill the page is crawled

Goal: Maximizetotal score of Web pages crawledwithin a fixed budget

(45)

40/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Predicting the Score of an Unknown Node

Variouspredictors can be used:

Distance with respect to a seed node

Average score of pages pointing to a node

Average score of pages pointed to by pages pointing to a node

More advanced classifiers using various features of the Web page

etc.

(46)

40/44

Predicting the Score of an Unknown Node

Variouspredictors can be used:

Distance with respect to a seed node

Average score of pages pointing to a node

Average score of pages pointed to by pages pointing to a node

More advanced classifiers using various features of the Web page

etc.

Which are the best predictorsfor a given crawling task?

(47)

41/44

Adaptive focused crawling (stateless)

[Gouriten et al., 2014]

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl Web pages such thattotal score is high

Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits

(48)

41/44

Adaptive focused crawling (stateless)

[Gouriten et al., 2014]

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl Web pages such thattotal score is high

Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits

(49)

41/44

Adaptive focused crawling (stateless)

[Gouriten et al., 2014]

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

0

Problem: Efficiently crawl Web pages such thattotal score is high

Challenge: The score of a node is unknown till it is crawled

Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits

(50)

42/44

Adaptive focused crawling (stateful)

[Han et al., 2018]

with

Problem: Efficiently crawl nodes in a graph such that total score is high, taking into account

currently crawled graph

Challenge: Huge state space

Methodology: MDP,clusteringof state space, together withlinear approximation of value functions

(51)

43/44

Outline

Introduction

Crawling of Structured Websites Learning as You Crawl

Conclusion

(52)

44/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Conclusion

Web, SNS crawling is a daunting task, especially under temporal constraints

Exploit the structureof the Web:

Web sites have specific structures that make them easier to crawl

Pages relevant to a specific topic are clustered together on the Web

Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword

(53)

44/44

Conclusion

Web, SNS crawling is a daunting task, especially under temporal constraints

Exploit the structureof the Web:

Web sites have specific structures that make them easier to crawl

Pages relevant to a specific topic are clustered together on the Web

Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword

Merci !

(54)

Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati. A first experience in archiving the French Web. In Proc. ECDL, Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In Proc. WWW, May 2003.

Soumen Chakrabarti, Martin van den Berg, and Byron Dom.

Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11–16):

1623–1640, 1999.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using context graphs. InProc. VLDB, Cairo, Egypt, September 2000.

(55)

Muhammad Faheem and Pierre Senellart. Demonstrating intelligent crawling and archiving of web applications. In Proc. CIKM, pages 2481–2484, San Francisco, USA, October 2013a. Demonstration.

Muhammad Faheem and Pierre Senellart. Intelligent and adaptive crawling of Web applications for Web archiving. In Proc. ICWE, pages 306–322, Aalborg, Denmark, July 2013b.

Muhammad Faheem and Pierre Senellart. Adaptive web crawling through structure-based link classification. InProc.

ICADL, pages 39–51, Seoul, South Korea, December 2015.

Tim Furche, Georg Gottlob, Giovanni Grasso, Christian

Schallhart, and Andrew Jon Sellers. OXPath: A language for scalable, memory-efficient data extraction from Web

applications. PVLDB, 4(11), 2011.

(56)

David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of Web page templates. In WWW, 2005.

Georges Gouriten and Pierre Senellart. API Blender: A uniform interface to social platform APIs. In Proc. WWW, Lyon, France, April 2012. Developer track.

Georges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable, generic, and adaptive systems for focused crawling. In Proc.

Hypertext, pages 35–45, Santiago, Chile, September 2014.

Douglas Engelbart Best Paper Award.

Miyoung Han, Pierre Senellart, and Pierre-Henri Wuillemin.

Focused crawling through reinforcement learning. In Proc.

ICWE, June 2018.

(57)

Wallace Koehler. A longitudinal study of web pages continued:

a consideration of document persistence. Inf. Res., 9(2), 2003.

Martijn Koster. A standard for robot exclusion.

http://www.robotstxt.org/orig.html, June 1994.

Pierre Senellart. Identifying Websites with flow simulation. In Proc. ICWE, pages 124–129, Sydney, Australia, July 2005.

Marc Spaniol, Dimitar Denev, Arturas Mazeika, Gerhard Weikum, and Pierre Senellart. Data quality in web archiving.

InProceedings of the 3rd Workshop on Information Credibility on the Web, 2009.

Références

Documents relatifs

By comparison, columns 4-5 report the exact same information about sitemaps, namely the number of URLs and the number of Web pages that contain structured product offer data

Central and unique to a nc 2 crawler is uniform access to both ob- ject data (such as Web documents or data already extracted from previously crawled Web pages) and metadata about

• Detect the type of Web application, kind of Web pages inside this Web application, and decide crawling actions accordingly. • Directly targets useful content-rich areas,

The AAH deals with two different cases of adaptation: first, when (part of) a Web application has been crawled before the template change and a recrawl is carried out after that (a

Discovering new URLs Identifying duplicates Crawling architecture Crawling Complex Content Focused

We use the following scheme for class hierarchies: An abstract subclass of Robot redefines the executionLoop method in order to indicate how the function processURL should be called

The World Wide Web Discovering new URLs Identifying duplicates Crawling architecture Crawling Complex Content Conclusion.. 23

Le droit d’usage d ´efini par la licence autorise un usage `a destination de tout public qui comprend : – le droit de reproduire tout ou partie du document sur support informatique