Crawling the Web

(1)

1/44

Timely Crawling of the Structured Web

Pierre Senellart

É C O L E N O R M A L E S U P É R I E U R E

RESEARCH UNIVERSITY PARIS

23 April 2018,TempWeb 2018

(2)

2/44

Crawling the Web

• The World Wide Web: largest information repository

• By nature (mostly) publicand freely accessibleinformation

• Numerous applications require navigatingthe Web and retrieving Web contentin bulk:

• Web search engines(e.g., Google, Bing), to create an index of Web content

• Shopping aggregators(e.g., Google Shopping, Kelkoo), to build a database of products and their prices on different sites

• Web archiving(e.g., Internet Archive), to preserve parts of the Web for future generations

(3)

3/44

Crawling ethics

• Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994]

User-agent: *

Allow: /searchhistory/

Disallow: /search

• Per-page exclusion

• Per-link exclusion

• AvoidDenial Of Service (DOS), wait 1s between two repeated requests to the same Web server

(4)

4/44

Crawling Social Networks

• Theoretically possible to crawl social networking sites using a regular Web crawler

• Often not possible:

https://www.facebook.com/robots.txt

• Oftenvery inefficient, considering politeness constraints

• Better solution: Use provided social networkingAPIs https://developer.twitter.com/en/docs/

api-reference-index

https://developers.facebook.com/docs/graph-api/

reference/v2.1/

https://developer.linkedin.com/docs/rest-api https://developers.google.com/youtube/v3/

(5)

5/44

Social Networking APIs

• Most social networking Web sites (and some other kinds of Web sites) provide APIs to effectively access their content

• Usually a RESTfulAPI, occasionally SOAP-baed

• Usually require atoken identifying the application using the API, sometimes a cryptographic signature as well

• May access the API as an authenticated user of the social network, or as an external party

• APIs seriously limit the rate of requests: https:

//dev.twitter.com/docs/api/1.1/get/search/tweets

(6)

6/44

Scope of a crawl

• Web-scale

• The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server

• Focus onimportantpages [Abiteboul et al., 2003]

• Web servers under a list of DNS domains: easy filtering of URLs

• A given topic: focused crawling techniques[Chakrabarti et al., 1999, Diligenti et al., 2000]based on classifiers of Web page content and predictors of the interest of a link.

• The national Web (cf.public deposit, national libraries):

what is this? [Abiteboul et al., 2002]

• A given Web site: what is a Web site? [Senellart, 2005]

(7)

7/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion

Importance of Timely Crawling

• The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]

• For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]

• Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.

• Crawling takes timeand consumes resources:

• Limited bandwidth, limiting computing power on the crawling side

• Because of crawling ethics, crawling a 5 million page site takes around2 months!

• Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;

using the Timeline API, at most 20 000 tweets per minute. . .

(8)

7/44

Importance of Timely Crawling

• The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]

• For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]

• Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.

• Crawling takes timeand consumes resources:

• Limited bandwidth, limiting computing power on the crawling side

• Because of crawling ethics, crawling a 5 million page site takes around2 months!

• Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;

using the Timeline API, at most 20 000 tweets per minute. . . 350 000new tweets per minute on average!

(9)

8/44

This Talk

• Focus onlimiting the number of HTTP requests: avoiding crawling useless or redundant content

• Two different scopes:

• Crawling a givenstructuredWeb site (URLs under a given domain)

• Focusedcrawling

• Also applies in a large way to the crawl of non-Web graphs, such as social networking sites through APIs[Gouriten and Senellart, 2012]

(10)

9/44

Outline

Introduction

Crawling of Structured Websites The Structured Web

Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments

Learning as You Crawl Conclusion

(11)

10/44

Outline

Introduction

(12)

11/44

The CMS-Based Web

40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]

(13)

11/44

The CMS-Based Web

40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]

(14)

12/44

Traditional crawling approach

• A traditional Web crawler (such asHeritrix) crawls the Web in a conceptually very simple way.

Queue Man- agement

Page

Fetching Link Extraction

URL Selection

• This approach does not take into account the nature of the Web site crawled, its structure, its template.

• Unoptimal! Most Web sites have lots of redundancies:

several URLs host the same content, different ways to navigate within a Web site to the same content, etc.

(15)

13/44

Application-Aware Helper (AAH)

[Faheem and Senellart, 2013b,a]

with

Archivist Interface

Web application

detection module Indexing

module

Crawling module Web applications

to crawl

Web application adaptation

Module

Content extraction and annotation

module Crawled Web

pages with annotated Contents

3 2 1

5 6

8

9 12

4

RDF store Stored

WARC files

7 11 10

XML knowledge

base

What is crawled depends on the CMS the Web site is based on, but relies on ahand-written knowledge base of existing CMSs

(16)

14/44

AAH: Web application detection module

with

• One main challenge in intelligent crawling and content extraction is to identify the Web application and then perform the best crawling strategyaccordingly.

• Detecting Web application using:

1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.

• For instance thevBulletin Web forum content management system, that can be identified by searching for a reference to a vbulletin_global.jsJavaScript script by using a simple //script/@srcXPath expression.

(17)

15/44

AAH: Adaptation to template change

with

• Determine when a change has occurred wrt the template described in the knowledge base and adapting actions accordingly

• Two cases:

Recrawl of a Web application Structural changes are detected by looking for the content in the archive. When structural change is detected, the system first marks the failed crawling actions and then aligns them using content from the archive.

Crawl of a new Web application If actions in the knowledge base do not match, the system collects all candidate attributes, values, tag names from the knowledge base, creates all possible combinations ofrelaxed expressions, and then evaluates them.

(18)

16/44

Desiderata

• Fully automated systemfor Web crawling

• Crawls the content of template-based Web sites with fewer requests than traditional Web crawlers

• Automatically identifies interesting navigation patternsin a Web site

• Adapts to the structure of new Web sites, based on unknown CMSs

(19)

17/44

Outline

Introduction

(20)

18/44

Structure-driven crawler

[Faheem and Senellart, 2015]

with

• ACEBot (AdaptiveCrawler Botfor dataExtraction) is a fully automatic, structure-driven crawler.

• ACEBotutilizes the inner structure of the Web pages and guides the crawling process based on the importance of their content.

• ACEBot has two phases:

Offline phase: constructs a dynamic sitemap (limiting the number of URLs retrieved), learns the best traversal strategy based on importance of navigation patterns (selecting those leading to valuable content).

Online phase: performsmassivecrawling following the chosen navigation patterns.

(21)

19/44

Simple example: structure of a Web page

with

html

div div a table

thead table thead

tbody a tbody

a tbody

a

`₁ html-table-thead-table- thead-tbody-a

`₂ html-div-div-a

`₃ html-table-thead-tbody-a

(22)

20/44

Simple example: structure of a Web site

with

p₀ p₄

2039 p₁

239

p₂ 754

p₃ 3227

p₅

`2 2600

`₃

`¹

`₁ `₄ p₀ p₃ p₄

5107 p₁

239

p2

754

p₅

`2 2600

`₃

`₁ `₄

(23)

Objective

with

• Determine the navigation patternson a Web site that lead to most content items

• Content items: can be anything, e.g., k-grams (succession of kwords)

• Score of a navigation pattern: number of distinct items number of requests

Finding a set of navigation patterns that optimizes the score is coNP-hard.

)Greedyapproach, until sufficiently many content items have been retrieved

(24)

Objective

with

• Score of a navigation pattern: number of distinct items number of requests Proposition

been retrieved

(25)

21/44

Objective

with

• Score of a navigation pattern: number of distinct items number of requests Proposition

)Greedyapproach, until sufficiently many content items have been retrieved

(26)

22/44

Language of navigation patterns

with

Non-disjunctiveregular expressions over link paths, expressed asOXPath expressions [Furche et al., 2011] (language for Web navigation)

hexpri ::= "doc" "("hurli ")" (hestepi)+

hestepi ::= hstepi |hkleenei

hstepi ::= "/" "("hactioni ")"

hkleenei ::= "/" "(" hactioni ")" ( "∗" |hnumberi ) hactioni ::= ("/"hnodetesti)+ "{click/}"

hnodetesti ::= tag |"@" tag

(27)

23/44

Simple example: best navigation pattern

with

p0 p3 p4

5107 p₁

239

p2

754 p₅ 2600

(`2,239)

(`₃,754)

(`1,2553.5) (`,2600)4 NP total distinct score

2-grams 2-grams

`₁ 5266 5107 2553.5

`₁; `₄ 7866 7214 2404.7

`3 754 754 754

`₂ 239 239 239

(28)

23/44

Simple example: best navigation pattern

with

p0 p3 p4

5107 p₁

239

p2

754 p₅ 2600

(`2,239)

(`₃,754)

(`1,2553.5) (`,2600)4 NP total distinct score

2-grams 2-grams

`₁ 5266 5107 2553.5

`₁; `₄ 7866 7214 2404.7

`3 754 754 754

`₂ 239 239 239

(29)

24/44

ACEBot architecture

with

I

II I

Navigation Patterns Construction Sitemap

Construction

Navigation Patterns Selection

Crawler Repository

XML Knowledge

base

(30)

25/44

Outline

Introduction

(31)

26/44

Experiment setup

with

• The experiments are performed with ACEBot,AAHand GNU wget

• Crawled 50 Web sites (totaling nearly 2 million Web pages) with both ACEBotandwget

• Crawled 10 Web sites (totaling nearly 0.5 million Web pages) with both ACEBotandAAH

• Dynamic site map has been constructed with randomly chosen 3 000 Web pages

(32)

27/44

Performance metrics

with

• The number of HTTP requests made by both systems vs the amount of useful contentretrieved

• Coverage of useful content is calculated by comparing the proportion of:

• distinct 2-grams

• external links

in the crawl result of different systems.

(33)

28/44

Crawl efficiency

with

WordPress vBulletin phpBB other 0

10 20 30

Proportionof HTTP requests(%) ACE

AAH

(34)

29/44

Navigation pattern efficiency

with

0 2 4 6 8 10 12

0 2 4 6 8 10

Number of selected navigation patterns

NumberofHTTPrequests(1;000)

requests 0 20 40 60 80 100

Proportionofseen2-grams(%)

2-grams

(35)

30/44

Crawling a particular Web site

with

0 2;000 4;000 6;000

0 100 200 300

Number of HTTP requests

Numberofdistinct2-grams(1;000)

ACEAAH wget

(36)

31/44

Impact of the sitemap size

with

500 1;000 1;500 2;000 2;500 3;000 20

40 60 80 100

6 6

7 9

10 10

Size of site map

Proportionofseenn-grams(%)

(Labels are the number of navigation patterns learned)

(37)

32/44

Lessons learned

In brief:

• It is possible to retrieve the significant content of a Web sitemuch more efficientlythan by crawling it whole

• Comparable resultsbetween fully automatic approach and approach relying on hand-written knowledge base

• Sample, learn, apply to whole Web site

Main limitation:

• Requires a separate first crawl of a sitemap of fixed size

• Is it possible:

• Toadaptivelyselect the size of the sitemap?

• To learn the site structureas it is being crawled?

(38)

33/44

Outline

Introduction

Crawling of Structured Websites Learning as You Crawl

Reinforcement learning RL for Focused Crawling Conclusion

(39)

34/44

Outline

Introduction

(40)

35/44

Reinforcement learning

• Deals with agents learning to interact with a partially unknown environment

• Agents can perform actions, resulting inrewards (or penalties) and in a possiblestate change

• Agents learn about the world by interacting with it

• Goal: find the right sequence of actions that maximizes overall reward, or minimizes overall penalty

• Classic tradeoff: exploration vs exploitation

(41)

36/44

Multi-armed bandits

• Stateless model

• k > 1 different actions, with unknown rewards

• often assumed that the rewards are from a

parametrized probabilistic distribution (Bernoulli, Gaussian, exponential, etc.)

(42)

37/44

Markov decision process (MDP)

• Finite set of states

• In each state, actions lead:

• to astate change

• to areward

• Depending on cases:

• state changes may be deterministic or probabilistic

• state changes may be known or unknown (to be learned)

• rewards may be known or unknown (to be learned)

• current state may even be unknown! (poMDP)

(43)

38/44

Outline

Introduction

(44)

39/44

Abstracting Out Focused Crawling

• As usual, the Web is seen as a graph

• Nodes areweighted with scores that correspond to relevance to the topic of the crawl

• Edges are weighted with scores indicating clues of the edge being relevant to the crawl

• Scores are unknowntill the page is crawled

• Goal: Maximizetotal score of Web pages crawledwithin a fixed budget

(45)

Predicting the Score of an Unknown Node

• Variouspredictors can be used:

• Distance with respect to a seed node

• Average score of pages pointing to a node

• Average score of pages pointed to by pages pointing to a node

• More advanced classifiers using various features of the Web page

• etc.

(46)

40/44

Predicting the Score of an Unknown Node

• Variouspredictors can be used:

• Distance with respect to a seed node

• Average score of pages pointing to a node

• Average score of pages pointed to by pages pointing to a node

• More advanced classifiers using various features of the Web page

• etc.

• Which are the best predictorsfor a given crawling task?

(47)

41/44

Adaptive focused crawling (stateless)

[Gouriten et al., 2014]

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

• Problem: Efficiently crawl Web pages such thattotal score is high

• Challenge: The score of a node is unknown till it is crawled

• Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits

(48)

41/44

Adaptive focused crawling (stateless)

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

(49)

41/44

Adaptive focused crawling (stateless)

with 3

5 0

4 0

0

3 5

3

4 2

2 3

3

5 0

4 0

0

3 5

3

4 2

2 3

0

0 0

0 1 0 1

0 1

3

0 0

0

(50)

42/44

Adaptive focused crawling (stateful)

[Han et al., 2018]

with

• Problem: Efficiently crawl nodes in a graph such that total score is high, taking into account

currently crawled graph

• Challenge: Huge state space

• Methodology: MDP,clusteringof state space, together withlinear approximation of value functions

(51)

43/44

Outline

Introduction

Conclusion

(52)

Conclusion

• Web, SNS crawling is a daunting task, especially under temporal constraints

• Exploit the structureof the Web:

• Web sites have specific structures that make them easier to crawl

• Pages relevant to a specific topic are clustered together on the Web

• Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword

(53)

44/44

Conclusion

• Web, SNS crawling is a daunting task, especially under temporal constraints

• Exploit the structureof the Web:

• Web sites have specific structures that make them easier to crawl

• Pages relevant to a specific topic are clustered together on the Web

• Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword

Merci !

(54)

Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati. A first experience in archiving the French Web. In Proc. ECDL, Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In Proc. WWW, May 2003.

Soumen Chakrabarti, Martin van den Berg, and Byron Dom.

Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11–16):

1623–1640, 1999.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using context graphs. InProc. VLDB, Cairo, Egypt, September 2000.

(55)

Muhammad Faheem and Pierre Senellart. Demonstrating intelligent crawling and archiving of web applications. In Proc. CIKM, pages 2481–2484, San Francisco, USA, October 2013a. Demonstration.

Muhammad Faheem and Pierre Senellart. Intelligent and adaptive crawling of Web applications for Web archiving. In Proc. ICWE, pages 306–322, Aalborg, Denmark, July 2013b.

Muhammad Faheem and Pierre Senellart. Adaptive web crawling through structure-based link classification. InProc.

ICADL, pages 39–51, Seoul, South Korea, December 2015.

Tim Furche, Georg Gottlob, Giovanni Grasso, Christian

Schallhart, and Andrew Jon Sellers. OXPath: A language for scalable, memory-efficient data extraction from Web

applications. PVLDB, 4(11), 2011.

(56)

David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of Web page templates. In WWW, 2005.

Georges Gouriten and Pierre Senellart. API Blender: A uniform interface to social platform APIs. In Proc. WWW, Lyon, France, April 2012. Developer track.

Georges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable, generic, and adaptive systems for focused crawling. In Proc.

Hypertext, pages 35–45, Santiago, Chile, September 2014.

Douglas Engelbart Best Paper Award.

Miyoung Han, Pierre Senellart, and Pierre-Henri Wuillemin.

Focused crawling through reinforcement learning. In Proc.

ICWE, June 2018.

(57)

Wallace Koehler. A longitudinal study of web pages continued:

a consideration of document persistence. Inf. Res., 9(2), 2003.

Martijn Koster. A standard for robot exclusion.

http://www.robotstxt.org/orig.html, June 1994.

Pierre Senellart. Identifying Websites with flow simulation. In Proc. ICWE, pages 124–129, Sydney, Australia, July 2005.

Marc Spaniol, Dimitar Denev, Arturas Mazeika, Gerhard Weikum, and Pierre Senellart. Data quality in web archiving.

InProceedings of the 3rd Workshop on Information Credibility on the Web, 2009.