1/44
Timely Crawling of the Structured Web
Pierre Senellart
É C O L E N O R M A L E S U P É R I E U R E
RESEARCH UNIVERSITY PARIS
23 April 2018,TempWeb 2018
2/44
Crawling the Web
• The World Wide Web: largest information repository
• By nature (mostly) publicand freely accessibleinformation
• Numerous applications require navigatingthe Web and retrieving Web contentin bulk:
• Web search engines(e.g., Google, Bing), to create an index of Web content
• Shopping aggregators(e.g., Google Shopping, Kelkoo), to build a database of products and their prices on different sites
• Web archiving(e.g., Internet Archive), to preserve parts of the Web for future generations
3/44
Crawling ethics
• Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994]
User-agent: *
Allow: /searchhistory/
Disallow: /search
• Per-page exclusion
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
• Per-link exclusion
<a href="toto.html" rel="nofollow">Toto</a>
• AvoidDenial Of Service (DOS), wait 1s between two repeated requests to the same Web server
4/44
Crawling Social Networks
• Theoretically possible to crawl social networking sites using a regular Web crawler
• Often not possible:
https://www.facebook.com/robots.txt
• Oftenvery inefficient, considering politeness constraints
• Better solution: Use provided social networkingAPIs https://developer.twitter.com/en/docs/
api-reference-index
https://developers.facebook.com/docs/graph-api/
reference/v2.1/
https://developer.linkedin.com/docs/rest-api https://developers.google.com/youtube/v3/
5/44
Social Networking APIs
• Most social networking Web sites (and some other kinds of Web sites) provide APIs to effectively access their content
• Usually a RESTfulAPI, occasionally SOAP-baed
• Usually require atoken identifying the application using the API, sometimes a cryptographic signature as well
• May access the API as an authenticated user of the social network, or as an external party
• APIs seriously limit the rate of requests: https:
//dev.twitter.com/docs/api/1.1/get/search/tweets
6/44
Scope of a crawl
• Web-scale
• The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server
• Focus onimportantpages [Abiteboul et al., 2003]
• Web servers under a list of DNS domains: easy filtering of URLs
• A given topic: focused crawling techniques[Chakrabarti et al., 1999, Diligenti et al., 2000]based on classifiers of Web page content and predictors of the interest of a link.
• The national Web (cf.public deposit, national libraries):
what is this? [Abiteboul et al., 2002]
• A given Web site: what is a Web site? [Senellart, 2005]
7/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion
Importance of Timely Crawling
• The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]
• For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]
• Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.
• Crawling takes timeand consumes resources:
• Limited bandwidth, limiting computing power on the crawling side
• Because of crawling ethics, crawling a 5 million page site takes around2 months!
• Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;
using the Timeline API, at most 20 000 tweets per minute. . .
7/44
Importance of Timely Crawling
• The Web is veryvolatile, with a typical half-life of URLs of a few years[Koehler, 2003]
• For many purposes (archiving, analytics), a crawl quality can be measured by itstemporal coherence [Spaniol et al., 2009]
• Ideally, Web pages pointed to by a Web page should be crawled at the same time. Unrealistic in practice.
• Crawling takes timeand consumes resources:
• Limited bandwidth, limiting computing power on the crawling side
• Because of crawling ethics, crawling a 5 million page site takes around2 months!
• Limitations of social networking APIsdrastic: on Twitter, using the Search API, at most 3 000 tweets per minute;
using the Timeline API, at most 20 000 tweets per minute. . . 350 000new tweets per minute on average!
8/44
This Talk
• Focus onlimiting the number of HTTP requests: avoiding crawling useless or redundant content
• Two different scopes:
• Crawling a givenstructuredWeb site (URLs under a given domain)
• Focusedcrawling
• Also applies in a large way to the crawl of non-Web graphs, such as social networking sites through APIs[Gouriten and Senellart, 2012]
9/44
Outline
Introduction
Crawling of Structured Websites The Structured Web
Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments
Learning as You Crawl Conclusion
10/44
Outline
Introduction
Crawling of Structured Websites The Structured Web
Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments
Learning as You Crawl Conclusion
11/44
The CMS-Based Web
40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]
11/44
The CMS-Based Web
40–50% of Web content may rely on template-basedcontent management systems(CMSs) [Gibson et al., 2005]
12/44
Traditional crawling approach
• A traditional Web crawler (such asHeritrix) crawls the Web in a conceptually very simple way.
Queue Man- agement
Page
Fetching Link Extraction
URL Selection
• This approach does not take into account the nature of the Web site crawled, its structure, its template.
• Unoptimal! Most Web sites have lots of redundancies:
several URLs host the same content, different ways to navigate within a Web site to the same content, etc.
13/44
Application-Aware Helper (AAH)
[Faheem and Senellart, 2013b,a]
with
Archivist Interface
Web application
detection module Indexing
module
Crawling module Web applications
to crawl
Web application adaptation
Module
Content extraction and annotation
module Crawled Web
pages with annotated Contents
3 2 1
5 6
8
9 12
4
RDF store Stored
WARC files
7 11 10
XML knowledge
base
What is crawled depends on the CMS the Web site is based on, but relies on ahand-written knowledge base of existing CMSs
14/44
AAH: Web application detection module
with
• One main challenge in intelligent crawling and content extraction is to identify the Web application and then perform the best crawling strategyaccordingly.
• Detecting Web application using:
1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.
• For instance thevBulletin Web forum content management system, that can be identified by searching for a reference to a vbulletin_global.jsJavaScript script by using a simple //script/@srcXPath expression.
15/44
AAH: Adaptation to template change
with
• Determine when a change has occurred wrt the template described in the knowledge base and adapting actions accordingly
• Two cases:
Recrawl of a Web application Structural changes are detected by looking for the content in the archive. When structural change is detected, the system first marks the failed crawling actions and then aligns them using content from the archive.
Crawl of a new Web application If actions in the knowledge base do not match, the system collects all candidate attributes, values, tag names from the knowledge base, creates all possible combinations ofrelaxed expressions, and then evaluates them.
16/44
Desiderata
• Fully automated systemfor Web crawling
• Crawls the content of template-based Web sites with fewer requests than traditional Web crawlers
• Automatically identifies interesting navigation patternsin a Web site
• Adapts to the structure of new Web sites, based on unknown CMSs
17/44
Outline
Introduction
Crawling of Structured Websites The Structured Web
Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments
Learning as You Crawl Conclusion
18/44
Structure-driven crawler
[Faheem and Senellart, 2015]
with
• ACEBot (AdaptiveCrawler Botfor dataExtraction) is a fully automatic, structure-driven crawler.
• ACEBotutilizes the inner structure of the Web pages and guides the crawling process based on the importance of their content.
• ACEBot has two phases:
Offline phase: constructs a dynamic sitemap (limiting the number of URLs retrieved), learns the best traversal strategy based on importance of navigation patterns (selecting those leading to valuable content).
Online phase: performsmassivecrawling following the chosen navigation patterns.
19/44
Simple example: structure of a Web page
with
html
div div a table
thead table thead
tbody a tbody
a tbody
a
`1 html-table-thead-table- thead-tbody-a
`2 html-div-div-a
`3 html-table-thead-tbody-a
20/44
Simple example: structure of a Web site
with
p0 p4
2039 p1
239
p2 754
p3 3227
p5
`2 2600
`3
`1
`1 `4 p0 p3 p4
5107 p1
239
p2
754
p5
`2 2600
`3
`1 `4
21/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion
Objective
with
• Determine the navigation patternson a Web site that lead to most content items
• Content items: can be anything, e.g., k-grams (succession of kwords)
• Score of a navigation pattern: number of distinct items number of requests
Finding a set of navigation patterns that optimizes the score is coNP-hard.
)Greedyapproach, until sufficiently many content items have been retrieved
21/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion
Objective
with
• Determine the navigation patternson a Web site that lead to most content items
• Content items: can be anything, e.g., k-grams (succession of kwords)
• Score of a navigation pattern: number of distinct items number of requests Proposition
Finding a set of navigation patterns that optimizes the score is coNP-hard.
been retrieved
21/44
Objective
with
• Determine the navigation patternson a Web site that lead to most content items
• Content items: can be anything, e.g., k-grams (succession of kwords)
• Score of a navigation pattern: number of distinct items number of requests Proposition
Finding a set of navigation patterns that optimizes the score is coNP-hard.
)Greedyapproach, until sufficiently many content items have been retrieved
22/44
Language of navigation patterns
with
Non-disjunctiveregular expressions over link paths, expressed asOXPath expressions [Furche et al., 2011] (language for Web navigation)
hexpri ::= "doc" "("hurli ")" (hestepi)+
hestepi ::= hstepi |hkleenei
hstepi ::= "/" "("hactioni ")"
hkleenei ::= "/" "(" hactioni ")" ( "∗" |hnumberi ) hactioni ::= ("/"hnodetesti)+ "{click/}"
hnodetesti ::= tag |"@" tag
23/44
Simple example: best navigation pattern
with
p0 p3 p4
5107 p1
239
p2
754 p5 2600
(`2,239)
(`3,754)
(`1,2553.5) (`,2600)4 NP total distinct score
2-grams 2-grams
`1 5266 5107 2553.5
`1; `4 7866 7214 2404.7
`3 754 754 754
`2 239 239 239
23/44
Simple example: best navigation pattern
with
p0 p3 p4
5107 p1
239
p2
754 p5 2600
(`2,239)
(`3,754)
(`1,2553.5) (`,2600)4 NP total distinct score
2-grams 2-grams
`1 5266 5107 2553.5
`1; `4 7866 7214 2404.7
`3 754 754 754
`2 239 239 239
24/44
ACEBot architecture
with
I
II I
Navigation Patterns Construction Sitemap
Construction
Navigation Patterns Selection
Crawler Repository
XML Knowledge
base
25/44
Outline
Introduction
Crawling of Structured Websites The Structured Web
Adaptive Crawler Bot for Data Extraction (ACEBot) Experiments
Learning as You Crawl Conclusion
26/44
Experiment setup
with
• The experiments are performed with ACEBot,AAHand GNU wget
• Crawled 50 Web sites (totaling nearly 2 million Web pages) with both ACEBotandwget
• Crawled 10 Web sites (totaling nearly 0.5 million Web pages) with both ACEBotandAAH
• Dynamic site map has been constructed with randomly chosen 3 000 Web pages
27/44
Performance metrics
with
• The number of HTTP requests made by both systems vs the amount of useful contentretrieved
• Coverage of useful content is calculated by comparing the proportion of:
• distinct 2-grams
• external links
in the crawl result of different systems.
28/44
Crawl efficiency
with
WordPress vBulletin phpBB other 0
10 20 30
Proportionof HTTP requests(%) ACE
AAH
29/44
Navigation pattern efficiency
with
0 2 4 6 8 10 12
0 2 4 6 8 10
Number of selected navigation patterns
NumberofHTTPrequests(1;000)
requests 0 20 40 60 80 100
Proportionofseen2-grams(%)
2-grams
30/44
Crawling a particular Web site
with
0 2;000 4;000 6;000
0 100 200 300
Number of HTTP requests
Numberofdistinct2-grams(1;000)
ACEAAH wget
31/44
Impact of the sitemap size
with
500 1;000 1;500 2;000 2;500 3;000 20
40 60 80 100
6 6
7 9
10 10
Size of site map
Proportionofseenn-grams(%)
(Labels are the number of navigation patterns learned)
32/44
Lessons learned
In brief:
• It is possible to retrieve the significant content of a Web sitemuch more efficientlythan by crawling it whole
• Comparable resultsbetween fully automatic approach and approach relying on hand-written knowledge base
• Sample, learn, apply to whole Web site
Main limitation:
• Requires a separate first crawl of a sitemap of fixed size
• Is it possible:
• Toadaptivelyselect the size of the sitemap?
• To learn the site structureas it is being crawled?
33/44
Outline
Introduction
Crawling of Structured Websites Learning as You Crawl
Reinforcement learning RL for Focused Crawling Conclusion
34/44
Outline
Introduction
Crawling of Structured Websites Learning as You Crawl
Reinforcement learning RL for Focused Crawling Conclusion
35/44
Reinforcement learning
• Deals with agents learning to interact with a partially unknown environment
• Agents can perform actions, resulting inrewards (or penalties) and in a possiblestate change
• Agents learn about the world by interacting with it
• Goal: find the right sequence of actions that maximizes overall reward, or minimizes overall penalty
• Classic tradeoff: exploration vs exploitation
36/44
Multi-armed bandits
• Stateless model
• k > 1 different actions, with unknown rewards
• often assumed that the rewards are from a
parametrized probabilistic distribution (Bernoulli, Gaussian, exponential, etc.)
37/44
Markov decision process (MDP)
• Finite set of states
• In each state, actions lead:
• to astate change
• to areward
• Depending on cases:
• state changes may be deterministic or probabilistic
• state changes may be known or unknown (to be learned)
• rewards may be known or unknown (to be learned)
• current state may even be unknown! (poMDP)
38/44
Outline
Introduction
Crawling of Structured Websites Learning as You Crawl
Reinforcement learning RL for Focused Crawling Conclusion
39/44
Abstracting Out Focused Crawling
• As usual, the Web is seen as a graph
• Nodes areweighted with scores that correspond to relevance to the topic of the crawl
• Edges are weighted with scores indicating clues of the edge being relevant to the crawl
• Scores are unknowntill the page is crawled
• Goal: Maximizetotal score of Web pages crawledwithin a fixed budget
40/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion
Predicting the Score of an Unknown Node
• Variouspredictors can be used:
• Distance with respect to a seed node
• Average score of pages pointing to a node
• Average score of pages pointed to by pages pointing to a node
• More advanced classifiers using various features of the Web page
• etc.
40/44
Predicting the Score of an Unknown Node
• Variouspredictors can be used:
• Distance with respect to a seed node
• Average score of pages pointing to a node
• Average score of pages pointed to by pages pointing to a node
• More advanced classifiers using various features of the Web page
• etc.
• Which are the best predictorsfor a given crawling task?
41/44
Adaptive focused crawling (stateless)
[Gouriten et al., 2014]
with 3
5 0
4 0
0
3 5
3
4 2
2 3
3
5 0
4 0
0
3 5
3
4 2
2 3
2 3
0
0 0
0 1 0 1
0 1
3
0 0
0
0
• Problem: Efficiently crawl Web pages such thattotal score is high
• Challenge: The score of a node is unknown till it is crawled
• Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits
41/44
Adaptive focused crawling (stateless)
[Gouriten et al., 2014]
with 3
5 0
4 0
0
3 5
3
4 2
2 3
3
5 0
4 0
0
3 5
3
4 2
2 3
2 3
0
0 0
0 1 0 1
0 1
3
0 0
0
0
• Problem: Efficiently crawl Web pages such thattotal score is high
• Challenge: The score of a node is unknown till it is crawled
• Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits
41/44
Adaptive focused crawling (stateless)
[Gouriten et al., 2014]
with 3
5 0
4 0
0
3 5
3
4 2
2 3
3
5 0
4 0
0
3 5
3
4 2
2 3
2 3
0
0 0
0 1 0 1
0 1
3
0 0
0
0
• Problem: Efficiently crawl Web pages such thattotal score is high
• Challenge: The score of a node is unknown till it is crawled
• Methodology: Use various predictors of node scores, and adaptively select the best one so far with multi-armed bandits
42/44
Adaptive focused crawling (stateful)
[Han et al., 2018]
with
• Problem: Efficiently crawl nodes in a graph such that total score is high, taking into account
currently crawled graph
• Challenge: Huge state space
• Methodology: MDP,clusteringof state space, together withlinear approximation of value functions
43/44
Outline
Introduction
Crawling of Structured Websites Learning as You Crawl
Conclusion
44/44 Introduction Crawling of Structured Websites Learning as You Crawl Conclusion
Conclusion
• Web, SNS crawling is a daunting task, especially under temporal constraints
• Exploit the structureof the Web:
• Web sites have specific structures that make them easier to crawl
• Pages relevant to a specific topic are clustered together on the Web
• Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword
44/44
Conclusion
• Web, SNS crawling is a daunting task, especially under temporal constraints
• Exploit the structureof the Web:
• Web sites have specific structures that make them easier to crawl
• Pages relevant to a specific topic are clustered together on the Web
• Adaptivetechniques are possible, that learn how best to crawl a Web site or a particular keyword
Merci !
Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati. A first experience in archiving the French Web. In Proc. ECDL, Roma, Italie, September 2002.
Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. In Proc. WWW, May 2003.
Soumen Chakrabarti, Martin van den Berg, and Byron Dom.
Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11–16):
1623–1640, 1999.
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using context graphs. InProc. VLDB, Cairo, Egypt, September 2000.
Muhammad Faheem and Pierre Senellart. Demonstrating intelligent crawling and archiving of web applications. In Proc. CIKM, pages 2481–2484, San Francisco, USA, October 2013a. Demonstration.
Muhammad Faheem and Pierre Senellart. Intelligent and adaptive crawling of Web applications for Web archiving. In Proc. ICWE, pages 306–322, Aalborg, Denmark, July 2013b.
Muhammad Faheem and Pierre Senellart. Adaptive web crawling through structure-based link classification. InProc.
ICADL, pages 39–51, Seoul, South Korea, December 2015.
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian
Schallhart, and Andrew Jon Sellers. OXPath: A language for scalable, memory-efficient data extraction from Web
applications. PVLDB, 4(11), 2011.
David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of Web page templates. In WWW, 2005.
Georges Gouriten and Pierre Senellart. API Blender: A uniform interface to social platform APIs. In Proc. WWW, Lyon, France, April 2012. Developer track.
Georges Gouriten, Silviu Maniu, and Pierre Senellart. Scalable, generic, and adaptive systems for focused crawling. In Proc.
Hypertext, pages 35–45, Santiago, Chile, September 2014.
Douglas Engelbart Best Paper Award.
Miyoung Han, Pierre Senellart, and Pierre-Henri Wuillemin.
Focused crawling through reinforcement learning. In Proc.
ICWE, June 2018.
Wallace Koehler. A longitudinal study of web pages continued:
a consideration of document persistence. Inf. Res., 9(2), 2003.
Martijn Koster. A standard for robot exclusion.
http://www.robotstxt.org/orig.html, June 1994.
Pierre Senellart. Identifying Websites with flow simulation. In Proc. ICWE, pages 124–129, Sydney, Australia, July 2005.
Marc Spaniol, Dimitar Denev, Arturas Mazeika, Gerhard Weikum, and Pierre Senellart. Data quality in web archiving.
InProceedings of the 3rd Workshop on Information Credibility on the Web, 2009.