Adaptive Web Crawling through Stucture-Based Link Classiﬁcation

(1)

Adaptive Web Crawling through Stucture-Based Link Classification

Muhammad Faheem Pierre Senellart

(2)

Web archiving

(3)

The CMS-Based Web

40–50% of Web content may rely on template-based content management systems (CMSs)

(4)

The CMS-Based Web

40–50% of Web content may rely on template-based content management systems (CMSs)

(5)

Traditional crawling approach

A traditional Web crawler (such asHeritrix) crawls the Web in a conceptuallyvery simple way.

Queue Man- agement

Page Fetching

Link Extraction

URL Selection

This approach does not take into account the natureof the Web site crawled, its structure, itstemplate.

Veryinefficient! Most Web sites have lots ofredundancies: several URLs host the same content, different ways to navigate within a Web site to the same content, etc.

(6)

Application-Aware Helper (AAH)

[ICWE 2013, CIKM 2013]

Archivist Interface

Web application detection module

Indexing module

Crawling module Web applications

to crawl

Web application adaptation

Module

Content extraction and annotation

module Crawled Web

pages with annotated Contents

3 2 1

5 6

8

9 12

4

RDF store Stored

WARC files

7 11 10

XML knowledge

base

What is crawled depends on the CMS the Web site is based on, but relies on ahand-written knowledge base of existing CMSs

(7)

AAH: Web application detection module

One main challenge in intelligent crawling and content extraction is to identify the Web application and then perform the best crawling strategy accordingly.

Detecting Web application using:

1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.

For instance the vBulletinWeb forum content management system, that can be identified by searching for a reference to a

vbulletin_global.jsJavaScript script by using a simple //script/@srcXPath expression.

(8)

Adaptation to template change

Determine when a change has occurred wrt the template described in the knowledge base and adapting actions accordingly

Two cases:

Recrawl of a Web application Structural changes are detected by looking for the content in the archive. When structural change is detected, the system first marks the failed crawling actions and then aligns themusing content from the archive.

Crawl of a new Web application If actions in the knowledge base do not match, the system collects all candidate attributes, values, tag names from the knowledge base, creates all possible combinations ofrelaxed expressions, and then evaluates them.

(9)

Desiderata

Fully automated system for Web crawling

Crawls the content of template-based Web sites with fewer requests than traditional Web crawlers

Automatically identifies interestingnavigation patterns in a Web site Adaptsto the structure of new Web sites, based on unknown CMSs

(10)

Outline

Introduction

Adaptive Crawler Bot for Data Extraction (ACEBot)

Experiments

Conclusion

(11)

Structure-driven crawler

[ICADL 2015]

ACEBot (AdaptiveCrawlerBot for dataExtraction) is a fully automatic, structure-driven crawler.

ACEBot utilizesthe inner structure of the Web pages and guides the crawling process based on the importance of their content.

ACEBot has two phases:

Offline phase: constructs a dynamic site map (limiting the number of URLs retrieved), learns the best traversal strategy based on importance of navigation patterns (selecting those leading to valuable content).

Online phase: performsmassivecrawling following the chosen navigation patterns.

(12)

Simple example: structure of a Web page

html div div a table

thead table thead

tbody a tbody

a tbody

a

`1 html-table-thead-table-thead- tbody-a

`₂ html-div-div-a

`3 html-table-thead-tbody-a

(13)

Simple example: structure of a Web site

p0 p4

2039 p1

239

p2

754 p3

3227

p5

`2 2600

`₃

`¹

`1 `4

p0 p3 p4

5107 p1

239

p₂ 754

p5

2600

`2

`₃

`1 `4

(14)

Objective

Determine thenavigation patterns on a Web site that lead to most content items

Content items: can be anything, e.g., k-grams(succession of k words) Score of a navigation pattern: number of distinct items

number of requests Proposition

Finding a set of navigation patterns that optimizes the score is coNP-hard.

⇒ Greedyapproach, until sufficiently many content items have been retrieved

(15)

Objective

(16)

Objective

(17)

Language of navigation patterns

Non-disjunctive regular expressionsover link paths, expressed asOXPath expressions (language for Web navigation)

hexpri ::= "doc" "("hurli ")" (hestepi)+

hestepi ::= hstepi |hkleenei hstepi ::= "/" "("hactioni ")"

hkleenei ::= "/" "("hactioni ")" ( "∗" |hnumberi ) hactioni ::= ("/"hnodetesti)+ "{click/}"

hnodetesti ::= tag |"@" tag

(18)

Simple example: best navigation pattern

p₀ p₃ p₄

5107 p1

239

p2

754 p5

2600

(`2,239)

(`₃ ,754) (`₁,2553.5)

(`4,2600) NP total distinct score 2-grams 2-grams

`₁ 5266 5107 2553.5

`1, `4 7866 7214 2404.7

`3 754 754 754

`₂ 239 239 239

(19)

Simple example: best navigation pattern

p₀ p₃ p₄

5107 p1

239

p2

754 p5

2600

(`2,239)

(`₃ ,754) (`₁,2553.5)

(`4,2600) NP total distinct score 2-grams 2-grams

`₁ 5266 5107 2553.5

`1, `4 7866 7214 2404.7

`3 754 754 754

`₂ 239 239 239

(20)

ACEBot architecture

I

II I

Navigation Patterns Construction Sitemap

Construction

Navigation Patterns Selection

Crawler Repository

XML Knowledge

base

(21)

Outline

Introduction

Experiments

Conclusion

(22)

Experiment setup

The experiments are performed with ACEBot,AHHandGNU wget Crawled 50 Web sites (totaling nearly 2 million Web pages) with both ACEBot andwget

Crawled 10 Web sites (totaling nearly 0.5 million Web pages) with both ACEBotandAAH

Dynamic site map has been constructed with randomly chosen 3000 Web pages

(23)

Performance metrics

The number of HTTP requests made by both systems vs the amount of useful content retrieved

Coverage of useful content is calculated by comparing the proportion of 2-gramsin the crawl result of both systems for every WA and by counting the number of external links

(24)

Crawl efficiency

WordPress vBulletin phpBB other

0 10 20 30

Proportionof HTTP requests(%)

ACE AAH

(25)

Navigation pattern efficiency

0 2 4 6 8 10 12

0 2 4 6 8 10

Number of selected navigation patterns

NumberofHTTPrequests(×1,000)

requests 0 20 40 60 80 100

Proportionofseen2-grams(%)

2-grams

(26)

Crawling a particular Web site

0 2,000 4,000 6,000

0 100 200 300

Number of HTTP requests

Numberofdistinct2-grams(×1,000)

ACE AAH wget

(27)

Impact of the sitemap size

500 1,000 1,500 2,000 2,500 3,000 20

40 60 80 100

6 6

7 9

10 10

Size of site map

Proportionofseenn-grams(%)

(28)

Outline

Introduction

Experiments

Conclusion

(29)

Conclusion

In brief:

It is possible to retrieve the significant content of a Web sitemuch more efficientlythan by crawling it whole

Comparable resultsbetween fully automatic approach and approach relying on hand-written knowledge base

Sample, learn, apply to whole Web site

Main open direction:

More adaptive selection of sitemap size – adapt to the size of the Web site

Merci !

(30)

Conclusion

In brief:

It is possible to retrieve the significant content of a Web sitemuch more efficientlythan by crawling it whole

Comparable resultsbetween fully automatic approach and approach relying on hand-written knowledge base

Sample, learn, apply to whole Web site

Main open direction:

More adaptive selection of sitemap size – adapt to the size of the Web site

Merci !