INF344: Donn´ees et algorithmes du Web Web Crawling

(1)

INF344: Donn ´ees et algorithmes du Web

Web Crawling

(2)

2 / 35

14 May 2014

Licence de droits d’usage

Plan

Discovering new URLs

Identifying duplicates Crawling architecture Recrawling URLs?

Crawling the Hidden Web Crawling the Social Web Conclusion

(3)

Web Crawlers

crawlers,(Web) spiders,(Web) robots: autonomous user agents that retrieve pages from the Web

Basics of crawling:

1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (cf. next slide)

4. Repeat on each found URL

No real termination condition (virtual unlimited number of Web pages!)

Graph-browsingproblem

deep-first: not very adapted, possibility of being lost inrobot traps

breadth-first

combination of both: breadth-first with limited-depth deep-first on each discovered website

(4)

4 / 35

14 May 2014

Sources of new URLs

From HTML pages:

hyperlinks <a href="...">...</a>

media <img src="..."> <embed src="...">

frames <frame src="..."> <iframe src="...">

JavaScript links window.open("...") etc.

Other hyperlinked content (e.g., PDF files)

Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): use regular expressions to extract them Referrer URLs

Sitemaps [sitemaps.org, 2008]

(5)

Scope of a crawler

Web-scale

The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server

Focus onimportantpages [Abiteboul et al., 2003]

Web servers under a list ofDNS domains: easy filtering of URLs A given topic: focused crawlingtechniques [Chakrabarti et al., 1999, Diligenti et al., 2000] based on classifiers of Web page content and predictors of the interest of a link.

The national Web (cf.public deposit, national libraries): what is this? [Abiteboul et al., 2002]

A given Web site: what is a Web site? [Senellart, 2005]

(6)

6 / 35

14 May 2014

Plan

Discovering new URLs Identifying duplicates

Crawling architecture Recrawling URLs?

(7)

A word about hashing

Definition

Ahash functionis a deterministic mathematical function transforming objects (numbers, character strings, binary. . . ) into fixed-size,

seemingly random, numbers. The more random the transformation is, the better.

Example

Java hash function for theStringclass:

n−1

X

i=0

s_i×31ⁿ⁻ⁱ⁻¹mod 2³²

wheres_i is the (Unicode) code of characteriof a strings.

(8)

8 / 35

14 May 2014

Identification of duplicate Web pages

Problem

Identifying duplicates or near-duplicates on the Web to prevent multiple indexing

trivial duplicates: same resource at the samecanonizedURL:

http://example.com:80/toto http://example.com/titi/../toto exact duplicates: identification byhashing

near-duplicates: (timestamps, tip of the day, etc.) more complex!

(9)

Near-duplicate detection

Edit distance. Count theminimum number of basic modifications (additions or deletions of characters or words, etc.) to obtain a document from another one. Good measure of similarity, and can be computed inO(mn)wheremandn are the size of the documents. But: does not scaleto a large collection of documents (unreasonable to compute the edit distance for every pair!).

Shingles. Idea: two documents similar if they mostly share the same succession ofk-grams(succession of tokens of lengthk).

Example

I like to watch the sun set with my friend.

My friend and I like to watch the sun set.

S={i like, like to, my friend, set with, sun set, the sun, to watch, watch the, with my}

T ={and i, friend and, i like, like to, my friend, sun set, the sun, to watch, watch the}

(10)

10 / 35

14 May 2014

Hashing shingles to detect dupli- cates [Broder et al., 1997]

Similarity: Jaccard coefficienton the set of shingles:

J(S,T) = |S∩T|

|S∪T|

Stillcostly to compute! But can be approximated as follows:

1. ChooseN different hash functions

2. For each hash functionh_i and each set of shingles Sk ={s_k1. . .skn}, storeφ_ik =minjhi(skj)

3. ApproximateJ(Sk,Sl)as theproportionofφ_ikandφ_il that are equal Possibly to repeat in a hierarchical way withsuper-shingles(we are only interested inverysimilar documents)

(11)

Plan

Discovering new URLs Identifying duplicates Crawling architecture

Recrawling URLs?

(12)

12 / 35

14 May 2014

Crawling ethics

Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994].

User-agent: *

Allow: /searchhistory/

Disallow: /search Per-page exclusion.

Per-link exclusion.

AvoidDenial Of Service(DOS), wait≈1s between two repeated requests to the same Web server

(13)

Parallel processing

Network delays, waits between requests:

Per-server queueof URLs

Parallel processing of requests to different hosts:

multi-threadedprogramming

asynchronousinputs and outputs (select, classes from java.util.concurrent): less overhead

Use ofkeep-aliveto reduce connexion overheads

(14)

General Architecture [Chakrabarti, 2003]

(15)

Plan

Discovering new URLs Identifying duplicates Crawling architecture Recrawling URLs?

(16)

16 / 35

14 May 2014

Refreshing URLs

Content on the Webchanges Differentchange rates:

online newspaper main page: every hour or so published article: virtually no change

Continuouscrawling, and identification of change rates for adaptivecrawling: how to know thetime of last modificationof a Web page?

(17)

Estimating the Freshness of a Page

1. Check HTTP timestamp.

2. Check content timestamp.

3. Compare a hash of the page with a stored hash.

4. Non-significant differences (ads, fortunes, request timestamp):

only hash text content, or “useful” text content;

compare distribution ofn-grams (shingling);

or even compute edit distance with previous version.

Adapting strategy to each different archived website?

(18)

18 / 35

14 May 2014

Plan

(19)

The Deep Web

Definition (Deep Web, Hidden Web, Invisible Web)

All the content on the Web that is not directly accessible through hyperlinks. In particular: HTML forms, Web services.

Size estimate: 500 times more content than on thesurface Web!

[BrightPlanet, 2000]. Hundreds of thousands of deep Web databases [Chang et al., 2004]

(20)

20 / 35

14 May 2014

Sources of the Deep Web

Example

Yellow Pagesand other directories;

Library catalogs;

Weather services;

US Census Bureau data;

etc.

(21)

Discovering Knowledge from the Deep Web [Varde et al., 2009]

Content of the deep Web hidden to classical Web search engines (they just follow links)

But very valuable and high quality!

Even services allowing access through the surface Web (e.g., e-commerce) have more semantics when accessed from the deep Web

How tobenefitfrom this information?

Focus here: Automatic, unsupervised, methods

(22)

22 / 35

14 May 2014

Extensional Approach

WWW discovery

siphoning

bootstrap Index

indexing

(23)

Notes on the Extensional Approach

Main issues:

Discovering services

Choosing appropriate data to submit forms

Use of data found in result pages to bootstrap the siphoning process

Ensure good coverage of the database

Approachfavored by Google, used in production [Madhavan et al., 2006]

Not always feasible (huge load on Web servers)

(24)

Intensional Approach

WWW discovery

probing analyzing

Form wrapped as a Web service

query

(25)

Notes on the Intensional Approach

Moreambitious[Chang et al., 2005, Senellart et al., 2008]

Main issues:

Discovering services

Understanding the structure and semantics of a form Understanding the structure and semantics of result pages Semantic analysis of the service as a whole

No significant load imposed on Web servers

(26)

26 / 35

14 May 2014

Forms

Analyzing thestructureof HTML forms.

Authors

Title Year Page

Conference ID

Journal Volume Number

Maximum of matches

Goal

Associating to each form field the appropriatedomain concept.

(27)

1

^st

Step: Structural Analysis

1. Build acontextfor each field:

labeltag;

idandnameattributes;

text immediately before the field.

2. Removestop words,stem(see lecture on IR).

3. Matchthis context with the concept names, extended with WordNet.

4. Obtain in this waycandidate annotations.

(28)

27 / 35

14 May 2014

1

^st

Step: Structural Analysis

labeltag;

(29)

1

^st

Step: Structural Analysis

labeltag;

(30)

27 / 35

14 May 2014

1

^st

Step: Structural Analysis

labeltag;

(31)

2

^nd

Step: Confirm Annotations w/ Probing

For each field annotated with a conceptc:

1. Probe the field with nonsense word to get anerror page.

2. Probethe field with instances ofc(chosen representatively of the frequency distribution ofc).

3. Compare pages obtained by probing with the error page (by clustering along the DOM tree structure of the pages), to distinguish error pages andresult pages.

4. Confirmthe annotation if enough result pages are obtained.

(32)

28 / 35

14 May 2014

2

^nd

Step: Confirm Annotations w/ Probing

(33)

2

^nd

Step: Confirm Annotations w/ Probing

(34)

28 / 35

14 May 2014

2

^nd

Step: Confirm Annotations w/ Probing

(35)

How well does this work?

Good resultsin practice [Senellart et al., 2008]

Probing raises precisionwithout hurting recall

But some critical assumptions:

It is possible to query a field with asubword All form fields areindependent

No field isrequired

(36)

29 / 35

14 May 2014

How well does this work?

Good resultsin practice [Senellart et al., 2008]

Probing raises precisionwithout hurting recall But some critical assumptions:

It is possible to query a field with asubword All form fields areindependent

No field isrequired

(37)

Plan

(38)

31 / 35

14 May 2014

The Social Web

Social Web sites, with user-contributed content: Facebook, Twitter, YouTube, Google+, Foursquare. . .

They are part of the Web (private or public) Can be crawled as regular Web content . . . but sometimesimpossible(see

http://www.facebook.com/robots.txt) orinefficient(lots of Web pages crawled with little relevant information)

Better: usingsocial networking APIs https://dev.twitter.com/docs/api https://developers.facebook.com/

(39)

Challenges on using SN APIs

Strict policyrate limitations(e.g.,

https://dev.twitter.com/docs/api/1.1/get/search/tweets) Complexauthenticationschemes (typically based on the OAuth protocol)

Some information visible to the user may not be available in the API

Constrainedby the existing API access methods (but often still better than blindly following hyperlinks)

How to integrate aregular Web crawlerwith aSN crawler?

Largely open.

Numerous libraries facilitating SN API accesses (twipy,

Facebook4J, FourSquare VP C++ API. . . ) incompatible with each other. . . Some efforts at generic APIs (OneAll, APIBlender).

(40)

33 / 35

14 May 2014

Plan

(41)

Conclusion

What you should remember

Crawling as agraph-browsingproblem.

Shinglingfor identifying duplicates.

Numerousengineering issuesin building a Web-scale crawler.

Harder to crawl non-surface Web content.

(42)

35 / 35

14 May 2014

References

Software

Wget, a simple yet effective Web spider (free software)

Heritrix, a Web-scale highly configurable Web crawler, used by the Internet Archive (free software)

HTML Parser, TagSoup: Java libraries for parsing real-world Web pages

To go further

A good textbook [Chakrabarti, 2003]

Main references:

HTML 4.01 recommendation [W3C, 1999]

HTTP/1.1 RFC [IETF, 1999]

(43)

Bibliography I

Serge Abiteboul, Gr ´egory Cobena, Julien Masan `es, and Gerald Sedrati. A first experience in archiving the French Web. InProc.

ECDL, Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Gregory Cobena. Adaptive on-line page importance computation. InProc. WWW, May 2003.

BrightPlanet. The deep Web: Surfacing hidden value. White Paper, July 2000.

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the Web. Computer Networks, 29(8-13):1157–1166, 1997.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery.

(44)

Bibliography II

Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured databases on the Web: Observations and implications. SIGMOD Record, 33(3):61–70, September 2004.

Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward large scale integration: Building a metaquerier over databases on the Web. InProc. CIDR, Asilomar, USA, January 2005.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori. Focused crawling using context graphs. InProc.

VLDB, Cairo, Egypt, September 2000.

IETF. Request For Comments 2616. Hypertext transfer

protocol—HTTP/1.1. http://www.ietf.org/rfc/rfc2616.txt, June 1999.

Martijn Koster. A standard for robot exclusion.

http://www.robotstxt.org/orig.html, June 1994.

(45)

Bibliography III

Jayant Madhavan, Alon Y. Halevy, Shirley Cohen, Xin Dong, Shawn R.

Jeffery, David Ko, and Cong Yu. Structured data meets the Web: A few observations. IEEE Data Engineering Bulletin, 29(4):19–26, December 2006.

Pierre Senellart. Identifying Websites with flow simulation. InProc.

ICWE, pages 124–129, Sydney, Australia, July 2005.

Pierre Senellart, Avin Mittal, Daniel Muschick, R ´emi Gilleron, and Marc Tommasi. Automatic wrapper induction from hidden-Web sources with domain knowledge. InProc. WIDM, pages 9–16, Napa, USA, October 2008.

sitemaps.org. Sitemaps XML format.

http://www.sitemaps.org/protocol.php, February 2008.

Aparna Varde, Fabian M. Suchanek, Richi Nayak, and Pierre Senellart.

Knowledge discovery over the deep Web, semantic Web and XML.

InProc. DASFAA, pages 784–788, Brisbane, Australia, April 2009.

(46)

Bibliography IV

W3C. HTML 4.01 specification, September 1999.

http://www.w3.org/TR/REC-html40/.

(47)

Licence de droits d’usage

Contexte public}avec modifications

Par le t él échargement ou la consultation de ce document, l’utilisateur accepte la licence d’utilisation qui y est attach ée, telle que d étaill ée dans les dispositions suivantes, et s’engage à la respecter int égralement.

La licence conf ère à l’utilisateur un droit d’usage sur le document consult é ou t él écharg é, totalement ou en partie, dans les conditions d éfinies ci-apr ès et à l’exclusion expresse de toute utilisation commerciale.

Le droit d’usage d ´efini par la licence autorise un usage `a destination de tout public qui comprend : – le droit de reproduire tout ou partie du document sur support informatique ou papier,

– le droit de diffuser tout ou partie du document au public sur support papier ou informatique, y compris par la mise à la disposition du public sur un r éseau num érique,

– le droit de modifier la forme ou la pr ´esentation du document,

– le droit d’int égrer tout ou partie du document dans un document composite et de le diffuser dans ce nouveau document, à condition que : – L’auteur soit inform é.

Les mentions relatives à la source du document et/ou à son auteur doivent être conserv ées dans leur int égralit é.

Le droit d’usage d ´efini par la licence est personnel et non exclusif.

Tout autre usage que ceux pr évus par la licence est soumis à autorisation pr éalable et expresse de l’auteur :[email protected]