Web search
Web data management and distribution
Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart
Web Data Management and Distribution http://webdam.inria.fr/textbook
September 18, 2013
Outline
1 Web crawling
Discovering new URLs Identifying duplicates Crawling architecture
2 Web Information Retrieval
3 Web Graph Mining
4 Conclusion
Web crawling Discovering new URLs
Web Crawlers
crawlers,(Web) spiders,(Web) robots: autonomous user agents that retrieve pages from the Web
Basics of crawling:
1 Start from a given URL or set of URLs
2 Retrieve and process the corresponding page
3 Discover new URLs (cf. next slide)
4 Repeat on each found URL
No real termination condition (virtual unlimited number of Web pages!) Graph-browsingproblem
deep-first: not very adapted, possibility of being lost inrobot traps breadth-first
combination of both: breadth-first with limited-depth deep-first on each discovered website
Sources of new URLs
From HTML pages:
I hyperlinks<a href="...">...</a>
I media<img src="..."> <embed src="..."> <object data="...">
I frames<frame src="..."> <iframe src="...">
I JavaScript linkswindow.open("...")
I etc.
Other hyperlinked content (e.g., PDF files)
Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): use regular expressions to extract them
Referrer URLs Sitemaps [sit08]
Web crawling Discovering new URLs
Scope of a crawler
Web-scale
I The Web is infinite! Avoid robot traps by putting depth or page numberlimits on each Web server
I Focus onimportantpages [APC03] (cf. lecture on the Web graph) Web servers under a list ofDNS domains: easy filtering of URLs
A given topic:focused crawlingtechniques [CvdBD99, DCL+00] based on classifiers of Web page content and predictors of the interest of a link.
The national Web (cf.public deposit, national libraries): what is this?
[ACMS02]
A given Web site: what is a Web site? [Sen05]
A word about hashing
Definition
Ahash functionis a deterministic mathematical function transforming objects (numbers, character strings, binary. . . ) into fixed-size, seemingly random, numbers. The more random the transformation is, the better.
Example
Java hash function for the
String
class:n−1 i
∑
=0si
×
31n−i−1mod 232 wheresi is the (Unicode) code of characteri of a strings.Web crawling Identifying duplicates
Identification of duplicate Web pages
Problem
Identifying duplicates or near-duplicates on the Web to prevent multiple indexing trivial duplicates: same resource at the samecanonizedURL:
http://example.com:80/toto http://example.com/titi/../toto
exact duplicates: identification byhashingnear-duplicates: (timestamps, tip of the day, etc.) more complex!
Near-duplicate detection
Edit distance. Count theminimum number of basic modifications(additions or deletions of characters or words, etc.) to obtain a document from another one. Good measure of similarity, and can be computed in O
(
mn)
wheremandnare the size of the documents. But:does not scaleto a large collection of documents (unreasonable to compute the edit distance for every pair!).Shingles. Idea: two documents similar if they mostly share the same succession ofk-grams(succession of tokens of lengthk).
Example
I like to watch the sun set with my friend.
My friend and I like to watch the sun set.
S={i like, like to, my friend, set with, sun set, the sun, to watch, watch the, with my}
T ={and i, friend and, i like, like to, my friend, sun set, the sun, to watch, watch the}
Web crawling Identifying duplicates
Hashing shingles to detect duplicates [BGMZ97]
Similarity:Jaccard coefficienton the set of shingles:
J
(
S,T) = |
S∩
T|
|
S∪
T|
Stillcostly to compute! But can be approximated as follows:
1 ChooseNdifferent hash functions
2 For each hash functionhi and each set of shinglesSk={sk1. . .skn}, store φik=minjhi(skj)
3 ApproximateJ(Sk,Sl)as theproportionofφikandφilthat are equal Possibly to repeat in a hierarchical way withsuper-shingles(we are only interested inverysimilar documents)
Crawling ethics
Standard for robot exclusion: robots.txtat the root of a Web server [Kos94].
User-agent: *
Allow: /searchhistory/
Disallow: /search
Per-page exclusion (de factostandard).
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
Per-link exclusion (de factostandard).
<a href="toto.html" rel="nofollow">Toto</a>
AvoidDenial Of Service(DOS), wait 100ms/1s between two repeated requests to the same Web server
Web crawling Crawling architecture
Parallel processing
Network delays, waits between requests:
Per-server queueof URLs
Parallel processing of requests to different hosts:
I multi-threadedprogramming
I asynchronousinputs and outputs (select, classes from java.util.concurrent): less overhead
Use ofkeep-aliveto reduce connexion overheads
Web crawling Crawling architecture
Refreshing URLs
Content on the Webchanges Differentchange rates:
online newspaper main page: every hour or so published article: virtually no change
Continuouscrawling, and identification of change rates foradaptive crawling:
I If-Last-ModifiedHTTP feature (not reliable)
I Identification ofduplicatesin successive request
Outline
1 Web crawling
2 Web Information Retrieval Text Preprocessing Inverted Index
Answering Keyword Queries
3 Web Graph Mining
4 Conclusion
Web Information Retrieval
Information Retrieval, Search
Problem
How toindexWeb content so as to answer (keyword-based) queriesefficiently?
Context: set oftext documents
d1 The jaguar is a New World mammal of the Felidae family.
d2 Jaguar has designed four new engines.
d3 For Jaguar, Atari was keen to use a 68K family device.
d4 The Jacksonville Jaguars are a professional US football team.
d5 Mac OS X Jaguar is available at a price of US $199 for Apple’s new “family pack”.
d6 One such ruling family to incorporate the jaguar into their name is Jaguar Paw.
d7 It is a big cat.
Text Preprocessing
Initial textpreprocessingsteps Number of optional steps
Highly depends on theapplication
Highly depends on thedocument language(illustrated with English)
Web Information Retrieval Text Preprocessing
Language Identification
How to find the language used in a document?
Meta-information about the document: oftennot reliable!
Unambiguousscripts or letters: not very common!
한글 カタカナ
Għarbi þorn
Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi, Maltese, Icelandic
Extension of this:frequent characters, or, better,frequentk-grams Use standard machine learning techniques (classifiers)
Language Identification
How to find the language used in a document?
Meta-information about the document: oftennot reliable!
Unambiguousscripts or letters: not very common!
한글 カタカナ
Għarbi þorn
Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi, Maltese, Icelandic
Extension of this:frequent characters, or, better,frequentk-grams Use standard machine learning techniques (classifiers)
Web Information Retrieval Text Preprocessing
Tokenization
Principle
Separate text intotokens(words) Not so easy!
In some languages (Chinese, Japanese), wordsnot separated by whitespace
Dealconsistentlywith acronyms, elisions, numbers, units, URLs, emails, etc.
Compound words: hostname,host-nameandhost name. Break into two tokens or regroup them as one token? In any case, lexicon and linguistic analysis needed! Even more so in other languages as German.
Usually, remove punctuation and normalize case at this point
Tokenization: Example
d1 the1jaguar2is3a4new5world6mammal7of8the9felidae10family11
d2 jaguar1has2designed3four4new5engines6
d3 for1jaguar2atari3was4keen5to6use7a868k9family10device11
d4 the1jacksonville2jaguars3are4a5professional6us7football8team9
d5 mac1os2x3jaguar4is5available6at7a8price9of10us11$19912 for13apple’s14new15family16pack17
d6 one1such2ruling3family4to5incorporate6the7jaguar8into9
their10name11is12jaguar13paw14 d7 it1is2a3big4cat5
Web Information Retrieval Text Preprocessing
Stemming
Principle
Mergedifferent forms of the same word, or of closely related words, into a single stem
Not in all applications!
Useful for retrieving documents containinggeesewhen searching forgoose Various degreesof stemming
Possibility of building different indexes, with different stemming
Stemming schemes (1/2)
Morphological stemming.
Removebound morphemesfrom words:
I plural markers
I gender markers
I tense or mood inflections
I etc.
Can be linguisticallyvery complex, cf:
Les poules du couvent couvent.
[The hens of the monastery brood.]
In English, somewhateasy:
I Remove final -s, -’s, -ed, -ing, -er, -est
I Take care of semiregular forms (e.g., -y/-ies)
I Take care of irregular forms (mouse/mice) But still someambiguities: cf stocking, rose
Web Information Retrieval Text Preprocessing
Stemming schemes (2/2)
Lexical stemming.
Mergelexically relatedterms of various parts of speech, such aspolicy,politics,politicalorpolitician
For English,Porter’s stemming[Por80]; stemuniversityand universaltounivers: not perfect!
Possibility of coupling this withlexiconsto merge (near-)synonyms
Phonetic stemming.
Mergephonetically relatedwords: search despite spelling errors!
For English,Soundex[US 07] stemsRobertandRupertto R163. Verycoarse!
Stemming Example
d1 the1jaguar2be3a4new5world6mammal7of8the9felidae10family11
d2 jaguar1have2design3four4new5engine6
d3 for1jaguar2atari3be4keen5to6use7a868k9family10device11
d4 the1jacksonville2jaguar3be4a5professional6us7football8team9
d5 mac1os2x3jaguar4be5available6at7a8price9of10us11$19912 for13apple14new15family16pack17
d6 one1such2rule3family4to5incorporate6the7jaguar8into9
their10name11be12jaguar13paw14 d7 it1be2a3big4cat5
Web Information Retrieval Text Preprocessing
Stop Word Removal
Principle
Removeuninformativewords from documents, in particular to lower the cost of storing the index
determiners: a,the,this, etc.
function verbs: be,have,make, etc.
conjunctions: that,and, etc.
etc.
Stop Word Removal Example
d1 jaguar2new5world6mammal7felidae10family11
d2 jaguar1design3four4new5engine6 d3 jaguar2atari3keen568k9family10device11
d4 jacksonville2jaguar3professional6us7football8team9
d5 mac1os2x3jaguar4available6price9us11$19912apple14 new15family16pack17
d6 one1such2rule3family4incorporate6jaguar8their10name11
jaguar13paw14 d7 big4cat5
Web Information Retrieval Inverted Index
Inverted Index construction
After all preprocessing, construction of aninverted index:
Index ofall terms, with the list of documents where this termoccurs Small scale: disk storage, withmemory mapping(cf.
mmap
) techniques;secondary index for offset of each term in main index
Large scale: distributed on acluster of machines; hashing gives the machine responsible for a given term
Updating the index is costly, so onlybatch operations(not one-by-one addition of term occurrences)
Inverted Index Example
family d1,d3,d5,d6
football d4
jaguar d1,d2,d3,d4,d5,d6
new d1,d2,d5
rule d6
us d4,d5
world d1
. . . Note:
the length of an inverted (posting) list is highly variable – scanning short lists first is an important optimization.
entriesare homogeneous: this gives much room for compression.
Web Information Retrieval Inverted Index
Storing positions in the index
phrase queries, NEAR operator: need to keepposition informationin the index
just add it in the document list!
family d1/11,d3/10,d5/16,d6/4 football d4/8
jaguar d1/2,d2/1,d3/2,d4/3,d5/4,d6/8
+
13 new d1/5,d2/5,d5/15rule d6/3 us d4/7,d5/11 world d1/6 . . .
⇒
so far, ok forBooleanqueries: find the documents that contain a set of keywords; reject the other.TF-IDF Weighting
The inverted is extended by addingTerm Frequency—Inverse Document Frequencyweighting
tfidf
(
t,d) =
nt,d∑t0nt0,d
·
log|
D|
|{
d0∈
D|
nt,d0>
0}|
nt,d number of occurrences oftind D set of all documents
Documents (along with weight) are stored indecreasing weight orderin the index
Web Information Retrieval Inverted Index
TF-IDF Weighting Example
family d1/11/.13,d3/10/.13,d6/4/.08,d5/16/.07 football d4/8/.47
jaguar d1/2/.04,d2/1/.04,d3/2/.04,d4/3/.04,d6/8
+
13/.04, d5/4/.02new d2/5/.24,d1/5/.20,d5/15/.10 rule d6/3/.28
us d4/7/.30,d5/11/.15 world d1/6/.47
. . .
Exercise: take an entry, and check that the tf/idf value is indeed correct (take documents after stop-word removal).
Answering Boolean Queries
Single keyword query: just consult the index and return the documents in index order.
Boolean multi-keyword query
(jaguarANDnewAND NOTfamily) ORcat
Same way! Retrieve document lists from all keywords and apply adequate set operations:
AND intersection OR union AND NOT difference
Global score: some function of the individual weight (e.g., addition for conjunctive queries)
Position queries: consult the index, and filter by appropriate condition
Web Information Retrieval Answering Keyword Queries
Answering Top-k Queries
t1AND . . . ANDtn
Problem
Find the top-k results (for some givenk) to the query, without retrieving all documents matching it.
Notations:
s
(
t,d)
weight oft ind(e.g., tfidf)g
(
s1,. . .,sn)
monotonous function that computes the global score (e.g.,addition)
The Threshold Algorithm
1 LetRbe the empty list, andm
= +
∞.2 For each 1
≤
i≤
n:1 Retrieve the documentd(i)containing termti that has thenext largest s(ti,d(i)).
2 Compute its global scoregd(i)=g(s(t1,d(i)),. . .,s(tn,d(i)))by retrieving alls(tj,d(i))withj6=i.
3 IfRcontains less thankdocuments, or ifgd(i)is greater than theminimum of the score of documentsinR, addd(i)toR.
3 Letm
=
g(
s(
t1,d(1))
,s(
t2,d(2))
,. . .,s(
tn,d(n)))
.4 IfR contains more thank documents, and the minimum of the score of the documents inRisgreater than or equaltom, returnR.
5 Redo step 2.
Web Information Retrieval Answering Keyword Queries
The TA, by example
q= “newORfamily”, andk
=
3. We use inverted lists sorted on the weight.family d1/11/.13,d3/10/.13,d6/4/.08,d5/16/.07 new d2/5/.24,d1/5/.20,d5/15/.10
. . .
Initially,R
= ∅
andτ= +
∞.1 d(1)is the first entry inLfamily, one findss
(
new,d1) =
.20; the global score ford1is.13+
.20=
.33.2 Next,i
=
2, and one finds that the global score ford2is .24.3 The algorithm quits the loop oniwithR
= h[
d1, .33]
,[
d2, .24]i
and τ=
.13+
.24=
.37.4 We proceed with the loop again, takingd3with score.13 andd5with score .17.
[
d5, .17]
is added toR(at the end) andτis now.10+
.13=
.23.A last loop concludes that the next candidate isd6, with a global score of .08. Then we are done.
Outline
1 Web crawling
2 Web Information Retrieval
3 Web Graph Mining PageRank HITS Spamdexing
4 Conclusion
Web Graph Mining
The Web Graph
The World Wide Web seenas a (directed) graph:
Vertices: Web pages Edges: hyperlinks
Same for otherinterlinkedenvironments:
dictionaries encyclopedias scientific publications social networks
The transition matrix
(gij
=
0 if there is no link between pagei andj;gij
=
n1i otherwise, withni the number of outgoing links of pagei.
G
=
0 1 0 0 0 0 0 0 0 0
0 0 14 0 0 14 14 0 14 0 0 0 0 12 12 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 12 0 0 0 12
1 3
1
3 0 13 0 0 0 0 0 0
0 0 0 0 0 13 0 13 0 13 0 13 0 0 0 0 0 0 13 13 0 12 12 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
Web Graph Mining PageRank
PageRank (Google’s Ranking [BP98])
Idea
Importantpages are pages pointed to byimportantpages.
PageRank simulates a random walk by iterately computing the PR of each page, represented as a vectorv.
Initially,v is set using a uniform distribution (v
[
i] =
|1v|).
Definition (Tentative)
Probabilitythat the surfer following therandom walkinGhas arrived on pagei at some distant given point in the future.
pr
(
i) =
k→+lim∞
(
GT)
kvi
wherev is some initial column vector.
PageRank Iterative Computation
0.100 0.100
0.100
0.100
0.100 0.100
0.100
0.100
0.100
0.100
Web Graph Mining PageRank
PageRank Iterative Computation
0.033 0.317
0.075
0.108
0.025 0.058
0.083
0.150
0.117
0.033
PageRank Iterative Computation
0.036 0.193
0.108
0.163
0.079 0.090
0.074
0.154
0.094
0.008
Web Graph Mining PageRank
PageRank Iterative Computation
0.054 0.212
0.093
0.152
0.048 0.051
0.108
0.149
0.106
0.026
PageRank Iterative Computation
0.051 0.247
0.078
0.143
0.053 0.062
0.097
0.153
0.099
0.016
Web Graph Mining PageRank
PageRank Iterative Computation
0.048 0.232
0.093
0.156
0.062 0.067
0.087
0.138
0.099
0.018
PageRank Iterative Computation
0.052 0.226
0.092
0.148
0.058 0.064
0.098
0.146
0.096
0.021
Web Graph Mining PageRank
PageRank Iterative Computation
0.049 0.238
0.088
0.149
0.057 0.063
0.095
0.141
0.099
0.019
PageRank Iterative Computation
0.050 0.232
0.091
0.149
0.060 0.066
0.094
0.143
0.096
0.019
Web Graph Mining PageRank
PageRank Iterative Computation
0.050 0.233
0.091
0.150
0.058 0.064
0.095
0.142
0.098
0.020
PageRank Iterative Computation
0.050 0.234
0.090
0.148
0.058 0.065
0.095
0.143
0.097
0.019
Web Graph Mining PageRank
PageRank Iterative Computation
0.049 0.233
0.091
0.149
0.058 0.065
0.095
0.142
0.098
0.019
PageRank Iterative Computation
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
Web Graph Mining PageRank
PageRank Iterative Computation
0.050 0.234
0.091
0.149
0.058 0.065
0.095
0.142
0.097
0.019
PageRank With Damping
May not always converge, or convergence may not be unique.
To fix this, the random surfer can at each steprandomly jumpto any page of the Web with some probabilityd (1
−
d:damping factor).pr
(
i) =
k→+lim∞
((
1−
d)
GT+
dU)
kvi
whereUis the matrix with allN1 values withNthe number of vertices.
Web Graph Mining PageRank
Using PageRank to Score Query Results
PageRank:globalscore, independent of the query Can be used to raise the weight ofimportantpages:
weight
(
t,d) =
tfidf(
t,d) ×
pr(
d)
, This can be directly incorporatedin the index.HITS (Kleinberg, [Kle99])
Idea
Two kinds of important pages:hubsandauthorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.
G0 transition matrix (with 0 and 1 values) of a subgraph of the Web. We use the following iterative process (starting withaandhvectors of norm 1):
(a:
=
k 1G0ThkG0Th h:
=
kG10akG0aConvergesunder some technical assumptions toauthorityandhubscores.
Web Graph Mining HITS
Using HITS to Order Web Query Results
1 Retrieve the setDof Web pagesmatchinga keyword query.
2 Retrieve the setD∗of Web pages obtained fromDby addingall linked pages, as well as allpages linking topages ofD.
3 Build fromD∗the correspondingsubgraphG0 of the Web graph.
4 Computeiterativelyhubs and authority scores.
5 Sort documents fromDbyauthority scores.
Less efficient than PageRank, becauselocalscores.
Spamdexing
Definition
Fraudulent techniques that are used by unscrupulous webmasters to artificially raise the visibility of their website to users of search engines
Purpose: attracting visitors to websites to make profit.
Unceasing war betweenspamdexersandsearch engines
Web Graph Mining Spamdexing
Spamdexing: Lying about the Content
Technique
Putunrelatedterms in:
meta-information (
<meta name="description">
,<meta name="keywords">
)text content hidden to the user with JavaScript, CSS, or HTML presentational elements
Countertechnique
Ignoremeta-information Try anddetectinvisible text
Link Farm Attacks
Technique
Huge number of hosts on the Internet used for the sole purpose ofreferencing each other, without any content in themselves, toraise the importanceof a given website or set of websites.
Countertechnique
Detection of websites withemptyorduplicatecontent
Use of heuristics to discoversubgraphsthat look like link farms
Web Graph Mining Spamdexing
Link Pollution
Technique
Pollute user-editablewebsites (blogs, wikis) or exploit security bugs to add artificiallinks to websites, in order to raise its importance.
Countertechnique
rel="nofollow"attribute to
<a>
links not validated by a page’s ownerOutline
1 Web crawling
2 Web Information Retrieval
3 Web Graph Mining
4 Conclusion
Conclusion
What you should remember
Theinverted indexmodel for efficient answers of keyword-based queries.
Thethreshold algorithmfor retrieving top-k results.
PageRankand its iterative computation.
References
Specifications
I HTML 4.01,http://www.w3.org/TR/REC-html40/
I HTTP/1.1,http://tools.ietf.org/html/rfc2616
I Robot Exclusion Protocol,
http://www.robotstxt.org/orig.html A book
Mining the Web: Discovering Knowledge from Hypertext Data, Soumen Chakrabarti, Morgan Kaufmann
Bibliography I
Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.
A first experience in archiving the French Web.
InProc. ECDL, Roma, Italie, September 2002.
Serge Abiteboul, Mihai Preda, and Grégory Cobena.
Adaptive on-line page importance computation.
InProc. Intl. World Wide Web Conference (WWW), 2003.
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig.
Syntactic clustering of the Web.
Computer Networks, 29(8-13):1157–1166, 1997.
Sergey Brin and Lawrence Page.
The anatomy of a large-scale hypertextual Web search engine.
Computer Networks, 30(1–7):107–117, April 1998.
Soumen Chakrabarti.
Mining the Web: Discovering Knowledge from Hypertext Data.
Morgan Kaufmann, 2003.
Soumen Chakrabarti, Martin van den Berg, and Byron Dom.
Focused crawling: A new approach to topic-specific Web resource discovery.
Computer Networks, 31(11–16):1623–1640, 1999.
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, and Marco Gori.
Focused crawling using context graphs.
InProc. VLDB, Cairo, Egypt, September 2000.
Jon M. Kleinberg.
Authoritative Sources in a Hyperlinked Environment.
Journal of the ACM, 46(5):604–632, 1999.
Bibliography III
Martijn Koster.
A standard for robot exclusion.
http://www.robotstxt.org/orig.html, June 1994.
Martin F. Porter.
An algorithm for suffix stripping.
Program, 14(3):130–137, July 1980.
Pierre Senellart.
Identifying Websites with flow simulation.
InProc. ICWE, pages 124–129, Sydney, Australia, July 2005.
sitemaps.org.
Sitemaps XML format.
http://www.sitemaps.org/protocol.php, February 2008.
US National Archives and Records Administration.
The Soundex indexing system.
http:
//www.archives.gov/genealogy/census/soundex.html, May 2007.