• Aucun résultat trouvé

Web search engines

N/A
N/A
Protected

Academic year: 2022

Partager "Web search engines"

Copied!
48
0
0

Texte intégral

(1)

Web search engines

(2)

Outline

Information Retrieval Text Preprocessing Inverted Index

Answering Keyword Queries PageRank

Conclusion

(3)

Information Retrieval, Search

Problem

How toindexWeb content so as to answer (keyword-based) queries efficiently?

Context: set oftext documents

d1 The jaguar is a New World mammal of the Felidae family.

d2 Jaguar has designed four new engines.

d3 For Jaguar, Atari was keen to use a 68K family device.

d4 The Jacksonville Jaguars are a professional US football team.

d5 Mac OS X Jaguar is available at a price of US $199 for Apple’s new “family pack”.

d6 One such ruling family to incorporate the jaguar into their name is Jaguar Paw.

d It is a big cat.

(4)

Outline

Information Retrieval Text Preprocessing Inverted Index

Answering Keyword Queries PageRank

Conclusion

(5)

Text Preprocessing

Initial textpreprocessingsteps Number of optional steps

Highly depends on theapplication

Highly depends on thedocument language(illustrated with English)

(6)

Tokenization

Principle

Separate text intotokens(words) Not so easy!

In some languages (Chinese, Japanese), wordsnot separated by whitespace

Dealconsistentlywith acronyms, elisions, numbers, units, URLs, emails, etc.

Compound words:hostname,host-nameandhost name. Break into two tokens or regroup them as one token? In any case, lexicon and linguistic analysis needed! Even more so in other

(7)

Tokenization: Example

d1 the1jaguar2is3a4new5world6mammal7of8the9felidae10 family11 d2 jaguar1has2designed3four4new5engines6

d3 for1jaguar2atari3was4keen5to6use7a868k9family10 device11 d4 the1jacksonville2jaguars3are4a5professional6us7football8team9 d5 mac1os2x3jaguar4is5available6at7a8price9of10 us11 $19912

for13 apple’s14new15 family16 pack17

d6 one1such2ruling3family4to5incorporate6the7jaguar8into9 their10 name11 is12 jaguar13 paw14

d7 it1is2a3big4cat5

(8)

Stemming

Principle

Mergedifferent forms of the same word, or of closely related words, into a singlestem

Not in all applications!

Useful for retrieving documents containinggeesewhen searching forgoose

Various degreesof stemming

Possibility of building different indexes, with different stemming

(9)

Stemming schemes (1/2)

Morphological stemming (lemmatization).

Removebound morphemesfrom words:

plural markers gender markers

tense or mood inflections etc.

Can be linguisticallyvery complex, cf:

Les poules du couvent couvent.

[The hens of the monastery brood.]

In English, somewhateasy:

Remove final -s, -’s, -ed, -ing, -er, -est Take care of semiregular forms (e.g., -y/-ies) Take care of irregular forms (mouse/mice) But still someambiguities: cf rose

(10)

Stemming schemes (2/2)

Lexical stemming.

Mergelexically relatedterms of various parts of speech, such aspolicy,politics,politicalorpolitician For English,Porter’s stemming[Porter, 1980]; stem universityanduniversaltounivers: not perfect!

Possibility of coupling this withlexiconsto merge (near-)synonyms

Phonetic stemming.

Mergephonetically relatedwords: search proper names with different spellings!

For English,Soundex[US National Archives and

(11)

Stemming Example

d1 the1jaguar2be3a4new5world6mammal7of8the9felidae10 family11 d2 jaguar1have2design3four4new5engine6

d3 for1jaguar2atari3be4keen5to6use7a868k9family10 device11 d4 the1jacksonville2jaguar3be4a5professional6us7football8team9 d5 mac1os2x3jaguar4be5available6at7a8price9of10us11$19912

for13 apple14new15 family16 pack17

d6 one1such2rule3family4to5incorporate6the7jaguar8into9 their10 name11 be12jaguar13paw14

d7 it1be2a3big4cat5

(12)

Stop Word Removal

Principle

Removeuninformativewords from documents, in particular to lower the cost of storing the index

determiners: a,the,this, etc.

function verbs: be,have,make, etc.

conjunctions: that,and, etc.

etc.

(13)

Stop Word Removal Example

d1 jaguar2new5world6mammal7felidae10 family11 d2 jaguar1design3four4new5engine6

d3 jaguar2atari3keen568k9family10device11

d4 jacksonville2jaguar3professional6us7football8team9 d5 mac1os2x3jaguar4available6price9us11$19912apple14

new15 family16 pack17

d6 one1such2rule3family4incorporate6jaguar8their10 name11 jaguar13 paw14

d7 big4cat5

(14)

Outline

Information Retrieval Text Preprocessing Inverted Index

Answering Keyword Queries PageRank

Conclusion

(15)

Inverted Index

After all preprocessing, construction of aninverted index:

Index ofall terms, with the list of documents where this term occurs

Small scale: disk storage, withmemory mapping(cf.mmap) techniques; secondary index for offset of each term in main index Large scale: distributed on acluster of machines; hashing gives the machine responsible for a given term

Updating the index costly, so onlybatch operations(not one-by-one addition of term occurrences)

(16)

Inverted Index Example

family d1,d3,d5,d6 football d4

jaguar d1,d2,d3,d4,d5,d6 new d1,d2,d5

rule d6 us d4,d5 world d1 . . .

(17)

Storing positions in the index

phrase queries, NEAR operator: need to keepposition information in the index

just add it in the document list!

family d1/11,d3/10,d5/16,d6/4 football d4/8

jaguar d1/2,d2/1,d3/2,d4/3,d5/4,d6/8+13 new d1/5,d2/5,d5/15

rule d6/3

us d4/7,d5/11 world d1/6

. . .

(18)

TF-IDF Weighting

Some term occurrences have moreweightthan others:

Terms occurringfrequentlyin agiven document: morerelevant Terms occurringrarelyin thedocument collectionas a whole: more informative

AddTerm Frequency—Inverse Document Frequencyweighting to occurrences;

tfidf(t,d) = nt,d

∑︀

tnt,d ·log |D|

{︀d ∈D|nt,d >0}︀⃒

nt,d number of occurrences oft ind D set of all documents

(19)

TF-IDF Weighting Example

family d1/11/.13,d3/10/.13,d6/4/.08,d5/16/.07 football d4/8/.47

jaguar d1/2/.04,d2/1/.04,d3/2/.04,d4/3/.04,d6/8+13/.04,d5/4/.02 new d2/5/.24,d1/5/.20,d5/15/.10

rule d6/3/.28

us d4/7/.30,d5/11/.15 world d1/6/.47

. . .

(20)

Outline

Information Retrieval Text Preprocessing Inverted Index

Answering Keyword Queries PageRank

Conclusion

(21)

Answering Boolean Queries

Single keyword query: just consult the index and return the documents in index order.

Boolean multi-keyword query

(jaguarANDnewAND NOTfamily) ORcat

Same way! Retrieve document lists from all keywords and apply adequate set operations:

AND intersection OR union AND NOT difference

Global score: some function of the individual weight (e.g., addition for conjunctive queries)

Position queries: consult the index, and filter by appropriate

(22)

Answering Top-k Queries

t1AND . . . ANDtn

t1OR . . . ORtn

Problem

Find thetop-k results(for some givenk) to the query, without retrieving all documents matching it.

Notations:

s(t,d) weight oft ind (e.g., tfidf)

g(s , . . . ,s ) monotonous function that computes the global score

(23)

Fagin’s Threshold Algorithm [Fagin et al., 2001]

(with an additional direct index givings(t,d)) 1. LetRbe the empty list andm= +∞.

2. For each 1≤i≤n:

2.1 Retrieve the documentd(i)containing termti that has thenext largests(ti,d(i)).

2.2 Compute its global scoregd(i) =g(s(t1,d(i)), . . . ,s(tn,d(i)))by retrieving alls(tj,d(i))withj̸=i.

2.3 IfRcontains less thank documents, or ifgd(i)is greater than the minimum of the score of documentsinR, addd(i)toR(and remove the worst element inRif it is full).

3. Letm=g(s(t1,d(1)),s(t2,d(2)), . . . ,s(tn,d(n))).

4. IfRcontainsk documents, and the minimum of the score of the documents inRisgreater than or equaltom, returnR.

5. Redo step 2.

(24)

Outline

Information Retrieval PageRank

Conclusion

(25)

The Web Graph

The World Wide Web seenas a (directed) graph:

Vertices: Web pages Edges: hyperlinks

Same for otherinterlinkedenvironments:

dictionaries encyclopedias scientific publications social networks

(26)

PageRank (Google’s Ranking [Brin and Page, 1998])

Idea

Importantpages are pages pointed to byimportantpages.

{︃gij =0 if there is no link between pageiandj;

gij = n1

i otherwise, withni the number of outgoing links of pagei.

Definition (Tentative)

Probabilitythat the surfer following therandom walkinGhas arrived on pagei at some distant given point in the future.

(︂ )︂

(27)

Illustrating PageRank Computation

0.100 0.100

0.100

0.100

0.100 0.100

0.100

0.100

0.100

0.100

(28)

Illustrating PageRank Computation

0.033 0.317

0.075

0.108

0.058

0.083

0.150

0.033

(29)

Illustrating PageRank Computation

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.094

0.008

(30)

Illustrating PageRank Computation

0.054 0.212

0.093

0.152

0.051

0.108

0.149

0.026

(31)

Illustrating PageRank Computation

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.099

0.016

(32)

Illustrating PageRank Computation

0.048 0.232

0.093

0.156

0.067

0.087

0.138

0.018

(33)

Illustrating PageRank Computation

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.096

0.021

(34)

Illustrating PageRank Computation

0.049 0.238

0.088

0.149

0.063

0.095

0.141

0.019

(35)

Illustrating PageRank Computation

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.096

0.019

(36)

Illustrating PageRank Computation

0.050 0.233

0.091

0.150

0.064

0.095

0.142

0.020

(37)

Illustrating PageRank Computation

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.097

0.019

(38)

Illustrating PageRank Computation

0.049 0.233

0.091

0.149

0.065

0.095

0.142

0.019

(39)

Illustrating PageRank Computation

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(40)

Illustrating PageRank Computation

0.050 0.234

0.091

0.149

0.065

0.095

0.142

0.019

(41)

PageRank with Damping

May not always converge, or convergence may not be unique.

To fix this, the random surfer can at each steprandomly jumpto any page of the Web with some probabilityd (1−d:damping factor).

pr(i) = (︂

k→+∞lim ((1−d)GT +dU)kv )︂

i

whereU is the matrix with all N1 values withN the number of vertices.

(42)

Using PageRank to Score Query Results

PageRank: globalscore, independent of the query Can be used to raise the weight ofimportantpages:

weight(t,d) =tfidf(t,d)×pr(d), This can be directly incorporatedin the index.

(43)

Iterative Computation of PageRank

1. ComputeG(often stored as its adjacency list). Make sure lines sum to 1.

2. Letu be the uniform vector of sum 1,v =u,w thezerovector.

3. Whilev isdifferent enoughfromw:

Setw =v.

Setv = (1d)GTv+du.

(44)

Outline

Information Retrieval PageRank

Conclusion

(45)

Conclusion

What you should remember

The inverted index model, associated tools and techniques Main ideas behind Fagin’s TA

PageRank, and its iterative computation Software

Most DBMS have text indexing capabilities (e.g., MySQL’s FULLTEXTindexes)

Apache Lucene, free software library for information retrieval To go further

Good textbooks [Manning et al., 2008, Chakrabarti, 2003]

Very influential papers on top-k algorithms and PageRank: [Fagin et al., 2001, Brin and Page, 1998]

(46)

Bibliography I

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):

107–117, April 1998.

Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Fransisco, USA, 2003.

Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. InProc. PODS, Santa Barbara, USA, May 2001.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃijtze.

Introduction to Information Retrieval. Cambridge University Press, Cambridge, United Kingdom, 2008. Available online at

http://informationretrieval.org.

(47)

Bibliography II

US National Archives and Records Administration. The Soundex indexing system.

http://www.archives.gov/genealogy/census/soundex.html, May 2007.

(48)

Licence de droits d’usage

Contexte public} avec modifications

Par le téléchargement ou la consultation de ce document, l’utilisateur accepte la licence d’utilisation qui y est attachée, telle que détaillée dans les dispositions suivantes, et s’engage à la respecter intégralement.

La licence confère à l’utilisateur un droit d’usage sur le document consulté ou téléchargé, totalement ou en partie, dans les conditions définies ci-après et à l’exclusion expresse de toute utilisation commerciale.

Le droit d’usage défini par la licence autorise un usage à destination de tout public qui comprend : – le droit de reproduire tout ou partie du document sur support informatique ou papier,

– le droit de diffuser tout ou partie du document au public sur support papier ou informatique, y compris par la mise à la disposition du public sur un réseau numérique,

– le droit de modifier la forme ou la présentation du document,

– le droit d’intégrer tout ou partie du document dans un document composite et de le diffuser dans ce nouveau document, à condition que : – L’auteur soit informé.

Les mentions relatives à la source du document et/ou à son auteur doivent être conservées dans leur intégralité.

Le droit d’usage défini par la licence est personnel et non exclusif.

Tout autre usage que ceux prévus par la licence est soumis à autorisation préalable et expresse de l’auteur :sitepedago@telecom-paristech.fr

Références

Documents relatifs

Unnecessary derivatization (blocking groups, protecting/deprotecting 
 grows temporary modification of chemical processes) should be minimized or avoided, 
 if

Finally, the set of grouped candidate terms with the highest similarity to the neighboring concepts is selected and next used when creating the fv of the corresponding

Examen du 6 janvier 2015, durée 3h L’ USAGE DE TOUT DISPOSITIF ÉLECTRONIQUE AUTRE QUE LA MONTRE ( ET ENCORE ) EST INTERDIT.. I L EN EST DE MÊME DE

Examen du 24 juin 2015, durée 2h L’ USAGE DE TOUT DISPOSITIF ÉLECTRONIQUE AUTRE QUE LA MONTRE EST INTERDIT.. I L EN EST DE MÊME DE

La modification proposée est fondée sur l’hypothèse selon laquelle un titulaire de droit d’auteur qui rend la totalité ou une partie d’une œuvre ou de tout autre objet de

Analyse tout en mémoire Page 16 Plus d’outils pour la récupération d’informations sur le réseau sont disponibles ici : Informations sur la

Sélectionner le texte à copier choisir la commande copier de l’onglet Accueil (CTRL+C) Mettre le curseur à l’endroit du copie choisir la commande coller de l’onglet Accueil

Quels sont les atouts et les contraintes du détroit de Malacca dans le commerce maritime mondial. 1° Décrire les caractéristiques géographiques du détroit de Malacca (Docs. 1