Page relevance and ranking - The anatomy of a search engine

9.3 The anatomy of a search engine

9.3.3 Page relevance and ranking

The tables of URLs and pages filled in during the crawling phase are the initial data used for building the inverted file structure ofFigure 9.3in what is called the indexing process. Other fundamental information is needed, related to the expected importance that each page will assume for an Internet user among the thousands of pages containing the required keywords. This leads to a ranking of the retrieved documents, a feature that differentiates particular search engines from one another. Many of the criteria used for ranking are kept secret and evolve continuously: what we report now is a public knowledge of methods used by all major engines.

The first feature to take into account is the relevance R(p, t) of a pagep as a function of any particular termtoccurring in it. For this the positions of the term in the page are important, with the occurrences oft in the URL, or in the title, or in the first lines ofpbeing assigned a greater importance than the other occurrences. But, perhaps unexpectedly, particular occurrences of t in other pages pointing to pare also very relevant for p. In fact a page p⁰ may point topthrough a clickable sequence of words called ananchor textfor p. For example a Web page p⁰ on songs of the 1960s may point to the page pon remastered Beatles’ songs through an anchor textBeatles’ compilations associated to the URLthebeatles.com (this is easily done by the designer of p⁰ in the HTML description of this page). So even the word t= compilations (and its prefix compilation that, together with the plural form, is certainly present in the term table T) becomes relevant forp, besides being relevant for p⁰. BothR(p, t) andR(p⁰, t) are then affected.

All search engines treat anchor text terms as very relevant both for the pointing and the pointed page, presuming that they have been chosen by the page designers as particularly descriptive of the situation at hand. If many pages point to p using a same term t in their anchor texts, t becomes very relevant forpand the value ofR(p, t) increases substantially. This is the first example of relevance decided by “popularity” instead of relying on the intrinsic value of a term (whatever that means).

Besides considering the positions of a termton a pagep, relevance is also affected by the number of timest occurs, and by the significance thattmay have in the given context. For example common linguistic elements such as articles and prepositions trivially occur very often and count for practically nothing as distinguishing elements of a page. More interestingly, the term

“song” in the pagethebeatles.commust be treated as much less relevant than the term “crawling” if the latter occurred in the page, because the appear-ance of an unexpected term is likely to indicate an important feature of the document at hand.

These concepts were well known in information retrieval long before search engines existed. A commonly accepted metric, called TFIDF for Term Fre-quency combined with Inverse Document FreFre-quency, is based on the score:

S(p, t) =TF(p,t)×IDF(t) (9.1)

where the term frequency TF(p,t)is the number of occurrences of t divided by the total number of terms inp, and the inverse term frequency IDF(t) is related to the unexpectedness oft in p. ActuallyIDF(t) has a more flexible definition because it is measured with respect to a collection C of pages on topics close to the topic ofp, that may be chosen with a certain freedom (for exampleC is a collection of pages about songs, and t is “crawling”). For a given collection C, denoting by C_tthe elements of C containing the termt, and by|C|and|C_t|the number of elements in the two collections, the standard definition is then:

IDF(t)= log₂(|C|/|Ct|). (9.2)

Note that the value|C|/|Ct|gives an indication on how “strange”tis for the group of pages in C (in fact, infinitely strange if t never occurs), while the logarithm makes the function much smoother.

A combination of the effect of the occurrence of t in particular positions ofpas explained before, of the anchor texts, and of the TFIDF scoreS(p, t), determines the relevance value R(p, t) according to the engine’s particular policy.

Page relevance with TFIDF metric is one of the two major criteria used for determining the importance of a page in answering user queries. The second criterion is the popularity of a page as a function of its location in the Web graph. The great success that Google experienced since its appearance was undoubtedly connected to the introduction of a new ranking method based on counting the incoming links to each page as a measure of their popularity. The quality of the answers was spectacularly improved over the existing engines.

As previously mentioned, the basic idea was to apply the mathematical concept of a Markov chain to the Web graph to compute the probability of reaching a certain page by a random walk: the more incoming links to a page there are, the greater the probability of visiting it; and so the higher the rank to be assigned to the page for answering users’ queries. The proposed algo-rithm was called Page Rank, and is just one of the ingredients for ranking.

According to it, ranking is completely independent of the actual queries. As we also mentioned, around the same time another method called HITS was pro-posed, also making use of the structure of the Web graph but deciding ranking dynamically as a function of user queries. Both methods are of paramount im-portance for the development of search engines and will be briefly described here. In fact, in the introduction of this book we have speculated on how to wander through the city of K¨onigsberg in search of art work, following very naively the rules of the two methods.

Figure 9.6 shows the basic structure on Page Rank. In the words of the Google Web site, Page Rank interprets a link from a page to another as a vote by the former to the latter but “looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote.

Vote cast by pages that are themselves “important” weight more heavily ... .”

A B

C

D

FIGURE 9.6: The basic recursive formula for Page Rank:R(A) =R(B)/4 + R(C)/2 +R(D)/3. Note thatR(D) is divided by three because two of its four outgoing links point to the same page.

Looking at the figure we can see that a page A with incoming links from B, C, D has its rank R(A) computed as the sum of the ranks of the pages pointing to it, each divided by the number of outgoing links from those pages.

For example the contribution of pageB to the rank ofAis equal to the rank ofBproportionally shared by all the pages to whichBgives its vote (out-links ofB). If more links go from one source to the same destination, as it happens for pageD, they count one in the rank computation. So the general formula for the Page RankR(p) of a pagepis as follows:

R(p) =P

q∈Q(R(q)/L(q)) (9.3)

whereQis the set of all pages pointing top, andL(q) is the number of distinct pages pointed byq.

Formula (9.3) is recursive and can be computed starting from an arbitrary assignment of ranks to all the pages (typically, all equal ranks) and applying linear algebra as we shall see below. It can be proved that this is equivalent to making a random tour through the Web graph, taking all outgoing links from any node with equal probability. As a consequence of Markov theory it can also be proved that the initial assignment of rank values does not affect the final result if the number of steps in the random tour (i.e., the number of rank re-calculations) goes to infinity, or in practice is very large. At this point the rank of a page is the probability that the tour terminates there:

the implication is that pages with high rank are more likely to be required by users.

Things, however, are not that simple. Users may not always follow the clickable links, and in fact all available statistics show that the average In-ternet user follows up to three links from a page before getting bored and

changing search strategy. So the random tour through the graph foreseen in mathematical terms might actually be shorter than expected. Moreover Page Rank, as defined with formula (9.3), tends to favor older pages because new ones generally have only a few incoming links even if they are actually im-portant, pretty much as it happens with the mechanisms of network growth discussed in Chapter 6. Then a damping factor d was added to the formula since the very beginning, accounting for the possibility of jumping from one node to any other node chosen at random. The new ranking formula then becomes:

R(p) =d·P

q∈Q(R(q)/L(q)) +^1−d_N (9.4)

where N is the total number of pages in the considered collection, and d is generally taken as 0.85. As a limit, ford = 1 formula (9.3) holds, while for d= 0 the links have no influence on the ranks and all vertices have equal rank 1/N.

At this point the computation of Page Rank is a mere application of matrix multiplication as explained in Chapter 6 (Section 6.1). Given the adjacency matrixM of the Web graph, we have seen that any of its powers M^k gives the number paths of lengthk inside the graph. To take care of damping, we extend the definition of M to a new matrix S whose elements are:S[i, j] = d·M[i, j] + (1−d)/N. Numbering the pages from 1 toN, the values of Page Rank can be stored in a vectorRwhereR[i] is the rank of pagei. So starting from an initial configuration of valuesR0 forR, the computation is iterated as:

R₁=S×R₀,

R2=S×R1, (i.e.,R2=S²×R0)

....Ri=S×R_i−1.... (i.e.,Ri =Sⁱ×R0) (9.5) An important point is that we do not actually need to compute the limit rank values, so the chain of computations can be interrupted when the relative standings of the elements of R_i are the same as the ones in R_i−1. In other words, it is not necessary to compute the probability of ending in a pageA, but just to know whetherR(A) is greater or smaller than the rank of the other pages. We will return to this point below.

Compared to Page Rank, Hyperlink Induced Topic Search (HITS) has the computational advantage of working on much smaller arrays and the logical advantage of exploiting the page incoming links in function of the user query, although this last property slows down the phase of query answering.

The idea behind the method is sorting out the pages that are really “au-thoritative” for a certain query q, from among the many pages with high relevance for the keywords ofq and high in-degree. To this end, the setQof pages containing the keywords are selected together with all the pages point-ing toQor pointed to byQ. The whole set is called the “base” of the query,

Q

FIGURE 9.7: The “base” of a query q for HITS. Q is the set of pages containing the keywords ofq. The base containsQtogether with all the pages pointing toQ or pointed to byQ. Grey nodes indicate pages linked to Qin both directions.

see Figure 9.7.⁶ The two concepts of authority and hub can be formalized at this point.

Restricting our graph to the base of q, a pagep has a non-negative au-thority weightA(p) and a non-negativehub weightH(p) mutually reinforcing according to the relations:

A(p) =P

s∈SH(s), H(p) =P

t∈TA(t), (9.6)

whereS is the set of pages pointing top, andT is the set of pages pointed to byp. An important authority is a page pointed to by many important hubs;

an important hub is a page that points to many important authorities.

Staring with equal values ofAandHfor all the pages of the base, relations (9.6) are iterated until an equilibrium is reached. The computation is similar to the one of Page Rank, using the much smaller adjacency matrix of the base.

Dans le document Algorithmic Foundations of the Internet (Page 192-196)