• Aucun résultat trouvé

ENHANCED TECHNIQUES FOR PAGE RANKING

Dans le document DATA MINING THE WEB (Page 74-77)

Both PageRank and HITS, along with their improvements that we have discussed, so far rely on the basic assumption that linked pages belong to a similar or the same topic. However, as we mentioned earlier, the topic changes quickly as the number of links between the pages increases. Thus, assuming that a set of pages are from the same topic, expansion of this set by one or more links may include pages from other topics. This process is called topic generalization. Topic generalization by a single link is used in HITS to form the base set and is the maximum that is feasible.

Expansion by more than one link usually brings many unrelated pages and has to be avoided.

Another undesired situation is when a page from a single topic set of pages points to a large set of pages from another topic. Then expanding the former set would include in it pages from the larger set, thus changing the original topic. This process, calledtopic drift, poses problems to both HITS and PageRank. In HITS the top-ranked hubs and authorities may appear unrelated to the query, and PageRank may assign high scores to pages with low relevance. This effect may be used intentionally to bring up the rank of a page linked to a large, densely connected web subgraph.

Problems with topic generalization and drift are due basically to a single ranking system based dominantly on the web graph structure. A general solution to these and other problems is to use many ranking systems and to weight their scores when com-puting the final page rank. We have already discussed two: the content-based relevance that uses the vector space metric and the link-based ranking of PageRank and HITS.

It is important to note that successful web search engines (e.g., Google) use these and other ranking schemes and sophisticated weighting techniques to combine them.

Other problems with link-based ranking includenepotismandoutliers. Densely linked pages located on a single server cause problems with purely link-based rank-ing. Such links are callednepotistic linksbecause they increase the page rank, but

EXERCISES 57 indicate hardly any authority and may also be used for commercial manipulation.

Two-party and multiparty nepotisms are also possible, due basically to navigation links or links between different sites belonging to the same business. For example, Google has more than 20 sites all linked together:http://froogle.google.com/, http://groups.google.com/, http://images.google.com/, and others.

Such sites may be completely unrelated with respect to page content. One simple approach to avoid nepotism is to assign weights of 1/k to the inlinks from pages belonging to a site withkpages.

Outliersare web pages that are retrieved by keyword search and thus are relevant to the query but are somehow far from the central topic of the query. Such pages may be linked to outside the topic web subgraphs and thus may increase the probability of topic drift when the root set is expanded. Therefore,outlier eliminationcan stabilize the ranking algorithm and avoid undesired topic generalizations and drifts. Outliers can be detected by clustering because they are far from the cluster centroids. Following are the basic steps of a simple approach that uses the idea of centroids and is designed to stabilize the HITS algorithm.

1. Create vector space representation for the pages from the root set.

2. Find the centroid of the root set. This is the page that minimizes the sum (or the average) of its cosine similarity to all pages in the set.

3. When expanding the root set, discard pages (from the base set) that are too far from the centroid page (their cosine similarity to the centroid is below a given threshold).

There are also other approaches to enhance page ranking that are based on the structure of the web graph as well as on the HTML structure of web documents.

Chakrabarti [3] provides an in-depth discussion of all these methods, including the basic PageRank and HITS algorithms and their improvements.

REFERENCES

1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd,The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technology Project, Stanford, University, Stanford, CA, 1998,http://dbpubs.stanford.edu/pub/1999-66. 2. Jon M. Kleinberg, Authoritative sources in a hyperlinked environment,J. ACM, 46(5):

604–632, 1999,http://www.cs.cornell.edu/home/kleinber/auth.ps.

3. Soumen Chakrabarti,Mining the Web:Discovering Knowledge from Hypertext Data, Mor-gan Kaufmann, San Francisco, CA, 2003.

EXERCISES

1. Use the WebSPHINX crawler (http://www.cs.cmu.edu/rcm/websphinx/, also available from the book series Web sitewww.dataminingconsultant.com), to create a document citation matrix for a small set of web pages. For example, use the domain http://www.artsci.ccsu.edu/.

a.To limit the web domain, crawl the server pages only (set Crawl: the server). As only the immediate citations are needed, the depth of crawling should be set to 1 hop.

b.Create a square matrix for the URLs collected from the pages crawled. For this purpose, crawl the server using each of these pages as a starting URL (again with a depth of 1 hop). The results of each crawl will include the pages cited by the page used as a starting URL, and thus will provide the information needed to fill the corresponding row of the adjacency matrix with 0’s or 1’s.

c.(Advanced project) The entire process of creating the citation matrix may be auto-mated by writing a program that uses the source code of the WebSPHINX crawler, the W3C Protocol Library (http://www.w3.org/Library/), or another package providing tools to manipulate web page links.

2. Compute the prestige score of the pages in the collection by finding the eigenvector associated with the largest eigenvalue of the citation matrix. Use a math package, such as MATLAB and MathWorks, or implement the power iteration algorithm described in this chapter.

3. Include weights in the adjacency matrix as explained in the section “PageRank.” For this purpose, use the citation matrix created in Exercise 1. Analyze the structure of the web graph described with this matrix and determine whether or not it contains rank sinks. Use the eigenvector approach to compute the PageRank score of the web pages (see Exercise 2).

4. Investigate how a rank sink may affect page scores based on the simplified PageRank algorithm (without rank source). If the pages collected do not include a rank sink, modify the matrix to create one.

5. Implement the power iteration method for computing PageRank. Investigate how it deals with the rank sink situation. Use a uniform rank source with a small norm. Also experiment with different rank source vectors.

6. Find statistics for web page visits and try to match the PageRank scores with the frequency of visits for each page. How good is the PageRank estimate of web traffic? Change the rank source vector (e.g., use the pages visited most often) or extend the web subgraph (include more pages) and see how this changes the PageRank scores.

7. Rewrite the PageRank equation with the rank source in matrix notation.

PART

II

WEB CONTENT

Dans le document DATA MINING THE WEB (Page 74-77)