• Aucun résultat trouvé

Web Search, Télécom ParisTech

N/A
N/A
Protected

Academic year: 2022

Partager "Web Search, Télécom ParisTech"

Copied!
2
0
0

Texte intégral

(1)

Web Search, Télécom ParisTech

PageRank

Pierre Senellart (

pierre.senellart@telecom-paristech.fr

)

 March 

The purpose of this lab session is to implement the iterative computation of the PageRank algorithm, to compute the importance of nodes in a graph.

Dataset

The graph of the Simple English Wikipedia is provided as dataset. You can find it either on the course website, or as~senellar/inf396/labelsand~senellar/inf396/edge_liston the computers used during the labs.

The set of articles is slightly different from the one used yesterday, andCategory:articles are kept. The format of the two files is as follows: labels contains the title of all articles, one per line, sorted by lexicographical order. Article appearing on linenis implicitly given the indexn−1. The first line ofedge_listis the total number of articles (that is, the number of lines of thelabelfile). Each line that follows has this form:

A B1,C1 B2,C2 ... Bn,Cn

Ais the index of a given article (theedge_listfile is sorted by such indices).B1, …,Bnare all articles pointed to byAandC1, …,Cnthe number of links fromAto the respective article.

 PageRank

. Create a classMemoryGraphthat will store in memory the content of a (weighted) graph, as an adjacency list for each node of the graph. Implement a constructor that takes as arguments two files in the format described above.

. Implement the PageRank iterative algorithm on such a graph. Do not forget to normalize the adjacency lists so that the sum of all outgoing edges of a given node is one. You can use a damping factor of .

and stop the iterative computation when the relative difference between two vectors is less than .

. Test your implementation. What are the nodes of the Simple English graph that are the most important?

Does that sound reasonable?

 To go further

. Combine this with the inverted index you built yesterday for the Simple English dataset, so that queries over this dataset use a combination of tf-idf and PageRank. Test it.

. Combine this with your Web crawler to compute the importance of crawled Web pages in the subgraph of the Web that you retrieved.

(2)

. It is unreasonable to assume that the whole graph can be kept in memory. Look at how you could use the

FileChannelclass of thetext.nio.channelspackage to directly work on a disk-based graph. Propose a way to store this graph.

Références

Documents relatifs

This manual describes Digital Microsystems' DSC-3 and DSC-4 computer systems. This manual describes the characteristics of all these systems; distinctions between

• Graphics, data entry, data capture and text editing software packages that reside and execute on HP or other computer systems, or execute locally on HP

These two conjectures have been the center of numerous investigations in the past few years using techniques from harmonic analysis in the case of equivariant compactifications of

Figure 2: Variation of the word error rate WER with the physical error for a hypergraph product code formed as a product of 3, 4-regular graphs with the Iterative BP + SSF

Implicite- ment, le Traité de la CEDEAO sur la libre circulation des personnes a conféré un statut particulier aux res- sortissants ouest-africains qui émi- grent dans les

When the SNMP agent receives a PDU containing an SNMP GET request for a variable that a sub-agent registered with the agent, it passes an SNMP DPI GET packet to the

The content of the searchBase value of the DataSet tells the client whether it represents another index server (the most significant part of the dn is a dSI attribute) or if

The IP datagram containing the initial TFTP RRQ (and all other IP datagrams sent by the booter) must of course contain both a source internet address and a destination