• Aucun résultat trouvé

Pierre Senellart (

N/A
N/A
Protected

Academic year: 2022

Partager "Pierre Senellart ("

Copied!
2
0
0

Texte intégral

(1)

Web Search, Renmin University of China

Lab : programmers

Pierre Senellart (

pierre@senellart.com

)

 July 

Labs can be made individually or by groups of two. Make sure to send by email topierre@senellart.com, either at the end of the lab session or at the latest at midnight of the same day, an informal report about what you did in the lab, and the results you obtained. You do not have to finish all exercises, if you advance reasonably during the lab session but do not go to the end of the assignment, you will still get a passing grade.

The purpose of this lab session is to implement the iterative computation of the PageRank algorithm, to compute the importance of nodes in a graph.

Dataset

The graph of the Simple English Wikipedia is provided as dataset. You can find it on the course website. The set of articles is slightly different from the one that you use for constructing the inverted index, andCategory:

articles are kept. The format of the two files is as follows:labelscontains the title of all articles, one per line, sorted by lexicographical order. Article appearing on linenis implicitly given the indexn−1. The first line of

edge_listis the total number of articles (that is, the number of lines of thelabelfile). Each line that follows has this form:

A B1,C1 B2,C2 ... Bn,Cn

Ais the index of a given article (theedge_listfile is sorted by such indices).B1, …,Bnare all articles pointed to byAandC1, …,Cnthe number of links fromAto the respective article.

 PageRank

. Create a classMemoryGraphthat will store in memory the content of a (weighted) graph, as an adjacency list for each node of the graph. Implement a constructor that takes as arguments two files in the format described above.

. Implement the PageRank iterative algorithm on such a graph. Do not forget to normalize the adjacency lists so that the sum of all outgoing edges of a given node is one. You can use a damping factor of .

(d =0.15) and stop the iterative computation when the relative difference between two vectors is less than .

. Test your implementation. What are the nodes of the Simple English graph that are the most important?

Does that sound reasonable?

(2)

 To go further

. Combine this with the inverted index you built yesterday for the Simple English dataset, so that queries over this dataset use a combination of tf-idf and PageRank. Test it.

. It is unreasonable to assume that the whole graph can be kept in memory. Look at how you could use the

FileChannelclass of thetext.nio.channelspackage to directly work on a disk-based graph. Propose a way to store this graph.

Références

Documents relatifs

Volume: Data volumes beyond what is manageable by traditional data management systems (from TB to PB to EB) Variety: Very diverse forms of data (text, multimedia, graphs,..

Delano¨e [4] proved that the second boundary value problem for the Monge-Amp`ere equation has a unique smooth solution, provided that both domains are uniformly convex.. This result

For the data we consider here, our method seems to provide better accuracy in approximating the instantaneous frequency of noisy data than the recently developed EEMD method [Wu

Besides maxillo‐mandibular hypotro- phies, decreased muscle tone, and narrow naso‐oro‐hypo- pharyngeal anatomy related to obesity, rare causes such as parapharyngeal tumors, most

For this reason, when someone wants to send someone else a small change to a text file, especially for source code, they often send a diff file.When someone posts a vulnerability to

If I could have raised my hand despite the fact that determinism is true and I did not raise it, then indeed it is true both in the weak sense and in the strong sense that I

3) Compute how many stitches Lisa will knit the second hour, and then the third hour. 5) Determine the number of hours Lisa will need to knit her grand-mother’s gift.. Please do

The resolution of a plane elasticity problem comes down to the search for a stress function, called the Airy function A, which is bi- harmonic, that is to say ∆(∆A)=0. The expression