• Aucun résultat trouvé

Efficient and Simple Encodings for the Web Graph

N/A
N/A
Protected

Academic year: 2021

Partager "Efficient and Simple Encodings for the Web Graph"

Copied!
14
0
0

Texte intégral

(1)

HAL Id: inria-00471704

https://hal.inria.fr/inria-00471704

Submitted on 8 Apr 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Jean-Loup Guillaume, Matthieu Latapy, Laurent Viennot

To cite this version:

Jean-Loup Guillaume, Matthieu Latapy, Laurent Viennot. Efficient and Simple Encodings for the Web Graph. The Third International Conference on Web-Age Information Management (WAIM), Aug 2002, Beijing, China. �10.1007/3-540-45703-8_30�. �inria-00471704�

(2)

Jean-Loup Guillaume 1

, Matthieu Latapy 1;2

and Laurent Viennot 2

Abstrat: In this paper, we propose a set of simple and eÆient methodsbased on

standard, freeand widely availabletools,tostoreand manipulate largesets of URLs

and large parts of the Web graph. Our aim isboth to store eÆiently the URLs list

andthegraphinordertomanagealltheomputationsinaomputerentralmemory.

We also want to make the onversion between URLs and their identiers as fast as

possible,andtoobtainallthesuessors ofanURLintheWeb grapheÆiently. The

methodsweproposemakeitpossibletoobtainagoodompromisebetween thesetwo

hallenges, and makeit possible tomanipulate large parts of the Web graph.

Keywords: Web graph, Web links,URLs, Compression.

Approximate words ount: 3500.

1. Introdution.

One an view the Web as a graph whose verties are Web pages, and edges are

hyperlinks from one page to another. Understanding the struture of this graph is

a key hallenge for many important present and future appliations. Information

retrieval,optimizedrawlingandenhanedbrowsingaresome ofthem. Therststep

to study the Web graph is to be able to store and manipulate it eÆiently, both in

spae and in time terms. The key element of this enoding is to assoiate a unique

identier to eahURL whih willthen be used toenode the graph.

URLs are more than 70 bytes long on average and eah vertex has an average

outdegreeatleast seven, dependingontheonsidered domain(from7.2in[8℄to11.2

1

LIAFA, Universite Paris 7, 2, plae Jussieu, 75005 Paris, Frane.

(guillaume,latapy)liafa.jussieu.fr,+33(0)144272837

2

Projet Hiperom, INRIA Roquenourt, F-78153 Le Chesnay (Frane).

(3)

in [1℄ and 11.57 for the data we used in our experiments). Enoding a one million

verties subgraph of the Web graph without any ompression would therefore need

more than 100 MB of memory. When one is onerned with the Web graph, it is

important to deal with muh bigger graphs, lassially several hundreds of millions

verties. Therefore, the eÆient enoding of the graph beomes a ruial issue. The

hallengeis then to nd agoodbalane between spae and timerequirements.

Until now, the main work onerning graph enoding is the Connetivity Server

presented in[2℄. This server maintains the graph in memory and is able toompute

theneighborhoodof oneormoreverties. In therst versionofthe server, the graph

isstored as anarray of adjaeny lists, desribing the suessors and predeessors of

eahvertex. The URLs are ompressed using adelta ompressor: one URLis stored

usingonlythedierenes fromthepreviousoneinthelist. Theseond[3℄andpresent

version [10℄ of the Connetivity Server have signiantly improved the ompression

rate for both links and URLs. The spae needed to store a link has been redued

from8to1.7bytesinaverage,and thespaeneededtostoreaURLhas beenredued

from16to 10bytes inaverage. Notie however that a fulldesription of the method

is available only for the rst version of the server [2℄, the newer (and more eÆient)

ones being only shortly desribed in[3,10℄.

Our aim is to provide an eÆient and simple solution to the problem of enoding

large sets of URLs and large parts of the Web graph using only standard, free and

widely available tools, namely sort, gzip and bzip. The gzip tool is desribed in

[5,6℄ and bzip algorithmin[4℄. We tested our methods ona 8millions verties and

55.5 millions links rawl performed inside the \.fr" domain in June 2001. We used

the rawler designed by Sebastien Ailleret, available atthe following URL:

http://pauilla.inria.fr/ ~ail ler et/p rog/ larb in/ inde x-en g.ht ml

Our set of data itself isavailable at:

http://hiperom.inria.fr/ ~vie nno t/we bgra ph/

It has been obtained by a breadth-rstrawl fromasigniant set of URLs. See the

URLaboveformoredetailsonthesedata. Althoughitmaybeonsideredasrelatively

(4)

small, this set of data is representative of the Web graph sine it is onsistent with

the known statistis (inpartiular in terms of in-and out-degree distribution [1, 3℄,

and for the average length of URLs, whih are the most important parameters for

our study).

Allthe experimentshavebeenmade ona Compaq TM

WorkstationAP550, with a

800 MHzPentium TM

III proessor,with 1GB memoryand aLinux2.4.9 kernel. We

obtainedanenodingofeahURLin6.54bytesonaveragewithaonversionbetween

URLs and identiers (in both diretions) in about 2 ms. One-way links an also be

ompressed to1.6byteon average with immediate aess(around 20s), whih an

be improved to 1byteif one allows slower aess.

We desribein Setion2our methodtoassoiate aunique identierto eah URL,

based on the lexiographial order. We show how to ompress the URLs set and

how toobtain fast onversion between URLs and identiers. In Setion3, we notie

some properties onthe graphitself, onerninga notionof distane between verties

and their suessors. These properties explain the good results obtained when we

ompress the graph. Two dierentand opposite approahs are disussed onerning

theompression: oneofthemoptimizesspaeuse,andtheotheroneoptimizesaess

time.

2. URLs Enoding.

Given alarge set of URLs, wewant toassoiate a unique identier (an integer)to

eahURL,andtoprovideafuntionwhihanmakethemappingbetweenidentiers

and URLs. A simple idea onsists in sorting all the URLs lexiographially. Then a

URL identier is itsposition inthe set of sorted URLs. We willsee that this hoie

for anidentier makesit possible to obtaineÆient enoding.

Let us onsider a le ontaining a (large) set of URLs obtained from a rawl.

Firstnotie that sortingthis leimprovesitsompression sine itinreases the loal

redundanyof the data: weobtained anaverageof 7.27 bytesby URL beforesorting

(5)

very low, and itmay beonsidered asalowerbound. Indeed, usingthis ompression

method is very ineÆient in terms of lookup time, sine when one onverts a URL

intoits identier and onversely, one has tounompress the entire le. On the other

hand,randomaess ompressionshemesexist[7,9℄,but theirompression rateare

muhlower, too muhforour problem. Notiethanone an alsouse bzip[4℄instead

of gzip to obtain better ompression rates (but paying it by a ompression and

expansionslowdown). However, we usedgzipinour experimentsbeauseitprovides

faster ompression and expansion routines, and is more easily usable, through the

zlib library for instane.

2.1. Enoding by gzipped bloks. Toavoidtheneed ofunompressing theentire

list of URLs, we split the le into bloks and ompress independently eah of them.

We also know the rst URL of eah blok, together with its identier. We save this

wayalargeamountoftimesineonlyoneblokhastobeunompressedtoahievethe

mapping. Moreover, sine the URLs are sorted, the ones whih share long ommon

prexesareinthesameblok,andsowedonotdamagetheompressionratetoomuh

(insome ases, we even obtain abetter ompression rate than when one ompresses

the entire le).

Experimentally, the average size for a ompressed URL does not signiantly in-

reases as long as bloks length stays over one thousand URLs. In this ase, URL

average size is 5.62 bytes long. With bloks of one hundred URLs, the average size

grows up to 6.43 bytes long. Notie that the method an be improved by taking

bloks ofdierentsizes, depending ontheloalredundany ofthe URLs list. Wedid

not use this improvement in the results presented here, whih have therefore been

realizedwith bloksof onstant length.

One an then onvert a URL intoan identier as follows:

1. Find the blok whih ontains the URL to onvert: use a dihotomi searh

based on the knowledge of the rst URL of eah blok (either beause we kept

alistof thoseURLs,orby unompressing the rst lineof eahonernedblok,

(6)

Enoding total size (8millions URLs) Average size/URL

Text 568733818 bytes 69.24 bytes

bzip 36605478bytes 4.45 bytes

gzip 45263569bytes 5.55 bytes

Table 1. Average URL size aording tooding format.

2. Unompress the blok.

3. Find the identier of the URL inside the (unompressed) blok: use a linear

searh inthe list (we annot avoid this linear searh sine all the URLs donot

havethe same length).

This onversion sheme is summarized inTable 2.

Conversely, one an onvert an identier to a URLas follows:

1. Find the blok whih ontains the identier to onvert: sine all the bloks

ontainsthesamenumberofURLs,thebloknumberisgivenby

Identier

BloksLength .

2. Unompress the blok.

3. Find the URL in the (unompressed) blok: it is nothing but the line number

Identier BloksLengthBlokNumberinblok. Again, weneedtousealinear

searh in the list.

This onversion is summarized inTable 2.

URLto identier identier toURL

Firststep O(log(numberof bloks)) O(1)

Seond step O(bloks length) O(bloks length)

Thirdstep O(bloks length) O(bloks length)

Table2. URLtoidentierandidentiertoURLmappingosts,when

allthe URLs donot have the same length inside a blok.

Notie that, beauseof the linear searh ina blok (Step 3 of eah onversion), it

(7)

use of a xed length for allthe URLs in eah blok. This is what we will present in

the followingsubsetion.

2.2. Fixed URLs length. To improve the lookup time, we add at the end of all

the URLs in a given blok as many ourrenes of a speial harater as neessary

to make it as long as the longest URL in the blok. In eah blok, the xed length

is then the length of the longest URL. Therefore, the third point of the URL to

identier onversion beomes a dihotomi searh in the blok, and the third point

of theidentier toURL onversion an bedone inonstant timesine the URL isat

position UrlsLength(Identier BloksLengthBlokNumber) in the blok. This

improvementis summarizedin Table 3.

URLto identier identier toURL

Firststep O(log(numberof bloks)) O(1)

Seond step O(bloks length) O(bloks length)

Thirdstep O(log(bloks length)) O(1)

Table3. URLtoidentierandidentiertoURLmappingosts,when

allthe URLs have the same lengthinside a blok.

Notie that this optimization must be done arefully to ensure both a good om-

pression of the URLs and a fast expansion (Step 2). If the bloks size is too low,

ompressionrate will benaturally low. On the opposite, if the size if too important,

the probability that a very long URL lies in the le will inrease, adding a lot of

unused harater, whih are goingto inrease the averageURL size. Expansion time

islinear withrespet tothe bloks length,sowe must use assmall bloksas possible

togetfastmapping. Usingmedianblokslengthwillresultinverygoodompression

ratebut medianexpansionspeed. Results showingthese phenomenaan befound in

Figure1.

(8)

URL to IDtime in ms ID toURL time in ms Average URL Size inbytes

10000 1000

100 10

16

14

12

10

8

6

4

2

0

Figure1. AverageURLsize andonversion timeswithrespet tothe

size of the onsidered bloks, using xed-length URLs.

In onlusion, we obtained a oding of the URLs in 6.54 bytes in average, with

onversion between URLs and their identiers in about 2 ms (in both diretions),

using only simple, free and widely available tools (sort and gzip). This oding

assoiatestoeahURLitspositionintheentire listwithrespettothe lexiographi

order,andweshowhowonean omputetheorrespondene eÆiently. Wewillnow

see how this enoding an be used torepresent large parts of the Web graph.

3. Graph Enodings.

As soon as the mapping between URLs and identiers is dened, we an try to

ompress alllinksasmuh aspossible. Alink isdened by a oupleofintegers,eah

of them being the identier of a URL as dened in Setion 2. The graph is then

stored in a le suh that line number k ontains the identiers of all the suessors

of vertex k (in a textual form). Using bzip to ompress this le, we obtain a very

ompatenoding: 0.8byteby linkonaverage. Ifoneuses gzipinsteadof bzip,the

average size of eah link grows up to 0.83 byte onaverage. Again, these values may

(9)

Inthis setion,wewillpropose twomethodstoenode thelinksof the Web graph.

Therstoneisasimpleextensionofthe gzippedbloksmethodusedintheprevious

setion. It gives high ompression rates, whih an be understood as a onsequene

of a strong loality of the links we willdisuss. In order to improve the aess time

to the suessors of a vertex, whih is very important to be able to make statistis

and run algorithms on the graph, we propose a seond method whih ahieve this

goalbut stillallows highompressionrates. Notie thatthe tehniqueswe presentin

this setion an be used to enode the reverse links (given an URL, whih pages do

ontain alink tothis URL).The performanes would be similar.

3.1. Enoding by gzipped bloks. Using the same methodas in Setion 2.1, we

ansplit the lerepresenting thegraph intobloks andthen ompress the bloks. In

order to nd the suessors of a vertex, one has to unompress the blok ontaining

the vertex in onern. One this has been done, vertex suessors have tobe found.

Dependingonhowsuessorsareoded,twodierentsearhingmethodsanbeused.

If suessors lists have variable length, one has to read the blok linearly from the

beginning to the right suessors list. On the other hand, if suessors have xed

length (this an be done in the same way as for the URLs) then the suessors list

an be found diretly. Notie that in both ases, sine most of the lookup time is

spent in the blok expansion, there is no real time dierene between getting one

suessor of a vertex, or the entire list of its suessors. Average lookup time and

linkaverage sizean befound inFigure??. One an obtainanenodingof eahlink

in1.24 bytein average with alookup time of 0.45 ms, using 32lines bloks. Table 4

present the results when blok size hange.

However, most of the operations made on the graph onern the exploration of

suessors or predeessors of verties (during breadth-rst searh for instane). In

this ase, suessors lookup time beomes a ruial parameter, and blok ompres-

sion method should be improved in terms of time. We are going to present another

ompression method whih uses a strong property of the Web graph, the loality,to

(10)

Average lookup time in ms Average Link Size inbytes

10000 1000

100 10

2

1.5

1

0.5

0

Figure 2. Average link size and average lookup time with respet to

the size of the onsidered bloks.

3.2. Loality. The highompression rates we obtained whenwe enoded the graph

using gzip an be understood as a onsequene of a strong property of the links.

Letusdenethedistane betweentwoURLsasthe (signed)dierenebetween their

identiers, and the length of alink between twoURLs as the distane between these

two URLs. Now, letus onsider the distanes distribution. This distribution follows

a power law: the probability for the distane between two given verties to be i is

proportional toi

. In our ase the exponent is about 1.16. See Figure 3.

One may want to use this loality to improve both ompression rate and aess

time by enoding the graph in a le as follows: the k-th line of the graph ontains

the suessors of URL number k, enoded by their distane to k. We an then use

the same tehnique of gzipped bloks enoding to manipulate the graph. We tried

this method, but we obtained lower ompression rates than the ones presented in

the previous subsetion. However, this enoding may be used to improve lookup

time, without damaging ompression rate too muh, as explained in the following

(11)

Power law, exponant 1.16 Distane distribution

100000 10000

1000 100

10 1

1e+07

1e+06

100000

10000

1000

100

10

1

Figure 3. Distane distribution between verties and their suessors.

3.3. Aess time improvement. Our experiments show that 68 perent of the

URLswhih are linked together are at distane between -255 and 255. We allthese

links short links. They an be enoded on 1 byte, plus 1 bit for the sign of the

dierene. Moreover, we need one more bit todistinguish short links from longones

(thelonglinksareenodingusing3bytes,sineweareonsideringa8millionsverties

graph). This sheme allows us to enode a link using 1.89 byte on average. Going

further, one an distinguish short (68 perent of the links, eah enoded on 1 byte),

medium(26.75perentof thelinks,enodedon2bytes)and long(5.25perentofthe

links, enoded on 3 bytes) links. We therefore use one bit per link to give the sign

of the distane, and a prex to know the type of the link (0 for short links, 10 for

mediumlinks and 11 for long links). This way, a linkan be stored using 1.66 byte

onaverage.

Moreover, the distane distribution enourages us to use Human ompression of

the distanes. However, our experiments showthat itis better not toompress long

(12)

improvement of 1 bit on average, whih brings us to 1.54 byte by link. Our results

are summarized inTable 4.

4. Conlusion.

We desribed in this paper a simple and eÆient method to enode large sets of

URLs and large parts of the Web graph. We gave a way toompute the position of

a URL in the sorted list of all the onsidered URLs, and onversely, whih makes it

possible to manipulate large data sets in RAM, avoiding disk usage. Our gzipped

bloks method makes it possible to store 400 millions of URLs and the 4.6 billions

links between them in 8 GB of memory spae. Using this enoding, the onversion

between identiersand URLstakesaround2msonouromputer,inboth diretions,

and nding all the suessors of a given URL takesaround 0.5 ms. We an improve

the link lookup to around 20 s by using the seond method we proposed, but with

aninrease of the spae requirements.

We therefore obtained results whih are omparable to the best results known in

the literature, using only standard, free, and widely available tools like sort, gzip

andbzip. Notiethatthegoodperformanesofourmethodrelyontheperformanes

of these tools,whihhave the advantage of being strongly optimized.

Our work an be improved in many diretions. We disussed some of them inthe

paper, for example the use of piees of les of dierent sizes (dependingon the loal

redundany of the URLslist). Another idea istotry toinrease the loalityand the

redundanyof the URLs, forexamplebyreversingthe sites names. Thismay redue

the distanes between pages of sites whih belong to a same sub-domain. There are

also many parameters whih depend on the priority of time or spae saving, itself

depending on the appliation. However, the optimization of memory requirements

makesitpossibletostorethe entire datainRAM,reduingdiskaess,andtherefore

is also important to improve omputing time. This is why we gave priority to the

optimizationof spae requirements,exept whena bigimprovement inspeed an be

(13)

Average link size

Average lookup time

for allthe suessors

identiers 8 bytes {

gzipped identiers 0.83 byte {

distanes 4.16 bytes {

gzipped distanes 1.1byte {

gzipped identiers,

bloks of 8lines

1.61 byte 0.44 ms

gzipped identiers,

bloks of 16lines

1.36 byte 0.44 ms

gzipped identiers,

bloks of 32lines

1.24 byte 0.45 ms

gzipped identiers,

bloks of 64lines

1.20 byte 2.395 ms

gzipped identiers,

bloks of 128 lines

1.21 byte 5.694 ms

gzipped identiers,

bloks of 256 lines

1.26 byte 16.866 ms

short, long links 1.89 byte 20s

short, medium,

long links

1.66 byte 20s

short (Hufmann),

medium, long links

1.54 byte 20s

Table 4. The average spae needed to store one link, depending on

the method used. The rst four lines are just here to serve as refer-

enes,sinetheyimplyeitheraverylowompressionratio,orveryslow

elementaryoperations.

Références

Documents relatifs

As these metadata projects grow horizontally, with more databases and types of data sets, as well as vertically, with more documents, the ecologist will need information

Modelling turbulence generation in solitary waves on shear shallow water flows.. Gaël Loïc

Serratosa, A general model to define the substitu- tion, insertion and deletion graph edit costs based on an embedded space, Pattern Recognition Let- ters 138 (2020) 115 –

Model transformation based on triple graph grammars (TGGs) is a general, intuitive and formally well-defined technique for the translation of models [25,26,13].. While previous

S’agis- sant de la deuxième déstabilisation potentielle pour certains enseignants, les deux types de stratégies, évitement et approche, y sont lisibles dans leurs remarques à

Ensuite, nous présenterons l’impact des TIC sur les différents processus métiers de l’entreprise, à savoir : la fonction logistique, production, commerciale, gestion des

A novel hybrid metaheuristic algorithm based on NSGA-III and Pareto Local Search (PLS) with problem specific local search operator to fully utilise the graph neighbourhood concept

In this case, our two phases approach can be used for SRT prediction. In the case in Fig. If we do not have three instances of W 1, then some operations must wait for the termination