HAL Id: inria-00471704
https://hal.inria.fr/inria-00471704
Submitted on 8 Apr 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Jean-Loup Guillaume, Matthieu Latapy, Laurent Viennot
To cite this version:
Jean-Loup Guillaume, Matthieu Latapy, Laurent Viennot. Efficient and Simple Encodings for the Web Graph. The Third International Conference on Web-Age Information Management (WAIM), Aug 2002, Beijing, China. �10.1007/3-540-45703-8_30�. �inria-00471704�
Jean-Loup Guillaume 1
, Matthieu Latapy 1;2
and Laurent Viennot 2
Abstrat: In this paper, we propose a set of simple and eÆient methodsbased on
standard, freeand widely availabletools,tostoreand manipulate largesets of URLs
and large parts of the Web graph. Our aim isboth to store eÆiently the URLs list
andthegraphinordertomanagealltheomputationsinaomputerentralmemory.
We also want to make the onversion between URLs and their identiers as fast as
possible,andtoobtainallthesuessors ofanURLintheWeb grapheÆiently. The
methodsweproposemakeitpossibletoobtainagoodompromisebetween thesetwo
hallenges, and makeit possible tomanipulate large parts of the Web graph.
Keywords: Web graph, Web links,URLs, Compression.
Approximate words ount: 3500.
1. Introdution.
One an view the Web as a graph whose verties are Web pages, and edges are
hyperlinks from one page to another. Understanding the struture of this graph is
a key hallenge for many important present and future appliations. Information
retrieval,optimizedrawlingandenhanedbrowsingaresome ofthem. Therststep
to study the Web graph is to be able to store and manipulate it eÆiently, both in
spae and in time terms. The key element of this enoding is to assoiate a unique
identier to eahURL whih willthen be used toenode the graph.
URLs are more than 70 bytes long on average and eah vertex has an average
outdegreeatleast seven, dependingontheonsidered domain(from7.2in[8℄to11.2
1
LIAFA, Universite Paris 7, 2, plae Jussieu, 75005 Paris, Frane.
(guillaume,latapy)liafa.jussieu.fr,+33(0)144272837
2
Projet Hiperom, INRIA Roquenourt, F-78153 Le Chesnay (Frane).
in [1℄ and 11.57 for the data we used in our experiments). Enoding a one million
verties subgraph of the Web graph without any ompression would therefore need
more than 100 MB of memory. When one is onerned with the Web graph, it is
important to deal with muh bigger graphs, lassially several hundreds of millions
verties. Therefore, the eÆient enoding of the graph beomes a ruial issue. The
hallengeis then to nd agoodbalane between spae and timerequirements.
Until now, the main work onerning graph enoding is the Connetivity Server
presented in[2℄. This server maintains the graph in memory and is able toompute
theneighborhoodof oneormoreverties. In therst versionofthe server, the graph
isstored as anarray of adjaeny lists, desribing the suessors and predeessors of
eahvertex. The URLs are ompressed using adelta ompressor: one URLis stored
usingonlythedierenes fromthepreviousoneinthelist. Theseond[3℄andpresent
version [10℄ of the Connetivity Server have signiantly improved the ompression
rate for both links and URLs. The spae needed to store a link has been redued
from8to1.7bytesinaverage,and thespaeneededtostoreaURLhas beenredued
from16to 10bytes inaverage. Notie however that a fulldesription of the method
is available only for the rst version of the server [2℄, the newer (and more eÆient)
ones being only shortly desribed in[3,10℄.
Our aim is to provide an eÆient and simple solution to the problem of enoding
large sets of URLs and large parts of the Web graph using only standard, free and
widely available tools, namely sort, gzip and bzip. The gzip tool is desribed in
[5,6℄ and bzip algorithmin[4℄. We tested our methods ona 8millions verties and
55.5 millions links rawl performed inside the \.fr" domain in June 2001. We used
the rawler designed by Sebastien Ailleret, available atthe following URL:
http://pauilla.inria.fr/ ~ail ler et/p rog/ larb in/ inde x-en g.ht ml
Our set of data itself isavailable at:
http://hiperom.inria.fr/ ~vie nno t/we bgra ph/
It has been obtained by a breadth-rstrawl fromasigniant set of URLs. See the
URLaboveformoredetailsonthesedata. Althoughitmaybeonsideredasrelatively
small, this set of data is representative of the Web graph sine it is onsistent with
the known statistis (inpartiular in terms of in-and out-degree distribution [1, 3℄,
and for the average length of URLs, whih are the most important parameters for
our study).
Allthe experimentshavebeenmade ona Compaq TM
WorkstationAP550, with a
800 MHzPentium TM
III proessor,with 1GB memoryand aLinux2.4.9 kernel. We
obtainedanenodingofeahURLin6.54bytesonaveragewithaonversionbetween
URLs and identiers (in both diretions) in about 2 ms. One-way links an also be
ompressed to1.6byteon average with immediate aess(around 20s), whih an
be improved to 1byteif one allows slower aess.
We desribein Setion2our methodtoassoiate aunique identierto eah URL,
based on the lexiographial order. We show how to ompress the URLs set and
how toobtain fast onversion between URLs and identiers. In Setion3, we notie
some properties onthe graphitself, onerninga notionof distane between verties
and their suessors. These properties explain the good results obtained when we
ompress the graph. Two dierentand opposite approahs are disussed onerning
theompression: oneofthemoptimizesspaeuse,andtheotheroneoptimizesaess
time.
2. URLs Enoding.
Given alarge set of URLs, wewant toassoiate a unique identier (an integer)to
eahURL,andtoprovideafuntionwhihanmakethemappingbetweenidentiers
and URLs. A simple idea onsists in sorting all the URLs lexiographially. Then a
URL identier is itsposition inthe set of sorted URLs. We willsee that this hoie
for anidentier makesit possible to obtaineÆient enoding.
Let us onsider a le ontaining a (large) set of URLs obtained from a rawl.
Firstnotie that sortingthis leimprovesitsompression sine itinreases the loal
redundanyof the data: weobtained anaverageof 7.27 bytesby URL beforesorting
very low, and itmay beonsidered asalowerbound. Indeed, usingthis ompression
method is very ineÆient in terms of lookup time, sine when one onverts a URL
intoits identier and onversely, one has tounompress the entire le. On the other
hand,randomaess ompressionshemesexist[7,9℄,but theirompression rateare
muhlower, too muhforour problem. Notiethanone an alsouse bzip[4℄instead
of gzip to obtain better ompression rates (but paying it by a ompression and
expansionslowdown). However, we usedgzipinour experimentsbeauseitprovides
faster ompression and expansion routines, and is more easily usable, through the
zlib library for instane.
2.1. Enoding by gzipped bloks. Toavoidtheneed ofunompressing theentire
list of URLs, we split the le into bloks and ompress independently eah of them.
We also know the rst URL of eah blok, together with its identier. We save this
wayalargeamountoftimesineonlyoneblokhastobeunompressedtoahievethe
mapping. Moreover, sine the URLs are sorted, the ones whih share long ommon
prexesareinthesameblok,andsowedonotdamagetheompressionratetoomuh
(insome ases, we even obtain abetter ompression rate than when one ompresses
the entire le).
Experimentally, the average size for a ompressed URL does not signiantly in-
reases as long as bloks length stays over one thousand URLs. In this ase, URL
average size is 5.62 bytes long. With bloks of one hundred URLs, the average size
grows up to 6.43 bytes long. Notie that the method an be improved by taking
bloks ofdierentsizes, depending ontheloalredundany ofthe URLs list. Wedid
not use this improvement in the results presented here, whih have therefore been
realizedwith bloksof onstant length.
One an then onvert a URL intoan identier as follows:
1. Find the blok whih ontains the URL to onvert: use a dihotomi searh
based on the knowledge of the rst URL of eah blok (either beause we kept
alistof thoseURLs,orby unompressing the rst lineof eahonernedblok,
Enoding total size (8millions URLs) Average size/URL
Text 568733818 bytes 69.24 bytes
bzip 36605478bytes 4.45 bytes
gzip 45263569bytes 5.55 bytes
Table 1. Average URL size aording tooding format.
2. Unompress the blok.
3. Find the identier of the URL inside the (unompressed) blok: use a linear
searh inthe list (we annot avoid this linear searh sine all the URLs donot
havethe same length).
This onversion sheme is summarized inTable 2.
Conversely, one an onvert an identier to a URLas follows:
1. Find the blok whih ontains the identier to onvert: sine all the bloks
ontainsthesamenumberofURLs,thebloknumberisgivenby
Identier
BloksLength .
2. Unompress the blok.
3. Find the URL in the (unompressed) blok: it is nothing but the line number
Identier BloksLengthBlokNumberinblok. Again, weneedtousealinear
searh in the list.
This onversion is summarized inTable 2.
URLto identier identier toURL
Firststep O(log(numberof bloks)) O(1)
Seond step O(bloks length) O(bloks length)
Thirdstep O(bloks length) O(bloks length)
Table2. URLtoidentierandidentiertoURLmappingosts,when
allthe URLs donot have the same length inside a blok.
Notie that, beauseof the linear searh ina blok (Step 3 of eah onversion), it
use of a xed length for allthe URLs in eah blok. This is what we will present in
the followingsubsetion.
2.2. Fixed URLs length. To improve the lookup time, we add at the end of all
the URLs in a given blok as many ourrenes of a speial harater as neessary
to make it as long as the longest URL in the blok. In eah blok, the xed length
is then the length of the longest URL. Therefore, the third point of the URL to
identier onversion beomes a dihotomi searh in the blok, and the third point
of theidentier toURL onversion an bedone inonstant timesine the URL isat
position UrlsLength(Identier BloksLengthBlokNumber) in the blok. This
improvementis summarizedin Table 3.
URLto identier identier toURL
Firststep O(log(numberof bloks)) O(1)
Seond step O(bloks length) O(bloks length)
Thirdstep O(log(bloks length)) O(1)
Table3. URLtoidentierandidentiertoURLmappingosts,when
allthe URLs have the same lengthinside a blok.
Notie that this optimization must be done arefully to ensure both a good om-
pression of the URLs and a fast expansion (Step 2). If the bloks size is too low,
ompressionrate will benaturally low. On the opposite, if the size if too important,
the probability that a very long URL lies in the le will inrease, adding a lot of
unused harater, whih are goingto inrease the averageURL size. Expansion time
islinear withrespet tothe bloks length,sowe must use assmall bloksas possible
togetfastmapping. Usingmedianblokslengthwillresultinverygoodompression
ratebut medianexpansionspeed. Results showingthese phenomenaan befound in
Figure1.
URL to IDtime in ms ID toURL time in ms Average URL Size inbytes
10000 1000
100 10
16
14
12
10
8
6
4
2
0
Figure1. AverageURLsize andonversion timeswithrespet tothe
size of the onsidered bloks, using xed-length URLs.
In onlusion, we obtained a oding of the URLs in 6.54 bytes in average, with
onversion between URLs and their identiers in about 2 ms (in both diretions),
using only simple, free and widely available tools (sort and gzip). This oding
assoiatestoeahURLitspositionintheentire listwithrespettothe lexiographi
order,andweshowhowonean omputetheorrespondene eÆiently. Wewillnow
see how this enoding an be used torepresent large parts of the Web graph.
3. Graph Enodings.
As soon as the mapping between URLs and identiers is dened, we an try to
ompress alllinksasmuh aspossible. Alink isdened by a oupleofintegers,eah
of them being the identier of a URL as dened in Setion 2. The graph is then
stored in a le suh that line number k ontains the identiers of all the suessors
of vertex k (in a textual form). Using bzip to ompress this le, we obtain a very
ompatenoding: 0.8byteby linkonaverage. Ifoneuses gzipinsteadof bzip,the
average size of eah link grows up to 0.83 byte onaverage. Again, these values may
Inthis setion,wewillpropose twomethodstoenode thelinksof the Web graph.
Therstoneisasimpleextensionofthe gzippedbloksmethodusedintheprevious
setion. It gives high ompression rates, whih an be understood as a onsequene
of a strong loality of the links we willdisuss. In order to improve the aess time
to the suessors of a vertex, whih is very important to be able to make statistis
and run algorithms on the graph, we propose a seond method whih ahieve this
goalbut stillallows highompressionrates. Notie thatthe tehniqueswe presentin
this setion an be used to enode the reverse links (given an URL, whih pages do
ontain alink tothis URL).The performanes would be similar.
3.1. Enoding by gzipped bloks. Using the same methodas in Setion 2.1, we
ansplit the lerepresenting thegraph intobloks andthen ompress the bloks. In
order to nd the suessors of a vertex, one has to unompress the blok ontaining
the vertex in onern. One this has been done, vertex suessors have tobe found.
Dependingonhowsuessorsareoded,twodierentsearhingmethodsanbeused.
If suessors lists have variable length, one has to read the blok linearly from the
beginning to the right suessors list. On the other hand, if suessors have xed
length (this an be done in the same way as for the URLs) then the suessors list
an be found diretly. Notie that in both ases, sine most of the lookup time is
spent in the blok expansion, there is no real time dierene between getting one
suessor of a vertex, or the entire list of its suessors. Average lookup time and
linkaverage sizean befound inFigure??. One an obtainanenodingof eahlink
in1.24 bytein average with alookup time of 0.45 ms, using 32lines bloks. Table 4
present the results when blok size hange.
However, most of the operations made on the graph onern the exploration of
suessors or predeessors of verties (during breadth-rst searh for instane). In
this ase, suessors lookup time beomes a ruial parameter, and blok ompres-
sion method should be improved in terms of time. We are going to present another
ompression method whih uses a strong property of the Web graph, the loality,to
Average lookup time in ms Average Link Size inbytes
10000 1000
100 10
2
1.5
1
0.5
0
Figure 2. Average link size and average lookup time with respet to
the size of the onsidered bloks.
3.2. Loality. The highompression rates we obtained whenwe enoded the graph
using gzip an be understood as a onsequene of a strong property of the links.
Letusdenethedistane betweentwoURLsasthe (signed)dierenebetween their
identiers, and the length of alink between twoURLs as the distane between these
two URLs. Now, letus onsider the distanes distribution. This distribution follows
a power law: the probability for the distane between two given verties to be i is
proportional toi
. In our ase the exponent is about 1.16. See Figure 3.
One may want to use this loality to improve both ompression rate and aess
time by enoding the graph in a le as follows: the k-th line of the graph ontains
the suessors of URL number k, enoded by their distane to k. We an then use
the same tehnique of gzipped bloks enoding to manipulate the graph. We tried
this method, but we obtained lower ompression rates than the ones presented in
the previous subsetion. However, this enoding may be used to improve lookup
time, without damaging ompression rate too muh, as explained in the following
Power law, exponant 1.16 Distane distribution
100000 10000
1000 100
10 1
1e+07
1e+06
100000
10000
1000
100
10
1
Figure 3. Distane distribution between verties and their suessors.
3.3. Aess time improvement. Our experiments show that 68 perent of the
URLswhih are linked together are at distane between -255 and 255. We allthese
links short links. They an be enoded on 1 byte, plus 1 bit for the sign of the
dierene. Moreover, we need one more bit todistinguish short links from longones
(thelonglinksareenodingusing3bytes,sineweareonsideringa8millionsverties
graph). This sheme allows us to enode a link using 1.89 byte on average. Going
further, one an distinguish short (68 perent of the links, eah enoded on 1 byte),
medium(26.75perentof thelinks,enodedon2bytes)and long(5.25perentofthe
links, enoded on 3 bytes) links. We therefore use one bit per link to give the sign
of the distane, and a prex to know the type of the link (0 for short links, 10 for
mediumlinks and 11 for long links). This way, a linkan be stored using 1.66 byte
onaverage.
Moreover, the distane distribution enourages us to use Human ompression of
the distanes. However, our experiments showthat itis better not toompress long
improvement of 1 bit on average, whih brings us to 1.54 byte by link. Our results
are summarized inTable 4.
4. Conlusion.
We desribed in this paper a simple and eÆient method to enode large sets of
URLs and large parts of the Web graph. We gave a way toompute the position of
a URL in the sorted list of all the onsidered URLs, and onversely, whih makes it
possible to manipulate large data sets in RAM, avoiding disk usage. Our gzipped
bloks method makes it possible to store 400 millions of URLs and the 4.6 billions
links between them in 8 GB of memory spae. Using this enoding, the onversion
between identiersand URLstakesaround2msonouromputer,inboth diretions,
and nding all the suessors of a given URL takesaround 0.5 ms. We an improve
the link lookup to around 20 s by using the seond method we proposed, but with
aninrease of the spae requirements.
We therefore obtained results whih are omparable to the best results known in
the literature, using only standard, free, and widely available tools like sort, gzip
andbzip. Notiethatthegoodperformanesofourmethodrelyontheperformanes
of these tools,whihhave the advantage of being strongly optimized.
Our work an be improved in many diretions. We disussed some of them inthe
paper, for example the use of piees of les of dierent sizes (dependingon the loal
redundany of the URLslist). Another idea istotry toinrease the loalityand the
redundanyof the URLs, forexamplebyreversingthe sites names. Thismay redue
the distanes between pages of sites whih belong to a same sub-domain. There are
also many parameters whih depend on the priority of time or spae saving, itself
depending on the appliation. However, the optimization of memory requirements
makesitpossibletostorethe entire datainRAM,reduingdiskaess,andtherefore
is also important to improve omputing time. This is why we gave priority to the
optimizationof spae requirements,exept whena bigimprovement inspeed an be
Average link size
Average lookup time
for allthe suessors
identiers 8 bytes {
gzipped identiers 0.83 byte {
distanes 4.16 bytes {
gzipped distanes 1.1byte {
gzipped identiers,
bloks of 8lines
1.61 byte 0.44 ms
gzipped identiers,
bloks of 16lines
1.36 byte 0.44 ms
gzipped identiers,
bloks of 32lines
1.24 byte 0.45 ms
gzipped identiers,
bloks of 64lines
1.20 byte 2.395 ms
gzipped identiers,
bloks of 128 lines
1.21 byte 5.694 ms
gzipped identiers,
bloks of 256 lines
1.26 byte 16.866 ms
short, long links 1.89 byte 20s
short, medium,
long links
1.66 byte 20s
short (Hufmann),
medium, long links
1.54 byte 20s
Table 4. The average spae needed to store one link, depending on
the method used. The rst four lines are just here to serve as refer-
enes,sinetheyimplyeitheraverylowompressionratio,orveryslow
elementaryoperations.