The Web graph - Algorithmic Foundations of the Internet

We have represented the Internet as a graph with autonomous systems or routers at the nodes to show the web of connections between sites, disregarding the huge number of other computers and storage devices that are connected to the network for performing other services, including our own terminals.

As we all know, a common way that information exchanged through the network is organized is in the World Wide Web (or simply the Web), a system of interlinked documents that may actually reside in any computer or stor-age device connected to the Internet, often replicated in many copies spread around the network for faster access. While in the next chapter we will ex-plain the mathematical tools that allow us to “browse” and “search” through these documents, we recall here how and why the Web was conceived, how it is represented in graph form, and some basic properties of this graph.

In its early years the Internet was very difficult for non specialists to use. A drastic change took place when a system originally developed for coordinating scientific projects at CERN, the major European center for nuclear physics, was made available over the Internet in the early 1990s. The Web was born.

The intent of the inventor, the British computer scientist Tim Berners-Lee, was to allow independent actors to conduct their own transactions over the network without a centralized control. The well known concept of hypertext, which we will discuss again later, was adopted for this purpose. All documents are represented in a common format using a special language, HTML (Hyper-Text Markup Language) or its extension XML, that supports links to other documents. These can be reached by clicking on hot spots.

The integration of hypertext with the Internet had spectacular conse-quences. Web pages, rendered attractively on a computer monitor, allowed one to jump from one another following the links and to come back, through a unified system of identifiers called URLs (Uniform Resource Locators). Since links are unidirectional one can jump to another page without permission, as long as the page is on the Web. In principle it is possible to develop entire Web servers without permission.

It is natural, then, to represent the Web as a directed graph with pages as nodes and links as arcs. This graph is somehow “hosted” by the Internet graph, but has no resemblance to it.Figure 7.7repeats the four autonomous systems K, H, E, L of Figure 7.4, with the addition of three new computers (encircled) in K and L connected to the local servers. The three rectangles are Web pages hosted in these computers. Pages 1 and 2 are connected by a directed arc (dashed arrow), however, they reside in two computers of the same autonomous system and are then invisible to the Internet graph both at AS and router level. Page 3 and 2 are also connected by an arc, but their physical connection in the Internet graph passes through several ASs or routers.

Over the years many articles have been published on the size of the Web graph. Search engines declare the number of pages that they can reach thus establishing a lower bound on the Web size.¹³ In a blog written by Google engineers a total of one trillion pages reached up to 2008 was claimed, but a definitive answer is not obvious as many (possibly most) pages belong to the so calleddeep Web that is not directly addressed by the search engines.

13The sets of pages reached by different engines overlap partially, but the size of their union is not known. Strictly speaking the lower bound is given by the greatest of the numbers declared.

L H

2 1

FIGURE 7.7: Web pages 1, 2, 3 and their links.

While this point will be made clear in Chapter 9, it is sufficient to say now that the Web graph is by far the largest discrete mathematical entity humans have ever directed their attention to.

From what we have seen in Section 7.1, it is not surprising that the Web graph contains a giant component like the one in Figure 7.2. Several exper-iments show that this is indeed the case, with the subgraphs GI, GS, and GO containing a vast majority of the nodes, and with the other nodes shared among tendrils and small independent components. Recent studies suggest that about two thirds of all Web pages belong to GS. What is maybe even more interesting is that the GS topology scales, that is, the subgraphs of GS tend also to have the structure of Figure 7.2.

Furthermore it is generally said that the Web graph, or better its giant component, is a small world, although such a statement must be taken with some caution. It has been observed that, if all the arcs were not directed, the mean distance between any two nodes would be about six, the magic number of Milgram. But this does not mean much. The mean distance in the actual (directed) graph has been reported of being approximately sixteen at the beginning of the decade 2000, and is slowly growing with the ever increasing number of nodes.

The degree distribution of the Web graph is separately studied for the in-degree and the out-in-degree that in principle are independent from one another.

In fact the outgoing links of each node are determined by the needs of the page owner and are rarely changed after the page is created, while the in-degree is completely out of the page owner’s control.¹⁴All experiments have shown that

14A certain dependence between in and out degrees is due to the presence of “mutual references.” That is, two pages may be built deliberately pointing to each other.

both the in-degree and the out-degree of the Web nodes exhibit a power law distribution with exponentγbetween 2 and 3. Of the two degrees the former is much more interesting since the number of incoming links is a measure of node “popularity.” To better understand all this let us consider empirically how the Web grows: in particular how a new page appears and is linked to the existing Web.

Unless the author of the new page is particularly expert, for example a professional designer of Web sites, more than likely he or she will get inspi-ration from an existing page. This is easily done thanks to the availability of the source code of most Web pages, that is explicitly made public: even some outgoing links may be copied if the inspiring and the inspired pages deal with similar business (see below). In any case the major attachment process is preferential, with important or popular pages being pointed to with higher probability. Note also that, for the new page to be found by search engines, at least one link to it must be provided either from the site of the organization to which the page owner belongs, or from some other source, for example a blog.

Several Web growing models have been proposed in mathematical terms, built on the scheme of the preferential and random process 4 presented in Section 6.3 of the previous chapter, extended to take care of some real-world features like adding and possibly deleting several links at each step, with different probabilities.

On practical grounds a drawback of most preferential linking models is that the choice of links requires the knowledge of the degrees of all nodes of the graph to establish preference. Partly for this reason an interestingcopying model has been proposed, aimed at capturing the tendency to imitate exist-ing pages. Figure 7.8 shows the difference between preferential linking and copying.

All models, including the copying one, lead to a power law for the vertex in-degree and out-degree distributions, with exponents close to the experimen-tal values already mentioned. This does not imply that such models capture all the features of the Web evolution, but at least shows that they are not unreasonable. The mathematical analysis is quite complicated, so we refer to the bibliographical notes for that. It is worth noting, however, that the fat tail for the in-degree power law is particularly scattered because there are many popular pages with very high and different in-degrees (the Yahoo! home page with hundreds of thousands incoming links is often mentioned). So the cumu-lative distribution must be plotted for clarity, as done for example in Figure 7.6for the Internet graph.¹⁵

Unlike in the Internet graph, betweenness is not a particularly important parameter for Web search because users tend to extend search paths only for a few links, relying more often on the direct answers of a search engine than

15This point is often not sufficiently underlined in the literature.

(a) (b)

FIGURE 7.8: A new page appears in the Web as a white node. (a) Effect of preferential linking. (b) Effect of copying the grey node: the new page copies most of the links of the copied page.

on the possible successive hops through a chain of links. Instead, betweenness is related to the presence of particular subgraphs called communities.

Dans le document Algorithmic Foundations of the Internet (Page 150-154)