Crawling the Web - The anatomy of a search engine

9.3 The anatomy of a search engine

9.3.2 Crawling the Web

The next aspect of search engines to consider is how data are extracted from the Web. This is the role ofcrawlers, that is computer programs charged with retrieving as many Web pages as possible.⁵Due to the size of the network, each engine actually uses a very large number of crawlers, that work in parallel and interrogate the Web relentlessly.

First note the essential difference between a browser and a crawler. The former resides in the user’s computer and is designed to retrieve Web pages whose URL is known. The latter resides in a search engine computer and is designed to collect “all” pages available, with a limitation imposed by the engine policy that demands a minimum threshold of presumed significance.

In fact the effort needed for crawling the Web and managing the search engine data structures requires a trade off between the amount of information that can be made available to the users and the speed of operation of the whole system. Let us consider this aspect first.

The number of Web pages is huge and continuously changing, not only because sites are born and die with incredible frequency, but also because many sites change their contents continuously. While it may be important to visit weather report sites rather frequently, so Internet users may have an updated forecast for their home town, it may not be that interesting to collect high atmosphere wind speeds every hour. Even more importantly, Web sites are often organized as whole subgraphs of pages of different “levels,” going down from level to level with the addition of a slash into the page URL to reach more and more specific information (for example, en.wikipedia.org is at the first level,en.wikipedia.org/wikiis one level down, etc.). Pages of very low level are said to constitute the deep Web and are not collected by the crawlers, as they can be reached anyway if the engine returns a page of higher level pointing to them. All the non-deep pages form theindexed Web.

The basic algorithmic structure of a crawler is indicated in Figure 9.4.

The program makes use of two data structures, a queuecalled QUEUE and two tablesA, B. By definition the queue keeps its items one above the other, outputs the top item upon request (QUEUE→x, where the variablextakes

5Crawlers are also called spiders or robots, although the first term is more commonly used.

algorithmCRAWLER 1

starting condition:URL₁,...,URL_s are inQUEUE; while QUEUE6= Ø{

QUEUE→URL;

if(URL ∈/ A){

requestTEXT(URL);

URL→A;TEXT(URL)→B;

foranylink LinTEXT(URL){ letL point toURL’;

if(URL’∈/A)URL’→QUEUE; } } }

FIGURE 9.4: Basic algorithmic structure of a crawler.

the value of the output element), and accepts new elements at the bottom (x → QUEUE). The two tables can be implemented at will, provided fast lookup and insertion operations are possible.

As the figure shows, a group of s URLs of potentially important sites is initially loaded into the queue as a seed. The crawler asks for the pages with the URLs in the queue and, if not already present in the tables, retrieves both the URL and text and stores them intoAandB. It then scans the page just found for possible links contained in it and, if the URLs pointed to have not been encountered yet, it loads them into the queue. The algorithm is very simple and goes on until there are no URLs to be examined into the queue (the termination command whileQUEUE6= Ø checks for a void queue). In principle a whole connected subgraph of the Web is visited and the algorithm may continue forever if sites keep on changing. In practice the story is different.

As we have seen crawlers are not designed to retrieve all Web pages. More-over they have to return frequently to the same page if this gets updated continuously, as for example for weather conditions and news, but might visit only once in a while corporate pages that tend to remain stable. Therefore the queue is substituted with apriority queue where each element has an associ-ated priority (in fact, an integer). The structure keeps the element of highest priority at the top and returns it upon request. The construction of such a queue can be found in any textbook of data structures: we only note that the elements must be kept in a way that the next top element can be readily identified after the extraction of the top one, hence the new entries must be allocated in their proper positions to make this possible in short (logarithmic) time. Items with the same priority are kept in arrival order.

A new version of the crawler that exploit URL priorities is given inFigure 9.5. The algorithm is not trivial and is given for courageous readers to study.

PQUEUEis the priority queue: the way priority is computed depends on the engine policy. The value “refresh” given to priority indicates that the page

algorithmCRAWLER 2 starting condition:

(URL1,P1), ..., (URLs,Ps) are inPQUEUE; while PQUEUE6= Ø{

PQUEUE→(URL,P);

if(URL ∈Aand P =refresh){ requestTEXT(URL);

TEXT(URL)→B ; [replace old occurrence]

after delay∆ (URL,P)→PQUEUE; } if(URL ∈/ A){

requestTEXT(URL);

URL→A;TEXT(URL)→B;

foranylink LinTEXT(URL){ letL point toURL’;

if(URL’∈/A){

compute P’forURL’; if(P’≥threshold)

(URL’,P’)→PQUEUE; } } } }

FIGURE 9.5: The algorithmic structure of a crawler with priorities.

must be read again so it is returned to the queue after a given delay ∆. A

“threshold” is also specified, below which the new page is not fetched.

To make our algorithm even closer to reality we must add some further considerations. First, Web sitemasters may decide not to allow some of their pages to be fetched by search engines. This is specified in a special file stored in the site that the crawler must interpret in order to decide which pages to take. Furthermore crawlers should try to avoid loading different copies of the same page that are repeated at different URLs (as occurs quite often), or pages that are very similar. Hashing whole pages for comparison is a helpful technique for discarding duplicates. Finally crawlers are designed not to pass over the same pages too many times to avoid overloading a site. All these features must be implemented into the crawler algorithm.

A natural question to ask at this point is how one crawler can visit the entire Web in a reasonable amount of time. The answer is that this task is carried out by a large number of crawlers working in a distributed fashion as explained below.

Dans le document Algorithmic Foundations of the Internet (Page 189-192)