Approaches to Content Distribution - Scalable content distribution in the Internet

One of the first approaches to content distribution was the introduction of client caches either implemented as host-local disks or as in-memory caches. Although client caches can achieve reasonable hit rates, i.e. the percentage of requests satisfied from the cache, soon it was clear that caches shared by many people with similar interests would yield higher hit rates. The first widely available software providing this functionality was the CERN proxy [60], created in early 1994 at the birthplace of the World Wide Web, the CERN laboratories in Geneva.

This software gained widespread acceptance and served as a reference implementation to

many caching proxies¹ developed later on. However, the CERN cache failed to provide cooperative caching schemes. The CERN cache could only be configured to use one par-ent cache. Moreover, it could not gracefully deal with the failure of such parpar-ent caches.

HENSA [84], was one of the first and best documented approaches to building a coherent large caching infrastructure. The HENSA cache was started in late 1993 in the University of Kent. The goal was to provide an efficient national cache infrastructure for the UK. One single, centrally administered cache was created to handle over one million cache requests per day. The approach of a centrally administered caching service subsequently faced sev-eral barriers inherent to a non-distributed approach. Harvest developed in 1995 a much more sophisticated caching system, resulted from the Harvest information discovery project [18].

The Harvest approach sharply contrasts in concept from the HENSA approach by enabling administrators to span a tree of cooperating caches over wide area distances. The technical improvements implemented in the Harvest cache aim at optimizing nearly any component on the critical path from user request to delivery of the data. The public descendant of the Har-vest cache, Squid [88], soon became the most popular publicly available caching software.

Additionally, the Harvest cache took caching one step further towards the original server by providing a mode in which the cache could be run as an accelerator for origin servers.

During the last years there has been a lot of research to improve the performance of Web caches and provide new and more efficient caching sharing schemes [37] [80] [64] [72] [86].

However, traditional Web caching was created with a very strong ISP oriented point of view.

Web caches were installed by ISPs to save bandwidth and delay the process of upgrading their access links to other ISPs. In the USA, bandwidth was very cheap and ISPs had lit-tle incentive to deploy caches. In Europe, on the other hand, bandwidth was much more expensive and many ISPs rapidly started deploying caches. However, Web caches admin-istered by ISPs use their own time-to-live heuristics for the documents stored in the cache, and may provide clients with stale content. Popular content providers were very concerned with the fact that their content was delivered to the clients without them having control over

1In the rest of the thesis we use the term proxy, cache, and proxy cache interchangeably, since the proxying and caching functions are co-located in a single entity.

the freshness of the content. This situation made the European Commission consider to ban Web caching in early 1999. In response to the ISP’s control of the cache, content providers often misuse or abuse features of the HTTP protocol [41] to make content uncacheable.

Since many ISPs, specially in the USA, were not deploying caches and popular con-tent providers were defeating caching where deployed, popular sites started experiencing high loads and flash crowds. Thus, some large content providers decided to deploy their own network of mirror servers/reverse proxy caches that delivered only their content. Since mirror servers are directly administered by content providers, content providers can fully control what is being delivered to the clients. Given a network of mirror servers, clients should be automatically and transparently redirected to the optimal server. The most popular techniques to redirect clients to a server are application-level redirection using the HTTP protocol and DNS redirection. Combining efficient redirection techniques with a network of mirror servers provided a very efficient solution for many content providers (e.g., CNN, Microsoft, WWW Consortium). However, building and running such an infrastructure re-quires a considerable amount of effort at the content provider’s site. Content providers would rather out-source the operation of the content distribution infrastructure and focus on content production.

Recently several companies started providing content distribution infrastructures as a service for content providers. Content Distribution Networks (CDNs) such as Akamai [2]

started deploying private overlay networks of caches around the world to circumvent much of the public Internet and provide fast and reliable access for content provider’s clients. Such a content distribution network can be shared by multiple content providers, thus, reducing the costs of running such network. Akamai Technologies, founded by two MIT scientists in August 1998, rapidly established itself as the market leader in content distribution. Akamai has deployed a broad global network for static content and streaming media, with several hundreds of nodes that cover over 40 countries. Since the creation of Akamai, the content distribution market has experienced a tremendous growth and is expected to continue during the coming years, integrating new technologies such as multicast or satellite broadcast.

Dans le document Scalable content distribution in the Internet (Page 21-24)