• Aucun résultat trouvé

Bringing the web to the network edge: large caches and satellite distribution

N/A
N/A
Protected

Academic year: 2022

Partager "Bringing the web to the network edge: large caches and satellite distribution"

Copied!
10
0
0

Texte intégral

(1)

Bringing the Web to the Network Edge:

Large Caches and Satellite Distribution

Pablo Rodriguez Ernst W. Biersack Institut EURECOM

2229, route des Crˆetes.

06904, Sophia Antipolis Cedex, FRANCE

f

rodrigue, erbi

g

@eurecom.fr September 30, 1998

Abstract

The World Wide Web is growing exponentially and al- ready accounts for a big percentage of the traffic in the Internet. Popular Web servers are overloaded, hot docu- ments travel many times across the same congested links, and receivers experience slow response times. Cache hit rates can be significantly increased by having caches cooperate. In this paper we analyze a scenario where caches cooperate via a push-satellite distribution [16] [6].

When a cache fetches a new document, the document is immediately broadcast to all other caches via the satel- lite. Caches connected to the satellite distribution end up containing all documents requested by a huge com- munity of clients. The probability that a client is the first to request a document is very small. An ISP with a large cache connected to a push-satellite distribution can achieve bandwidth savings up to 90 %. Clients with local caches connected to the satellite distribution can allevi- ate the last mile problem and virtually browse the whole Web locally. We evaluate the feasibility of a satellite dis- tribution to local caches and perform a quantitative anal- ysis in terms of hit rate, latency, satellite bandwidth, and storage requirements. We also discuss how to schedule the push operation and present several techniques to fil- ter non-desired documents.

Keywords: WWW, Caching, Push, Satellites

1 Introduction

The growth of the World Wide Web is overloading popular servers and increasing the traffic in the network and the response times to the clients. Web caching is being extensively used in the Internet to alleviate these problems. Web caching works as follows: when a client issues a request, the request is first directed to the cache.

In Proc. ACM/IEEE MobiCom’98 Workshop on Satellite-based Information Services (WOSBIS’98), Dallas

If the document is found in the cache the document is delivered directly to the client. If the document is not hit in the cache, the document is fetched from the origin server and a copy is placed in the cache. Further requests for the same document are satisfied from the cache. Re- quests not satisfied in the cache are called misses. Misses can be classified into:

First-Access: Misses occurring when accessing documents the first time.

Storage Capacity: Misses occurring when access- ing documents previously requested but discarded from the cache to make space for other documents.

Updates: Misses occurring when accessing docu- ments previously requested but already expired.

Uncacheable: Misses occurring when accessing documents that need to be delivered from the fi- nal server (e.g. dynamic documents generated from cgi-bin scripts or fast changing documents)

First-access misses are much higher than any other kind of misses [17] and may account for the20% of all re- quests. In this paper we will focus on how to reduce the first-access misses as well as update misses. One way to reduce the number of misses is by sharing a cache with a large client population. The more users share the cache, the greater the chance that several clients are in- terested in the same document. One popular scheme to make caches cooperate is via a caching hierarchy [8][18].

In a caching hierarchy caches are placed at several lev- els of the network, including caches at the client side, at the metropolitan networks, at the regional networks, and at the national backbones. Caches cooperate at the same level of the caching hierarchy and at different lev- els of the caching hierarchy, sharing all the documents requested by their clients.

(2)

Another way to reduce the number of misses is by pre- populating the cache. Documents are pushed into the cache even if the cache has never fetched them. The idea is to get documents in the cache expecting that clients will likely request them later. Prepopulating caches has two main inconveniences: disk space and network band- width. If the cache keeps many documents that are never requested by its users a lot of disk space is wasted with- out any additional benefit. Disk space may not be such a problem because disk capacity is increasing at rate of 60% per year and large disks are becoming cheaper and cheaper [3][9]. However, if document sizes keep in- creasing and more multimedia documents are incorpo- rated into Web documents, disk capacity may become limited. A more serious problem is network bandwidth, especially when caches are prepopulated via congested terrestrial links.

To scale a prepopulating distribution to many clients a multicast distribution should be used. Nevertheless, a multicast distribution in the Internet is not generally available and still needs further development. An alter- native way to prefetch documents in the Internet is via a Caching hierarchy [15]. Documents are prefetched from one cache level to another cache level resembling a mul- ticast distribution. The main drawback of any prefetch distribution over the Internet is the limited achieved throughput. If many documents are prefetched through congested links and no client requests them the band- width wasted is considerable.

An alternative way to feed caches with documents and therefore reduce the number of first-access misses is via a satellite distribution. A satellite distribution is being ex- tensively deployed and bandwidth is plentiful. Compa- nies willing to offer multicast applications are frustrated with the lack of multicast in the terrestrial network and are using satellite infrastructures which by default sup- port multicast. A satellite distribution has fewer losses and congestion problems than a distribution in the Inter- net. Also, a satellite distribution can reach a very large population with relatively little effort; adding a new ad- ditional client does not increase the cost of transmission.

A satellite-push distribution to Web caches could work as follows. When a cache requests a document, the doc- ument is fetched at the origin server and kept locally.

Additionally a small message is sent to a master distri- bution center that will get the document itself. Then, the master sends the document via satellite broadcast to all the other caches. Any other client that requests the same document, will hit the document in its cache. With this scheme a cache stores the documents requested by a much larger community, approximately the total number of users in all caches connected to the satellite distribu- tion. Therefore, the probability that a single client is the first client asking for a document is very small.

There can be caches connected to the satellite distribu- tion at any level of the network. As more caches join the satellite distribution the aggregation of clients is higher and the system works better. ISPs with large caches con- nected to the satellite distribution can achieve very high hit rates, obtain great bandwidth savings, and addition- ally increase their clients’ satisfaction. Recently sev- eral companies like Sky Cache[16] and isp-sat [6] have started to offer this satellite distribution service.

A situation that deserves a special attention is the case where clients with local caches are also connected to the satellite distribution. A client receives all documents that are broadcasted by the satellite. If these clients do not have very large caches, they need to apply filtering poli- cies to remove some of the documents. Clients benefit from local copies of the Web documents that best match their preferences achieving very high hit rates. They can surf these documents locally avoiding high response times due to congested links and slow modem connec- tions.

In this paper we investigate the feasibility of a satellite-push scheme. We develop analytical models to quantify the achieved hit rate and the bandwidth needed to broadcast all documents through the satellite. We dis- cuss several filtering policies at the master side as well as the at the client side. We describe how to schedule the push operation and compare a satellite-push distribution with an Internet distribution. We also present the eco- nomical implications of this scheme as well as introduce some possible future scenarios.

2 The Model

In this section we describe the main components of the system that will be the base of our analysis.

2.1 Hierarchical Caching in the Internet As shown in Figure 1, the Internet connecting the server and the clients can be modeled as a hierarchy of ISPs, each ISP with its own autonomous administration.

We shall make the reasonable assumption that the Inter- net hierarchy consists of three levels of ISPs: metropoli- tan networks, regional networks, and national back- bones. All of the clients are connected to the metropoli- tan networks; the metropolitan networks are connected to the regional networks; the regional networks are con- nected to the national networks.

Caches are usually placed at the access points between two different networks to reduce the cost of transmitting across a new network. As shown in Figure 1, we make this assumption for all three levels. Every metropolitan network has a metropolitan cache, every regional net- work has a regional cache, and every national network has a national cache. Additionally clients may also have local caches.

(3)

National Cache Regional Cache Metropolitan Cache

Client Cache

00 0 11 1

0000 1111 000000

1111 11 00 11 00

0000 11 1111

Clients National Network

Regional Network Regional Network

Metropolitan Network Metropolitan Network Metropolitan Network Metropolitan Network Path

International Origin Servers

Figure 1: Internet topology

Hierarchical caching works as follows. At the bottom level of the hierarchy, there are the client caches. When a request is not satisfied by the client cache, the request is redirected to the metropolitan cache. At the metropoli- tan level several caches may cooperate to increase the hit rate and distribute the load. If the document is not found at the metropolitan level the request is then forwarded to the regional cache which in turn forwards unsatisfied requests to the national cache. If the document is not found at any cache level, the national cache contacts di- rectly the origin server. When the document is found, either at a cache or at the origin server, it travels down the hierarchy, leaving a copy at each of the intermediate caches.

2.2 Satellite Distribution

Given the model of the Internet described on Figure 1, we now consider the situation where a satellite distribu- tion is in place (Figure 2). The satellite receives doc- uments from a master distribution center and forwards them to all caches. At any level of the network there can be caches connected to the satellite distribution. Caches receive all the documents requested by any client in the Web. After a certain period of time caches with high ca- pacity could contain most of the documents in the Web.

Only expired documents or new documents are not found in the cache. In the case that a cache is down when the satellite is broadcasting a certain document, the cache will miss this document. If the cache comes back and one of its clients requests that specific document the master will re-broadcast the document to all caches via the satel- lite. Caches receive the document twice and the band- width in the satellite is not used efficiently. One possible solution is for the master to check the document consis- tency before re-sending the document. If the document has not changed it is not retransmitted. If the document has changed it is retransmitted. Thus, the satellite sends a document at most once during the period where the doc-

Master

Origin Server

Clients

Internet

Figure 2: Push distribution from the servers to caches.

ument is up-to-date. To keep documents consistent, the master and the caches can use any of the current policies (adaptive TTL, poll every time ) [11].

If a new cache joins the satellite distribution it may take several weeks for the cache to be filled. The cache will only receive new appearing documents and new ver- sions of existing documents but not the rest of the doc- uments that were already pushed. One way to solve this problem could be the following. The master distribution keeps a copy of all documents that were broadcasted and that are still up-to-date. When a new cache joins the sys- tem a copy of these documents can be shipped from the master center to the joining cache.

Satellite distribution of Web documents can be com- plementary to the current caching hierarchy using ICP [19]. If some of the caches in the caching hierarchy are not connected to the satellite distribution, they still can benefit from the documents pushed via the satellite into the neighbor caches. Additionally if a cache receives a request for a previously discarded document, it is very likely that the document is hit in the caching hierarchy before querying the origin server.

2.3 Web Documents

We denote the total number of documents in the WWW in the yearj-th asNj. Letxbe the annual rate at which the number of documents in the WWW is increas- ing.

N

j

=N

j;1 x

DenoteSi, the size in bytes for documenti. Document

iexpires after an elapsed timeTifollowing an exponen- tial distribution with average update intervali.

P(T

i

=t)= 1

i e

;t=

i

(4)

Requests for documentiare Poisson distributed with av- erage request ratei. Let P(Ri = r jT = t) be the probability that the number of requestsRifor document

iin an intervalT =tis equal tor.

P(R

i

=r jT =t)= e

;

i t

(

i t)

r

r !

The total number of requests for allNjdocuments is also Poisson distributed with average request rate

= Nj

X

i=1

i

Recent studies show that the number of requests or all documents in the Web follows a Zipf distribu- tion [5] [20]. Let allNjdocuments in the Web be ranked in order of their popularity where documenti is thei- th most popular document. The percentage of the total request rate that documentiaccounts for is given by:

i

=

i

whereis a parameter that usually takes values between

0:6and0:8[5], andis given by

=( Nj

X

i=1 1

i

)

;1

We assume that each document is requested indepen- dently from other documents, so we are neglecting any source of correlation between requests of different doc- uments. We consider that newly appearing documents will merge with existing documents into a new Zipf dis- tribution. That is, there will be some new appearing doc- uments that will be very popular but there will also be many other new documents that will be requested by few clients.

Let be the percentage of requests that are for non- cacheable documents (fast-changing documents, cgi-bin, etc).

3 Performance Analysis

In this section we present analytical models to calcu- late the hit rate in any cache connected to the satellite distribution, and the bandwidth required at the satellite.

3.1 Hit Rate

The hit rate is the percentage of requests that meet an up-to-date document in the cache. In a push-satellite dis- tribution only the first request for a document will see a document miss. The rest of the requests for an up-to- date document see a document hit. For popular docu- ments most of the requests will see a hit. However, for changing documents that are rarely requested most of the requests see a miss.

We analyze the hit rate in one cache depending on how fast documents change and depending on the number of documents in the Web. We assume that the caches have an infinite capacity and therefore documents are not re- moved from the cache due to storage limitations.

LetHitjbe the hit rate given that there areNj docu- ments in the Web.

Hit

j

= Nj

X

i=1

i

hit

i

(1; ) (1) wherehitiis the hit rate for documenti.

hit

i

= Z

1

0 hit(T

i

=t)P(T

i

=t)dt (2)

hit(T

i

=t) = 1

t

Z

=t

=0 P(R

i

>0jT =)d

where the hit ratehit(Ti =t)in an update intervalT=t is calculated as the probability1=tthat a request for doc- umentiarrives at timeT = , 2[0;t], which is uni- formly distributed, and the probabilityP(Ri > 0jT =

)that there has been at least one request for documenti before timeT =. Evaluatinghit(Ti=t)we obtain

hit

i (T

i

=t)=1+ e

;

i t

;1

i t

(3) Combining equations( 2) and ( 3) we obtain

hit

i

= 1+ 1

i

i l n(

1

1+

i

i )

3.2 Bandwidth

If Web documents would not change and new docu- ments did not appear, after a certain period the satellite would not need to send any more documents and caches would contain all Web documents. However, Web doc- uments change and new documents appear at very high rates. Thus, the satellite needs to keep transmitting all document updates and newly appearing documents. We calculate the necessary bandwidth at the satellite to dis- tribute and to broadcast all new documents depending on how fast new documents are produced and how fast doc- uments change.

The necessary bandwidthBWjto broadcast all docu- ment updates, and new appearing documents during the yearjis given by:

BW

j

= Nj

X

i=1 bw

i (4)

wherebwiis the bandwidth to broadcast updates for doc- umenti.

bw

i

= Z

1

0 bw (T

i

=t)P(T

i

=t)dt (5)

(5)

bw (T

i

=t)= S

i

t

(1;exp(;

i

t)) (6) where(1;exp(;it))is the probability that document

iis requested at least once in the periodt.

4 Numerical Results

In this section we pick some reasonable values for the different parameters in the model to obtain some quanti- tative results.

The number of documentsNj in the Web at the begin- ning of the yearj=1998 is about 250 millions [4]. The Web is increasing at a rate such that the number of docu- ments get duplicated every year (x=2) [4]. We consider that the HTTP traffic in the Internet backbone is equal to1000 document requests per second [3]. This traffic will grow by a factor of2:8per year due to the increas- ing number of clients, the increasing period of time that clients are connected, and the increasing bandwidth of clients’ connections [3].

We consider that there is no relationship between the update period of a document and its popularity [5]. Thus, we assume that all Web documents expire randomly with the same average update periodi = . We take the percentage of all document requests that are for non- cacheable objects as10%, i.e.=0:1[12] [17].

4.1 Caches Capacity

The price of disks is decreasing faster than the price of the network capacity. Additionally, the capacity of the clients’ disks is increasing at a rate of 60% per year [3]

with a baseline of 9 GBytes in 1998 (Figure 3). In a few years it will be easy to find disks with capacities close to TBytes at current prices [9].

1998 1999 2000 2001 2002 2003 2004 2005 50

100 150 200 250 300 350

year

Disk Capacity (GBytes)

Figure 3: Disk Capacity Trend.

With such large storage capacities it will be feasible to store almost the whole Web on several disks at different points in the network. In figure 4 we show the disk ca- pacity needed to store all the Web documents assuming that any Web document is requested at least once in an

update interval (Ri > 0). The values presented are an an upper bound for the storage requirements in a cache.

Taking an average document size ofS = 100Kbytes, andS = 200Kbytes the total number of documents in the Web can be stored in several hundreds of TBytes (fig- ure 4). The storage capacity to store all newly appearing Web documents needs to be doubled every year, follow- ing the rate at which the number of Web documents is increasing.

1998 1998.5 1999 1999.5 2000 2000.5 2001 50

100 150 200 250

year

Disk Capacity (TBytes)

S=100 KB. R

i>0 S=200 KB. R

i>0 S=100 KB. R

i>1 S=200 KB. R

i>1

Figure 4: Disk capacity needed to store all Web docu- ments requested at least once and disk capacity needed to store all Web documents requested more than once.

S=100Kbytes,S=200Kbytes.

Given the long-tail distribution of Web documents, there will many documents that are requested only once or less in an update period. Documents requested only once in an update period are not shared by several clients and thus it is useless to place these kind of documents in all the caches. A satellite distribution can take the decision to broadcast only those documents that are re- quested more than once (Ri > 1) in an update period.

Given this situation, the necessary storage capacity drops to6Tbytes in year 1998 (Figure 4) and increases at rate equal to rate at which the HTTP Internet backbone traffic increases (x2:8per year).

Usually individual clients can not afford to buy disks with enormous capacities as it can be the case of an ISPs.

When the capacity of the disks may not be enough to store all documents in the Web, clients can perform fil- tering policies to remove those documents that do not match their preferences and thus decrease their storage requirements (see Section 5).

4.2 Latency Analysis

If documents are hit at network levels close to the clients, the latency to retrieve a document is small. The higher levels of the network (national and international networks) are usually very congested. Local networks carry less traffic and the response times to transmit a

(6)

document to the clients is small. Given the hierarchical topology of the Internet with very congested top levels and much less congested low levels, we quantify the la- tency as the number of links needed to hit a document.

Lets give a simple example. Assume that a document expires every day and that receives 1000 requests in this period of time. Lets assume that there are 100 metropoli- tan caches that issue requests for that document with equal probability. If no caching hierarchy is in place the first client asking for the document in every cache will travel to the origin server to fetch the document. The re- maining 9 clients in every cache hit the document locally.

If a caching hierarchy is in place the first client out of the 1000 clients fetches the document from the origin server.

As the document travels to the client a copy is placed in the intermediate caches in the path from the origin server to the client. Next requests hit the document at lower levels of the caching hierarchy.

If a satellite distribution is in place, the document is pushed to all the caches in any level of the network af- ter the first client fetches the document. The probability that the 999 remaining clients hit the document at closer caches is much higher than in the hierarchical caching scheme. As far as more caches get connected to the satel- lite distribution, clients travel fewer number of links to hit the document.

To quantify the latency we calculate the average num- ber of links that a client needs to travel to hit a document.

We model the underlying topology of the Internet as a fullO-ary tree (O = 4). The height of the tree, which is the distance between the clients and the origin server, is equal to10 links. We set the distance between the clients and the metropolitan caches to one link. We also set the distance between the metropolitan caches and the regional caches and between the regional caches and the national cache to3links. Client caches are disabled. Due to space limitations we omit a detailed analysis of how to calculate the number of links to hit the document, which can be found in [15].

We will consider three different situations: i) no satel- lite push is available and all documents are distributed through a caching hierarchy, ii) a caching hierarchy and a satellite push are in place with regional caches connected to the satellite distribution, iii) a caching hierarchy and a satellite push are in place with metropolitan caches con- nected to the satellite distribution.

Figure 5 shows the expected number of links traversed by then-th document request arriving in an update pe- riod. In all three situations, the first client asking for a document always needs to travel10links to hit the doc- ument at the origin server. Next clients hit the document at closer levels while the document is up-to-date. If there are many requests for the same document last requests hit the document at the metropolitan cache. If documents

100 102 104 106

0 2 4 6 8 10

n−th request

links

No satellite push Push to regional caches Push to metropolitan caches

Figure 5: Average number of links traversed before a cache hit occurs for then-th request

are pushed via the satellite to regional or metropolitan caches, the average number of links that a single client traverses to hit the document is lower than when a satel- lite distribution is not used. Thus, a satellite distribution to regional or metropolitan caches reduces the latency significantly for the first requests of a document.

4.3 Hit Rate

The higher the number of documents hit in the cache, the higher the bandwidth savings and the lower the la- tencies. In this section we calculate the hit rate achieved by any cache connected to the satellite distribution as a function of average update period, and we compare it with the hit rate achieved by a metropolitan cache which is not connected to the satellite distribution.

Table 1 shows the hit rate for a cache connected to the satellite distribution and for a metropolitan cache not connected to the satellite distribution. In order to calcu- late the hit rate for a metropolitan cache that is not con- nected to the satellite distribution we model the Internet as described in Section 4.2 and assume that document re- quests are uniformly distributed between all metropolitan caches [15]. Using equation (1) for yearj =1998 we get the hit rates for a cache connected to the satellite distri- bution. From table 1 we observe that caches connected

1 hour 1 day 10 days

Hit rate Satellite 87 % 88 % 90 % Hit rate no Satellite 10 % 33 % 53 %

Table 1: Hit rate for a cache connected to the satellite distribution and for a cache not connected to the satellite distribution. 10% of uncacheable objects

(7)

to the satellite distribution can greatly increase their hit rates from10%;50%to values close to90%. The hit rate for caches connected to the satellite distribution does not significantly vary when the update period of the Web documents changes. Additionally we have observed that the hit rate does not significantly change as new docu- ments appear. Thus, due to the high hit rates achieved with a satellite distribution ISPs save a lot of bandwidth and documents can be fetched from lower level caches of the network at lower latencies.

4.4 Bandwidth

As we discussed in Section 3.2, the satellite needs to keep sending all document updates and newly appear- ing documents in the Web. To calculate the bandwidth needed to broadcast new document updates we need to consider that a Web document is built with several Web objects (images, java applets, text...). When a Web doc- ument is updated not all its objects get modified if not only some of them.

1998 1998.5 1999 1999.5 2000 2000.5 2001 50

100 150 200 250 300

year

BW (Mbps)

=10 days. R

i>0

=20 days. R

i>0

=10 days. R

i>1

=20 days. R

i>1

Figure 6: Required bandwidth at the satellite to trans- mit new documents and document updates: i) all docu- ments requested at least once in an update interval are transmitted across the satellite, ii) only documents re- quested more than once in an update interval are trans- mitted across the satellite. 20% of objects in one docu- ment are updated

Using equation 4 we plot in Figure 6 the necessary bandwidth at the satellite to keep broadcasting new ap- pearing documents and document updates that are re- quested at least once in an update interval (Ri>0). We take the size of a Web document as S = 100Kbytes and consider that only 20%of the document is modi- fied every update period. If 40% of the document is updated every update period, the needed bandwidth gets duplicated. From Figure 6 we can observe that the band- width needed to send documents through the satellite is around30MBps for year 1998 and that keeps increas- ing with a rate close to 1.8 per year. The rate at which

the bandwidth increases depends on the rate at which the HTTP Internet backbone traffic increases and on the rate at which new documents appear in the Web. If all appear- ing documents were requested at least once in an update period, the bandwidth would increase at the same rate that Web documents appear in the Web (x2per year).

However, we can observe that even if the number of Web documents gets duplicated every year, the bandwidth re- quirements increase at a rate of1:8the first years. This is due to the fact that many of the appearing documents will be non-popular documents which receive less than one request in an update interval and therefore are rarely transmitted through the satellite. As the HTTP traffic in- creases, the bandwidth needs to be increased by a factor close to2per year because the number of documents that are not requested in an update interval is very small.

As discussed in section 4.1 there is a large group of documents that are only requested one time or less in an update interval. The satellite distribution can take the decision to broadcast only documents that are requested more than once in one update interval and thus reduce the bandwidth used in the satellite. From Figure 6 we can see the bandwidth required when the satellite only transmits documents requested more than once (Ri >1). For an update period of=10days, we see that sending only those documents requested more than once saves about

30%of the bandwidth at the satellite. When the update interval of the document increases, there are fewer docu- ments that only receive one request in that interval. Thus, the bandwidth savings decrease when the update interval increases.

It is clear that transmitting only those documents that are requested multiple times in an update interval gives good bandwidth savings. The counter part of sending less documents through the satellite is that the hit rate gets reduced. However, we have quantified the reduction in the hit rate and we have seen that the hit rate is reduced by less than two percent.

A satellite distribution of an analog television channel uses a bandwidth of27Mbps. Therefore with the capac- ity needed to broadcast one TV channel a push-satellite distribution can cope with almost all Web documents in the year1998. For the master distribution center to send at a rate of30Mbps towards the satellite, it needs to re- ceive documents at the same rate from the Internet. Thus, the master needs an access link of at least30Mbps.

5 Filtering mechanisms

The filtering mechanisms that we present in this sec- tion are aimed at reducing the number of documents that are broadcasted via the satellite or kept in the cache. A good filtering mechanism is such that when some docu- ments are discarded the performance of the system (hit rate, latency to the clients) is not reduced. In previous

(8)

sections we have already introduced a simple filtering mechanism which avoids sending documents that are re- quested once or less in an update interval.

Both, the master and the cache need to perform some filtering mechanism. The master distribution center may need to filter some documents due to limitations in the bandwidth at the satellite. The cache may need to filter some documents due to space constraints.

5.1 Cache Side

Even if a cache might have enough space to store all documents in the Web, users of a given cache are not in- terested in all Web documents. For example, most caches in Europe could safely filter all documents written in an Asian language. The cache can apply a certain filtering policy to remove the documents that do not match the clients preferences and thus free some space for other multimedia applications.

On the other hand, if document sizes start increasing and the cache size becomes limited, the cache needs to decide which documents to store locally. The cache can apply a number of policies to reduce the amount of doc- uments kept without significantly reducing the hit rate.

First, the cache can apply a broad filtering, i.e. remove all documents *.jp but not *.edu. All expired documents could also be purged. Second, the cache can apply a purge routine that purges documents that do not match a long-list of key words. If after this filtering, the cache still needs to purge more documents the following ac- tions can be taken:

Remove documents from sites that no local user has ever visited

Remove documents from the bottom of document trees.

Re-encode images and videos [10]

5.2 Master Side

The master needs to discard some documents when the transmission rate at the satellite is insufficient. First, the master can apply a very simple policy and send only those documents requested more than once. If a hit me- tering mechanism is in place [13], documents can be shipped to the master with an indication of how popu- lar the document is. Thus, the master can better decide which documents to send. Additionally the master can also re-encode images and videos to reduce the amount of information that is transmitted through the satellite.

Content providers using a satellite distribution will have high bandwidth savings in their Internet access links and lower server’s load. Thus, another way for the mas- ter to reduce the number of documents distributed via the satellite is by charging content providers. Only the doc- uments from those content providers that are subscribed

with the master distribution center will be transmitted to the satellite.

6 Economical Implications and Future Scenarios

Satellite distribution of Web documents has important economic implications for both content providers and ISPs.

The content provider’s Web servers will see fewer re- quests, since many more requests are now serviced by the caches. Thus, content provider’s Web server can signifi- cantly reduce the rate of its access link connection to the Internet and the capacity of its server. A content provider can provide high-speed access to its site at low cost.

A satellite distribution is filling ISP caches with many documents almost for free. Having such a high number of documents increases the hit rates in the caches. Most of the requests get hit in the ISP cache saving expensive bandwidth in the terrestrial links and avoiding the over- head of communicating with neighboring caches. The master distribution center could also start charging ISPs connected to the satellite distribution because ISPs are increasing considerably their hit rates and saving band- width.

If clients with local caches are also connected to the satellite distribution they do not suffer the last mile prob- lem for the Web traffic. Web documents come through the satellite and can be browsed locally. The network is rarely used by the clients. The network is used only to check document’s consistency or to access non Web documents. Due to the high number of documents kept locally by the client local search engines should be de- veloped from to benefit from the possibility to hit docu- ments locally [1] [2].

Transmitting documents through the satellite to local caches is a distribution scheme that will become even more relevant when the Web and the TV get integrated.

The Television is currently the most important mass me- dia. The Web is growing exponentially and becoming another important mass media. The arrival of the digital television will trigger the development of television and Web integration. A satellite distribution can be used to push Web documents related with a certain TV program into clients caches and into a caching hierarchy [14]. As the pushed documents are very related with the actual TV program (i.e. additional weather maps on the weather re- port, information about the players in a football match..), the probability that clients request the proposed pushed- documents is higher. Additionally clients know that the pushed documents can be browsed locally with minimum delay and without using the Internet, which makes these documents even more attractive. The pushed documents come as a value-added service to the TV information.

(9)

7 Conclusions

Caching is being extensively deployed in the Inter- net to alleviate the problems related to the exponential growth of the Web. However, caching has a limited per- formance due to the high number of requests not hit in the cache. Requests not found in a cache are requests for documents that no one has fetched before in that cache and for uncacheable objects. Uncacheable objects may become cacheable if some intelligence is place in the caches to cope with dynamic content (rotating banners for advertisements, cookies, etc) [7]. To reduce the num- ber of requests for documents that no one has fetch be- fore a high community of clients needs to be shared.

A satellite distribution of Web documents into large caches can reduce the number of requests that are first requests for a certain document almost to zero. The satellite fills the caches with all documents requested by a large community of clients. Documents get into the caches almost for free and significantly alleviate the con- gested Internet links. In this paper we have presented analytical models and quantitative results showing that a satellite distribution of Web documents into caches is feasible. The satellite distribution would need a capac- ity equivalent to the bandwidth of a TV satellite channel to transmit all appearing Web documents and document updates during year1998. The hit rates in the caches get close to90%.

When the satellite capacity or the cache storage ca- pacity is limited some documents must be discarded. We have presented several filtering mechanisms to cope with the lack of bandwidth in the satellite and the storage lim- itations in the caches. Additionally we have discussed the economical implications of a satellite distribution and presented some future scenarios. We are currently per- forming trace-driven simulations to better confirm the analytical results presented in this paper.

8 Acknowledgments

The support of P. Rodriguez by the European Com- mission in form of a TMR (Training and Mobility for Researchers in Europe) fellowship (ERBFMBICT97263 9) is gratefully acknowledged. We would like to thank Keith W. Ross for his help and the fruitful discussions.

Eurecom’s research is partially supported by its indus- trial partners: Ascom, Cegetel, France Telecom, Hitachi, IBM France, Motorola, Swisscom, Texas Instruments, and Thomson CSF.

References

[1] “The Microsoft White Paper about Webcasting IE 4.0”.

[2] “The Netscape Netcaster.”.

[3] E. A.Brewer, P. Gauthier, and D. McEvoy, “The Long-Term Viability of Large-Scale Caching”,

In 3rd International WWW Caching Workshop, Manchester, UK, June 1998.

[4] K. Bharat and A. Broder, “A Technique for Mea- suring the Relative Size and Overlap of Public Web Search Engines”, In Seventh International WWW Conference, Brisbane Australia, April 1997.

[5] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “On the Implications of Zipf’s Law for Web Caching”, In 3rd International WWW Caching Workshop, June 1998.

[6] Broadcast Satellite Services, “http://www.isp- sat.com”.

[7] P. Cao, J. Zhang, and K. Beach, “Active Cache:

Caching Dynamic Contents on the Web”, , Univer- sity of Wisconsin-Madison, 1998.

[8] A. Chankhunthod et al., “A Hierarchical Internet Object Cache”, In Proc. 1996 USENIX Technical Conference, San Diego, CA, January 1996.

[9] M. L. Gullickson, A. L. Chervenak, and E. W. Ze- gura, “Using experience to Guide Web Server Se- lection”, , Georgia Institute of Technology, 1998.

[10] J. Kangasharju, Y. G. Kwon, and A. Ortega, “De- sign and Implementation of a Soft Caching Proxy”, In 3rd International WWW Caching Workshop, Manchester, UK, June 1998.

[11] C. Liu and P. Cao, “Maintaining Strong Cache Con- sistency in the World-Wide Web”, In Proceedings of ICDCS, may 1997.

[12] S. Manley and M. Seltzer, “Web Facts and Fan- tasy”, In Proceedings of the USENIX Symposium on Internet Technologies and Systems, December 1997.

[13] J. Mogul and P. J. Leach, “Hit-Metering and Usage- Limiting for HTTP”, Internet Draft: draft-ietf-http- hit-metering-02. Work in Progress., Internet Engi- neering Task Force, October 1997.

[14] P. Rodriguez, J. Gafsi, and J. Nonnenmacher, “A more Attractive and Interactive TV”, In W3C Work- shop on Television and the Web, Sophia Antipolis, France, July 1998.

[15] P. Rodriguez, K. W.Ross, and E. W.Biersack, “Dis- tributing Frequently-Changing Documents in the Web: Multicasting or Hierarchical Caching”, To appear in Computer Networks and ISDN systems, 1998.

[16] SkyCache, “http://www.skycache.com”.

(10)

[17] R. Tewari, M. Dahlin, H. M. Vin, and J. S. Kay,

“Beyond Hierarchies: Design Considerations for Distributed Caching on the Internet”, , The Uni- versity of Texas at Austin, Feb 1998.

[18] D. Wessels, “Squid Internet Object Cache:

http://www.nlanr.net/Squid/”, 1996.

[19] D. Wessels and K. Claffy, “Application of In- ternet Cache Protocol (ICP), version 2”, In- ternet Draft:draft-wessels-icp-v2-appl-00. Work in Progress., Internet Engineering Task Force, May 1997.

[20] G. K. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley, Reading, MA, 1949.

Références

Documents relatifs

An obvious embellishment to the simple strategy of setting tthreshold to an estimate of the information's mean or median lifetime would be to let the threshold value

“If no one turns around when we enter, doesn’t answer when we speak, doesn’t pay attention to what we do, if everyone we meet acts with us as if we didn’t exist, we would soon

The multimodal archi- tecture enhances the conventional backhaul by overlaying wide satellite beams and narrower broadcasting cells (e.g. HPHT), which are capable of P2MP (point

In this paper, we have investigated the performance of full- duplex enabled mobile edge caching systems via the delivery time metric. The considered system is analysed under two

the conventional backhaul by overlaying wide satellite beams and narrower broadcasting cells (e.g. HPHT), which are capable of P2MP (point to multi-point) multi/broadcasting. Studies

FEF (1) and LIP (2) population responses in trials in which the first stream was presented within the receptive field (black, mean ⫾ SE) or opposite (gray), aligned on the first

When the number of cooperating caches increases, the connection time decreases up to a minimum.. due to the fact that the probability to hit a document at neighbor caches which

For instance, using the previous day of logs from AYE, the additional hit rate increases by 12:0% , AYE’s cache needs additionally 8:9 Gbytes of disk space, and the database has