• Aucun résultat trouvé

Selective-rank model for a replication system

Data replication in grid environments

3.4 Selective-rank model for a replication system

In a typical data grid, replica optimization is important as efficient job scheduling in the process of maximizing overall job throughput [Cameron et al., 2004c ]. Generally, grid users are concerned with the execution time of their submitted jobs and the correctness of the results. Concretely, they are interested in minimizing the total execution time of their jobs and increasing the reliability of execution results. Replicating data files at different sites to achieve data availability as high as possible is an important mechanism for reducing data access time and hence the total job execution time. Therefore,

Data replication in grid environments 79 we focus on how to replicate a set of files so as to optimize file availability.

We introduce in this section notations and terminologies to formulate the replication problem. We firstly describe assumptions and system metrics that we use to construct our model. Table 3.1 summarizes the symbols used in our model. Secondly, we introduce how to estimate the availability of files.

Finally, we formulate the replication problem based on two main factors, i.e., file availability and its selective-rank.

Parameter Meaning

M total number of SEs in the system N total number of files in the system αi availability of SEi

ρi availability of a replica of file i Pi availability of filei

P(Ji) availability of file for jobi

Poverall overall level data availability of the system

si size of SEi

Ri replica set of filei

nri number of replicas of filei,nri=|Ri| R= [ri,j] matrix of replica placement

Vi popularity of filei

C(fi, Sk) correlation of fileion site k

D(fi, Sk) average distance of the file ion the sitekto all replicas of other files in the same file set sci storage cost of file i

Table 3.1: Parameters and their meanings.

3.4.1 Model assumptions

In our model, we assume: (1) a replication system with a fixed population of S sites connected by a communication network, the number of sitesM =|S|; (2) a set of N files in the whole system; (3) each SE i is described by two parameters: storage capacitysi and online availabilityαi, wherei∈[1, M].

Storage capacity: each SEi is supposed to offer a certain limited amount of storage space for replication purposes, denoted bysi.

Online availability: the online availability αi [0,1] represents the ex-pected probability of time that the SEiis online. When a SE is available, all the replicas it stores are assumed to be available and accessible to

80 Fundamentals of Grid Computing other sites in the system.

αi= M T BFi M T BFi+M T Fi

Here,M T BFi represents “mean time between failure” and the M T Fi represents “mean time of failure” of the SEi.

3.4.2 Estimating the availability of files

As the replicas stored on an available site are themselves supposed to be available and accessible to other sites in the system, the availabilityρi[0,1]

of replica of fileion sitejis equal to the availabilityαj of sitej. As a result, a file is available when at least one of its replica or itself is online. Therefore, we can estimate the availability of filei,Pi, according to binomial distribution1 with the assumption that each replica of fileigives an availabilityρi, as:

Pi(h, nr,i}) = nr

h=1

nr h

ρh (1−ρ)nr−h (3.3) wherenrrepresents the number of replicas of fileiin the system andρrefers to the average availability of probability seti}.

ρ= 1 nr

nr i=1

ρi (3.4)

Suppose that a job i requires access to a file set Fi; the availability of files P(Ji) required for the execution of jobican be computed as:

P(Ji) = k

j=1

Pj, ∀k∈[1, N] (3.5)

wherek=|Fi|. This approximation in fact estimates the probability required for allkfiles being available at the moment the job is executed.

3.4.3 Problem definition

The best system data availability results from maximizing equation (3.3) and (3.5). For this purpose, we define the replication matrixR= [ri,j]MxN,

1The binomial distribution returns the probabilityp(k, n, q) to obtainksuccesses perform-ingnindependent trials of a certain test when the probability of success of each single trial isq. The binomial distribution is given by the formula:

p(k, n, q) =n k

qk(1q)n−k, 0kn

Data replication in grid environments 81 whereri,j indicates whether a replica of file jis assigned to sitei.

ri,j=

1, if siteistores a replica of filej;

0, otherwise;

where i∈[1, M] and j∈[1, N]. Letrj refer to thejthcolumn of the replica placement matrix R, which denotes the subset of sites where the file j is replicated. Letnrdenote the number of replicas of filej stored in the system:

nrj= M

i=1

ri,j, ∀j (3.6)

Obviously, the total size of all the replicas stored at sitei should not exceed its storage capacitysi:

N j=1

ri,j size(j)≤si, ∀i∈[1, M] (3.7)

where size(j) is the size of replicaj, supposing that a file is associated with a specific rankq, which indicates its importance degree. Applying equation (3.3) and taking into consideration the file rank, we define the overall file availability of the system as follows:

Poverall(R) = N j=1

qj Pj(h, nrj,{ρj}) N

j=1

qj

(3.8)

where Pj(h, nrj,{ρj}) is the availability of filej, nrj denotes the number of replicas of file j and can be calculated by equation (3.6), j} denotes the availability of sites on which a replica of filej is replicated (i.e.,ri,j = 1).

Given the above equation, we can formulate the data replication problem as to find the assignment ofri,jvalues in theRmatrix that maximize the avail-ability of files in the system. Our objective then in the design of a replication algorithm is to optimize (i.e., maximize) the overall system level availability Poverallin equation (3.8), subject to the storage constraint in equation (3.7).

With such a replication algorithm, highly ranked files will receive higher avail-ability. The optimal assignment of the replicas to the appropriate sites is a typical “Knapsack Problem” in consideration that each file replica has a stor-age cost.

82 Fundamentals of Grid Computing