Selective-rank model for a replication system

Data replication in grid environments

3.4 Selective-rank model for a replication system

In a typical data grid, replica optimization is important as efficient job scheduling in the process of maximizing overall job throughput [Cameron et al., 2004c ]. Generally, grid users are concerned with the execution time of their submitted jobs and the correctness of the results. Concretely, they are interested in minimizing the total execution time of their jobs and increasing the reliability of execution results. Replicating data files at different sites to achieve data availability as high as possible is an important mechanism for reducing data access time and hence the total job execution time. Therefore,

Data replication in grid environments 79 we focus on how to replicate a set of ﬁles so as to optimize ﬁle availability.

We introduce in this section notations and terminologies to formulate the replication problem. We ﬁrstly describe assumptions and system metrics that we use to construct our model. Table 3.1 summarizes the symbols used in our model. Secondly, we introduce how to estimate the availability of ﬁles.

Finally, we formulate the replication problem based on two main factors, i.e., ﬁle availability and its selective-rank.

Parameter Meaning

M total number of SEs in the system N total number of ﬁles in the system α_i availability of SEi

ρ_i availability of a replica of ﬁle i P_i availability of ﬁlei

P(J_i) availability of ﬁle for jobi

P_overall overall level data availability of the system

s_i size of SEi

Ri replica set of ﬁlei

nr_i number of replicas of ﬁlei,nr_i=|Ri| R= [r_i,j] matrix of replica placement

V_i popularity of ﬁlei

C(f_i, S_k) correlation of ﬁleion site k

D(f_i, S_k) average distance of the file ion the sitekto all replicas of other files in the same file set sc_i storage cost of file i

Table 3.1: Parameters and their meanings.

3.4.1 Model assumptions

In our model, we assume: (1) a replication system with a ﬁxed population of S sites connected by a communication network, the number of sitesM =|S|; (2) a set of N ﬁles in the whole system; (3) each SE i is described by two parameters: storage capacitys_i and online availabilityα_i, wherei∈[1, M].

Storage capacity: each SEi is supposed to oﬀer a certain limited amount of storage space for replication purposes, denoted bys_i.

Online availability: the online availability α_i ∈ [0,1] represents the ex-pected probability of time that the SEiis online. When a SE is available, all the replicas it stores are assumed to be available and accessible to

80 Fundamentals of Grid Computing other sites in the system.

α_i= M T BF_i M T BF_i+M T F_i

Here,M T BF_i represents “mean time between failure” and the M T F_i represents “mean time of failure” of the SEi.

3.4.2 Estimating the availability of ﬁles

As the replicas stored on an available site are themselves supposed to be available and accessible to other sites in the system, the availabilityρ_i∈[0,1]

of replica of fileion sitejis equal to the availabilityα_j of sitej. As a result, a file is available when at least one of its replica or itself is online. Therefore, we can estimate the availability of filei,P_i, according to binomial distribution¹ with the assumption that each replica of fileigives an availabilityρ_i, as:

P_i(h, nr,{ρ_i}) = nr

h=1

nr h

ρ^h (1−ρ)^nr−h (3.3) wherenrrepresents the number of replicas of ﬁleiin the system andρrefers to the average availability of probability set{ρ_i}.

ρ= 1 nr

nr i=1

ρ_i (3.4)

Suppose that a job i requires access to a ﬁle set F_i; the availability of ﬁles P(J_i) required for the execution of jobican be computed as:

P(J_i) = k

j=1

P_j, ∀k∈[1, N] (3.5)

wherek=|Fi|. This approximation in fact estimates the probability required for allkﬁles being available at the moment the job is executed.

3.4.3 Problem deﬁnition

The best system data availability results from maximizing equation (3.3) and (3.5). For this purpose, we deﬁne the replication matrixR= [r_i,j]_MxN,

1The binomial distribution returns the probabilityp(k, n, q) to obtainksuccesses perform-ingnindependent trials of a certain test when the probability of success of each single trial isq. The binomial distribution is given by the formula:

p(k, n, q) =n k

q^k(1−q)^n−k, 0≤k≤n

Data replication in grid environments 81 wherer_i,j indicates whether a replica of ﬁle jis assigned to sitei.

r_i,j=

1, if siteistores a replica of ﬁlej;

0, otherwise;

where i∈[1, M] and j∈[1, N]. Letr_j refer to thej^thcolumn of the replica placement matrix R, which denotes the subset of sites where the ﬁle j is replicated. Letnrdenote the number of replicas of ﬁlej stored in the system:

nr_j= M

i=1

r_i,j, ∀j (3.6)

Obviously, the total size of all the replicas stored at sitei should not exceed its storage capacitys_i:

N j=1

r_i,j size(j)≤s_i, ∀i∈[1, M] (3.7)

where size(j) is the size of replicaj, supposing that a file is associated with a specific rankq, which indicates its importance degree. Applying equation (3.3) and taking into consideration the file rank, we define the overall file availability of the system as follows:

P_overall(R) = N j=1

q_j P_j(h, nr_j,{ρ_j}) N

j=1

q_j

(3.8)

where P_j(h, nr_j,{ρ_j}) is the availability of filej, nr_j denotes the number of replicas of file j and can be calculated by equation (3.6), {ρ_j} denotes the availability of sites on which a replica of filej is replicated (i.e.,r_i,j = 1).

Given the above equation, we can formulate the data replication problem as to ﬁnd the assignment ofr_i,jvalues in theRmatrix that maximize the avail-ability of ﬁles in the system. Our objective then in the design of a replication algorithm is to optimize (i.e., maximize) the overall system level availability P_overallin equation (3.8), subject to the storage constraint in equation (3.7).

With such a replication algorithm, highly ranked ﬁles will receive higher avail-ability. The optimal assignment of the replicas to the appropriate sites is a typical “Knapsack Problem” in consideration that each ﬁle replica has a stor-age cost.

82 Fundamentals of Grid Computing

Dans le document Grid Computing Theory, Algorithms and Technologies (Page 103-107)