MaxDAR optimizer algorithm - Selective-rank replication algorithm

Data replication in grid environments

3.5 Selective-rank replication algorithm

3.5.3 MaxDAR optimizer algorithm

We propose an original algorithm called MaxDAR (maximize data avail-ability with selective rank) optimizer algorithm as shown in Algorithm 3.5.1.

The main task of the MaxDAR optimizer is to determine whether or not a new replica is created based on the beneﬁts received in the overall data avail-ability of the system. The goal of MaxDAR is to increase the availavail-ability

84 Fundamentals of Grid Computing

of the most highly ranked files in order to avoid situations in which storage resources are wasted by unimportant files. In this way, the highly ranked, or important files have a high possibility to have greater storage resources delivered to them – hence achieve a higher availability according to their re-quirements. This strategy makes sense especially in the context of limited storage resources.

Due to storage constraints there must be an efficient mechanism to delete the existing files from the sites for replacement. The replacement strategy in [Ranganathan and Foster, 2001 ] proposes to delete the most unpopular files (i.e., LFU), once the storage space of the site is exhausted. The other popular strategy for replica replacement is to delete the least recently used files [Bell et al., 2003 ] (i.e., LRU).

In our algorithm, existing files are deleted to gain space for the new replica in case there is not enough free space. The candidate files to be replaced will be selected based on their storage cost. We introduce the storage cost of file ias:

sc_i=nr_i size(i)

P_i q_i (3.12)

where nr_i denotes the number of replicas and size(i) represents the size, P_i reﬂects the availability,q_i denotes the rank of the ﬁlei.

As shown in Algorithm 3.5.1, if the required file does not locally exist in the SE, it will be fetched from other sites (line 1-3). Then, if the free storage space is large enough to store the requested file, the replication of the file will always take place (line 4-5). Otherwise, a set of candidate filesC to be deleted for the new replica will be selected according to their storage cost (line 7-12). Since the goal of MaxDAR is to maximize the system level data availability according to a selective-rank, the replication benefits are required to be greater than the replacement loss (line 13-18) for the replication to take place. The replacement loss is evaluated by the sum of the availability loss of the selected candidate files according to a selective-rank (line 13). The P_i andP_i^new is the availability of file i before and after the replication and is calculated by equation (3.3). It should be noted that the algorithm will ignore the master files, i.e., primary copies of the data file, and pinned files, i.e., files that are being accessed by jobs, in the selection of candidate files for the replacement.

Based on MaxDAR algorithm, we propose three variant optimizers:

• MaxDAR-Pb: Files are ranked according to their popularity. The pre-dicting function for the number of ﬁle access in the future is calculated by a binomial distribution,q_i=V_i. This optimizer focuses on replicat-ing the frequently accessed ﬁles.

• MaxDAR-Pz: Files are ranked according to their popularity. The pre-dicting function for the number of ﬁle access in the future is calculated by a Zipf distribution, q_i =V_i. This optimizer focuses on replicating the frequently accessed ﬁles.

Data replication in grid environments 85 Algorithm 3.5.1 MaxDAR (Maximize Data Availability with selective Rank)

1: if needed ﬁlef_i ∃in the sitethen

2: Get f_i from other sites

3: end if

4: if free space in SE> size(i)then

5: Storef_i in this SE

6: else

7: Sort ﬁles in the SE in the descending order of storage cost (equation (3.12))

8: C ← {}

9: while

fk∈Csize(k)< size(i)do

10: Pop the first file f_candidateoff the sorted list

11: C=C ∪ {f_candidate}

12: end while

13: loss=

fk∈C

P_k q_k

14: benef it=P_i^new q_i

15: if benef it > lossthen

16: Delete all the ﬁle in C

17: Storef_i

18: end if

19: end if

• MaxDAR-C: Files are ranked according to their correlation with other files located near the site where the replication is considered, q_i = C(f_i, S_k). The replication decision of file i on the site k is evaluated by the value ofC(f_i, S_k) (equation (3.9)). As indicated in Section 3.5.2, this optimizer aims to replicate in priority the files that are likely to be requested in the same job so that when the job is executed nearby, the cost for file access will be reduced. The running times and hence the costs of running jobs are also reduced.

In the next section, we present the performance evaluation of our proposed MaxDAR optimizers with simulation experiments using the grid simulation tool OptorSim [Cameron et al., 2004b ].

3.6 Evaluation

We evaluated our proposed MaxDAR optimizers using the OptorSim v2.0.1, which was developed by the European DataGrid project [European datagrid,

86 Fundamentals of Grid Computing

FIGURE 3.3: Grid topology in the simulation.

2009 ] for evaluating both the file access optimization and data replication strategies. OptorSim simulates the architecture shown in Figure 3.2, which consists of several components: computing elements (CEs), storage element (SEs), resource broker, and replica optimizer. Simulated jobs are distributed to the optimal CEs across the grid according to scheduling algorithm used by the resource broker. As the CE requests the file set for each job, the replica optimizer decides whether or not to replicate each file according to the benefit gains from the replication.

We have implemented three MaxDAR optimizers as three new replica opti-mizers in OptorSim. We ﬁrst discuss the simulation’s conﬁguration, followed by the results.

Table 3.2: Parameter settings for the simulation.

Data replication in grid environments 87 3.6.1 Grid conﬁguration

The grid topology used in the simulation is adopted from CMS testbed [GriPhyN, 2009 ], [Holtman, 2001 ], which has the resources and network bandwidths between the sites as shown in Figure 3.3. It consists of 8 routers and 20 grid sites in Europe and USA [Cameron et al., 2004c ]. We utilize the default settings of the OptorSim and modiﬁed the topology to meet our needs.

At the beginning of the simulation, all the master ﬁles are placed at the CERN site. Initially, there are 97 ﬁles in the grid and the storage capacity of SEs ranges from 50 GB to 1000 GB. CERN was allocated huge SEs of 1000 GB capacity and no CEs. Every other site excluding FNAL, which was allocated 100 GB, was given 50 GB of storage and a CE with one worker node.

We are interested in how the replication algorithm performs under different parameters, such as the total number of jobs to be executed, the file size, and the data availability at each SE. These parameters are summarized in Table 3.2. It should be noted that our simulation does not take into account the job execution, i.e., the processing of the files by the CEs, but only the file transmission time required by the job.

Dans le document Grid Computing Theory, Algorithms and Technologies (Page 108-112)