Placement, replication, caching - Problems and some solutions

Data management in grids

4.5 Problems and some solutions

4.5.4 Placement, replication, caching

The placement of data is a decision that impacts the performances of the Grids, both in terms of access time and robustness. In the DataGrid and now EGEE project like in many others, the solution was to adopt a hierarchical data structure. Subsets of data are copied at several levels of a data tree. The nodes of this tree are distributed to physical distant location: For instance in the LHC experiments of the DataGrid, CERN is the production center of the data. It is Tear 0. It is connected through high speed networks to several sites acting as Tier 1, having less capacities, where data are stored. These are recursively connected to Tier 2 with smaller bandwidth and data capacities.

It is common knowledge that replication of data increases performances, placing some data pieces closer to their future usage, decreasing the client waiting time and balancing the load between several potential servers. It also increases the reliability of the system, the data being more likely to be accessible even in case of system failures.

Failures that can show up are very diverse and account for several ways to replicate the data, which are easily possible in a heterogeneous grid. Media failure (disk failure, tape failure) can be overcome by replicating on multiple media. Avoiding problems with vendor speciﬁc system error, data should be replicated on diﬀerent vendor products. To handle problems with the site connections, replication on a second site is necessary, while avoiding natural disasters requires the data to be replicated in a distant site. Finally, replicat-ing data in deep archives (with more robustness and security) decreases the risks due to malicious users or noncareful administrators.

Works on replica management are numerous. They investigate several sides

Data management in grids 113 of the problem: from replica placement, migration, deletion, access to perfor-mances in terms of access time and disk capacity used. They try to balance the criteria of number of replicas, dynamic locations of replicas, local and global system performance . . . . Examples include the LCG Replica Catalog [gLite, 2009 ], the Globus Replica Location Service RLS and Data Replication Ser-vice [Globus, 2009a ]. They allow registration and discovery of replicas using replica catalogs, mapping a logical identiﬁer to actual physical locations of replicas. Choosing the best replica is a tedious task, based on acquired knowl-edge about the status of the infrastructure and needs of the users: Among others, the Network Distance Service [Gossa et al., 2007 ] helps to insert sev-eral contradictory parameters for data placement, data replication (number of replicas) or replicas selection. Being able to optimize the number of replica and their placement is a key issue, notably for keeping consistency manage-able. Since several copies of the same data piece exist in the system, there is a need for consistency between these copies. This applies also with the metadata attached to the data. This aspect will be covered in Section 4.5.6.

Diﬀerences between replication and caching are subject to discussion. While replication is often made explicitly by a user or a service, caching operates the data pieces transparently for the system in order mainly to increase the performance of the access. Data in caches are stored temporarily and no guarantee on the presence of data pieces on some sites is given: The system can decide to delete arbitrarily the data to let space for another piece of data.

Few works are interested in caching in Grids, since most claim that caching is just a specialized way for replicating. Nevertheless, the coordination of caches in a Grid like in [Cardenas et al., 2007 ] gives much beneﬁt to all the sites of the grid especially where a community of users share interests and thus data pieces. It increases the caching capacities and allows for more advanced data management like for instance performance enhancement and data splitting between caches.

4.5.5 Security: transport, authentication, access control, en-cryption

Data security has to be ensured at different places in the architecture. The first thing is to ensure the security of the sites where the data are stored, then to ensure that the communication of the data is secure. In Grids, no specific work has been done concerning the secure transport of the data. The applications rely on well known protocols like SSL/TLS.

To secure sites, there is basically three complementary mechanisms: Au-thentication, authorization and encryption. The security problem is particu-larly diﬃcult in Grids because of the lack of a global and centralized authority.

Indeed, local administrators in autonomous sites of diﬀerent organizations do not want to let the access decisions outside their control.

Authentication is the process of identifying securely the services and users that want to access the sites. It is not related to data management but is

114 Fundamentals of Grid Computing

a necessary brick of a grid middleware where the data management services have to interact with distant sites on behalf of the users. To avoid the need for the user to authenticate manually on the different sites, single-sign-on procedures have been developed. They are mainly based on certification and delegation of authority, for instance by the use of proxy certificate like in the Globus Security Infrastructure GSI [Foster and Kesselman, 2004 ]. Shibbo-leth [ShibboShibbo-leth, 2009 ] allows for the cooperation of identity servers and for identity federation. Traditionally the authentication of the users connecting to a service is done at the service site. In Shibboleth it is the user organi-zation that verifies its presence in its database and transmits to the service the required attributes. GridShib [Scavo and Welch, 2007 ] is a grid version integrated in Globus.

Authorization allows for verifying the permissions to access specific re-sources (for instance data) when being authenticated on a site. Several access control mechanisms have been developed for Grids. The Community Au-thorization Service (CAS) [Pearlman et al., 2003 ] has been proposed in the Globus middleware to control the access to resources in a Virtual Organi-zation (VO) in grids. Attribute certificates carry the users’ membership in terms of VO. The organizations’ members of a VO delegate the access control of some of their resources to the CAS servers. The CAS servers verify the certificate memberships and the resources authorities. Interconnecting VOs (by collaborative CASs) allows for statically mapping the different profiles implemented in each VO.

In VOMS (Virtual Organization Membership Service) [Alﬁeri et al., 2005 ] the authorization rules for accessing the resources stay at the resources sites:

Thus the owner of the resource is responsible for its access control and not the VO administrator (opposite of CAS).

Permis (PErmission and Role Management Infrastructure Standards) [Chadwick et al., 2008 ] uses the Role Based Access Control (RBAC) to is-sue attribute certiﬁcate based on the role in an organization rather than to individuals. It integrates the delegation of authority where the SOA (Source of Authority) of the resource expresses the trusted entities allowed to issue attribute certiﬁcates.

In Sygn [Seitz et al., 2005 ], all permissions to access resources are encoded in attribute certiﬁcates that are stored with their owner. Sygn does not involve any communication when granting the access to the resources. At any time, resources administrators (or entities being delegated the permission) can issue and give new authorization certiﬁcates.

These approaches diﬀer mainly at the granularity level of the access control, the location of the decision point of access control (centralized, replicated, distributed) and the responsibility of the resource administrator. They all use attribute certiﬁcates (X509 or ad-hoc).

Due to the lack of central authority, the access policies are normally not expressed in the same language. This heterogeneity has to be handled, for instance by the means of a standardized language to express the policies like

Data management in grids 115 XACML (eXtensible Access Control Markup Language). Nevertheless, there is the need to work at a semantic level to potentially map the diﬀerent policies.

The data access needs to be traceable. Indeed in many applications it is useful or mandatory to know the previous read-write access to the data.

Encryption secures the content of the data. It is recognized that access control does not protect against all the security threats. For instance a disk containing data might be directly accessed without using any middleware at maintenance sessions or when the site containing the data leaves the grid.

Traditional encryption mechanisms can be used like RSA or DES, but then the secure management of encryption keys becomes diﬃcult. Where should these be kept in order to allow access to authorized users or services while not suﬀering from single point of failures? Some works [Seitz et al., 2003 ] tend to distribute some part of decryption keys among a set of servers in order to decrease the risks related to an attacked key servers’ repository.

Metadata are often attached to data. The security handling of both must be consistent, and this is also the case when data are replicated.

4.5.6 Consistency

The consistency problem appears when data are replicated and accessed within write patterns. Most of the huge amounts of data produced in pro-duction grids are mainly read-only. For instance data produced by the LHC experiments result from the collision of particles and thus are not subject from changes. From these data, additional data can be produced or computed, but usually their number is far smaller. As a result, data consistency did not gain much attraction in the past in the grid community.

The part where consistency made sense is the management of metadata.

Some metadata are fixed like the acquisition date or the description of the experiment that produced the data in terms of instruments for instance. Some metadata may evolve with time, such as annotations made by experts after analysis of the data. Security permissions must also be considered as metadata and consistency should hold: When the permissions to access a data piece is modified, this modification has to be forwarded.

Since the support for consistency has long been mainly ignored in grid mid-dleware, the consistency of the data was managed by the users themselves:

These were replacing the modiﬁed data (or metadata) manually. This proce-dure is highly ineﬃcient:

• The users can make some errors in the process, forgetting some replicas for instance;

• When several users wants to update data at the same time, they have to cooperate, leading to diﬃcult user oriented coordination procedure.

• It is not robust when some sites are temporarily unavailable: These sites miss the updates of data leading to inconsistencies in the system.

116 Fundamentals of Grid Computing

Quorum mechanisms depending on the application requirements could be used to ensure that an operation is performed when at least a given number of replicas are available. Synchronization is then performed as soon as the replicas become again available.

Few works can be cited for the consistency management in grids and they clearly depend on the applications’ requirements. [Pucciani, 2008 ] presents a comprehensive introduction, related works and innovative solutions. So-lutions mostly rely on the concept of master replicas (up-to-date data are found at a master replica), and diﬀerent mechanisms to take the modiﬁca-tion of the master replicas (possibly several) into account [Sun and Xu, 2004 ], [Chang and Chang, 2006 ]: Lazy updates are done only when a slave replica (nonmaster replicas are slaves) is accessed; Push methods allow for a more aggressive way of updating data, at the initiative of the master replicas. Other methods [Chen et al., 2007 ] for databases are based on the possibilities of un-derlying replication strategies and consistency management in the databases, using the possibility to replay a set of operations in databases (with import and export logs).

4.6 Toward pervasive, autonomic and on-demand data

Dans le document Grid Computing Theory, Algorithms and Technologies (Page 137-141)