Hosting and links to the cloud and to the data centers 29

You are certainly already using cloud services hosted in data centers. As a personal matter, your mail is hosted in a data center; you also use online editing services, you manage your agenda online and you deposit your photos. . . in the cloud. All these services are the beginning of an industrialization of the tasks of the researcher. We can now examine them from a broader angle. Note that the list of major issues we discuss in this document is of course not exhaustive.

"Data science" is a new discipline which relies upon computer science and mathematics, espe-cially statistics, to extract information from data. The term big data characterizes data, rather than a study object and less scientific methods of knowledge extraction. In data science, researchers rely upon "data mining", statistics, signal processing, learning and data visualization. Each of these disciplines produces and exchanges data. Production sites can be located geographically not in the same place, which implies data traffic flow and hosting between the sites.

2.1.2 Functional view of the data life cycle

From an architectural and functional point of view, that is to say, for the computer scientist, the high-level view of the computer architecture, describing the main functions of the system required by those working in the e-Sciences, the model is, for example, that of the NIST represented in Figure 2.1. We defend here this model which will prevail and be generalized to anyone working in the e-Science area (i.e, all sciences which use digital as the cloud and big data).

This figure represents a model of the life cycle of the data and thus shows the data traffic flow and hosting, of the data, in the e-Sciences. As we introduced in the very beginning of this document, the life cycle of an explicit datum, where the data are created, where they are transferred, hosted and where they vanish. In Figure 2.1 we clearly notice that the cycle which starts in the Data discovery box and ends in the Data archiving or Data recycling box.

2.1 Hosting and links to the cloud and to the data centers

Figure 2.1: Functional view of the data life cycle. According to NIST.

It is important to note that each box corresponds to a set of computer tools and which define the overall technical view. With cloud computing, it is possible for any scientist to compose his or her technical view by integrating the tools he or she wishes, on demand, without any system administrator’s intervention.

This possibility offered by the cloud computing complicates the overall management and hosting issues of the data, while offering great flexibility for use. For example should security be dealt at the level of the single boxes or at the level of a group of boxes? As far as the legislation is concerned, it refers to what type? We must use our imagination to find out the service offered by a box may be located in a site which is geographically different from the service offered by another box.

The purpose of the computer scientist in the "System" is to enable other scientists, working in e-Science, to insert their discipline into the functional view of Figure 2.1. It must therefore promote the upcoming of the major design patterns, i.e, arrangements and the ways of organizing configurable data management services for a given discipline.

The signification of the Figure 2.1, once implemented in a real system, allows an individual to deploy his or her favorite tools on demand. On the scientific teams’ level, this same view is a coherent framework, making it possible to think about the interactions between the teams and the flow of information. In large projects like the LHC⁶¹, this view is already implemented, but not across the cloud and the data center systems. These architectures could nevertheless serve to unify the systems, to pool resources and for which, smoother policies could be implemented (supplying the system with external resources during peak activities). In case of small projects, as

61 Projet Le Grand collisionneur de hadrons (LHC) (Project: The Large Hadron Collider (LHC))

seen in the Figure 2.1 also allows a single individual to master all the profession processes (analysis, archiving, curation. . . ) from the moment when this individual recognizes the contributions of this data management approach.

To control the data exchanges between the boxes of the Figure 2.1, the computer scientist also develops programming models for the management of the data life cycles. For the computer scientist Gilles Fedak⁶², a perfect life-cycle data management system should:

• capture the steps and the essential properties of the life-cycle: creation, destruction, defaults, replication, error checking.. . .

• enable existing systems to expose their intrinsic data life-cycle;

• reasoning on data-sets distributed over the infrastructures’ sets and heterogeneous systems and

• simplify application programming "Data Life Cycle".

In this work, the incoming data questions are crucial, in order to estimate the quality of the data-set and to keep track of the data acquisition and transformation conditions.

R For information, the UK Data Archive website⁶³is another entry point on the questions of life-cycling. This site also allows you to web-post your own data. Finally, we recommend the special issue of November 2015 on Computing Edge of IEEE Computer Society⁶⁴. This issue discusses on big data in various disciplinary domains, including medical.

2.1.3 Examples of scientific approaches impacted by the cloud and the big data After examining how the new paradigms of the cloud and the big data affect the day-to-day life of the researcher, let’s look at how the researcher impacts these universes through a few examples.

As far as the scientific approaches are concerned, the architectural considerations presented in the previous section call for more creativity on the part of the researcher. He or she is now capable of developing new scientific approaches and methods.

The first example concerns the hypothesis management. Once stated, an hypothesis can be discussed, for example, in the context of an experimental approach. In the article by Gonçalves and Porto⁶⁵, the hypothesis are seen as data, just like a basic data base, in order to obtain an unified framework between the hypothesis and the observed data.

Let us first recall the role of the hypothesis in the context of an experimental scientific approach.

Of course, the study of researchers’ practices reveals such a great diversity of approaches and scientific disciplines which the idea of an universal definition of the hypothesis is false. The scientific analysis is recalled in the Figure 2.2 and the formulation of the hypothesis appears when it is a question of making a prediction. Here we assume that there is a theory from which we formulate the hypothesis.

In Gonçalves and Porto’s article, it is about integrating the data (observed) and theories (simu-lated) into a single frame as part of thinking. For this reason, the authors assume that we are able to

62 ActiveData, un modèle de programmation pour la gestion des cycles de vie des données, Simonet et al., 2014 (ActiveData, a programming model for managing the life cycles of data, Simonet et al., 2014)

63 Research Data Lifecycle, UK Data Archives

64 big data, Computing Edge, IEEE Computer Society, 2015

65 Managing Scientific Hypotheses as Data with Support for Predictive Analytics, B. Gonçalves and F. Porto, 2015

2.1 Hosting and links to the cloud and to the data centers

Figure 2.2: Experimental scientific method

extract hypothesis from a computational simulation (as part of a first approximation, the equations which we inject into a simulator). The out come of the simulation results are tested and compared with the observed data from the experiment. If a validation occurs then the researcher may publish new results, otherwise he or she must return to the hypothesis’s rewording stage.

One of the key points of the work lies on the management of the simulation data, which by definition are uncertain, unlike raw data such as those in particle physics or astronomy, and as the ones which are processed on large computing clusters. The uncertainty can come from two sources: non-completeness (missing data) and multiplicity (inconsistent data). Another key point is the way we consider a unit of the elementary data. In simulated data management, researchers are more interested in the predictive content of a given datum than in its simple dimension. The key technologies and tools are then the probabilistic databases^{66, 67}and the Bayesian statistics⁶⁸.

Finally, the advantage of the Gonçalves and Porto method is also due to the fact that the data numbers from the hypothesis are in a smaller magnitude order than the traditional method centered over a dimensional problem from the raw data. In conclusion to this first example, we can state here the strategy has been to reduce the volume of the processed data in the experimental process.

The second example, which shows the impact of the cloud computing on the data hosting, concerns the energy management in the data centers. Various techniques have been developed in recent years to consolidate cloud servers. It’s about grouping applications on some machines and turning off others. Exact or approximate solutions to this problem exist (they come under the computer discipline of the Combinatorial Optimization) and they tell you which machines to turn off, when and where to move applications which migrate. The solution considers that all servers are besides than being equal. In fact, it is a global solution which guarantees the overall data center will consume as little as possible.

66 Bases de données probabilistes (Probabilistic databases)

67 Probabilistic Database, D. Suciu et al. 2011

68 Introduction to Bayesian Statistics, 2^ndEdition, W. M. Bolstad, 2007

Let’s imagine the following situation. In a data hosting center, the servers belong to different clients, as such the data center houses a collection of mini data centers, potentially managed by cloud-based technologies. There is physical and software isolation between the mini data centers. In fact, minimizing the energy consumed by each of the mini data centers is not, most of the time, the optimal solution to the energy optimization issue, when we consider all servers at the same time.

The question of whether a data center (made up of mini data centers), is efficient in terms of electrical power, is therefore debatable and debated. What is certain is, if you allow the data to migrate to any server, then the Combinatorial Optimization Theory will explain how to reach the minimum energy to run all the hosted applications.

Perhaps the most energy-efficient cloud and data center would be the cloud of mobile phones, tablets, and other home-connected devices which are known to consume far less than a PC or a laptop. Thus, there could be pooling of services according to a new procedure. Your tablet would provide the services you know, but would also house services and data for other people. Researchers call this type of cloud,a voluntary cloud⁶⁹.

Recent work to use lightweight devices, instead of large infrastructures, has been conducted and involves the deployment of the full-text search engines⁷⁰, deploying on mobile phones the transactions of the databases⁷¹and archiving infrastructure issues on Raspberry Pi⁷².

2.2 Hosting and links with ethics and legal commissions

Dans le document Hosting and Data Management (Page 40-44)