• Aucun résultat trouvé

The big data 7

Dans le document Hosting and Data Management (Page 18-23)

Many figures circulate on the volume of data produced, collected and analyzed daily. Everyone agrees the volume will grow exponentially. For example, the firm IDC18already estimated in 2011, the number of data produced and shared on the Internet would reach 8 zettabytes in 2015. Let’s recall that 1 zettabyte = 1000 exabytes = 1 million petabytes = 1 billion terabytes! This "information flooding" of data has also been perceptible for more than a decade, in the scientific research field and it is now clear that the volume of data collected over the next few years will exceed those collected during the previous centuries19.

Big data characterizes the research data of many scientific domains (genome, astronomical,

17 Fondamentaux juridiques - Collaboration industrielle et innovation ouverte (Legal Basics - Industrial Collaboration and Open Innovation)

18 International Data Corporation

19 The data deluge: an e-Science perspective

1.3 The big data

climatic, etc.) which include terabyte and even petabyte volumes in 2015 for a single experiment.

Big data are defined in many different ways in the literature. We are referring back to the firm IDC

20:

Definition 9 The Big data is presented by our American colleagues, and are defined as "a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis". (IDC Councelling firm)

It was from 2001 when the scientific community began to explain the challenges inherent in the growth of data, as three-dimensional, through the rule known as "the 3Vs" (volume, velocity and variety), then "the 5Vs" (volume, velocity, variety, veracity and value) or more, because the specialists disagree on the number of Vs to take into account, as they depend on the disciplines. The common definitions, associated with these dimensions are, the following:

• Volume - Refers to the size of the overall data created by the different participants or equipment of the ecosystem;

• Variety - Refers to the number of individual systems and / or types of data used by the participants; multiple data formats within the same system are problematic in practice;

• Velocity - Represents the frequency with which data is both generated, captured, shared, and updated;

• Value - Refers to our ability to transform all created data into one value, based on the two previous types of data: initial and non-immutable; non-modifiable;

• Veracity or truthfulness - Refers to the fact that the initial data may be unintentionally modified through their life cycles, or the fact that the unchangeable data is being managed. The whole affects the quality of the information and therefore the reliability of the information is at stake.

On Figure 1.1, other terms, than the five initial ones, appear and they allow to specify certain goals which are related to the stages of the life cycle of the data. As an example, let’s consider the terms Terabytes, Records / Arch, Transactions, Tables, Files. All these terms refer to the issues related to storage, representation (Tables, Files) and the access model (Transactional). In such a framework, the goal of the big data is to provide powerful means to store, to represent and to access data

Figure 1.2 is a refinement of the notions of the big data and it comes from a report of the NIST21. This colored figure, where each of the five colors, emphasizes either a definition or a challenge or the goal of the big data. This allows to start the discussions on the major scientific and technical issues which arise in this field, particularly on the issues of new data models, analysis issues and issues of functional and technical architectures as well as tools. The density of the terms demonstrates the issues, buried behind the big data terminology, are to be taken into account from multiple angles of the approach. Following this, we will focus more specifically on the architectural aspects of the systems, as they are directly related to our topic on the data hosting.

1.3.1 Data management in the field of research

Today, the dimensions of the big data mean that the data management, in the research sector, is becoming more and more an integral part of the research project. Data management requires rigorous

20 Big data Analytics: Future Architectures, Skills and Roadmaps for the CIO, IDC, 2011

21 National Institute of Standards and Technology

Figure 1.1: 5Vs of big data

Figure 1.2: Categorization of the terms and goals of the big data (from NIST)

1.3 The big data

organization, planning, and monitoring throughout the life of the project and beyond, to ensure their sustainability, accessibility and reuse. Data management in research fulfills the following objectives

22:

• it increases the research’s efficiency by facilitating data access and their analysis to the researcher, who conducted the research or by any other new researcher;

• it ensures the continuity of the research through the reuse of data, while avoiding duplicating efforts;

• it promotes expanded distribution and increases the impact: the research data are properly formatted, described and identified so as to preserve their long-term values;

• it ensures the integrity of the research and the authentication of the results. Accurate and comprehensive research data also allow the reconstruction of events and processes which led to these results;

• It reduces the risk of loss and enhances data security through the use of robust and responsive storage devices. However, it should be noted now that these problems are not only technical in nature. We will examine further, the various participants working to secure the information systems. The challenge is then to make all these participants work together;

• it accompanies the publication’s current evolution: scientific journals are increasingly propos-ing that the data, which incorporates the basis of a publication to be shared and deposited in a data accessible data facility. As a result, the research management of data, makes life easier for the submitting the process to scientific journals based on the documented data sets;

• it satisfies donors’ financing conditions for the project: they are mostly interested in the researchers producing data, issued during a project, and often require funding for the opening of the project of these data, so that they are freely accessible and free of charge;

• It testifies and takes responsibility: by managing your research data and making it available.

As such, it demonstrates the responsible using of the public funding for research.

1.3.2 Data Management Plan and data life cycle

Good data management requires the development of a Data Management Plan (DMP) and must take into account all the stages of the data life cycle.

Definition 10 The data management plan is a formal document explaining how data is obtained, documented, analyzed and how it’s used during and after a research project. It describes how data are produced, described, stored and distributed.

Research operators, for example, at European level, ask researchers to build a data management plan and integrate it into their scientific project23. Very soon there will be an obligation, even at national level, to provide this data to this type of document, in response to each request for projects.

R You can inspire from the preexisting DMP models, as the one which was produced by the Common Documentation Service (SCD) of Paris Descartes, the Bureau of Archives and Support Directorate for Research and Innovation (DARI) of Paris Diderot24or develop your

22 Pourquoi gérer les données de la recherche?, Cirad, 2015 (Why manage research data ?, Cirad, 2015)

23 Le libre accès aux publications et aux données de recherche, Portail H2020 (Open Access to Publications and Research Data, H2020 Portal)

24 Réaliser un plan de gestion de données, A. Cartier et al. 2015 (Perform a data management plan, A. Cartier and al.

2015)

own data management plan using online tools25.

There are several existing representations for the life cycle of the research data.

Definition 11 Schematically, the life cycle of a given datum corresponds to the period which extends from the datum’s design to its utilization and until its destruction or its preservation for historical or scientific purposes.

According to the UK Data Archives Centre26specialized in social sciences research data, the life cycle of the research data contains six main steps:

• the creation of the data

– definition of the research technical baseline – implementation of a data management plan – data localization

– rendering the data anonymous and description of the data – data management and storage

– establishment of protections (copyright vs. copyleft) – promoting data via an open internet platform

• the reusing of data

– monitoring and reviewing the research – new research from the data

– cross-referencing data with other data from other domains

In the sub-section 2.1.2, we will explain the purely functional view of a model of the data life cycle in e-sciences (sciences using digital technology, such as cloud and big-data).

25 DMP Tool

26 Research Data Lifecycle, UK Data Archives

Dans le document Hosting and Data Management (Page 18-23)