• Aucun résultat trouvé

Data preservation 41

Dans le document Hosting and Data Management (Page 52-58)

Definition 21 For librarians and archivists, digital preservation is a formal process to ensure authenticated digital information remains accessible and usable. This process includes the plan-ning and allocation of digital resources through the application methods and the technological conservation.

In the scientific field, data preservation has often been neglected in the past. Indeed, in the mid XXth century, the entry into the digital age did not fundamentally change the scientific approach in the beginning. Scientific communities have taken full advantage of this new tool to advance to more accurate experiences, making the data batches obsolete quickly. A certain way of anticipating data analysis is by focusing on the desired result, as well as the rapid evolution of technology, which has only intensified and have not encouraged an in-depth thoughtfulness on the digital preservation in the research field. With the remarkable exception of astrophysics, scientific communities have not prioritized the preservation and the provisioning of the digital data in the experiments’ planning.

The situation changes in the early 2000s for several reasons: the ability to collect data increases dramatically. At the same time, in several disciplines, the complexity of experimental devices has also increased rapidly, giving rise to data flows which exceed the capacity for immediate analysis:

more data are collected than expected, with longer-term scientific potential. In addition, scientists have in common, ways of accessing larger ("big science") experiments such as the "Large Hadron Collider" (LHC), which makes these experiments difficult to reproduce, and therefore unique. This is even more true for digital data containing temporal information, as for example, the earth or the ecosystem observation data, or tracking the position of stellar objects. The loss of batches of the data would therefore be irrevocable.

High energy physics has been a precursor in the collection and massive calculation of digital data, as well as for the implementation of complex projects, with a lifespan beyond ten years. Despite this context, the data preservation issues were raised relatively late at the dawn of the start of the LHC. The example of astrophysics, organized as a virtual observatory for several decades, becomes a source of inspiration and an invitation to a broader, interdisciplinary approach.

In the high energy physics community it is customary to operate as described in Figure 2.5 where we observe the overlap of the same cycle (preparation - data acquisition - analysis) between different projects of the same experience. The Coordination between the phases is therefore vital.

As pointed out by the journal of CNRS in 2016, in this context, the interdisciplinary project PREDON, part of the MADICS Research Group, took place in 2012. Under the leadership of Cristinel Diaconu, director of research in CNRS, the diagnosis is the following: "The explosion of the data volume resulting from the experiments conducted in CERN led us to think over about their preservation. We first structured our community around an international organization called Data Preservation and Long Term Analysis in High Energy Physics. Subsequently, we realized, many disciplines were facing the same concern. It is from this observation, we came up with the idea of forming an interdisciplinary community around the question of preservation of the scientific data."

In the midst of the forum, the participants therefore exchange questions such as ""How to preserve data in the long term?", "How can we guarantee that we will be able to read them in ten

2.3 Data preservation

Figure 2.5: The high energy physics experimental cycles

or twenty years?" Or how can the next generation of researchers understand the archived data?"

in order to adapt the preservation strategies to their own sector. These queries become even more important as the presentation of a Data Management Plan (the data management plan which we have introduced earlier in this document) is already being requested, as part of a pilot project, by the European program for research and innovation Horizon 2020, and it has the tendency to become more widespread.

On the national level, in addition to the community of high energy physics and astronomy, the humanities and social sciences have a very large research infrastructure (TGIR) called Huma-Num, to manage the dissemination and preservation of their digital data. This service is offered in partnership with the Centre Informatique National de l’Enseignement Supérieur (CINES), which provides the tools and expertise necessary for archiving. The TGIR Huma- Num develops a technological device which allows to accompany the various stages of the life cycle of digital data. Thus, it provides a set of services for the storage, processing, exposure, reporting, dissemination and long-term preservation of the research’s digital data in humanities and social sciences.

We will now discuss, at a general level, about the outline, the sizing of a service preservation by specifying the elements of the vocabulary.

2.3.2 Factors pushing for more preservation

To identify the factors, pushing for more preservation, we need to define some notions.

Definition 22 The Cyber-infrastructures can be defined as the coordinated set of information technologies and systems (including experts and organizations) allowing, the work, the leisure, the research, the education, in the digital information era. They integrate advanced acquisition (data acquisition), storage (data storage), data management, data integration, data mining, data visualization, or other computer science or information processing services.

The development of cyber-data infrastructures is greatly affected by the use of both current and projected cases, and by our need to search, to analyze, to model, to look for in-depth, and to visualize digital data. More widely the world of cyber-data infrastructures is influenced by trends in technology, economics, politics and law. Four significant trends reflect the environment, in which cyber-data infrastructures are evolving :

• Data volumetry.

There is more digital data created than storage volume to receive them. Some articles 91 highlighted it in 2007, at the crossover point, the volume of data was estimated around 264 exabytes (264×1018bytes). which is nearly a million times the amount of the digital data hosted in 2008 by the American Library of the Congress, considered to be the largest library in the world.

• Policies and regulations.

More and more national policies and regulations demand for access, management and preser-vation of the digital data. In the United States, the Sarbanes-Oxley Act of 2002 promotes responsible and an appropriate management for the preservation. A 1996 law specifies the responsibility of the public agencies to digitally store financial and other information. The Health Insurance Portability ensures the confidentiality of the digital medical records. Already during this period, on the research front, some National Institutes of Health’s staff were required to submit digital copies of their publications to PubMed Central’s department92.

• Storage costs lower and lower.

SSDs of several terabytes are now available, for example, from the manufacturer SCANDISK

93. Western Digital announced on September 20th, 2016 a prototype94of 1TB capacity disk in SDXC technology, a format for the general public proposed in 2009.

• Commercialization of storage and digital data services.

The introduction of Amazon Simple Storage Solutions in 2006 is just one example. This service, which the general public can use online is now known as Amazon S3.

Nowadays, there is even a free access to S3, which includes 5 GB of storage, 20000 GET requests (recovering data) and 2000 PUT requests (drop data). The interesting part of the S3 service is, it can be coupled to many Amazon services, for example to the service EC2, which allows to calculate on the data flow. Again, there is a free mode, which includes 750 hours per month of Linux and Windows t2.micro instances for one year. To stay within the free level, you only need to use Micro EC2 instances95.

• Software and Services’ virtualization.

A very important factor, in the long-term preservation of the data, is the preservation of the capacity of reading and decoding capabilities, and so, the need to preserve the reading software is required. Indeed, the scientific data respond to the speed and to the efficiency requirements, which often go beyond the simple file on the disk, and require processing by complex software and the management of ancillary data sets (meta-data) necessary for decoding and operating these data. The democratization of these software resources (especially via virtual machines) render easy the preservation for suitable computing ecosystems, typical in the scientific experimental fields.

2.3.3 Calculating the dimensions or sizing

One key question is who is responsible for making easier the preservation of digital data. It is noteworthy that the term "value" means different things to different people. It is also noteworthy for making easier the data preservation involves hosting multiple copies of the same data, migrating data

91 Got Data? a Guide to Data Preservation in the information age

92 PubMed Central

93 SCANDISK

94 Premier disque SDXC de 1To, Western Digital (First SDXC disk with 1To, Western Digital)

95 Types d’instances Amazon EC2 (Typical examples Amazon EC2)

2.3 Data preservation

from one generation of storage media to another. to ensure sustainability, and finally the protection of its integrity and authenticity.

To build a cyber-data infrastructure, it is necessary to distinguish the different uses of the data as well as the different preservation scenario. Figure 2.6 depicts a model for the entire spectrum of the preservation. The arrow to the left of the pyramid denotes properties on data collections. The arrow to the right of the pyramid denotes the infrastructures’ properties.

On the top of the pyramid are the valuable data for the society in general, where the operators are mainly institutions of public interest (such as government agencies, libraries, museums and universities).

In the middle of the pyramid is the specific value data to a specific community. Examples include digital recordings from your local hospital, scientific research data stored in the community depots. . . In the bottom of the pyramid are everyone’s data: photos, text documents. . . There is a need here for additional primary sites for the storage of user’s individual and private storage collections.

The creation of an economically feasible data pyramid must also be completed by continuous research, leading to the development of solutions, which meet the challenges of the data management and the preservation. This research approach may then lead to the possibility of using and creating new knowledge from stored data. For example, the process of searching in-depth, of the operating data, depends upon how it is organized, in the additional information level (meta-data), and associated with which data. It is therefore important, at this point of the analysis, to work with the communities.

The archives’ assessments to be kept, refers to the process of identifying the records and other documents to be preserved by determining their values. Several factors, generally must be taken into account, when making this decision. It is a difficult and critical process because the selected recordings will shape the researchers’ understanding of the records, or funds.

The archives’ assessments can be performed once or during the various acquisition and processing steps. A macroscopic assessment, i.e, a high level functional analysis of the records, can be done even before the acquisition of the recordings, in order to determine the recordings to be acquired.

Iterative evaluation can also be performed while the records are being processed.

Also noteworthy is the organization and monitoring of large amounts of data, particularly in regards to the scientific data. Indeed, the flawless organization of the type "library", to professionally index and save batches of the data resulting from an experimentation, remains an essential factor for the durability of the data. Nevertheless, the connection with the scientific communities and organizing the availability are essential to ensure the usefulness or not for storing, in the long term, the data of "poor quality".

2.3.4 Community Initiatives Internationally

To standardize the practice of digital preservation and to provide a set of recommendations for the implementation of a preservation program, the Open Archival Information System96 (OAIS) has been developed. OAIS deals with all the technical aspects of the life cycle of a digital object:

ingestion, archive storage, data management, administration, access, and conservation planning. The model also addresses meta-data queries and recommends which five types of meta-data be attached to a digital object: information reference (identification), origin (including conservation history), context, fixity or steadiness (authenticity indicators), and representation (formatting, file structure).

International Research on Permanent Authentic Records in Electronic Systems97(InterPARES)

96 Open Archival Information System

97 International Research on Permanent Authentic Records in Electronic Systems

Figure 2.6: Lexical field of preservation. According to F. Berman, 2008.98

is a collaborative research initiative led by the University of British Columbia, which focuses on addressing the long-term preservation issues of the authentic digital documents. The research is conducted by different groups, from various institutions, in North America, Europe, Asia and Australia, with the goals of building theories and methodologies which will form the basis for the strategies, the standards, the policies and the procedures necessary to ensure the reliability and accuracy of the digital documents over time.

2.3.5 Specific tools and methodologies

Digital Repository Audit Method Based On Risk Assessment 99 (DRAMBORA), presented by the Digital Curation Center (DCC) and DigitalPreservationEurope (DPE) in 2007, proposes a methodology and a toolkit for the digital risk assessment repository. The tool allows either to carry out internally the evaluation (self-evaluation) or to outsource the process.

The DRAMBORA process is organized in six stages and focuses on the definition of the mandate, the identification of the risks as well as the assessment of the probability and the potential impact of the risks. The auditor is required to describe and document on the role, goals, policies, activities of the entity under review, in order to identify and assess the risks associated with these activities and to define appropriate measures to manage them.

In 2002, the project Preservation and Long-term Access through Networked Services (PLAN-ETS), part of the "EU Framework Program for Research and Technological Development", began addressing the challenges of preservation. The primary goal of PLANETS has been to provide services and practical tools to help ensure the long-term preservation of cultural and scientific communities. The project was completed in 2010 and it came in the form of the Open Planets Foundation, and from 2014 onwards as the Open Preservation Foundation100to get aligned with the foundation’s orientations.

In France,le Centre Informatique National de l’Enseignement Supérieur101(CINES) is pursuing

98 Got data? A guide to data preservation in the information age, Communication of the ACM, 2008

99 Digital Technical baseline Audit Method Based On Risk Assessment

100 Open Preservation Foundation

101 Centre Informatique National de l’Enseignement Supérieur (the National Computer Center for Higher Education)

2.3 Data preservation

lobbying actions with the participants in the storage market fields, so that their file formats can still be read in several years. A team of engineers is engaged in a permanent race against obsolescence, by ensuring, which software or hardware drives can permanently access the data. The FACILE102 platform lists the range of formats currently supported by CINES.

To meet this need for exchange and preservation of data, the international astronomical commu-nity has structured itself around the Virtual Observatory,, a set of services which retrieves useful information from all astronomical data open to the researchers through a directory and shared standards.

The Astronomical Data Center in Strasbourg103(CDS) is a data center dedicated to collecting and distributing astronomical data around the world. It hosts the worldwide reference base for the identification of astronomical objects and its missions consist of:

• collecting useful information concerning astronomical objects, in computerized form;

• distributing this information to the international astronomical community;

• conducting research using this data.

The CDS was created in 1972 by the National Institute of Astronomy and Geophysics (INAG), since became the National Institute of Sciences of the Universe (INSU), in agreement with the University Louis Pasteur, which has since became University of Strasbourg.

The main services of the CDS are Simbad, the reference database for the identification and bibliography of astronomical objects (outside the solar system), Vizier, which collects the astro-nomical catalogs and tables published in the academic journals, and Aladin, an interactive sky atlas for viewing astronomical images from the ground and the space observatories, or user-supplied observatories, and the data coming from the CDS or other data bases, such as NASA and IPAC (NED).

In conclusion, the CDS staff standardize documents in long lasting formats for their preservation and assign to them the meta-data recognized by the discipline. This meta-data can be present in the received documents (Fits files, Flexible Image Transport System) or added by CDS reference librarians, thus showing the multidisciplinary nature of the work. These data are then duplicated on eight mirror sites located in different countries.

These examples show the need for a structured and ongoing approach within each scientific project and program, as illustrated in Figure 2.5, but above all the need for structuring within communities and research organizations. Indeed, the preservation of the scientific data cannot be the sole responsibility of the scientific projects because the temporal structuring does not allow a longer-term vision of the fate of the data. The successes mentioned above are based on specialized structures (dedicated and specialized data center, virtual observatories), in close connection with the scientific communities, following the policies defined at national and especially international levels.

2.3.6 Complementary readings

Regarding the data generated by the research community, the CNRS (Centre National de la Recherche Scientifique) has structured the data thinking under the aegis of committees, for example the MICADO committee104and will published soon a white paper entitled "Livre Blanc sur les Données au CNRS : État des Lieux et Pratiques – Mission « Calcul et données » (MICADO)". In France, again, a new version of the brochure "Ouverture des données de recherche – Guide d’analyse du

102 plateforme FACILE (platform FACILE)

103 Centre de données astronomiques de Strasbourg (Strasbourg Astronomical Data Center)

104 http://orap.irisa.fr/wp-content/uploads/2016/10/ORAP-OCT-mdayde-2016.pdf

cadre juridique en France" is in preparation. The interested reader may find a past version by Googling. This short book is dedicated to legal issues.

Dans le document Hosting and Data Management (Page 52-58)