Bioinformatics - Development of Bioinformatics Tools and Workflows for the Analysis of Cell Lin

Bioinformatics can be defined as an interdisciplinary field that uses computational approaches as well as statistical methods to store, analyze, and visualize biological data [2]. It lies at the intersection of molecular biology, computer science, infor-mation science, and mathematics. The use of bioinformatics resources and tools became a necessity for omics research, whether in the fields of genomics, tran-scriptomics, proteomics, or glycomics. To be understandable and exploitable, the ever-growing amount of biological data (Figure1.1) generated by high-throughput technologies requires suitable analytical resources and computer software. The scope of bioinformatics notably includes: data storage, annotation and integration

in databases (i); data sharing and dissemination between scientists and resources (ii); and data processing, interpretation, and visualization by computer software and workflows (iii). In this section, we present some of the core bioinformatics con-cepts that were directly or indirectly used in this thesis, while mentioning related challenges that are faced nowadays.

Figure 1.1: Data growth at EMBL-EBI by experimental platform. Figure courtesy of C.E.

Cooket al. [3].

1.1.1 Data and Databases

As a direct consequence of being generated by a multitude of experimental meth-ods and instruments, biological data distinguishes itself by its high heterogeneity [4]. Distinct high-throughput technologies emerged as references in the different omics fields, such as next-generation sequencing in genomics or mass spectrometry in proteomics and glycomics. To be able to reliably store the raw data generated by biological experiments, distinct data formats were designed. While all formats for a given experimental approach tend to share the same core information, they can significantly differ in terms of metadata. Metadata is the supplementary in-formation that describes the data, encapsulating all the relevant attributes about an experiment. It thus describes the materials, methods, instrumentation, and ad-ditional information that were used to generated the data. To define the minimum information that data formats should contain, collaborative information standards were successively established for each experimental approach. These specifications

1.1. BIOINFORMATICS

ensure experimental result interpretability by regulating the mandatory metadata information and its formatting. Following the successful example of a guideline regulating the minimum information about a microarray experiment (MIAME) [5]

released in 2001, numerous other minimum information standards were introduced to the scientific community. Nowadays, all main high-throughput omics techniques have their own standard, which are coordinated together by the minimum informa-tion about a biomedical or biological investigainforma-tion (MIBBI) [6] project since 2008.

The minimum information about a proteomics experiment (MIAPE) [7] was a no-tably important initiative, as there is a high heterogeneity and complexity in the field of mass spectrometry for proteomics.

To be easily accessible and exploitable by researchers, the knowledge extracted from biological data requires to be organized and stored in databases. According to their content, biological databases have been historically classified as eitherprimaryor secondarydatabases. Primary databases, also referred to as archival databases (or repositories), are constituted of sequence or structural data directly extracted from experimental results. Secondary databases, also referred to as curated databases, are constituted of data derived from the processing of the information stored in pri-mary databases [8]. These databases rely on manual annotation by experts and computational algorithms to reach a high level of data curation. The information they provide is often derived from multiple sources while using ontologies as regu-latory frameworks for data representation. Some databases can also be hybrid and present characteristics of both a primary and secondary database at the same time.

Biological databases are also frequently classified based on their scope. A database is thus deemed asuniversalwhen including many species, or asspecialized when focusing on a single one. A multitude of databases emerged through the years to respond to the needs of modern biological research. While some act solely as repos-itories enabling the dissemination of raw experimental data, others have developed a panel of computer tools to access and further analyze their data. The 27^threlease of the NAR online molecular biology database collection [9] reported a total of 1,637 databases in 2019.

1.1.2 Experimental Reproducibility Crisis

In a 2016 survey from theNaturejournal [10], a majority of the 1,576 interviewed researchers (52%) agreed that a significant reproducibility crisis is currently under-going in science. Almost all of the respondents (90%) admitted that at least a slight crisis is occurring. An alarming number of experiments fail to be reproduced, and scientists can even struggle to replicate their own results. While the existence of a crisis is almost unanimous and sparked a lot of attention in recent years, scientists

are more divided on the factors that led to its inception. The reproducibility crisis is a complex multifaceted problem that is influenced by many causes, some of which are specific to each field of application. Among the many factors involved, selective reporting and the pressure to publish are deemed to be the biggest crisis contribu-tors [10]. More technical causes are implicated, including bad experimental design and the unavailability of raw data, methods, or software. Poor statistical analysis is also a key factor, which worsened with the significant decrease in the produc-tion cost of data and its consequent large-scale generaproduc-tion [11]. Furthermore, the use of misidentified or contaminated cell lines is known to additionally hinder the reproducibility of experimental results [12].

Dans le document Development of Bioinformatics Tools and Workflows for the Analysis of Cell Line Data (Page 18-22)