Docker Containerization - Technical Discussion

6.2 Technical Discussion

6.2.3 Docker Containerization

Docker is a free and open-source platform providing tools to automate and facilitate the development, deployment, and execution of computer software. Using Docker, applications are packaged into containers bundling all the required code, dependen-cies and configuration files. This isolates the running processes from their environ-ment, ensuring they will behave consistently regardless of where they are deployed.

Multiple containers can thus be launched in parallel on the same server without the risk of having dependency or configuration conflicts. The deployment is performed using a single command line without having to consider pre-installation require-ments other than the Docker daemon itself. This simplified procedure allows easily moving any application from a given computing environment to another, which can prove to be particularly useful when deploying from test to production. The con-tainers are created from Docker images, which are files representing snapshots of the file system once all the necessary system libraries and tools have been installed.

The content of a Docker image is specified by aDockerfile, which is a small script file containing all the command instructions required to build the image. Containers are built from images (as objects are instances of classes in the OOP), and multiple containers can be launched using the same image. This enables to easily scale an application by running more instances of the same container under the control of a load balancer.

Docker is made possible by two features from the Linux kernel callednamespacing andcgroups, enabling respectively the isolation of resources per process and the limitation of the amount of resources allocated per process. On Linux, Docker can run without the use of virtual machines, making it lightweight in terms of hard-ware resources required to function. On the Windows and Mac operating systems, the Docker engine runs on a Linux virtual machine in order to access to the fea-tures provided by the Linux kernel. Of note, Docker does not work on thehome version of Windows 10, as the Microsoft Hyper-V hypervisor handling virtual

ma-chines is not availableon this platform. This however is not much of an issue as most servers use Linux, and theEnterpriseversion of the Docker engine supports Windows servers.

The use of Docker in the context of this thesis was prompted by the instability of some applications that are part of the Glycomics@ExPASy initiative [133]. The GlyConnect [114], SugarBind [134] and UniCarb-DB [135] databases all inherited the same common architecture: they were developed using the Play framework, which requires the installation of Scala and its SBT build tool in addition to a Java runtime environment. Each of the three databases needs a specific combination of version dependencies in order to be able to be compiled and launched. As all the database applications were deployed on a single server, the different versions of the same dependencies were seemingly causing conflicts that were making them peri-odically crash. In order to resolve the issue, we ported the applications to Docker by writing a specific Dockerfile for each of them. Using the Docker compose tool, we made sure that the Docker daemon would automatically restart any container that would shutdown unexpectedly in case they would still be unstable. With this newly acquired experience, I additionally ported the CLASTR and GlyConnect Compozi-tor tools to Docker. As they are fairly similar in design, the same Docker architec-ture (Figure6.2) was used to deploy them on the SIB ExPASy server [116].

Figure 6.2: Scheme of the Docker implementation shared by CLASTR and Glycon-nect Compozitor, as deployed on the ExPASy server. The two blue boxes represent the Docker containers: one for the NGINX web server handling the front end HTML/CSS and JavaScript files; and another for the Tomcat application server handling the back end writ-ten in Java. The two containers communicate by share the same network in bridge mode as represented by the green box. Only the NGINX container is directly accessible from the exterior.

The interest of Docker for bioinformatics, and more generally for research in science, is twofold. In addition to the containerization of applications detailed above, Docker can be used to run workflows and pipelines in containers. Thus, this technology can

6.2. TECHNICAL DISCUSSION

contribute to the improvement of the reproducibility of scientific research [136].

When an analysis workflow featured in a scientific publication uses Docker (or any other containerization system, such as Singularity or OpenVZ), it enables other re-searchers to easily reproduce the presented results. For instance, all the data and plots presented in the study featured in Chapter3can be regenerated by anyone us-ing a sus-ingle Docker command in a console. All the tools and scripts that were used to generate the results of an article are not always available in the corresponding supplementary material. Additionally, setting up a whole analysis workflow can be time-consuming and require computer skills that wet lab researchers may be lack-ing. Dockerfiles being significantly smaller than Docker images, they represent good candidates to be included in the supplementary information. One downside is that Dockerfiles rely on the availability of dependencies to be able to build images.

There is no strict guarantee that a Dockerfile would work after a prolonged period of time, even if it is unlikely when using trusted dependency providers (e.g., Google APIs, Cloudflare or Apache Maven). As such, there is an interest to also provide the corresponding image to ensure sustainability, despite them potentially weighting several gigabytes as a result of being file system snapshots. In addition to the re-production of experimental results, workflow containerization also enables to apply the same implemented methodology on new data sets with little effort.

Chapter 7 Conclusion

In the previous chapters, we focused on the work achieved in the course of this thesis. We discussed the problems that were encountered in the different biomolec-ular disciplines we covered while detailing the computer software solutions that were designed to overcome them. In this final chapter, we conclude by listing some of the tasks that can still be undertaken to further improve what we accomplished, as both omics and computer technologies are constantly evolving.

7.1. CLASTR

Since its public release on the SIB ExPASy server [116] in March 2019, the CLASTR tool has been regularly updated to fix minor issues and implement new function-alities. For example, the Microsoft Excel XLSX format was integrated as an ex-port choice in both the web interface and RESTful API. While not being a standard file format, it remains commonly used by researchers and allows keeping the same color code and comments than the ones proposed in the application web interface.

As originally planned, the species support was additionally extended to mouse and dog cell lines, which are the only species other than human for which STR markers have been selected. Overall, CLASTR has been particularly stable and provides the scientific community with a reliable service to perform STR similarity searches. As a result, there is little room for improvement in the current tool version (1.4.3). The main possible enhancements are in the Cellosaurus data itself, with the integra-tion of new STR profiles and the annotaintegra-tion of cell line entries when new cases of misidentification or cross-contamination are reported.

Nonetheless, there is currently no standard data format for the exchange of STR profiles between the different stakeholders, that is, the laboratories carrying out STR profiling, the researchers, the journals and online resources such as the

Cel-losaurus. The existence of such a standard would strongly benefit and promote the sharing of STR profiles between data producers and consumers. The implementa-tion of standardized file formats to report experimental results has been success-fully achieved in numerous omics disciplines [137] and invitroomics [138] would benefit from such a development. This new standard format should be designed to enable storing and regulating metadata information that is relevant to the gener-ation of an STR profile. It could notably contain annotgener-ation informgener-ation about the data producer identity, the cell line characteristics and culture conditions, the pro-filing methodology, and the dates of submission and analysis. The data core would consist of one or more STR profiles, indicating the STR loci and corresponding al-leles. In terms of architecture, the file format could be either text or XML based as commonly the case in bioinformatics. One advantage of the XML format is the ability to validate that a file is compliant with the standard through the use of XSD files describing the schema. Once the standard specifications have been defined, the file parser of CLASTR would have to be extended to support the extraction of STR profile information from the file and its loading into the input form.

Dans le document Development of Bioinformatics Tools and Workflows for the Analysis of Cell Line Data (Page 126-131)