• Aucun résultat trouvé

Chapter 1 Introduction

1.3. Software Engineering and Bioinformatics Challenges

1.3.1. Big Data

MS datasets range in size from a few thousand to tens of millions of spectra. And dataset sizes are expected to increase as mass spectrometers are improved so that more of the molecules in the sample are measured. The software engineering challenge this poses is writing software that can gracefully scale from thousands to tens of millions of spectra. The challenges are due to the different storage and computation resources that are required to best facilitate the analysis of different sized datasets. For small datasets, single threaded in memory processing is sufficient.

As the dataset gets bigger it becomes necessary to process the data on multiple threads and to manage how data storage is split between memory and disk. For processing very large datasets, it is necessary to distribute the computing and data storage over large computer clusters or cloud infrastructure. This scaling challenge can be tackled by using big data frameworks such as Apache Hadoop25 and Apache Spark26. Hadoop has already been used in bioinformatics, primarily for next-generation sequence analysis27 but also for proteomics28–30, while Spark was more recently introduced31,32.

Apache Hadoop. Apache Hadoop25 is a top-level Apache project that aims to develop software for reliable, scalable, distributed computing of large datasets across clusters of computers. Hadoop is designed to scale up from single servers to thousands of machines.

Failures in hardware and the network are handled in the software library. This makes it possible to build clusters using cheap commodity hardware. Hadoop includes four modules: Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.

Hadoop Common is the core of the Hadoop framework. It contains utilities that support the other Hadoop modules, such as abstractions of the underlying operating system and its file system. It also provides support for reading and writing Hadoop specific file formats such as Sequence Files. Sequence Files are flat files consisting of binary key/value pairs that are used by Hadoop MapReduce as input/output and to store temporary data.

The second module is the Hadoop Distributed File System (HDFS) which is based on the Google File System Paper33. HDFS provides a distributed storage layer that is schema less, durable, can automatically handle failure and rebalances automatically to even out disk space consumption on a cluster. HDFS works by splitting each file into 64MB chunks and storing

16

each chunk on 3 different nodes. If a component fails, the system automatically notices the defect and re-replicates the chunks that were stored on the failed node using data from the other two healthy replicas.

The third module is Hadoop YARN which is a framework for job scheduling and cluster management. It provides services for resource management, workflow management fault tolerance and job scheduling/monitoring. The final module is Hadoop MapReduce which is a framework for parallel processing of large datasets. Hadoop MapReduce is based on the Google MapReduce paper 34. The three main problems that MapReduce addresses are:

 Parallelization

- How to parallelize the computation

 Distribution

- How to distribute the data

 Fault-tolerance

- How to handle component failure

MapReduce hides away most of the complexities of dealing with large scale distributed systems. MapReduce provides a minimal API that consists of two functions, map and reduce.

A key insight of the Google MapReduce paper was to send code to data, not data to code. This is an important differentiator when compared to traditional data warehouse systems and relational databases. Together with HDFS, MapReduce can handle terabytes and even petabytes of data and it can be used to process data that is too big to be moved27.

To implement a MapReduce program a map function is specified that takes a key-value pair as input and outputs a set of intermediate key-value pairs. And a reduce function is specified that merges all intermediate values that have the same key. The parallelization of programs written in this functional style are performed automatically by Hadoop and works as follows: The input data is split into independent junks each of which is processed in parallel by a map task. The outputs of the map tasks are then sorted by the framework and used as input for the reduce task.

17

By persisting data back to disk after each map and reduce function call, Hadoop can process data sets that are larger than the available memory on the cluster.

While MapReduce is capable of processing huge data sets, the availability of only one transformation (map) and one action (reduce) makes it difficult to use MapReduce for writing machine learning and graph processing algorithms.

Apache Spark. Apache Spark26 is a top-level Apache project that provides a fast and general engine for large-scale data processing. As such Spark is a direct replacement and successor to Hadoop MapReduce. The main problems that Spark addresses are:

 Ease of use

- Enables iterative and streaming workloads

- Supports many more functions for transformations and actions - Provides distributed shared variables

 Execution speed

- Removes the performance penalty that is incurred from persisting the data to disk by storing data in memory

- Uses “lazy evaluations” which allows automatic optimization of data processing While Spark directly competes with MapReduce it is designed to integrate with the rest of the Hadoop infrastructure (Figure 6).

18

Figure 6. The Hadoop architecture.

Spark is much easier to use and more expressive than MapReduce. Spark generalizes the two-stage map/reduce paradigm to support an arbitrary directed acyclic graph (DAG) of tasks. The core concept is the abstraction called a Resilient Distributed Datasets (RDDs). A RDD is an immutable collection of objects that are split into partitions which are distributed across a cluster. Computations are performed in parallel by performing transformations or actions on the RDD. RDD’s have map and reduce functions like MapReduce, but also add many other functions like filter, flatMap, groupByKey and aggregateByKey.

To combine data from two or more RDD’s, Spark provides functions such as union, intersection, join and cogroup. To share state across a cluster, Spark provides broadcast and accumulator variables. Broadcast variables are used to keep a read-only variable cached on each machine rather than shipping a copy of it with each task. Accumulators are variables that are added to and can be used as distributed counters and sums. A good comparison of the relative ease of MapReduce and Spark are the word count examples that are provided by both processing

engines. The MapReduce word count example

(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v2.

0) is over 100 lines of code and the programming style is very different to the style that would be used to write a non-distributed word count. The Spark equivalent is:

19

JavaRDD<String> textFile = sc.textFile("hdfs://...");

JavaPairRDD<String, Integer> counts = textFile

.flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1))

.reduceByKey((a, b) -> a + b);

counts.saveAsTextFile("hdfs://...");

The Spark code is far more compact, and the programming style is very similar to what would be used to write a non-distributed word count.

Another advantage of Spark is the execution speed. Spark can be up to 100 times faster than MapReduce. This is mainly because RDDs can be persisted in memory and only needs to spill to disk if the data is too big. This greatly reduces the serialization overhead incurred by Hadoop and consequently speeds up calculations. If the data is too large to fit in memory and the RDD is stored on disk, Spark is still up to 10 times faster than MapReduce. This is because Spark uses "lazy evaluation" of the task DAG allowing automatic optimization of the execution plan to speed up computation.

Spark code can be written almost the same way as code that would be written for a non-distributed program. This expressiveness together with Spark’s performance and ability to spill data to disk for large datasets, makes Spark an excellent fit for processing small, medium and huge MS/MS datasets.

20