Uniﬁed Fault-Tolerant Parallel Computing over Dynamic Distributed Environments

(1)

Unified Fault-Tolerant Parallel Computing over Dynamic Distributed Environments

Duration and stipend:5 months, 500-600 euros/month Starting date: February-March 2021

Internship institution: LISN, équipe ParSys, physically in LRI (Laboratoire de Recherche en Informa- tique), Université Paris-Saclay (remote supervision in case of COVID lockdown is possible)

Supervisors:Janna Burman ([email protected]), Stéphane Vialle ([email protected]) and Nicola Roberto Zema ([email protected])

Keywords:distributed algorithms, high-performance computing, computer networks, fault tolerance, Cloud, edge-computing

Expected skills: preliminary experience and knowledge in some of the related domains: distributed algorithms, high-performance computing and computer networks, is an advantage

To apply:contact the supervisors presenting your motivations and relevant skills (preferably including grade transcripts)

1 Motivations

Recently emerging light-weight distributed systems allowed parallel computing to shift from expensive massively parallel architectures to clusters of commodity PCs or even mobile devices (robots, drones, etc.), to take advantage of cost and performance benefits. Such highly distributed systems are inherently dynamic, decentralized, frequently large-scale and their nodes may possess extremely restricted resources, while still providing performance. Their dynamism, low-cost and large-scale characteristics imply however intensification in failures (e.g., mean time to failure metric decreases significantly). Failures may result in a possible increase in execution time and cost of running the applications, or even compromise the whole computation. Hence, the need for reliable fault-tolerance becomes a growing concern in such systems, while standard fault-tolerance mechanisms still rely on redundancy and checkpoint-restart.

They are time and resource consuming, and become extremely inefficient in emerging light-weigh parallel computing systems.

New technologies require new methods for addressing new types of failures:

1. Strong failures remains, likenode fail-stoporlinkfailures (detectable or not and with or without restart).

2. Emerging light-weight systems frequently suffer fromsoft failures. A machine can stay physically operational while being unable to carry out some computational tasks: its GPU can overheat and shutdown, a UAV can lose a sensor.

3. A machine can receive afailure alertdelta-time before being stopped, and takes the opportunity to save or emit its latest results, and to signal its next shutdown to others machines: a mobile robot can run out of battery, a low-cost pre-emptible machine in the cloud can be retrieved to run more urgent jobs.

Moreover, previous studies generally assumed a sufficiently long mean time between failures, sometimes only a single possible failure or only in disjoint times [2]. This is no more relevant in the systems we consider here. It is necessary a whole new class of solutions to addresssoft failuresandfailure alertsin (various) distributed computing environments.

This interdisciplinary internship unifies three related complementary fields of study: high performance computing[5, 3],distributed algorithms [1] andcomputer networks[4], to exploit the unified knowledge and experience and tackle the project problematic with the best possible vision.

(2)

2 Objectives

The goal of this internship is to investigate medium to light-weight parallel computation systems for identifying common failure patterns and the appropriate fault-tolerance needs. As reference systems we expect investigating parallel computation over Cloud and multi-robot (Edge computing) architectures.

First, recently acquired communication equipment could be used to construct a multi-robot network, for an initial testing of the algorithms and protocols. Second, some experimental clusters of Centrale- Supelec (Metz campus) can be used to run distributed algorithms by provoking artificial breakdowns, to test the designed algorithms while causing different failures.

This will be accompanied by a bibliographical study, of existed fault-tolerance methods and models in the three fields of study relevant to the project (networking and high performance and distributed computing).

Based on this, the next step will be to propose new unified fault-tolerant parallel and distributed computation models and methodsunder which our goal is to achieve algorithmic efficiency: resilience to different types of failures without significant degradation of computing speed and speedup on the considered architectures.

We expect validating the proposed algorithms both formally by mathematical proofs and practically using two available experimental platforms.

References

[1] Joffroy Beauquier, Janna Burman, Julien Clément, and Shay Kutten. On utilizing speed in networks of mobile agents. InProceedings of the 29th Annual ACM Symposium on Principles of Distributed Computing, PODC 2010, Zurich, Switzerland, July 25-28, 2010, pages 305–314. ACM, 2010.

[2] Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J.

Supercomput., 65(3):1302–1326, 2013.

[3] Constantinos Makassikis, Virginie Galtier, and Stéphane Vialle. A skeletal-based approach for the development of fault-tolerant SPMD applications. In 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2010, Wuhan, China, 8-11 December, 2010, pages 239–248. IEEE Computer Society, 2010.

[4] Sabato Manfredi, Enrico Natalizio, Claudio Pascariello, and Nicola Roberto Zema. Stability and convergence of a message-loss-tolerant rendezvous algorithm for wireless networked robot systems.

IEEE Trans. Control. Netw. Syst., 7(3):1103–1114, 2020.

[5] Stéphane Vialle, Amelia De Vivo, and Fabrice Sabatier. A grid architecture for comfortable robot control. In Peter M. A. Sloot, Alfons G. Hoekstra, Thierry Priol, Alexander Reinefeld, and Marian Bubak, editors,Advances in Grid Computing - EGC 2005, European Grid Conference, Amsterdam, The Netherlands, February 14-16, 2005, Revised Selected Papers, volume 3470 ofLecture Notes in Computer Science, pages 344–353. Springer, 2005.