A user-centered and autonomic multi-cloud architecture for high performance computing applications

(1)

HAL Id: tel-01127070

https://tel.archives-ouvertes.fr/tel-01127070

Submitted on 6 Mar 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Alessandro Ferreira Leite

To cite this version:

Alessandro Ferreira Leite. A user-centered and autonomic multi-cloud architecture for high per-formance computing applications. Computer Aided Engineering. Université Paris Sud - Paris XI; Universidade de Brasília, 2014. English. �NNT : 2014PA112355�. �tel-01127070�

(2)

Ecole Doctorale d’Informatique de Paris-Sud

INRIA SACLAY ÎLE-DE-FRANCE/LABORATORIE DE RECHERCHE

EN INFORMATIQUE

Discipline : Informatique

En Cotutelle Internationale avec

Université de Brasília, Brésil

Thèse de doctorat

Soutenue le 2 décembre 2014 par

Alessandro FERREIRA LEITE

A User-Centered and Autonomic

Multi-Cloud Architecture for High

Performance Computing Applications

Composition du jury :

Rapporteurs : M Christophe CÉRIN Professeur (Université Paris 13)

M Jean-Louis PAZAT Professeur (INSA Rennes)

Examinateurs :

Mme Christine FROIDEVAUX Professeur (Université Paris-Sud 11)

Mme Célia GHEDINI RALHA Professeur (Université de Brasília)

Mme Christine MORIN Directrice de Recherche (INRIA, IRISA)

Directrices de thèse : Mme Christine EISENBEIS Directrice de Recherche (INRIA, LRI, UPSud)

(3)

(4)

(5)

(6)

First, I would like to thank my supervisors for their support and advices. Next, I would like to thank my thesis’s reviewers Christophe Cérin and Jean-Louis Pazat, and the members of my thesis committee: Christine Froidevaux, Célia Ghedini Ralha, and Christine Morin. I appreciate all yours comments and questions.

Additionally, I would like to thank the professors Vander Alves, Genaína Rodrigues, Li Weigang, and Jacques Blanc for their comments, suggestions, and discussions.

Next, I am very thankful to Katia Evrat and Valérie Berthou for kindly helping me in various administrative works. In addition, I am grateful to the authors that sent me their papers when I requested them.

After, I would like to thank all my team members in INRIA, especially Michael Kruse and Taj Muhammad Khan. I also thank my friends Cícero Roberto, Antônio Junior, Éric Silva, André Ribeiro, Francinaldo Araujo, Emerson Macedo, and the former Logus’s team. Moreover, I would like to thank the people I collaborated with during the last years.

I further acknowledge the financial assistance of CAPES/CNPq through the program science without borders (grant 237561/2012-3) during the year of 2013, and Campus France/INRIA (project Pascale) during the period of February to July and November to December 2014.

Last but not least, I would like to thank my parents and my family for their uncondi-tional support and patience.

(7)

(8)

Cloud computing has been seen as an option to execute high performance computing (HPC) applications. While traditional HPC platforms such as grid and supercomputers offer a stable environment in terms of failures, performance, and number of resources, cloud computing offers on-demand resources generally with unpredictable performance at low financial cost. Furthermore, in cloud environment, failures are part of its normal operation. To overcome the limits of a single cloud, clouds can be combined, forming a cloud federation often with minimal additional costs for the users. A cloud federation can help both cloud providers and cloud users to achieve their goals such as to reduce the execution time, to achieve minimum cost, to increase availability, to reduce power consumption, among others. Hence, cloud federation can be an elegant solution to avoid over provisioning, thus reducing the operational costs in an average load situation, and removing resources that would otherwise remain idle and wasting power consumption, for instance. However, cloud federation increases the range of resources available for the users. As a result, cloud or system administration skills may be demanded from the users, as well as a considerable time to learn about the available options. In this context, some questions arise such as: (a) which cloud resource is appropriate for a given application? (b) how can the users execute their HPC applications with acceptable performance and financial costs, without needing to re-engineer the applications to fit clouds’ constraints? (c) how can non-cloud specialists maximize the features of the clouds, without being tied to a cloud provider? and (d) how can the cloud providers use the federation to reduce power consumption of the clouds, while still being able to give service-level agreement (SLA) guarantees to the users? Motivated by these questions, this thesis presents a SLA-aware application consolidation solution for cloud federation. Using a multi-agent system (MAS) to negotiate virtual machine (VM) migrations between the clouds, simulation results show that our approach could reduce up to 46% of the power consumption, while trying to meet performance requirements. Using the federation, we developed and evaluated an approach to execute a huge bioinformatics application at zero-cost. Moreover, we could decrease the execution time in 22.55% over the best single cloud execution. In addition, this thesis presents a cloud architecture called Excalibur to auto-scale cloud-unaware application. Executing a genomics workflow, Excalibur could seamlessly scale the applications up to 11 virtual machines, reducing the execution time by 63% and the cost by 84% when compared to a user’s configuration. Finally, this thesis presents a software product line engineering (SPLE) method to handle the commonality and variability of infrastructure-as-a-service (IaaS) clouds, and an autonomic multi-cloud architecture that uses this method to configure and to deal with failures autonomously. The SPLE method uses extended feature model (EFM) with attributes to describe the resources and to select them based on the users’ objectives. Experiments realized with two different cloud providers show that using the proposed method, the users could execute their application on a federated cloud environment, without needing to know the variability and constraints of the clouds.

Keywords: Autonomic computing; High performance computing (HPC); Cloud federation;

(9)

(10)

Le cloud computing a été considéré comme une option pour exécuter des applications de calcul haute performance (HPC). Bien que les plateformes traditionnelles de calcul haute performance telles que les grilles et les supercalculateurs offrent un environnement stable du point de vue des défaillances, des performances, et de la taille des ressources, le cloud computing offre des ressources à la demande, généralement avec des performances imprévisibles mais à des coûts financiers abordables. En outre, dans un environnement de cloud, les défaillances sont perçues comme étant ordinaires. Pour surmonter les limites d’un cloud individuel, plusieurs clouds peuvent être combinés pour former une fédération de clouds, souvent avec des coûts supplémentaires légers pour les utilisateurs. Une fédération de clouds peut aider autant les fournisseurs que les utilisateurs à atteindre leurs objectifs tels la réduction du temps d’exécution, la minimisation des coûts, l’augmentation de la disponibilité, la réduction de la consommation d’énergie, pour ne citer que ceux-là. Ainsi, la fédération de clouds peut être une solution élégante pour éviter le sur-approvisionnement, réduisant ainsi les coûts d’exploitation en situation de charge moyenne, et en supprimant des ressources qui, autrement, resteraient inutilisées et gaspilleraient ainsi de énergie. Cependant, la fédération de clouds élargit la gamme des ressources disponibles. En conséquence, pour les utilisateurs, des compétences en cloud computing ou en administration système sont nécessaires, ainsi qu’un temps d’apprentissage considérable pour maîtrises les options disponibles. Dans ce contexte, certaines questions se posent : (a) Quelle ressource du cloud est appropriée pour une application donnée ? (b) Comment les utilisateurs peuvent-ils exécuter leurs applications HPC avec un rendement acceptable et des coûts financiers abordables, sans avoir à reconfigurer les applications pour répondre aux normes et contraintes du cloud ? (c) Comment les non-spécialistes du cloud peuvent-ils maximiser l’usage des caractéristiques du cloud, sans être liés au fournisseur du cloud ? et (d) Comment les fournisseurs de cloud peuvent-ils exploiter la fédération pour réduire la consommation électrique, tout en étant en mesure de fournir un service garantissant les normes de qualité préétablies ? À partir de ces questions, la présente thèse propose une solution de consolidation d’applications pour la fédération de clouds qui garantit le respect des normes de qualité de service. On utilise un système multi-agents (SMA) pour négocier la migration des machines virtuelles entre les clouds. Les résultats de simulations montrent que notre approche pourrait réduire jusqu’à 46% la consommation totale d’énergie, tout en respectant les exigences de performance. En nous basant sur la fédération de clouds, nous avons développé et évalué une approche pour exécuter une énorme application de bioinformatique à coût zéro. En outre, nous avons pu réduire le temps d’exécution de 22,55% par rapport à la meilleure exécution dans un cloud individuel. Cette thèse présente aussi une architecture de cloud baptisée « Excalibur » qui permet l’adaptation automatique des applications standards pour le cloud. Dans l’exécution d’une chaîne de traitements de la génomique, Excalibur a pu parfaitement mettre à l’échelle les applications sur jusqu’à 11 machines virtuelles, ce qui a réduit le temps d’exécution de 63% et le coût de 84% par rapport à la configuration de l’utilisateur. Enfin, cette thèse présente un processus d’ingénierie des lignes de produits (PLE) pour gérer la variabilité de l’infrastructure à la demande du cloud, et une architecture multi-cloud autonome qui utilise ce processus pour configurer et faire face aux défaillances de manière indépendante. Le processus PLE utilise le modèle étendu de fonction (EFM) avec des attributs pour décrire les ressources et les sélectionner en fonction des objectifs de l’utilisateur. Les expériences réalisées avec deux fournisseurs de cloud différents montrent qu’en utilisant le modèle proposé, les utilisateurs peuvent exécuter leurs applications dans un environnement de clouds fédérés, sans avoir besoin de connaître les variabilités et contraintes du cloud.

Mots-clés: Calcul Autonomique; Calcul Haute Performance (HPC); Cloud Federation; Ligne de Produits Logiciels; Modèles de Variabilité

(11)

(12)

A computação em nuvem tem sido considerada como uma opção para executar aplicações de alto desempenho. Entretanto, enquanto as plataformas de alto desempenho tradicionais como grid e su-percomputadores oferecem um ambiente estável quanto à falha, desempenho e número de recursos, a computação em nuvem oferece recursos sob demanda, geralmente com desempenho imprevisível à baixo custo financeiro. Além disso, em ambiente de nuvem, as falhas fazem parte da sua normal operação. No entanto, as nuvens podem ser combinadas, criando uma federação, para superar os limites de uma nuvem muitas vezes com um baixo custo para os usuários. A federação de nuvens pode ajudar tanto os provedores quanto os usuários das nuvens a atingirem diferentes objetivos tais como: reduzir o tempo de execução de uma aplicação, reduzir o custo financeiro, aumentar a disponibilidade do ambiente, reduzir o consumo de energia, entre outros. Por isso, a federação de nuvens pode ser uma solução elegante para evitar o sub-provisionamento de recursos ajudando os provedores a reduzirem os custos operacionais e a reduzir o número de recursos ativos, que outrora ficariam ociosos consumindo energia, por exemplo. No entanto, a federação de nuvens aumenta as opções de recursos disponíveis para os usuários, requerendo, em muito dos casos, conhecimento em administração de sistemas ou em computação em nuvem, bem como um tempo considerável para aprender sobre as opções disponíveis. Neste contexto, surgem algumas questões, tais como: (a) qual dentre os recursos disponíveis é apropriado para uma determinada aplicação? (b) como os usuários podem executar suas aplicações na nuvem e obter um desempenho e um custo financeiro aceitável, sem ter que modificá-las para atender as restrições do ambiente de nuvem? (c) como os usuários não especialistas em nuvem podem maximizar o uso da nuvem, sem ficar dependente de um provedor? (d) como os provedores podem utilizar a federação para reduzir o consumo de energia dos datacenters e ao mesmo tempo atender os acordos de níveis de serviços? A partir destas questões, este trabalho apresenta uma solução para consolidação de aplicações em nuvem federalizadas considerando os acordos de serviços. Nossa solução utiliza um sistema multi-agente para negociar a migração das máquinas virtuais entres as nuvens. Simulações mostram que nossa abordagem pode reduzir em até 46% o consumo de energia e atender os requisitos de qualidade. Nós também desenvolvemos e avaliamos uma solução para executar uma aplicação de bioinformática em nuvens federalizadas, a custo zero. Nesse caso, utilizando a federação, conseguimos diminuir o tempo de execução da aplicação em 22,55%, considerando o seu tempo de execução na melhor nuvem. Além disso, este trabalho apresenta uma arquitetura chamada Excalibur, que possibilita escalar a execução de aplicações comuns em nuvem. Excalibur conseguiu escalar automaticamente a execução de um conjunto de aplicações de bioinformática em até 11 máquinas virtuais, reduzindo o tempo de execução em 63% e o custo financeiro em 84% quando comparado com uma configuração definida pelos usuários. Por fim, este trabalho apresenta um método baseado em linha de produto de software para lidar com as variabilidades dos serviços oferecidos por nuvens de infraestrutura (IaaS), e um sistema que utiliza deste processo para configurar o ambiente e para lidar com falhas de forma automática. O nosso método utiliza modelo de feature estendido com atributos para descrever os recursos e para selecioná-los com base nos objetivos dos usuários. Experimentos realizados com dois provedores diferentes mostraram que utilizando o nosso processo, os usuários podem executar as suas aplicações em um ambiente de nuvem federalizada, sem conhecer as variabilidades e limitações das nuvens.

Palavras-chave: Computação Autonômica; Computação de Alto Desempenho; Federação de Nuvens; Linha de Produto de Software; Modelo de Features

(13)

(14)

List of Figures xvii

List of Tables xxi

Listings xxii

List of Abbreviations xxiii

1 Introduction 1 1.1 Motivation . . . 1 1.2 Objectives . . . 3 1.3 Thesis Statement . . . 4 1.4 Contributions . . . 4 1.5 Publications . . . 6 1.6 Thesis Outline . . . 7

I

Background

9

2 Large-Scale Distributed Systems 10 2.1 Evolution . . . 11 2.1.1 The 1960s . . . 11 2.1.2 The 1970s . . . 12 2.1.3 The 1980s . . . 13 2.1.4 The 1990s . . . 13 2.1.5 2000-2014 . . . 15 2.1.6 Timeline . . . 16 2.2 Cluster Computing . . . 17 2.3 Grid Computing . . . 19 2.3.1 Architecture . . . 20 2.4 Peer-to-peer . . . 21 2.4.1 Architecture . . . 23 2.4.2 Unstructured P2P Network . . . 24 2.4.3 Structured P2P Network . . . 24 2.4.4 Hybrid P2P Network . . . 26 2.4.5 Hierarchical P2P Network . . . 27

(15)

2.5.2 Drawbacks . . . 32

2.6 Summary . . . 34

3 A Detailed View of Cloud Computing 36 3.1 Technologies Related to Cloud Computing . . . 37

3.1.1 Virtualization . . . 37

3.1.1.1 Definition . . . 37

3.1.1.2 Techniques . . . 38

3.1.1.3 Live Migration . . . 39

3.1.1.4 Workload and Server Consolidation . . . 39

3.1.2 Service-Level Agreement . . . 41

3.1.3 MapReduce . . . 43

3.1.3.1 Definition . . . 43

3.1.3.2 Characteristics . . . 44

3.2 Cloud Organization . . . 45

3.2.1 Architecture and Service Model . . . 45

3.2.2 Deployment Model . . . 47

3.2.3 Cloud Federation . . . 48

3.2.3.1 Definition . . . 48

3.2.3.2 Classification . . . 49

3.2.3.3 Challenges . . . 51

3.3 Cloud Standards and Metrics . . . 53

3.3.1 Cloud Standards . . . 53

3.3.2 Cloud Metrics . . . 54

3.4 IaaS Cloud Computing Systems . . . 55

3.4.1 Architecture . . . 55

3.4.2 Using an IaaS Cloud Service . . . 57

3.5 Cloud Computing Architectures . . . 60

3.5.1 Centralized Systems . . . 60 3.5.1.1 Claudia . . . 60 3.5.1.2 SciCumulus . . . 60 3.5.1.3 Cloud-TM . . . 61 3.5.1.4 mOSAIC . . . 63 3.5.1.5 TClouds . . . 63 3.5.1.6 FraSCAti . . . 63 3.5.1.7 STRATOS . . . 65 3.5.1.8 COS . . . 66 3.5.1.9 Rafhyc . . . 66 3.5.1.10 JSTaaS . . . 68 3.5.2 Decentralized Systems . . . 68 3.5.2.1 Reservoir . . . 68 3.5.2.2 Open Cirrus . . . 69 3.5.2.3 CometCloud . . . 70 3.5.2.4 Contrail . . . 71

(16)

3.6 Summary . . . 73

4 Autonomic Computing 76 4.1 Definition . . . 77

4.2 Properties . . . 77

4.3 Architecture . . . 79

4.4 Autonomic Computing Systems . . . 80

4.4.1 V-MAN . . . 80 4.4.2 Sunflower . . . 80 4.4.3 Market-based . . . 81 4.4.4 Component-Management Approach . . . 82 4.4.5 Snooze . . . 82 4.4.6 Cloudlet . . . 83 4.4.7 Distributed VM Scheduler . . . 83

4.4.8 Thermal Management Framework . . . 85

4.4.9 SmartScale . . . 85 4.4.10 SLA Management . . . 86 4.4.11 Comparative View . . . 87 4.5 Summary . . . 87 5 Green Computing 89 5.1 Energy-Aware Computing . . . 90

5.2 Green Data Centers . . . 92

5.2.1 Green Data Center Benchmarks . . . 92

5.2.1.1 The Green500 Initiative . . . 93

5.2.1.2 The Green Index . . . 93

5.2.1.3 SPECpower . . . 94

5.2.1.4 JouleSort . . . 94

5.2.1.5 Comparative View . . . 94

5.3 Green Performance Indicators . . . 95

5.3.1 The Approach of Stanley, Brill, and Koomey . . . 95

5.3.1.1 Overview . . . 95

5.3.1.2 Metrics . . . 95

5.3.1.3 Final Remarks . . . 97

5.3.2 The Green Grid Approach . . . 97

5.3.2.1 Overview . . . 97

5.3.2.2 Metrics . . . 98

5.3.2.3 Final Remarks . . . 99

5.4 Summary . . . 99

II

Contributions

100

6 Power-Aware Server Consolidation for Federated Clouds 101 6.1 Introduction and Motivation . . . 102

(17)

6.3.1 Modifications in CloudSim . . . 106

6.3.2 Simulation Environment . . . 107

6.3.3 Scenario 1: workload submission to a single data center under power consumption threshold . . . 107

6.3.4 Scenario 2: distinct workload submission to different overloaded data centers . . . 109

6.4 Related Work . . . 109

6.5 Summary . . . 113

7 Biological Sequence Comparison at Zero-Cost on a Vertical Public Cloud Federation 114 7.1 Introduction and Motivation . . . 115

7.2 Biological Sequence Comparison . . . 116

7.2.1 The Smith-Waterman Algorithm . . . 117

7.3 Design of our Federated Cloud Architecture . . . 118

7.3.1 Task Generation with MapReduce . . . 120

7.3.2 Smith-Waterman Execution . . . 120

7.4 Experimental Results . . . 121

7.6 Summary . . . 126

8 Excalibur: A User-Centered Cloud Architecture for Executing Parallel Applications 128 8.1 Introduction and Motivation . . . 129

8.2 Architecture Overview . . . 130

8.2.1 Scaling Cloud-Unaware Applications with Budget Restrictions and Resource Constraints . . . 132

8.2.2 Reducing Data Movement to Reduce Cost and Execution Time . . 133

8.2.3 Reducing Job Makespan with Workload Adjustment . . . 134

8.2.4 Making the Cloud Transparent for the Users . . . 134

8.3.1 Scenario 1: execution without auto-scaling and based on users’ preferences . . . 139

8.3.2 Scenario 2: execution with auto-scaling . . . 140

8.5 Summary . . . 143

9 Resource Selection Using Automated Feature-Based Configuration Man-agement in Federated Clouds 145 9.1 Introduction . . . 146

9.2 Motivation and Challenges . . . 149

9.3 Multi-Objective Optimization . . . 151

9.4 Feature Modeling . . . 154

(18)

9.5.1.3 Virtual Machine Image Model . . . 159

9.5.1.4 Instance Model . . . 159

9.5.2 Cost Model . . . 160

9.5.2.1 Networking and Storage Cost . . . 160

9.5.2.2 Instance Cost . . . 160

9.6 Modeling IaaS Clouds Configuration Options with Feature Model . . . 161

9.7.1 Scenario 1: simple . . . 170

9.7.2 Scenario 2: compute . . . 170

9.7.3 Scenario 3: compute and memory . . . 173

9.8.1 Virtual Machine Image Configuration . . . 175

9.8.1.1 SCORCH . . . 175

9.8.1.2 VMI Provisioning . . . 175

9.8.1.3 Typical Virtual Appliances . . . 177

9.8.2 Virtual Machine Image Deployment . . . 177

9.8.2.1 Virtual Appliance Model . . . 177

9.8.2.2 Composite Appliance . . . 177

9.8.3 Deploying PaaS Applications . . . 179

9.8.3.1 HW-CSPL . . . 179

9.8.3.2 SALOON . . . 179

9.8.4 Configuration options of multi-tenant applications . . . 180

9.8.4.1 Multi-Tenant Deployment . . . 180

9.8.4.2 Capturing Functional and Deployment Variability . . . 180

9.8.4.3 Configuration Management Process . . . 182

9.8.4.4 Service Line Engineering Process . . . 183

9.8.5 Infrastructure Configuration . . . 184

9.8.5.1 AWS EC2 Service Provisioning . . . 184

9.8.6 Comparative View . . . 184

9.9 Summary . . . 188

10 Dohko: An Autonomic and Goal-Oriented System for Federated Clouds189 10.1 Introduction and Motivation . . . 190

10.2 System Architecture . . . 191 10.2.1 Client Layer . . . 191 10.2.2 Core Layer . . . 193 10.2.3 Infrastructure Layer . . . 196 10.2.4 Monitoring Cross-Layer . . . 197 10.2.5 Autonomic Properties . . . 197 10.2.5.1 Self-Configuration . . . 198 10.2.5.2 Self-Healing . . . 198 10.2.5.3 Context-Awareness . . . 199

10.2.6 Executing an Application in the Architecture . . . 200

(19)

10.3.3 Scenario 2: application execution . . . 206

10.3.4 Scenario 3: application deployment and execution with failures . . . 206

10.5 Summary . . . 211

11 Conclusion 212 11.1 Overview . . . 212

11.2 Summary of the Contributions . . . 213

11.3 Threat to Validity . . . 217

11.4 Perspectives . . . 218

11.5 Summary . . . 219

(20)

2.1 Computing landmarks in five decades . . . 17

2.2 Tompouce: an example of a medium-scale cluster . . . 18

2.3 Tianhe-2 architecture and network topology . . . 18

2.4 Foster’s grid computing model . . . 19

2.5 Hourglass grid architecture . . . 20

2.6 Globus Toolkit 3 architecture . . . 21

2.7 Globus Toolkit 4 architecture . . . 22

2.8 Different use of P2P systems . . . 23

2.9 Generic P2P architecture . . . 24

2.10 Example of a lookup operation in Chord . . . 27

2.11 A cloud computing system . . . 30

2.12 A vision of grid, P2P, and cloud computing characteristics overlaps . . . . 31

3.1 Example of virtualization . . . 38

3.2 Example of workload consolidation using virtual machines . . . 40

3.3 Example of an SLA structure . . . 42

3.4 MapReduce execution flow . . . 44

3.5 Cloud computing architecture . . . 45

3.6 Cloud service model considering the customers’ viewpoint . . . 46

3.7 Hybrid cloud scenario . . . 48

3.8 An example of the cloud federation approach . . . 49

3.9 Difference between multi-clouds and federated clouds . . . 50

3.10 Categories of federation of clouds: vertical, horizontal, inter-cloud, cross-cloud, and sky computing . . . 50

3.11 A horizontal cloud federated scenario with three clouds . . . 52

3.12 A generic IaaS architecture . . . 58

3.13 Storage types usually available in IaaS clouds . . . 58

3.14 Claudia architecture . . . 61 3.15 SciCumulus architecture . . . 62 3.16 Cloud-TM architecture . . . 62 3.17 mOSAIC architecture . . . 64 3.18 TClouds architecture . . . 65 3.19 FraSCAti architecture . . . 65 3.20 Stratos architecture . . . 66 3.21 COS architecture . . . 67 3.22 Rafhyc architecture . . . 67

(21)

3.25 Open Cirrus architecture . . . 70

3.26 CometCloud architecture . . . 71

3.27 Contrail architecture . . . 72

3.28 OPTIMIS architecture . . . 73

4.1 Architecture of an autonomic element . . . 79

4.2 Sunflower architecture . . . 81

4.3 Control loop of a component management cloud system . . . 82

4.4 Snooze architecture . . . 83

4.5 Cloudlet architecture . . . 84

4.6 DVMS control loop . . . 84

4.7 Thermal-aware autonomic management architecture . . . 85

4.8 SmartScale architecture . . . 86

5.1 Taxonomy of power and energy management techniques . . . 91

5.2 Green metrics categorization of Stanley, Brill, and Koomey . . . 96

5.3 The Green Grid Metrics . . . 97

6.1 Agents of the cloud market . . . 104

6.2 Detailed view of a data center . . . 104

6.3 Case study 1: power consumption with 2 data centers under limited power consumption . . . 108

6.4 Case study 2: power consumption of two overloaded data centers under limited power consumption . . . 110

7.1 Comparing two biological sequences . . . 117

7.2 Example of a Smith-Waterman similarity matrix . . . 118

7.3 Federated cloud architecture to execute MapReduce applications . . . 119

7.4 Type of the messages exchanged in our multi-cloud architecture . . . 119

7.5 Comparing protein sequences with a genomics database on multiple clouds 121 7.6 Smith-Waterman execution following the MapReduce model . . . 121

7.7 Execution time for 24 sequence comparisons with the Uniprot/SwissProt database . . . 123

7.8 Sequential execution time for the longest sequence (Q9UKN1) with SSEARCH compared with the standalone execution time in Amazon EC2 . . . 124

7.9 GCUPS of 24 query sequences comparison with the database UniProtKB/Swiss-Prot using our SW implementation . . . 124

8.1 Excalibur: services and layers . . . 131

8.2 A DAG representing a workflow application with 4 tasks, where one of them, T3, is composed of three independent subtasks . . . 132

8.3 Executing an application using the Excalibur cloud architecture . . . 137

8.4 The Infernal-Segemehl workflow . . . 139

8.5 Infernal’s target hits table . . . 139

(22)

8.8 Scaling the Infernal-Segemehl workflow . . . 142

9.1 An engineering method to handle clouds’ variabilities . . . 148

9.2 Average network bandwidth of the Amazon EC2 instance c3.8xlarge when created with the default configuration; and using an internal (private) and an external (public) address for data transfer . . . 150

9.3 Multi-objective optimization problem evaluation mapping . . . 152

9.4 Pareto optimality for two objectives . . . 153

9.5 Example of a feature model with And, Optional, Mandatory, Or, and

Alter-native features . . . 156

9.6 Abstract extended feature model of IaaS clouds . . . 163

9.7 Our process to select and to deploy the resources in the clouds . . . 164

9.8 Example of the abstract extended feature model instantiated to represent two products of Amazon EC2 . . . 166

9.9 Example of an abstract extended feature model instantiated to represent two products of GCE . . . 167

9.10 UnixBench score for one, two, four, and eight virtual cores for the general instance types of Amazon EC2 and Google Compute Engine (GCE) . . . . 169

9.11 Instance types that offer at least 4 vCPU cores and 4GB of RAM memory with a cost of at most 0.5 USD/hour . . . 171

9.12 Instance types in the Pareto front that offer at least 4 vCPU cores and 4GB of RAM memory with a cost of at most 0.5 USD/hour . . . 171

9.13 Instance types suggested by the system that offer at least 4 vCPU cores and 4GB of RAM memory with a cost of at most 0.5 USD/hour . . . 172

9.14 Instance types that offer at least 16 CPU cores and 8GB of RAM memory with a cost of at most 1.0 USD/hour . . . 172

9.15 The solutions with the minimal cost (best solutions) when requested at least 16 CPU cores and 8GB of RAM memory with a cost of at most 1.0 USD/hour . . . 173

9.16 Instance types that offer at least 16 CPU cores and 90GB of RAM memory with a cost of at most 2 USD/hour . . . 174

9.17 Instance types in the Pareto front that offer at least 16 CPU cores and 90GB of RAM memory, with a cost of at most 2 USD/hour . . . 174

9.18 SCORCH MDE process . . . 176

9.19 MDE approach for VMI configuration . . . 176

9.20 Virtual appliance model . . . 178

9.21 Composite appliance model . . . 178

9.22 HW-CSPL’s feature model . . . 179

9.23 SALOON framework . . . 180

9.24 Feature model showing external and internal variability options of a multi-tenant SaaS application . . . 181

9.25 A model for managing configuration options of SaaS applications . . . 182

9.26 A configuration process model and its stages . . . 183

(23)

10.2 Structure of the hierarchical P2P overlay connecting two clouds . . . 197

10.3 The autonomic properties implemented by our architecture . . . 198

10.4 Example of super-peer failure and definition of a new super-peer . . . 199

10.5 Interaction between the architecture’s module when submitted an applica-tion to execute . . . 202

10.6 Workflow to create one virtual machine in the cloud . . . 203

10.7 Configuration time of the virtual machines on the clouds . . . 206

10.8 SSEARCH’s execution time on the clouds to compare 24 genomics query sequences with the UniProtKB/Swiss-Prot database . . . 207

10.9 Execution time of the SSEARCH application to compare 24 genomics query sequences with the UniProtKB/Swiss-Prot database in a multi-cloud scenario208

10.10Deployment and execution time of the experiments . . . 208

10.11Execution time of the application on the clouds with three different type of failures . . . 209

(24)

2.1 Comparative view of the P2P overlay networks . . . 28

3.1 A summary of some cloud standards . . . 54

3.2 Some metrics to evaluate an IaaS cloud . . . 56

3.3 Comparative view of some cloud architectures . . . 74

4.1 Autonomic computing systems . . . 88

5.1 Energy efficiency benchmarks and metrics . . . 95

6.1 Comparative view of cloud server consolidation strategies . . . 113

7.1 Configuration of the clouds to execute the SW algorithm . . . 122

7.2 Query sequences compared to the UniprotKb/Swiss-Prot genomics database122

7.3 Comparative view of the approaches that implement SW in HPC platforms 127

8.1 Resources used to execute the Infernal-Segemehl workflow . . . 139

8.2 Comparative view of user-centered cloud architectures . . . 143

9.1 Using group cardinality in feature model diagrams . . . 156

9.2 Notation of the model . . . 157

9.3 Amazon EC2 instance types and their cost in the region of Virginia . . . . 165

9.4 Google Compute Engine (GCE) instance types and their cost in the US region168

9.5 Benchmark applications . . . 168

9.6 Three users’ requirements to select the clouds’ resources . . . 170

9.7 Instance types in the Pareto front considering a demand for 16 CPU cores and 8GB of RAM memory with a cost of at most 1.0 USD/hour . . . 171

9.8 Characteristics of the instance types that offer at least 16 CPU cores and 90GB of RAM memory . . . 173

9.9 Comparative view of cloud variability models . . . 187

10.1 Main operations implemented by the key-value store . . . 196

10.2 Setup of the application . . . 203

10.3 Users’ requirements to execute the SSEARCH in the cloud . . . 205

10.4 Instance types that met the users’ requirements to execute the SSEARCH 205

10.5 Financial cost for executing the application on the cloud considering different requirements . . . 207

(25)

3.1 Counting the number of occurrence of each word in a text using MapReduce [98] 43

8.1 Example of a splittable file format (i.e., FASTA file) . . . 133

8.2 Example of a JSON with three genomics sequences . . . 133

8.3 Defining a genomics analysis application . . . 134

8.4 Defining a Twitter analysis application . . . 134

8.5 Excalibur: example of a YAML file with the requirements and one applica-tion to be executed on the cloud . . . 136

8.6 Users’ description of the Infernal-Segemehl workflow . . . 138

10.1 Structure of an application descriptor . . . 193

10.2 An example of a deployment descriptor generated by the provisioning module194

10.3 Example of one script with three variability points . . . 195

10.4 Application descriptor with the requirements and one application to be executed in one cloud . . . 201

10.5 An application descriptor with one SSEARCH description to be executed in the cloud . . . 204

(26)

AE Autonomic Element

AM Autonomic Manager

API Application Programming Interface AT Allocation Trust

AWS Amazon Web Services

BPEL Business Process Execution Language CA Cloud Availability

CDMI Cloud Data Management Interface CE Cost-Effectiveness

CERA Carbon Emission Regulator Agency

CIMI Cloud Infrastructure Management Interface CK Configuration Knowledge

Clafer Class feature reference

CloVR Cloud Virtual Service

CLSP Cloud Service Provider

CLU Cloud User

CORBA Common Object Request Broker Architecture

CP Constraint Programming CPE Compute Power Efficiency CSP Constraint Satisfaction Problem DAG Directed Acyclic Graph

DAOP Dynamic Aspect-Oriented Programming DCD Data Center Density

DCeP Data Center Energy Productivity DCiE Data Center Infrastructure Efficiency DCOM Distributed Computing Object Model DCPE Data Center Performance Efficiency DFS Distributed File System

(27)

DSL Domain Specific Language

DSTM Distributed Software Transaction Memory DTM Distributed Transaction Memory

DVFS Dynamic Voltage and Frequency Scaling EC2 Elastic Compute Cloud

ECA Event-Condition-Action ECU Elastic Computing Unit EFM Extended Feature Model EPP Electric Power Provider

FAP Federated Application Provisioning

FM Feature Model

FODA Feature-Oriented Domain Analysis GAE Google App Engine

GCE Google Compute Engine

GCEUs Google Compute Engine Units

GCUP Billions of Cell Updates per Second GGF Global Grid Forum

GHG Greenhouse Gas

GPI Green Performance Indicator GUI Graphical User Interface HPC High Performance Computing HTTP Hypertext Transfer Protocol IaaS Infrastructure-as-a-Service IC Instance Capability

IE Instance Efficiency IM Infrastructure Manager

IOPS Input/Output Operations per Second IPT Instance Performance Trust

ISP Instance Sustainable Performance JSON JavaScript Object Notation KPI Key Performance Indicator MAS Multi-Agent System

MDE Model-Driven Engineering

ME Managed Element

(28)

MPP Massively Parallel Processor MQS Message Queue System MTC Many Task Computing

MTRA Mean Time to Resource Acquisition NAS Network-Attached Storage

NC Network Capacity

NIST National Institute of Standards and Technology OCCI Open Cloud Computing Interface

OD Outage Duration

OGF Open Grid Forum

OGSA Open Grid Service Architecture OGSI Open Grid Service Infrastructure

OS Operating System

OVF Open Virtualization Format P2P Peer-to-peer

PaaS Platform-as-a-Service

PC Personal Computer

PLE Product Line Engineering PLR Packet Lost Ratio

PUE Power Usage Effectiveness PVM Parallel Virtual Machine QoS Quality of Service

RE Resource Efficiency

REST Representational State Transfer RMI Java Remote Method Invocation RPC Remote Procedure Call

RRT Resource Release Time SaaS Software-as-a-Service SC Storage Capacity

SCA Service Component Architecture

SCORCH Smart Cloud Optimization for Resource Configuration Handling

SDA Service Discovery Agent SDAR Storage Data Access Ratio SGE Sun Grid Engine

(29)

SOA Service-Oriented Architecture SOAP Simple Object Access Protocol

SOOP Single-Objective Optimization Problem

SPEC Standard Performance Evaluation Corporation SPL Software Product Line

SPLE Software Product Line Engineering SS Self-Scheduling

SSD Solid-State Disk SSO Single Sign-On

SW Smith-Waterman

TGI The Green Index TM Transactional Memory TVA Typical Virtual Appliance VA Virtual Appliance

VI Virtual Infrastructure

VM Virtual Machine

VMI Virtual Machine Image VMM Virtual Machine Monitor W3C World Wide Web Consortium WfMS Workflow Management System WSDL Web Service Definition Language WSLA Web Service Level Agreement WSRF Web Service Resource Framework

WWW World-Wide Web

XML Extensible Markup Language

XMPP Extensible Messaging and Presence Protocol YAML YAML Ain’t Markup Language

(30)

Introduction

1.1 Motivation

Cloud computing is a recent paradigm for provisioning of computing infrastructure, platform and/or software. It provides computing resources through a virtualized infras-tructure, letting applications, computing power, data storage and network resources to be provisioned, and remotely managed over private networks or over the Internet [154, 307]. Hence, cloud computing enhances collaboration, agility, scalability, and availability to end users and enterprises.

Furthermore, the clouds offer different features, enabling resource sharing (e.g., in-frastructure, platform and/or software) among cloud providers and cloud users in a pay-as-you-go model. These features have been used for many objectives such as (a) to decrease the cost of ownership, (b) to increase the capacity of dedicated infrastructures when they run out of resources, (c) to reduce power consumption and/or carbon footprint, and (d) to respond effectively for changes in the demand. However, the access to these features depends on the levels of abstraction (i.e., cloud layer). The levels of abstraction are usually defined as being: infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), and software-as-a-service (SaaS). While IaaS offers low-level access to the infrastructure, including virtual machines (VMs), storages, and network, PaaS adds a layer above the IaaS infrastructure, offering high-level primitives to help the users on developing native cloud applications. In addition, it also provides services for application deployment, monitoring,

(31)

and scaling. Finally, SaaS provides the applications, and manages the whole computing environment.

In this context, the users interested in the cloud face the problem of choosing low-level services to execute ordinary applications, thus, being responsible for managing the com-puting resources (i.e., VMs) or choosing high-level services needing to develop native cloud applications, in order to delegate the management of the computing environment to cloud providers. These two options exist, because the clouds often target Web applications, whereas users’ applications are usually batch-oriented performing parameter sweeps. There-fore, deploying and executing an application in the cloud is still a complex task [179, 392]. For instance, to execute an application using the cloud infrastructure, the users must first select a virtual machine (VM), configure all necessary applications, transfer all data, and finally, execute their applications. Learning this process can require days or even weeks, if the users are unfamiliar with system administration, without guarantees of meeting their objectives. Hence, this can represent a barrier to use the cloud infrastructure.

Moreover, with the number of cloud providers growing, the users have the challenge of selecting an appropriate cloud and/or resource (e.g.,VM) to execute their applications. This represent a difficult task because there is a wide range of resources offered by the clouds. Furthermore, these resources are usually suited for different purposes, and they often have multiple constraints. Since clouds can fail, or may have scalability limits, a cloud federation scenario should be considered, increasing the work and the difficulties in using clouds’ infrastructures. Some questions that arise in this context are: (a) which resources (VMs) are suitable for executing a given application; (b) how to avoid under and over-provisioning taking into account both resources’ characteristics and users’ objectives (e.g., performance at lower cost), without being tied to a cloud provider; and (c) how a non-cloud specialist may maximize the usage of the cloud infrastructure without re-engineering their applications to the cloud environment.

Although some efforts have been made to reduce the cloud’s complexity, most of them target software developers and are not straightforward for inexperienced users [179]. Besides that, cloud services usually run on large-scale data centers and demand a huge amount of electricity. Nowadays, the electricity cost can be seen as one of the major concerns of data centers, since it is sometimes nonlinear with the capacity of the data centers, and it is also associated with a high amount of carbon emission (CO2). The

projections considering the data center energy-efficiency [142, 201, 202] show that the total amount of electricity consumed by data centers in the next years will be extremely high, and it is like to overtake the airlines industry in terms of carbon emissions.

Additionally, depending on the efficiency of the data center infrastructure, the number of watts that it requires can be from three to thirty times higher than the number of watts needed for computations [330]. And it has a high impact on the total operation costs [31], which can be over 60% of the peak load. Nevertheless, energy-saving schemes that result in too much degradation of the system performance or in violations of service-level agreement (SLA) parameters would eventually cause the users to move to another cloud provider.

Thus, there is a need to reach a balance between the energy savings and the costs incurred by these savings in the execution of the applications.

In this context, we advocate the usage of cloud federation to seamlessly distribute the services workload across different clouds, according to some objectives (e.g., to reduce energy consumption) without incurring in too much degradation of performance requirements

(32)

defined between cloud providers and cloud users. Moreover, we advocate a declarative approach where the users describe their applications and objectives, and submit it to a system that automatically: (a) provisions the resources; (b) sets up the whole computing environment; (c) tries to meet the users’ requirements such as performance at reduce cost; and (d) handles failures in a federated cloud scenario.

1.2 Objectives

Following a goal-oriented strategy, this thesis aims to investigate the usage of federated clouds considering different viewpoints. It considers the point of views of the: (a) cloud providers, (b) experienced software developers, (c) ordinary users, and (d) multiple users’ profiles (i.e., system administrators, unskilled and skilled cloud users, and software developers). To achieve the main goal, this thesis considers the following four sub-goals.

1. reducing power consumption: the cloud services usually execute in big data centers that normally contain a large number of computing nodes. Thus, the cost of running these services may have a major impact on the total operational cost [31] due to the amount of energy demanded by such services. In this scenario, we aim to investigate the use of a cloud federation environment to help cloud providers on reducing power consumption of the services, without having a great impact in quality of service (QoS) requirements.

2. execution of a huge application at reduced-cost: most of the clouds provide some resources at low financial costs. These resources normally have limited comput-ing capacity and small amount of RAM memory. Moreover, they are heterogeneous with regard to the cloud layer (i.e., PaaS and IaaS), requiring different strategies to use and to connect them. Thus, we want to investigate if a federated execution using exclusively the cheapest resources of each cloud can achieve an acceptable execution time.

3. reducing the execution time of cloud-unaware applications: some of the users applications were designed to execute in a single and dedicated resource with almost predictable performance. Hence, these applications do not fit the cloud model, where resource failures and performance variation are part of its normal operation. However, most of these applications have parts that can be executed in parallel. In other words, these applications comprise parts of independent tasks. In this scenario, we aim to investigate the execution of cloud-unaware applications taking into account the financial cost and trying to reduce their execution time without users’ intervention.

4. automate the tasks of selection and configuration of clouds resources for

different kinds of users: nowadays, the users interested in the clouds face two

major problems. One is knowing what are the available resources including their constraints and characteristics. Another is the required skill to select, to configure, and to use these resources taking into account different objectives such as performance and cost. In this context, one of our goal is to investigate how to help the users on

(33)

dealing with these problems, requiring minimal users’ intervention to configure a single or a federated cloud environment.

1.3 Thesis Statement

In 1992, Smarr and Catlett [326] introduced the concept of metacomputing. Metacom-puting refers to the use of distributed comMetacom-puting resources connected by the networks. Thus creating a virtual supercomputer – the metacomputer. In this case, they advocated that the users must be unaware of the metacomputer or even any computer, since the metacomputer has the capacity to obtain whatever computing resources are necessary.

Based on this vision, our thesis statement is that:

Cloud computing is an interesting model for implementing a metacomputer due to its characteristics such as on-demand, pay-per-usage, and elasticity. Moreover, the cloud model focuses on delivering computing services rather than computing devices; i.e., in the cloud, the users are normally unaware of the computing infrastructure. Altogether, the cloud model can increase resource federation and reduce the efforts to democratize the access to high performance computing (HPC) environments at reduced cost.

Besides that, we ask the following research questions in this thesis:

can the cloud federation be used to reduce power consumption of data centers, without incurring into several performance penalties for the users?

can software developers use the clouds to speed up the execution of a native-cloud application at reduced-cost, without being locked to any cloud provider?

can inexperienced users utilize a cloud environment to execute cloud-unaware ap-plications, without having to deal with low-level technical details, having both an acceptable execution time and a financial cost, and without having to change their applications to meet cloud’s constraints?

is there a method to support automatic resource selection and configuration on cloud federation, and that offers a level of abstraction suitable for different users’ profiles?

1.4 Contributions

To tackle the introduced sub-goals this thesis makes the following five contributions:

1. Power-Aware Server Consolidation for Federated Clouds: we propose and

evaluate a power and SLA-aware application consolidation solution for cloud federa-tions [215]. To achieve this goal, we designed a multi-agent system (MAS) for server consolidation, taking into account service-level agreement (SLA), power consumption, and carbon footprint. Different for similar solutions available in the literature, in our

(34)

data centers before migrating the workload (i.e., VM). Simulation results show that our approach can reduce up to 46% of the power consumption, while trying to meet performance requirements. Furthermore, we show that federated clouds can provide an adequate solution to deal with power consumption in clouds. In this case, cloud providers can use the computing infrastructure of other clouds according to their objectives. This work was published in [215].

2. Biological Sequence Comparison at Zero-Cost on a Vertical Public Cloud

Federation: we propose and evaluate an approach to execute a huge bioinformatics

application on a vertical cloud federation [214]. This approach has two main components: (i) an architecture that can transparently connect and manage multiple clouds, thus creating a multi-cloud environment and (ii) an implementation of a MapReduce version of the bioinformatics application in this architecture. The architecture and the application were implemented and executed in five public clouds (Amazon EC2, Google App Engine, Heroku, OpenShift, and PiCloud), using only their free-quota. In our tests, we executed an application that did up to 12 million biological comparisons. Experimental results show that (a) our federated approach could reduce the execution time in 22.55% over the best stand-alone cloud execution; (b) we could reduce the execution time from 5 hours and 44 minutes (SSEARCH sequential tool) to 13 minutes (our Amazon EC2 execution); and (c) federation can enable the execution of huge applications in clouds at no expense (i.e., using only the free-quota). With this work, it became clear that our architecture could federate real clouds and it could execute a real application. Even though the architecture proposed was very effective, it was application-specific (i.e., Mapreduce application). Moreover, it became clear for us that configuration tasks are complex in a real cloud environment, and they often require advanced computing skills. So, we decided to investigate generic architectures and models that did not impose complex configuration tasks for the users. This work was published in [214].

3. Excalibur: A User-Centered Cloud Architecture for Executing Parallel

Applications: we propose and evaluate a cloud architecture, called Excalibur.

This architecture has three main objectives [217]: (a) provide a platform for high performance computing applications in the cloud for users without cloud skills; (b) dynamically scale the applications without user intervention; and (c) meet the

users requirements such as high performance at reduced cost. Excalibur comprises three main components: (i) an architecture that sets up the cloud environment; (ii) an auto-scaling mechanism that tries to reduce the execution time of cloud-unaware applications. In this case, the auto-scaling solution focuses on applications that were developed to be executed sequentially, but that have parts that can be executed in parallel; (iii) a domain specific language (DSL) that allows the users to describe the dependencies between the applications based on the structure of their data (i.e., input and output). We executed a complex genomics cloud-unaware application in our architecture, which was deployed on Amazon EC2. The experiments showed that the proposed architecture could dynamically scale this application up to 11 instances, reducing the execution time by 63% and the cost by 84% when compared to the execution in a configuration specified by the users. In this case, the execution time was reduced from 8 hours and 41 minutes to 3 hours and 4 minutes; and the cost was

(35)

reduced from 78 USD to 14 USD. With this work, the advantages of auto-scaling in clouds became clear to us. Furthermore, we showed that it was possible to execute a complex cloud-unaware application in the cloud. This work was published in [217].

4. Resource Selection Using Automated Feature-Based Configuration

Man-agement in Federated Clouds: we propose and evaluate a model to handle the

variabilities of IaaS clouds. The model uses extended feature model (EFM) with attributes to describe the resources and their characteristics and to select appro-priate virtual machine based on users’ objectives. We implemented the model in a solver (i.e., Choco [178]) considering the configurations of two different clouds (Ama-zon EC2 and Google Compute Engine (GCE)). Experimental results showed that using the proposed model, the users can get an optimal configuration with regard to their objectives without needing to know the constraints and variabilities of each cloud. Moreover, our model enabled application deployment and reconfiguration at runtime in a federated cloud scenario without requiring the usage of virtual machine image (VMI).

5. Dohko: An Autonomic and Goal-Oriented System for Federated Clouds:

we propose and evaluate an autonomic and goal-oriented system for federated clouds. Our system implements the autonomic properties: self-configuration, self-healing, and context-awareness. Using a declarative strategy, in our system, the users specify their applications and requirements (e.g., number of CPU cores, maximal finan-cial cost per hour, among others), and the system automatically (a) selects the resources (i.e., VMs) that meet the constraints using the model proposed in con-tribution 4; (b) configures and installs the applications in the clouds; (c) handles resource failures; and (d) executes the applications. We executed a genomics applica-tion (i.e., SSEARCH, September 2014) to compare up to 24 biological sequences with the UniProtKB/Swiss-Prot (September 2014) in two different cloud providers (i.e., Amazon EC2 and GCE) and considering different scenarios (e.g., standalone (i.e., single cloud) and multiple clouds). Experimental results show that our system could transparently connect different clouds and configure the whole execution environ-ment, requiring minimal users intervention. Moreover, by employing a hierarchical management organization (i.e., a hierarchical P2P overlay), our system was able to handle failures and to organize the nodes in a way that reduced inter-cloud communication.

1.5 Publications

1. Alessandro Ferreira Leite, Claude Tadonki, Christine Eisenbeis, Tainá Raiol, Maria Emilia M. T. Walter, and Alba Cristina Magalhães Alves de Melo. Excalibur: An

autonomic cloud architecture for executing parallel applications. In 4th International

Workshop on Cloud Data and Platforms, pages 2:1–2:6, Amsterdam, Netherlands, 2014

2. Alessandro Ferreira Leite and Alba Cristina Magalhães Alves de Melo. Executing . In

(36)

19th International Conference on High Performance Computing (HiPC), pages 1-9, Bangalore, India, 2012

3. Alessandro Ferreira Leite and Alba Cristina Magalhães Alves de Melo. Energy-aware multi-agent server consolidation in federated clouds. In Mazin Yousif and Lutz Schubert, editors, Cloud Computing, volume 112 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pages 72–81. 2013

4. Alessandro Ferreira Leite, Hammurabi Chagas Mendes, Li Weigang, Alba Cristina Magalhães Alves de Melo, and Azzedine Boukerche. An architecture for P2P

bag-of-tasks execution with multiple task allocation policies in desktop grids. Cluster

Computing, 15(4), pages 351-361, 2012

1.6 Thesis Outline

This thesis is organized in two parts: background and contributions.

In the first part, we present the concepts and recent developments in the domain of large-scale distributed systems. In this case, it comprises the following chapters:

chapter 2: we provide an historical perspective of concepts, mechanisms and tools

that are landmarks in the evolution of large-scale distributed systems. Then, we present the most representative types of these systems: clusters, grids, P2P systems, and clouds.

chapter 3: we discuss the practical aspects related to cloud computing, such as

virtualization, service-level agreement (SLA), MapReduce, and cloud computing architectures.

chapter 4: we describe autonomic computing. First, we present the definition of

autonomic systems, followed by the autonomic properties. Then, we present the concepts related to the architecture of autonomic systems. Finally, some autonomic systems for large-scale distributed systems are presented and compared.

chapter5: we present the concept of energy-aware computing, providing information

about green data centers followed by a discussion about green performance indicators. In the second part, we present the contributions of this thesis, and it is organized as follows:

chapter 6: we present the first contribution of this thesis: a server consolidation

approach to reduce power consumption in cloud federations. First, we present the proposed multi-agent system (MAS) server consolidation strategy for federated clouds. Then, the experimental results are discussed followed by the related work in this area. Finally, we present final considerations.

(37)

chapter 7: in this chapter, we describe the seconds contribution of this thesis: our

approach to execute the Smith-Waterman (SW) algorithm in a cloud federation with zero-cost. First, we provide a brief introduction for biological sequence comparison followed by a description of the Smith-Waterman algorithm. Next, we present the proposed architecture and the experimental results realized in a public cloud federation scenario. Finally, we discuss some of the related works that have executed the SW algorithm in different platforms followed by final considerations.

chapter 8: in this chapter, we describe the third contribution of this thesis: a

cloud architecture to help the users on reducing the execution time of cloud-unaware applications. First, we present our cloud architecture. After, experimental results are discussed followed by a discussion of similar cloud architecture available in the literature. Finally, we present final considerations.

chapter 9: this chapter presents the fourth contribution of this thesis: a model to

handle the variabilities of IaaS clouds. First, it presents the motivation and challenges addressed by our model followed by an overview of multi-objective optimization problem (MOOP) and feature modeling. Next, the proposed model is presented. After, it describes the experimental results followed by the related works. Finally, it presents final considerations.

chapter 10: in this chapter, we present the fifth contribution of this thesis: an

autonomic cloud architecture. First, we present the proposed architecture and its main components, followed by a description of its autonomic properties. Then, experimental results are discussed. Next, a comparative view of some important features of cloud architectures is presented. Finally, we present final considerations.

chapter 11: in this chapter, we summarize the overall manuscript, and we present

(38)

(39)

Large-Scale Distributed Systems

2.1 Evolution . . . . 11 2.1.1 The 1960s . . . 11 2.1.2 The 1970s . . . 12 2.1.3 The 1980s . . . 13 2.1.4 The 1990s . . . 13 2.1.5 2000-2014 . . . 15 2.1.6 Timeline . . . 16 2.2 Cluster Computing . . . . 17 2.3 Grid Computing . . . . 19 2.3.1 Architecture . . . 20 2.4 Peer-to-peer . . . . 21 2.4.1 Architecture . . . 23 2.4.2 Unstructured P2P Network . . . 24 2.4.3 Structured P2P Network . . . 24 2.4.4 Hybrid P2P Network . . . 26 2.4.5 Hierarchical P2P Network . . . 27

2.4.6 Comparative View of P2P Structures . . . 28

2.5 Cloud Computing . . . . 29

2.5.1 Characteristics . . . 30

2.5.2 Drawbacks . . . 32

(40)

Over the years, we have observed a considerable increase in the demand for powerful computing infrastructures. This demand has been satisfied by aggregating resources connected through a network, forming large-scale distributed systems. These systems normally appear to users as a single system or computer [88, 341]. One example is the Internet, where the users use different services to communicate and to share information, without needing to know about its computing infrastructure.

Large-scale distributed systems can be defined as systems that coordinate a big number of geographically distributed and mostly heterogeneous resources to deliver scalable services without having a centralized control. Scalability is an important characteristic of large-scale distributed systems since it guarantees that even if the number of users, nodes or the system workload increases, these systems can still deliver their services without noticeable effect on performance or on administrative complexity [192]. Examples of such systems are the SETI@home [370], the LHC Computing Grid [33], among others.

In this context, this chapter presents the evolution of computing systems from the 1960s until today. Then, the most common large-scale systems are discussed: clusters, grids, P2P systems, and clouds.

2.1 Evolution

Nowadays, we observe an astonishing level of complexity, interoperability, reliability, and scalability of large-scale distributed systems. This is due to concepts, models, and techniques developed in the last 50 years in many research domains, including computer architecture, networking, and parallel/distributed computing. In this section, we present a general landscape of events and the main landmarks that contributed to our current development state. It must be noted that this is not intended to be an exhaustive list. In order to give a historical perspective, this section is organized in subsections that describe the main developments in each decade.

2.1.1 The 1960s

The 1960s was a period of innovative ideas that could not become reality because of several technological barriers. It was also the decade of the first developments in the areas of supercomputing and networking.

In the 1960s, some researchers conceived the idea of the computer as a utility. They envisioned a world where the computer was not an expensive machine restricted to some organizations, but a public utility as the telephone system. This idea was exposed by John McCarthy in 1961 at a talk at the MIT Centennial [132]. In 1962, J. C. L. Licklider proposed the Galactic Network concept [226], where a set of globally interconnected computers could be used by anyone to quickly access data and programs. Even though these ideas became popular in the 1960s, many technological barriers were encountered that prevented their implementation at that time.

Still in the 1960s, the CDC 6000 series was designed by Seymour Cray as the first super-computer, composed of multiple functional units and one or two CPUs. This was a multiple

(41)

instruction multiple data (MIMD) machine, according to Flynn’s categorization [118]. In the late 1960s, the CDC 8600 had 4 CPUs, with a shared memory design.

In 1962, the first wide-area network experiment was made, connecting a computer in Massachusetts to a computer in California, using telephone lines. In 1969, the ARPANET was created, aiming to connect computing nodes in several Universities in the US, making use of the recently developed packet switching theory.

In 1965, IBM announced the System 370/67, introducing the usage of a software layer called virtual machine monitor (VMM) that enabled to share the same computer among several operating systems and computing environments [138,146].

The concepts involved in the client/server model were first proposed in 1969, where the server was named server-host and the client was named using-host. In this early model, both client and server were physical machines.

2.1.2 The 1970s

The 1970s can be seen as the decade where some ideas proposed in the 1960s started to be implemented. One of the main landmarks of this decade is the Internet.

In the early 1970s, Seymour Cray left CDC to create his own company, using a different approach to build supercomputers. The idea was to exploit the single instruction multiple data (SIMD) categorization [118], named vector processing at that time. It replicated the execution units, instead of the whole CPU. The Cray-1 Machine, released in 1976, could operate by blocks of 64 operands (SIMD capability), attaining 250 MFlops (millions of floating-point operations per second).

Still in the 1970s, computers started to be connected by networks, using the ARPANET. In 1972, a large demonstration of the ARPANET took place, using electronic mails as the main application. In addition, the concept of open-architecture network was proposed, where individual networks might be separately designed and then connected by a standard protocol. A packet radio program that used the open-architecture concept was called Internetting at DARPA. In 1973, the first Ethernet network was installed at the Xerox Palo Alto Research Center. Telnet (remote login) and ftp (file transfer) protocols were proposed in 1972 and 1973, respectively. In 1974, Cerf and Kahn published their paper presenting the TCP protocol [72]; and the term Internet was coined to describe a global TCP/IP network.

The combination of the time-sharing vision and the decentralization of computer infrastructures led to the concept of distributed systems in the late 1970s. A distributed system can be defined as a collection of autonomous computers connected by a network that appears to its users as a single system. In this system, computers communicate with each other using messages that are sent over the network, and this communication process is hidden from the users [88, 134, 341]. Transparency is a very important point in this definition. It means that users and applications should interact with a distributed system in the same way as they interact with a standalone system. Other important characteristics are scalability, availability and interoperability which, respectively, enable distributed systems to be relatively easy to expand or scale; to be continuously available, even though some parts are unavailable; and to establish communication among different hardware and software platforms.

(42)

2.1.3 The 1980s

The 1980s observed a rapid development in supercomputing, networking and distributed systems. The World-Wide Web (WWW) was proposed in this decade.

In the 1980s, Cray continued to release supercomputers based on vector processing and, in 1988, the first supercomputer to attain a sustained rate of 1 gigaflop was a Cray-YMP8, composed of 8 CPUs, capable of operating simultaneously on 64 operands each. Still in the 1980s, several companies focused on shared memory MIMD supercomputers. The company Thinking Machines built the CM-2 (Connection Machine) hypercube supercomputer, one of the first so-called massively parallel processors (MPPs), composed of 65,536 (1-bit check) processing elements with local memory. In 1989, the CM-2 with 65,536 processing elements attained 2.5 gigaflops.

In 1983, the TCP/IP protocol was adopted by the ARPANET, replacing the previous NCP protocol. In 1985, the TCP/IP protocol was used by several networks, including the USENET, BITNET and NSFNET. In 1989, Berners-Lee [39, 40] proposed a new protocol based on hypertext. This protocol would become the World-Wide Web (WWW) in 1991.

With the advent of personal computers (PCs) and workstations, the client-server model gained popularity in the late 1980s, being mainly used for file sharing. Still in the 1980s, important concepts in distributed systems such as event ordering, logical clocks, global states, Byzantine faults and leader election were proposed [88].

In the 1980s, it became clear that a distributed system must comprise different providers and that it should be independent from the communication technology. Initially, this was a difficult objective to achieve since the communication in a distributed system was mostly implemented using low-level socket primitives. Socket programming is complex and requires a deep understanding of the underlying network protocols. To bypass these difficulties, the remote procedure call (RPC) was proposed in 1983 [47], enabling functions that belong to the same program to be performed by remote computers, as if they were running locally. This represented a great advancement and RPC was the base for the CORBA and Web services.

2.1.4 The 1990s

In the 1990s, great technological advances and intelligent design choices made it possible to break the teraflop performance barrier. It was also the decade of the widespread adoption of the Internet. Metacomputing, grid computing and cloud computing were proposed in the 1990s. At this decade, the researchers also began to consider the use of virtual machines to overcome some limitations of the x86 architecture and operating systems [301].

The idea of cluster computing became very popular in the 1990s. Clusters consisted of commodity components connected in order to build a supercomputer. In 1993, the Beowulf project was able to connect 8 PCs (DX4 processors) with the 10Mbit Ethernet, using the Linux operating system. The number of PCs connected grew quickly and the 10Mbit Ethernet was replaced by Fast Ethernet. One of the reasons for the success of clusters was the development of programming environments such as parallel virtual machine (PVM) and message passing interface (MPI). In 1997, clusters were included among the fastest 500 machines at the Top 500 list (top500.org), with a sustained performance of 10 gigaflops.

A user-centered and autonomic multi-cloud architecture for high performance computing applications

HAL Id: tel-01127070

https://tel.archives-ouvertes.fr/tel-01127070

Alessandro Ferreira Leite

To cite this version:

Ecole Doctorale d’Informatique de Paris-Sud

INRIA SACLAY ÎLE-DE-FRANCE/LABORATORIE DE RECHERCHE

EN INFORMATIQUE

Discipline : Informatique

En Cotutelle Internationale avec

Thèse de doctorat

Soutenue le 2 décembre 2014 par

Alessandro FERREIRA LEITE

A User-Centered and Autonomic

Multi-Cloud Architecture for High

Performance Computing Applications

I

Background

9

II

Contributions

100

Introduction

Contents

1.1

Motivation

1.2

Objectives

1.3

Thesis Statement

1.4

Contributions

1.5

Publications

1.6

Thesis Outline

Large-Scale Distributed Systems

Contents

2.1

Evolution

2.1.1

The 1960s

2.1.2

The 1970s

2.1.3

The 1980s

2.1.4

The 1990s