• Aucun résultat trouvé

Kaapi / Charm++ preliminary comparison

N/A
N/A
Protected

Academic year: 2021

Partager "Kaapi / Charm++ preliminary comparison"

Copied!
25
0
0

Texte intégral

(1)

Kaapi / Charm++ preliminary comparison

Xavier Besseron1 Thierry Gautier1

Gengbin Zheng2 Laxmikant Kalé2

(1)MOAIS Project – LIG and INRIA

(2)University of Illinois at Urbana-Champaign

22-24 June 2010, Bordeaux

(2)

Outline

1 Qualitative comparison

2 NUMA execution

3 Distributed execution

4 Fault-tolerance

5 Preliminary Conclusion

(3)

Objective: efficiency

90% of efficiency on 10000 cores≡1000 cores not used

strong scalability (fixed problem size)

Comparison of middlewares for exascale architecture

Overhead of parallel programming API

overhead with respect to the execution time of sequential program

Scheduling

quality of scheduling algorithm

Fault tolerance

cost of fault tolerance protocol (coordinated checkpoint/rollback)

Cost of developping parallel application

Parallel programming complexity

Add fault tolerance to application

Other criteria (smallest timestep, memory) for futur comparisons

(4)

Overview

Charm++

Library/Compiler (C, C++, Fortran, ...)

Low level interface: Object based (Chares), message driven

High level interface: Structured dagger, Charisma, ...

Production tool

Kaapi

C++ library, template based, Fortran interface

High level interface: Data flow graph(Athapascan)

At runtime based on active messages + threads + scheduling algorithms ...

Research software

(5)

Which comparison ?

Key points

Programming model

Preliminary work: skew due to Charm++/Low level - Kaapi/High level

Futur work: Charisma versus Kaapi

Performance of execution

Overhead with respect to execution time of a sequential program

Multi-core

Cluster and grid

Fault Tolerance capability

Checkpoint / Restart protocolsexperimental costs

Methodology

"Same benchmark" using both frameworks + sequential version

Experimentation and Analysis

(6)

Programming models comparison

Charm++ (Low level interface: object based, message driven)

Simple: Model and syntax close to the classic object model ++ Good documentation, many examples, stable

Easy to understand, easy to use ++ Several "production quality" applications

large scale experiments, including the FT protocol ++ Automatic dynamic load balancing

Required for irregular computation +/– Non deterministic execution

Results can depend on the reception order

Explicit communications, required some synchronizations

Source of bugs (race conditions) Non-transparent fault-tolerance

Requires source code modification

(7)

Programming models comparison

Kaapi (High level interface: macro data flow)

Complex/unusual programming model: macro data flow ++ Sequential semantic / Implicit communications and

synchronizations

Once the (sequential) program is written and run on 1 (or 2) processors, very few bugs (in the application source code) ++ Automatic dynamic load balancing

++ Transparent fault-tolerance

No modification in the source code +/– Deterministic execution

Always the same results

Simple documentation, research software

Harder to understand and to start with ? Not a language: C++ Template based

Error messages at compilation are very ugly

(8)

Scheduling of Charm++ program

A Charm++ program⇒a set ofChares

Scheduler of message queue (priority, ...)

Dynamic Load balancing algorithm

GreedyLB, MetisLB, RecBisectBfLB, NeighborLB, ...

Based on automatic instrumentation of the application by the runtime

(9)

Scheduling of Kaapi program

A Kaapi program⇒adata flow graph

KAAPI Scheduler work stealing + affinity

K-ThreadK-Processor

process process

CPU CPU CPU

Active Message over TCP/IP, Myrinet

CPU CPU OS scheduler OS scheduler

Data Flow Graph abstract representation

Graph Partitionning

(10)

The benchmark: Poisson3D

Problem

Solve Poisson equation on a 3D domain: A×X =B

Parallelized by domain decomposition

Jacobi iterative method

xik+1= 1 aii(bi

N

X

j=1,j6=i

aijxjk)forifrom 1 to N

Benchmark data

X, X’: computation domain & temporary computation domain

A: Jacobi stencil (sparse matrix)

B: right-hand side

S: solution (here the solution is known to check the result)

(11)

Source code

Common code for "sequential" kernel: 671 lines

Same data structure and allocation

Lines of code (raw values)

Charm++ Kaapi

(De)serialization 105 125

Task encapsulation - 160

Interface file (.ci) 50 -

Computation Kernel 140 50

FT/LB 50 -

Total 483 481

Sequential program without decomposition: 50 lines

Sequential program with decomposition: 112 lines

(12)

Experiments

NUMA machine "idkoiff"

8 Opteron 875 dual core, 2.2Ghz, 30GBytes. 8 memory banks

gcc/g++ 4.5. All programs compiled with -O4.

Charm++ 6.2.0 (multicore-linux64.tar.gz)

+p 1: one process

+ppn <n> : number of PEs per process

+setcpuaffinity : useCPU affinity

+excludecore <coreid> : to exclude some core

Kaapi 2.4 (http://kaapi.gforge.inria.fr)

Same mapping of threads onto core for Charm++/Kaapi

using util.thread.cpuset

XKaapi 0.1beta

New C kernel for Kaapi. Currently only for multi-core machine.

Timings

Domain size: 4×106doubles (100x200x200) split ini×j×k subdomains

1 run = average time of one iteration over 50 iterations.

Get the best average over 10 runs

(13)

Execution time: 1 core (among 16 cores)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

1x1x1 1x2x2 2x2x2 1x4x4 2x4x4 4x4x4 4x8x8

One−iteration time (s)

Decomposition Charm++ SMP

Kaapi X−Kaapi

Seq with decomposition Sequential

(14)

Execution time: 8 cores (among 16 cores)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

1x1x1 1x2x2 2x2x2 1x4x4 2x4x4 4x4x4 4x8x8

One−iteration time (s)

Decomposition

Charm++ SMP Kaapi X−Kaapi Sequential

Few parallelism⇒slowdown due to activity of the threads

(15)

Execution time: 16 cores (among 16 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6

1x1x1 1x2x2 2x2x2 1x4x4 2x4x4 4x4x4 4x8x8

One−iteration time (s)

Decomposition

Charm++ SMP Kaapi X−Kaapi Sequential

Same slowdown

(16)

Distributed execution: experiments

Execution platform

Two experiments on Grid’5000 clusters:

Cluster Griffon at Nancy : 64 nodes with 8 cores, ie 512 cores

Cluster GDX at Orsay : 283 nodes with 2 cores, ie 566 cores

Gigabit ethernet network

Same Poisson 3D benchmark

Weak scalability: domain size = 4×106doubles per core

Two cases for Kaapi and Charm++:

1 process per node (named SMP)

1 process per core

Charm++ 6.1.3net-linux-x86_64[-smp]

Kaapi(git version 03/01/2010, branch besseron/master)

(17)

Distributed execution: cluster Griffon

Weak scalability: 4×106doubles per core, 4 subdomains per core Up to 512 cores (64 nodes of cluster Griffon of Nancy)

0 0.1 0.2 0.3 0.4 0.5

0 100 200 300 400 500 600

One−iteration execution time (s)

Number of cores

Kaapi Kaapi SMP Charm Charm SMP Sequential

(18)

Distributed execution: cluster GDX

Weak scalability: 4×106doubles per core, 4 subdomains per core Up to 566 cores (283 nodes of cluster GDX of Orsay)

0 0.1 0.2 0.3 0.4 0.5

0 100 200 300 400 500 600

One−iteration execution time (s)

Number of cores

Kaapi Kaapi SMP Charm Charm SMP Sequential

(19)

Checkpoint protocols in Charm++ and Kaapi

Charm++

Double blocking coordinated checkpoint on disk

double disk≡2 copies: local and on aremote buddy processor

⇒tolerate only 1 failure in worst case

global consistencyrequires an explicit barrier in the code

Kaapi

Blocking coordinated checkpoint on disk

1 copy per process on aKaapi checkpoint server

⇒assume that checkpoint servers are stable,

global consistencyis ensured by the checkpoint protocol

(20)

Experimental conditions

3D Poisson Benchmark

Cluster Griffon at Nancy : 80 nodes with 8 cores, ie 640 cores

Domain size = 106doubles per core

Charm++net-linux-x86_64-syncft(git version on 04/12/2010)

with three different load-balancers: NullLB, GreedyLB, OrbLB

Kaapi with FT support(git version on 04/12/2010, branch besseron/master)

Remote buddy processors / Kaapi checkpoint servers

Located on the next node (according to the node list)

Charm++ PEs Remote buddy processor Kaapi processors Checkpoint server

griffon-01 −→ griffon-02

... ...

griffon-79 −→ griffon-80

griffon-80 −→ griffon-01

(21)

Checkpoint time

80 nodes with 8 cores, ie 640 cores Domain size = 106doubles per core

Kaapi 202 MB/node

Charm++ NullLB 455 MB/node

Kaapi 202 MB/node

Charm++ NullLB 455 MB/node

Charm++ GreedyLB 455 MB/node

0510152025

5,0 s 5,0 s

11,0 s

27,1 s

3,9 s

14,6 s Total = 16,0 s

Total = 27,1 s

Total = 5,4 s

Total = 3,9 s

Total = 14,6 s

Checkpoint time (s)

Send data

Wait for acknowledgment

Checkpoint to disk Checkpoint to memory

(22)

Restart protocols in Charm++ and Kaapi

Scenario

Domain size = 106doubles per core

1 failure⇒restart on 79 nodes = 632 cores⇒load balancing

Restart protocols

Charm++ :Global restart + load balancing

Failed processes restart from their remote checkpoint

Alive processes restart from their local copy

Kaapi : Global restart + rescheduling

All processes restart from their remote checkpoint

Rescheduling: static partitionning + data redistribution

(23)

Restart and load balancing time

1 failure⇒restart on 79 nodes with 8 cores, ie 632 cores

Kaapi Charm++ OrbLB Charm++ GreedyLB Charm++ NullLB

02468

2,1 s 2,2 s 2,1 s

1,0 s 4,6 s

6,2 s

1,7 s 3,3 s

Total = 7,0 s Total = 6,8 s

Total = 8,4 s

Total = 1,0 s

Time (s)

Global restart Load balancing

Static partitionning Data redistribution

(24)

Conclusion of this preliminary work (1/2)

Two different levels

Charm++: low level API

Charisma will be the next candidate

Kaapi: high level API

Also recursive applications + work-stealing

Heterogeneous architecture (CPUs + GPUs)

Execution performance

NUMA: new XKaapi implementation performs well

comparison with NUMA features [Pousa, Méhaut (INRIA), Gioachin, Mei, Kalé (UIUC)] for Charm++

Distributed execution: SMP Charm++ execution is better

(25)

Conclusion of this preliminary work (2/2)

Fault tolerance

Checkpoint is the most costly part of the protocols

Charm++ tolerates only 1 failure in worst case without stable servers

Kaapi tolerates the failure of one cluster but required "stable servers"

⇒ Stable memory issues

diskless checkpointing: k-duplication, erasure encoding, locality

Restart

Global restart is simple, but lost work can be large

Partial restart in Kaapi (based on the data flow graph)

Références

Documents relatifs

Dans notre travail, nous avons fait le point sur ce type de LAL chez l’adulte et nous avons montré une nouvelle présentation de la LAL en rapportant des cas

D. MxN Repartitioning Algorithm based on Fixed Vertices Our algorithm uses a partitioning technique based on fixed vertices in a similar way to Zoltan [2] or RM-Metis [9]. As

In clusters based event-driven model, a newly submitted trigger is assigned to the node where the relevant datasets set.. However triggers are not instantly executed after

We evaluate DynaMoth over 224 of the Defects4J dataset. The evaluation shows that Nopol with DynaMoth is capable of repairing 27 bugs. DynaMoth is able to repair bugs that have

The encapsulation at the na- noscale, which is the focus of the present review, allows tuning the properties of both the guest species that can undergo modifications of their

qui impose au médecin de s’abst enir si les ri sques de la recherche sont supérieurs à l’intérêt de celle-ci pour l e participant, s’appréciera

The success of deep learning inspired a spectrum of different research topics from theoretical understanding, to applying and adapting deep neural networks to the structures of

Le protocole IGP permet `a chaque routeur de cartographier son r´eseau sous la forme d’un graphe pond´er´e orient´e, de calculer ses plus courts chemins vers chaque destination,