Harnessing clusters of hybrid nodes with a sequential task-based programming model

(1)

HAL Id: hal-01283949

https://hal.inria.fr/hal-01283949

Submitted on 7 Mar 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Harnessing clusters of hybrid nodes with a sequential

task-based programming model

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento,

Florent Pruvost, Marc Sergent, Samuel Thibault

To cite this version:

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, et al..

Harnessing clusters of hybrid nodes with a sequential task-based programming model. International

Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014), Jul 2014, Lugano,

Switzer-land. �hal-01283949�

(2)

Harnessing clusters of hybrid nodes with a

sequential task-based programming model

Emmanuel A

GULLO

, Olivier A

UMAGE

, Mathieu F

AVERGE

,

Nathalie F

URMENTO

, Florent P

RUVOST

, Marc S

ERGENT

,

Samuel T

HIBAULT

PMAA Workshop, Universit `a della Svizzera italiana, Lugano, Switzerland, July 3th, 2014

(3)

Outline

1. Introduction

2. Sequential task-based paradigm on a single node

3. A new programming paradigm for clusters?

4. Distributed Data Management

5. Comparison against state-of-the-art approaches

6. Conclusion and future work

(4)

Introduction

Runtime systems usually abstract a single node

I Plasma/Quark, Flame/SuperMatrix, Morse/StarPU,

Dplasma/PaRSEC ...

How should nodes communicate?

I _{Using explicit MPI user calls}

I _{Using a specific paradigm: Dplasma}

Can we keep the same paradigm and let the runtime handle

data transfers?

I _{Master-slave model (e.g.: ClusterSs)} I _{Replicated unroll model (e.g. Quarkd)}

Example:

Dense

Cholesky factorization (DPOTRF)

(5)

Introduction

Runtime systems usually abstract a single node

Dplasma/PaRSEC ...

How should nodes communicate?

Can we keep the same paradigm and let the runtime handle

data transfers?

Example:

Dense

Cholesky factorization (DPOTRF)

(6)

Introduction

Runtime systems usually abstract a single node

Dplasma/PaRSEC ...

How should nodes communicate?

Can we keep the same paradigm and let the runtime handle

data transfers?

Example:

Dense

Cholesky factorization (DPOTRF)

(7)

Introduction

Runtime systems usually abstract a single node

Dplasma/PaRSEC ...

How should nodes communicate?

Can we keep the same paradigm and let the runtime handle

data transfers?

Example:

Dense

Cholesky factorization (DPOTRF)

(8)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(9)

Sequential task-based Cholesky on a single node

(10)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(11)

Sequential task-based Cholesky on a single node

(12)

Sequential task-based Cholesky on a single node

(13)

Sequential task-based Cholesky on a single node

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(14)

Sequential task-based Cholesky on a single node

(15)

Sequential task-based Cholesky on a single node

(16)

Sequential task-based Cholesky on a single node

(17)

Sequential task-based Cholesky on a single node

(18)

Sequential task-based Cholesky on a single node

(19)

Sequential task-based Cholesky on a single node

(20)

Sequential task-based Cholesky on a single node

(21)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(22)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(23)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(24)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(25)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(26)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(27)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(28)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(29)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(30)

Runtime parallel execution on a heterogeneous node

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(31)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1 M M A R M

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

(32)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1 M M A R M

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF RAM GPU0

(33)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(34)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

(35)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

(36)

Mapping: Which node executes which tasks?

The application provides the

mapping

node0 node1 node2 node3

(37)

Data transfers between nodes

All nodes unroll the

whole task graph

They determine tasks

they will execute

They can infer required

communications

No negociation

between nodes

(not master-slave)

Unrolling can be

pruned

Irecv

Node 0 execution

Isend

Node 1 execution

(38)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all();

(39)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD); }

}

task_wait_for_all();

(40)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

int getnode(int i, int j) { return((i%p)*q + j%q); }

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j], WORLD, getnode(j,j)); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD, getnode(i,j)); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD, getnode(i,i)); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD, getnode(i,k)); }

}

(41)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

int getnode(int i, int j) { return((i%p)*q + j%q); } set_rank(A, getnode);

TRSM (RW,A[i][j], R,A[j][j], WORLD); for (i = j+1; i < N; i++) {

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD); }

}

(42)

Data transfers and cache system

Duplicate data transfers.

Let us avoid them

It means caching the received data

And drop it later

I When written to

I As advised by application

(43)

Cache flush

TRSM (RW,A[i][j], R,A[j][j], WORLD);

flush(A[j][j]);

for (i = j+1; i < N; i++) {

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD);

flush(A[i][j]);

} }

flush_all();

(44)

Cache flush

flush(A[j][j]);

for (i = j+1; i < N; i++) {

GEMM (RW,A[i][k],

flush(A[i][j]);

} }

flush_all();

(45)

Cache flush

flush(A[j][j]);

for (i = j+1; i < N; i++) {

GEMM (RW,A[i][k],

flush(A[i][j]);

} }

flush_all();

(46)

Cache effect, Cholesky Factorization, tile size 320

0 200 400 600 800 1000 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 GFlop/s Matrix size(N) 16 nodes cache 16 nodes no-cache 16 nodes cache + flush 1 node

(47)

Interaction with StarPU

StarPU-MPI is actually a separate library on top of StarPU

MPI communications are technically very much like tasks

running on CPU

I StarPU automatically fetches data from GPU as needed I StarPU automatically pushes data to GPU as needed

(48)

Communication engine on top of MPI

Post all MPI receptions as soon as possible

I _{Using MPI tags to sort out data}

I Can lead to thousands of MPI posts

• _{Some MPI implementations are not ready for that, can even} deadlock

Multiplex tags ourself

I _{Implement data tags ourself} I _{Only one MPI reception at a time} I _{Still in progress}

Could use some active-message library.

(49)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

I _Scalapack I _{Dplasma/PaRSEC} I _{Magma-morse/StarPU}

64 nodes

I _{2 Intel Westmere @ 2.66 GHz (8 cores per node)} I _{2 Nvidia Tesla M2090 (2 GPUs per node)}

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ... nb

(50)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

64 nodes

I _{2 Intel Westmere @ 2.66 GHz (8 cores per node)}

I _{2 Nvidia Tesla M2090 (2 GPUs per node)}

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ... nb

(51)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

64 nodes

I _{2 Intel Westmere @ 2.66 GHz (8 cores per node)} I _{2 Nvidia Tesla M2090 (2 GPUs per node)}

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ...

nb

(52)

64 homogeneous nodes (8 cores per node)

0 1000 2000 3000 4000 5000 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 220000 GFlop/s Matrix size (N) Scalapack 64 nodes, nb=192 Dplasma 64 nodes, nb=192 Magma-morse 64 nodes, nb=192 Magma-morse 64 nodes, nb=320 Magma-morse 64 nodes, nb=960

(53)

64 heterogeneous nodes (8 cores + 2 GPUs per node)

0 5 10 15 20 25 30 35 40 0 40000 80000 120000 160000 200000 240000 280000 320000 TFlop/s Matrix size (N) Dplasma (Nb=320) 64 nodes, P=8 Dplasma (Nb=960) 64 nodes, P=8 Magma-Morse (Nb=960) 64 nodes, P=8