• Aucun résultat trouvé

Harnessing clusters of hybrid nodes with a sequential task-based programming model

N/A
N/A
Protected

Academic year: 2021

Partager "Harnessing clusters of hybrid nodes with a sequential task-based programming model"

Copied!
55
0
0

Texte intégral

(1)

HAL Id: hal-01283949

https://hal.inria.fr/hal-01283949

Submitted on 7 Mar 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Harnessing clusters of hybrid nodes with a sequential

task-based programming model

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento,

Florent Pruvost, Marc Sergent, Samuel Thibault

To cite this version:

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, et al..

Harnessing clusters of hybrid nodes with a sequential task-based programming model. International

Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014), Jul 2014, Lugano,

Switzer-land. �hal-01283949�

(2)

Harnessing clusters of hybrid nodes with a

sequential task-based programming model

Emmanuel A

GULLO

, Olivier A

UMAGE

, Mathieu F

AVERGE

,

Nathalie F

URMENTO

, Florent P

RUVOST

, Marc S

ERGENT

,

Samuel T

HIBAULT

PMAA Workshop, Universit `a della Svizzera italiana, Lugano, Switzerland, July 3th, 2014

(3)

Outline

1. Introduction

2. Sequential task-based paradigm on a single node

3. A new programming paradigm for clusters?

4. Distributed Data Management

5. Comparison against state-of-the-art approaches

6. Conclusion and future work

(4)

Introduction

Runtime systems usually abstract a single node

I Plasma/Quark, Flame/SuperMatrix, Morse/StarPU,

Dplasma/PaRSEC ...

How should nodes communicate?

I Using explicit MPI user calls

I Using a specific paradigm: Dplasma

Can we keep the same paradigm and let the runtime handle

data transfers?

I Master-slave model (e.g.: ClusterSs) I Replicated unroll model (e.g. Quarkd)

Example:

Dense

Cholesky factorization (DPOTRF)

(5)

Introduction

Runtime systems usually abstract a single node

I Plasma/Quark, Flame/SuperMatrix, Morse/StarPU,

Dplasma/PaRSEC ...

How should nodes communicate?

I Using explicit MPI user calls

I Using a specific paradigm: Dplasma

Can we keep the same paradigm and let the runtime handle

data transfers?

I Master-slave model (e.g.: ClusterSs) I Replicated unroll model (e.g. Quarkd)

Example:

Dense

Cholesky factorization (DPOTRF)

(6)

Introduction

Runtime systems usually abstract a single node

I Plasma/Quark, Flame/SuperMatrix, Morse/StarPU,

Dplasma/PaRSEC ...

How should nodes communicate?

I Using explicit MPI user calls

I Using a specific paradigm: Dplasma

Can we keep the same paradigm and let the runtime handle

data transfers?

I Master-slave model (e.g.: ClusterSs) I Replicated unroll model (e.g. Quarkd)

Example:

Dense

Cholesky factorization (DPOTRF)

(7)

Introduction

Runtime systems usually abstract a single node

I Plasma/Quark, Flame/SuperMatrix, Morse/StarPU,

Dplasma/PaRSEC ...

How should nodes communicate?

I Using explicit MPI user calls

I Using a specific paradigm: Dplasma

Can we keep the same paradigm and let the runtime handle

data transfers?

I Master-slave model (e.g.: ClusterSs) I Replicated unroll model (e.g. Quarkd)

Example:

Dense

Cholesky factorization (DPOTRF)

(8)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(9)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(10)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(11)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(12)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(13)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(14)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(15)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(16)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(17)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(18)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(19)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(20)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all(); GEMM SYRK TRSM POTRF

(21)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(22)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(23)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(24)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(25)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(26)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(27)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(28)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(29)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(30)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(31)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1 M M A R M

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF

(32)

Runtime parallel execution on a heterogeneous node

task_wait_for_all(); CPU CPU GPU0 GPU1 M M A R M

Handles dependencies

Handles scheduling (e.g. HEFT)

Handles data consistency (MSI

protocol)

GEMM SYRK TRSM POTRF RAM GPU0

(33)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(34)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(35)

A new programming paradigm for clusters?

Infering communications from the

task graph

How to establish the mapping?

How to initiate communications?

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(36)

Mapping: Which node executes which tasks?

The application provides the

mapping

node0 node1 node2 node3

(37)

Data transfers between nodes

All nodes unroll the

whole task graph

They determine tasks

they will execute

They can infer required

communications

No negociation

between nodes

(not master-slave)

Unrolling can be

pruned

node0 node1 node2 node3

Irecv

Node 0 execution

node0 node1 node2 node3

Isend

Node 1 execution

(38)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); } } task_wait_for_all();

(39)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD); }

}

task_wait_for_all();

(40)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

int getnode(int i, int j) { return((i%p)*q + j%q); }

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j], WORLD, getnode(j,j)); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD, getnode(i,j)); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD, getnode(i,i)); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD, getnode(i,k)); }

}

task_wait_for_all();

(41)

Same paradigm for clusters (vs single node)

Almost

same code

MPI communicator

Mapping function

int getnode(int i, int j) { return((i%p)*q + j%q); } set_rank(A, getnode);

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD); }

}

task_wait_for_all();

(42)

Data transfers and cache system

Duplicate data transfers.

Let us avoid them

It means caching the received data

And drop it later

I When written to

I As advised by application

node0 node1 node2 node3

(43)

Cache flush

int getnode(int i, int j) { return((i%p)*q + j%q); } set_rank(A, getnode);

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD);

flush(A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD);

flush(A[i][j]);

} }

flush_all();

task_wait_for_all();

(44)

Cache flush

int getnode(int i, int j) { return((i%p)*q + j%q); } set_rank(A, getnode);

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD);

flush(A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD);

flush(A[i][j]);

} }

flush_all();

task_wait_for_all();

(45)

Cache flush

int getnode(int i, int j) { return((i%p)*q + j%q); } set_rank(A, getnode);

for (j = 0; j < N; j++) { POTRF (RW,A[j][j], WORLD); for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j], WORLD);

flush(A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j], WORLD); for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j], WORLD);

flush(A[i][j]);

} }

flush_all();

task_wait_for_all();

(46)

Cache effect, Cholesky Factorization, tile size 320

0 200 400 600 800 1000 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 GFlop/s Matrix size(N) 16 nodes cache 16 nodes no-cache 16 nodes cache + flush 1 node

(47)

Interaction with StarPU

StarPU-MPI is actually a separate library on top of StarPU

MPI communications are technically very much like tasks

running on CPU

I StarPU automatically fetches data from GPU as needed I StarPU automatically pushes data to GPU as needed

(48)

Communication engine on top of MPI

Post all MPI receptions as soon as possible

I Using MPI tags to sort out data

I Can lead to thousands of MPI posts

Some MPI implementations are not ready for that, can even deadlock

Multiplex tags ourself

I Implement data tags ourself I Only one MPI reception at a time I Still in progress

Could use some active-message library.

(49)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

I Scalapack I Dplasma/PaRSEC I Magma-morse/StarPU

64 nodes

I 2 Intel Westmere @ 2.66 GHz (8 cores per node) I 2 Nvidia Tesla M2090 (2 GPUs per node)

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ... nb

(50)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

I Scalapack I Dplasma/PaRSEC I Magma-morse/StarPU

64 nodes

I 2 Intel Westmere @ 2.66 GHz (8 cores per node)

I 2 Nvidia Tesla M2090 (2 GPUs per node)

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ... nb

(51)

Experimental Setup on TGCC CEA Curie

Double-precision

Cholesky

I Scalapack I Dplasma/PaRSEC I Magma-morse/StarPU

64 nodes

I 2 Intel Westmere @ 2.66 GHz (8 cores per node) I 2 Nvidia Tesla M2090 (2 GPUs per node)

Homogeneous tile size: 192x192

Heterogeneous tile sizes: 320x320 / 960x960

nb=192, 320, 960, ...

nb

(52)

64 homogeneous nodes (8 cores per node)

0 1000 2000 3000 4000 5000 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 220000 GFlop/s Matrix size (N) Scalapack 64 nodes, nb=192 Dplasma 64 nodes, nb=192 Magma-morse 64 nodes, nb=192 Magma-morse 64 nodes, nb=320 Magma-morse 64 nodes, nb=960

(53)

64 heterogeneous nodes (8 cores + 2 GPUs per node)

0 5 10 15 20 25 30 35 40 0 40000 80000 120000 160000 200000 240000 280000 320000 TFlop/s Matrix size (N) Dplasma (Nb=320) 64 nodes, P=8 Dplasma (Nb=960) 64 nodes, P=8 Magma-Morse (Nb=960) 64 nodes, P=8

(54)

Conclusion

Contribution

Harnessing cluster of hybrid nodes

Sequential task-based paradigm

Almost no code changes vs single node

Competitive performance

Future work

Pruning

Sparse solvers

Dynamic inter-node load balancing

MORSE: http://icl.cs.utk.edu/morse/

StarPU: http://runtime.bordeaux.inria.fr/StarPU/

(55)

Pruning the task graph traversal

Références

Documents relatifs

To further investigate the mechanisms that underlie the integration of local information into global forms, and in particular, to determine whether diVerent human brain areas

Finally, the seven amino acid extrahelical 3¢- ¯ap pocket mutant Fen1 (LQTKFSR) showed a similar altered cleavage pattern to the Fen1 (LTFR) mutant and with the

[r]

Pour donner plus de panache à la manifestation, elle a invité la fanfare de Gaillard en Haute- Savoie, société qui a déjà noué de nombreux liens d'amitié avec Fribourg.

Mots cl´ es : Contraction de Banach, point fixe, existence, unicit´ e, expression rationnelle, application non expansive, application quasi-non expansive, espace m´ etrique convexe,

(2018) write: “the essence of effectiveness resides in the identification of interpretable visual patterns that contribute to the overarching goal.” The visualization

O objetivo proposto neste trabalho foi calcular o impacto ambiental da produção de suínos, recebendo dietas com diferentes níveis de proteína bruta (PB), por

Contraints qu’ils sont de s’engager sur des voies que d’autres peuvent aussi emprunter, ils sont alors en concurrence, notamment pour des postes universitaire,