Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System

(1)

HAL Id: hal-01380126

https://hal.inria.fr/hal-01380126

Submitted on 12 Oct 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Controlling the Memory Subscription of Distributed

Applications with a Task-Based Runtime System

Marc Sergent, David Goudin, Samuel Thibault, Olivier Aumage

To cite this version:

Marc Sergent, David Goudin, Samuel Thibault, Olivier Aumage. Controlling the Memory

Subscrip-tion of Distributed ApplicaSubscrip-tions with a Task-Based Runtime System. SIAM Conference on Parallel

Processing for Scientific Computing (SIAM PP 2016), Apr 2016, Paris, France. pp.318 - 327.

�hal-01380126�

(2)

Controlling the Memory Subscription of

Distributed Applications with a Task-Based

Runtime System

Marc S

ERGENT

, David G

OUDIN

,

Samuel T

HIBAULT

, Olivier A

UMAGE

SIAM-PP C

ONFERENCE

,

P

ARIS

, A

PRIL

12 TH

, 2016

(3)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(4)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(5)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

Data A received before data B : works fine

Data B received before data A : deadlock

(6)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

A

Data A received before data B : works fine

(7)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

A

B

space

memory

Wait for

Data A received before data B : works fine

Data B received before data A : deadlock

(8)

Memory and distributed task-based applications

0

1

0

1

0

1

0

1

0

1

0

1

1 B

MPI

R

W C

C

B

Memory

Data A received before data B : works fine

(9)

Memory and distributed task-based applications

0

1

0

1

0

1

0

1

0

1

0

1

1 B

MPI

R

W C

C

B

Memory

Data A received before data B : works fine

Data B received before data A : deadlock

(10)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

Data A received before data B : works fine

(11)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

Data A received before data B : works fine

Data B received before data A : deadlock

(12)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

A

space

memory

Wait for

Data A received before data B : works fine

(13)

Memory and distributed task-based applications

00

11

00

11

00

11

00

11

00

11

00

11

11 C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

A

space

memory

Wait for

Data A received before data B : works fine

Data B received before data A : deadlock

(14)

Memory and distributed task-based applications

Issue

Bad memory allocation order for remote data can lead to

deadlocks

I

_{Need to constraint the order of memory allocations}

Problematic

To what extent should we sequentialize memory allocations for

recieving remote data ?

I

Without hindering the communication/computation overlapping

I

Without introducing other deadlocks

(15)

Memory and distributed task-based applications

Proposed solution

Decouple the task submission and task execution steps

Introduce a sequential ordering of tasks

at submission time,

not at execution time

⇒ Sequential Task Flow (STF) model

Implemented by the StarPU runtime system (Inria STORM team)

Working principle

Memory allocations performed at submission time

When not enough memory available, blocks task submission

When enough memory have been freed, unblocks task submission

(16)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(17)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(18)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(19)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(20)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(21)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(22)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(23)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(24)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(25)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(26)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(27)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(28)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(29)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

task_wait_for_all();

(30)

Distributed case : which node executes which tasks?

Two mappings needed for

distributed computing

I

_{Where is the data ?}

•

_{Data mapping}

I

_{Where are the computations ?}

•

_{Task mapping}

The application must provide the

initial data mapping

Two approaches for task mapping :

I

_{Automatically inferred}

•

_{From the data mapping}

•

_{By default, where data is written}

to

I

_{Specified by the application}

node1 node2 node3 node0

(31)

Distributed programming paradigm

Data communications inferred

from the task graph

I

_{Edges = data}

communications

I

_{If tasks are executed on}

different processes, infer an

MPI communication

Automatically inferred by the

runtime system at submission

time

I

_{Replicated unrolling of the}

task graph

•

_{Same task submission}

order on all processes

•

_{Communications inferred}

in the same order on all

processes

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(32)

Data transfers between nodes

Irecv

Node 0 execution

(33)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(34)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

node1

node2

node3

node0

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(35)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

node1

node2

node3

node0

Irecv

Node 0 execution

(36)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(37)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(38)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(39)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(40)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(41)

Memory control and sequential-task flow

SYRK (RW,A[i][i], R,A[i][j]);

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(42)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(43)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(44)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(45)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(46)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(47)

Memory control and sequential-task flow

SYRK (RW,A[i][i], R,A[i][j]);

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(48)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(49)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(50)

Memory control and sequential-task flow

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

(51)

Example of memory behaviour on a 24GB machine

500 1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

0

200

400

600

800 1000

1200

Memory footprint (MB)

Time (s)

Total memory

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

Memory thresholds for blocking/unblocking task submission :

20GB/18GB

(52)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

(53)

What if the size of data grows during the execution ?

Goal : overestimate the real memory subscription

Need an upper-bound of data growth

for each submitted task

If cannot be known

I

_{Need a maximum value of data growth for each piece of data}

Example : linear algebra solver for Block Low-Rank (BLR)

matrices

(54)

Block low-rank (BLR) matrices

Compressed tiles

n

_*

_k

k

n

~

=

m

V

Aij

U

_ij

ij

A

ij

' U

ij

V

_ij

t

k

is the rank of the compressed tile

(55)

Typical BEM matrix with compressed tiles

19 26 16 6 24 Dense Dense Dense Dense Dense 512 512

The rank of each compressed tile of the matrix grows during

the execution

I

_{Filling phenomenon of the matrix}

I

_{Unpredictable prior to the execution}

A compressed tile

never grows until it becomes dense again

I

Upper-bound of data growth : size of the dense tile

(56)

Memory control with upper-bound and BLR solver

Upper-bound of data growth : size of the dense tile

Cannot overestimate the size of all tiles with dense tiles

I

_{Average overestimation equals the compression rate}

(over 97%)

Pragmatic solution

Do not overestimate the size of local tiles

I

_{Memory margin needed to compensate for local growth of data}

Overestimate the memory space required for remote data

I

_{And correct it when the data is recieved}

Works in practice

I

_{Memory overestimation of remote data overlaps with the local}

filling of the matrix

(57)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(58)

Experimental setup

Application settings

Complex double-precision distributed compressed

Cholesky

factorization

Tile size:

512x512

Matrix order:

450.000 unknowns

Chameleon solver (Inria HiePACS team) on top of StarPU runtime

Experimental cluster

From

16 to 81 nodes of TERA100 Hybrid cluster

2 Intel Xeon X5620 @ 2.40 GHz (

8 cores per node)

24 GB RAM per node

Memory control settings

Blocks when reaches up to

18GB of memory occupancy

(59)

Without memory control (25 nodes, 450k unknowns)

500 1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

400000

600000

0

500 1000

1500

2000

2500

3000

Memory footprint (MB)

Time (s)

Total memory

OOM

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

Memory oversubscription for remote data + filling of the matrix =

out-of-memory

(60)

With memory control (25 nodes, 450k unknowns)

500 1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

400000

600000

0

500 1000

1500

2000

2500

3000

Memory footprint (MB)

Time (s)

Total memory

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

The memory overestimation of remote data overlaps with the local

filling of the matrix

(61)

Performance results (strong scaling, 450k unknowns)

0

20

40

60

80

100

120

140

160

180

16

25

36

49

64

81 CPU Hours

Number of computing nodes

CPU Hours without memory control

CPU Hours with memory control

The memory control does not impact the performance of the

application, and allows to use fewer computing nodes

(62)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(63)

Conclusion

Memory control with the distributed STF model

Memory allocations performed at submission time

When not enough memory available, blocks task submission

When enough memory have been freed, unblocks task

submission

What if the size of data grows during the execution ?

Need an upper-bound of data growth for each task

If cannot be known

I

Need at least a maximum value of data growth

I

_{Works in practice}

•

_{Distributed BLR solver successfully implemented with StarPU}

(64)

Future work

Task submission order can provoke out-of-memory scenarios

Example : submit all tasks that allocates memory first, then tasks

that frees memory

Application-level solution : change the task submission order

I

_{Group the submission of tasks which allocates and deallocates}

a specific data

Runtime-level solution : memory-driven activation level for tasks

I

_{Task ready when there is enough memory available to allocate}

its data

I

_{Non-blocking task submission}

Task execution order can provoke out-of-memory scenarios

Memory-aware task scheduling heuristics

(65)

Thank you !

Questions ?

(66)

Research context

The MORSE research project 2011-2013 - 2014-2016

People

I

_ICL

I

_{INRIA Bordeaux Sud Ouest (HiePACS, RealOpt, STORM)}

I

_KAUST

I

_UCD

Solvers

I

_{Sparse direct (PaStiX, qr mumps)}

I

_{H-matrix (EADS)}

I

_{FMM (ScalFMM)}

I

_{Dense (Chameleon)}

Task-based runtime systems

I

Insert task paradigm: StarPU, Quark

I

JDF: PaRSEC