• Aucun résultat trouvé

Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System

N/A
N/A
Protected

Academic year: 2021

Partager "Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System"

Copied!
66
0
0

Texte intégral

(1)

HAL Id: hal-01380126

https://hal.inria.fr/hal-01380126

Submitted on 12 Oct 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Controlling the Memory Subscription of Distributed

Applications with a Task-Based Runtime System

Marc Sergent, David Goudin, Samuel Thibault, Olivier Aumage

To cite this version:

Marc Sergent, David Goudin, Samuel Thibault, Olivier Aumage. Controlling the Memory

Subscrip-tion of Distributed ApplicaSubscrip-tions with a Task-Based Runtime System. SIAM Conference on Parallel

Processing for Scientific Computing (SIAM PP 2016), Apr 2016, Paris, France. pp.318 - 327.

�hal-01380126�

(2)

Controlling the Memory Subscription of

Distributed Applications with a Task-Based

Runtime System

Marc S

ERGENT

, David G

OUDIN

,

Samuel T

HIBAULT

, Olivier A

UMAGE

SIAM-PP C

ONFERENCE

,

P

ARIS

, A

PRIL

12

TH

, 2016

(3)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(4)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(5)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

Data A received before data B : works fine

Data B received before data A : deadlock

(6)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

A

A

Data A received before data B : works fine

(7)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

A

A

B

B

space

memory

Wait for

Data A received before data B : works fine

Data B received before data A : deadlock

(8)

Memory and distributed task-based applications

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

B

MPI

R

W C

C

B

Memory

Data A received before data B : works fine

(9)

Memory and distributed task-based applications

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

B

MPI

R

W C

C

B

Memory

Data A received before data B : works fine

Data B received before data A : deadlock

(10)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

Data A received before data B : works fine

(11)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

B

Data A received before data B : works fine

Data B received before data A : deadlock

(12)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

B

A

A

space

memory

Wait for

Data A received before data B : works fine

(13)

Memory and distributed task-based applications

00

00

00

11

11

11

00

00

11

11

00

00

11

11

00

00

00

11

11

11

00

00

11

11

00

00

11

11

C

MPI

W

R

C

MPI

C

W

R

A

B

Memory

B

B

A

A

space

memory

Wait for

Data A received before data B : works fine

Data B received before data A : deadlock

(14)

Memory and distributed task-based applications

Issue

Bad memory allocation order for remote data can lead to

deadlocks

I

Need to constraint the order of memory allocations

Problematic

To what extent should we sequentialize memory allocations for

recieving remote data ?

I

Without hindering the communication/computation overlapping

I

Without introducing other deadlocks

(15)

Memory and distributed task-based applications

Proposed solution

Decouple the task submission and task execution steps

Introduce a sequential ordering of tasks

at submission time,

not at execution time

⇒ Sequential Task Flow (STF) model

Implemented by the StarPU runtime system (Inria STORM team)

Working principle

Memory allocations performed at submission time

When not enough memory available, blocks task submission

When enough memory have been freed, unblocks task submission

(16)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(17)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(18)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(19)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(20)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(21)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(22)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(23)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(24)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(25)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(26)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(27)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(28)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(29)

Sequential task-based Cholesky on a single node

for (j = 0; j < N; j++) {

POTRF (RW,A[j][j]);

for (i = j+1; i < N; i++)

TRSM (RW,A[i][j], R,A[j][j]);

for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++)

GEMM (RW,A[i][k],

R,A[i][j], R,A[k][j]);

}

}

task_wait_for_all();

GEMM SYRK TRSM POTRF

(30)

Distributed case : which node executes which tasks?

Two mappings needed for

distributed computing

I

Where is the data ?

Data mapping

I

Where are the computations ?

Task mapping

The application must provide the

initial data mapping

Two approaches for task mapping :

I

Automatically inferred

From the data mapping

By default, where data is written

to

I

Specified by the application

node1 node2 node3 node0

(31)

Distributed programming paradigm

Data communications inferred

from the task graph

I

Edges = data

communications

I

If tasks are executed on

different processes, infer an

MPI communication

Automatically inferred by the

runtime system at submission

time

I

Replicated unrolling of the

task graph

Same task submission

order on all processes

Communications inferred

in the same order on all

processes

GEMM SYRK TRSM POTRF Isend Irecv node1 node0

(32)

Data transfers between nodes

node1 node2 node3 node0

Irecv

Node 0 execution

node1 node2 node3 node0

(33)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(34)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

node1

node2

node3

node0

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(35)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(36)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(37)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(38)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(39)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(40)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(41)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(42)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(43)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(44)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(45)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(46)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(47)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) {

SYRK (RW,A[i][i], R,A[i][j]);

for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(48)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(49)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(50)

Memory control and sequential-task flow

for (j = 0; j < N; j++) { POTRF (RW,A[j][j]); for (i = j+1; i < N; i++) TRSM (RW,A[i][j], R,A[j][j]); flush(A[j][j]); for (i = j+1; i < N; i++) { SYRK (RW,A[i][i], R,A[i][j]); for (k = j+1; k < i; k++) GEMM (RW,A[i][k], R,A[i][j], R,A[k][j]); flush(A[i][j]); } } task_wait_for_all();

Irecv

Irecv

Irecv

Irecv

node1

node2

node3

node0

Irecv

Node 0 execution

BLOCK UNBLOCK MEMORY AVAILABLE

(51)

Example of memory behaviour on a 24GB machine

500

1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

0

200

400

600

800

1000

1200

Memory footprint (MB)

Time (s)

Total memory

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

Memory thresholds for blocking/unblocking task submission :

20GB/18GB

(52)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

(53)

What if the size of data grows during the execution ?

Goal : overestimate the real memory subscription

Need an upper-bound of data growth

for each submitted task

If cannot be known

I

Need a maximum value of data growth for each piece of data

Example : linear algebra solver for Block Low-Rank (BLR)

matrices

(54)

Block low-rank (BLR) matrices

Compressed tiles

n

*

k

k

n

~

=

m

m

V

Aij

U

ij

ij

A

ij

' U

ij

V

ij

t

k

is the rank of the compressed tile

(55)

Typical BEM matrix with compressed tiles

19 26 16 6 24 Dense Dense Dense Dense Dense 512 512

The rank of each compressed tile of the matrix grows during

the execution

I

Filling phenomenon of the matrix

I

Unpredictable prior to the execution

A compressed tile

never grows until it becomes dense again

I

Upper-bound of data growth : size of the dense tile

(56)

Memory control with upper-bound and BLR solver

Upper-bound of data growth : size of the dense tile

Cannot overestimate the size of all tiles with dense tiles

I

Average overestimation equals the compression rate

(over 97%)

Pragmatic solution

Do not overestimate the size of local tiles

I

Memory margin needed to compensate for local growth of data

Overestimate the memory space required for remote data

I

And correct it when the data is recieved

Works in practice

I

Memory overestimation of remote data overlaps with the local

filling of the matrix

(57)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(58)

Experimental setup

Application settings

Complex double-precision distributed compressed

Cholesky

factorization

Tile size:

512x512

Matrix order:

450.000 unknowns

Chameleon solver (Inria HiePACS team) on top of StarPU runtime

Experimental cluster

From

16 to 81 nodes of TERA100 Hybrid cluster

2 Intel Xeon X5620 @ 2.40 GHz (

8 cores per node)

24 GB RAM per node

Memory control settings

Blocks when reaches up to

18GB of memory occupancy

(59)

Without memory control (25 nodes, 450k unknowns)

500

1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

400000

600000

0

500

1000

1500

2000

2500

3000

Memory footprint (MB)

Time (s)

Total memory

OOM

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

Memory oversubscription for remote data + filling of the matrix =

out-of-memory

(60)

With memory control (25 nodes, 450k unknowns)

500

1000

2000

4000

6000

10000

20000

40000

60000

100000

200000

400000

600000

0

500

1000

1500

2000

2500

3000

Memory footprint (MB)

Time (s)

Total memory

StarPU’s view of memory

Allocated physical memory

Current matrix memory

Initial matrix memory

The memory overestimation of remote data overlaps with the local

filling of the matrix

(61)

Performance results (strong scaling, 450k unknowns)

0

20

40

60

80

100

120

140

160

180

16

25

36

49

64

81

CPU Hours

Number of computing nodes

CPU Hours without memory control

CPU Hours with memory control

The memory control does not impact the performance of the

application, and allows to use fewer computing nodes

(62)

Outline

1. Topic

Distributed task-based applications and memory issues

2. Memory control with the sequential-task flow model

Distributed sequential-task flow model

Working principle of memory control

3. What if the size of data grows during the execution ?

An example : block low-rank matrices

Experimental results

4. Conclusion and future work

(63)

Conclusion

Memory control with the distributed STF model

Memory allocations performed at submission time

When not enough memory available, blocks task submission

When enough memory have been freed, unblocks task

submission

What if the size of data grows during the execution ?

Need an upper-bound of data growth for each task

If cannot be known

I

Need at least a maximum value of data growth

I

Works in practice

Distributed BLR solver successfully implemented with StarPU

(64)

Future work

Task submission order can provoke out-of-memory scenarios

Example : submit all tasks that allocates memory first, then tasks

that frees memory

Application-level solution : change the task submission order

I

Group the submission of tasks which allocates and deallocates

a specific data

Runtime-level solution : memory-driven activation level for tasks

I

Task ready when there is enough memory available to allocate

its data

I

Non-blocking task submission

Task execution order can provoke out-of-memory scenarios

Memory-aware task scheduling heuristics

(65)

Thank you !

Questions ?

(66)

Research context

The MORSE research project 2011-2013 - 2014-2016

People

I

ICL

I

INRIA Bordeaux Sud Ouest (HiePACS, RealOpt, STORM)

I

KAUST

I

UCD

Solvers

I

Sparse direct (PaStiX, qr mumps)

I

H-matrix (EADS)

I

FMM (ScalFMM)

I

Dense (Chameleon)

Task-based runtime systems

I

Insert task paradigm: StarPU, Quark

I

JDF: PaRSEC

Références

Documents relatifs

Fig. A sequence of evolution of the model during an overt visual scan trial. a) One of the three stimuli emerges in the focus map and the anticipation’s units predict the future

We have described our participation on the Task 2b (Phase A) of the BioASQ challenge for which we have developed a system based on in-memory database technology and that makes use

Without memory constraint, the heuristic from Step 3, SplitAgain, is efficiently splitting the tree to minimize the makespan, and achieves results 1.5 times better than the

They predicted that simple processing components presented at a slow and comfort- able pace should have a low impact on concurrent maintenance, thus resulting in higher spans than

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

The results showed that in the 20–30 Hz frequency band and in musically na¨ıve subjects, functional connectivity was greater within the occipital, parietal, central, frontal,

The aims of this study were (a) to examine the consistency of the control condition using a test-retest procedure and (b) to more closely examine the hemodynamic

Without memory constraint, the heuristic from Step 3, SplitAgain, is efficiently splitting the tree to mini- mize the makespan, and achieves results 1.5 times better than the