Response Time Analysis of Data ow Applications on a Many-Core Processor with Shared-Memory and Network-on-Chip

(1)

Response Time Analysis of Dataow Applications on a Many-Core Processor

with Shared-Memory and Network-on-Chip

Matthieu Moy

LIP (Univ. Lyon 1)

November 2018

(2)

CASH: Topics - People

Optimized (software/hardware) compilation for HPC software with data-intensive computations.

Means: dataflow IR, static analyses, optimisations, simulation.

Sequential Program

Parallel Program HPC Applications

Parallelism

Extraction Intermediate Parallel Representation

Code Generation

Hardware (FPGA)

Software (CPU & accelerators) Optimization

Dataflow Semantics

Analysis Abstract

Interpretation

Simulation Polyhedral

Model

Christophe Alias, Laure Gonnord, Ludovic Henrio, Matthieu

Moy

http://www.ens-lyon.fr/LIP/CASH/

(3)

Outline

1 Critical, Real-Time and Many-Core 2 Parallel code generation and analysis 3 Models Denition

4 Interferences and NoC Communications 5 Evaluation

6 Conclusion and Future Work

(4)

Outline

3^/28

(5)

Time-critical, compute intensive applications

◦ Hard Real-Time: we mustguarantee that task execution completes before deadline

◦ Compute-intensive

(6)

Performance Vs Predictability

Predictable Fast

68000 PowerPC

i7 GPU

Many Core

5^/28

(7)

Many-core

Lots of simple = cores

Kalray MPPA (Massively Parallel Processor Array):

◦ 256 cores

◦ No cache consistency

◦ No out-of-order execution

◦ No branch prediction

◦ No timing anomaly

◦ Predictable NoC

⇒ good t for real-time?

(8)

Many-core

Lots of simple = cores

Kalray MPPA (Massively Parallel Processor Array):

◦ 256 cores

◦ No cache consistency

◦ No out-of-order execution

◦ No branch prediction

◦ No timing anomaly

◦ Predictable NoC

⇒ good t for real-time?

6^/28

(9)

Kalray's business model

(10)

Hard Real-Time on Many-Core

High-level Data-Flow Application Model

Synchronous hypothesis:

computation/communication in 0-time

Network On Chip

Communication takes time

Shared Memory within Cluster

Interferences between tasks

Individual Cores

Cache, Pipeline, . . .

I1 I2

T1 T2

T3

O1 O2

Take into account all levels

in Worst-Case Execution Time (WCET) analysis and programming model

8^/28

(11)

Hard Real-Time on Many-Core

Network On Chip

Individual Cores

I1 I2

T1 T2

T3

O1 O2

(12)

Hard Real-Time on Many-Core

Network On Chip

Individual Cores

I1 I2

T1 T2

T3

O1 O2

8^/28

(13)

Hard Real-Time on Many-Core

Network On Chip

Individual Cores

Cache, Pipeline, . . . I1

I2

T1 T2

T3

O1 O2

(14)

Hard Real-Time on Many-Core

Network On Chip

Individual Cores

Cache, Pipeline, . . . I1

I2

T1 T2

T3

O1 O2

8^/28

(15)

Outline

1 Critical, Real-Time and Many-Core

2 Parallel code generation and analysis 3 Models Denition

(16)

Execution of Synchronous Data Flow Programs

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

High level representation

3Respect the dependency constraints

3Set the release dates to get precise upper bounds

on the interference

code generation Single-core

static non-preemptive scheduling

Industrialized as SCADE (1993)

heavily used in avionics and nuclear

int main_app (i1,i2)

{ na = NA(i1);

ne = NE(i2);

nb = NB(na);

nd = ND(na);

nf = NF(ne);

o = NC(nb ,nd ,nf);

return o;

}

10^/28

(17)

Execution of Synchronous Data Flow Programs

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

High level representation

3Respect the dependency constraints

3Set the release dates to get precise upper bounds

on the interference

code generation Multi/Many-core

static non-preemptive scheduling

intNF (...) {// taskτ6

return(...) ; }

intNE (...) {// taskτ5

return(...) ; }

intND (...) {// taskτ4

return(...) ; }

intNC (...) {// taskτ3

return(...) ; }

intNB (...) {// taskτ2

return(...) ; }

intNA (...) {// taskτ1

return(...) ; }

PE2

PE1

PE0 wcrt0

τ0

wcrt1 τ1

wcrt2 τ2

wcrt3 τ3

wcrt4 τ4

wcrt5

τ5

(18)

Parallel code generation from Lustre/SCADE (pseudo-code)

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

// Generated by SCADE KCG void NA( ctx_a * ctx ) {

// ... computation ...

}

void NA_wrapper ( ctx_a * ctx ) { RECV_NA (i0);

NA( ctx );

SEND_NA_NB (...) ;

}

PE2

PE1

PE0 wcrt0

τ0

wcrt1 τ1

wcrt2 τ2

wcrt3 τ3

wcrt4 τ4

wcrt5

τ5

// Generated by us void worker_PE0 (void) {

ctx_a ctxa ; ctx_b ctxb ; while (1) {

NA_wrapper (& ctxa );

wait ( release_t2 );

NB_wrapper (& ctxb );

wait ( end_of_period );

} }

# define RECV_NA ( data ) ...

₁₁_/₂₈

(19)

Contribution

◦ Previous work:

◦ Predictable execution model within each cluster

◦ Mathematical model of arbitration for memory accesses

◦ Algorithm to compute a time-triggered schedule (x-point resolution)

◦ This talk:

◦ Multi-cluster application

◦ Time-triggered schedule taking Network on Chip (NoC) accesses into account

(20)

Outline

1 Critical, Real-Time and Many-Core 2 Parallel code generation and analysis

3 Models Denition

13^/28

(21)

Architecture Model

I/OEthernet0 I/OEthernet1 I/O DDR 0

I/O DDR 1

◦ Kalray MPPA²⁵⁶Bostan

◦ 16 compute clusters + 4 I/O clusters

◦ Dual NoC (Network on Chip)

Rx

Tx DSU

RM

P₁₅

P₀

RR 3→1

16RR→1 2RR→1

FP shared memory

bank highpriority

G3

G2

G1

(22)

Architecture Model

I/O DDR 1

P₀ P₁ P2 P3

P4 P5

P6 P7

RM NoC Rx

P₈ P₉ P10 P11

P12 P13

P14 P15

DSU NoC Tx

8sharedmemorybanks 8sharedmemorybanks

Per cluster:

◦ 16 cores + 1 Resource Manager

◦ NoC Tx, NoC Rx, Debug Unit

◦ 16shared memory banks (total size:²MB)

◦ Multi-level bus arbiter per memory bank

Rx

Tx DSU

RM

P₁₅

P₀

3RR→1

16RR→1 2RR→1

FP shared memory

bank highpriority

G3

G2

G1

14^/28

(23)

Architecture Model

I/O DDR 1

P₀ P₁ P2 P3

P4 P5

P6 P7

RM NoC Rx

P₈ P₉ P10 P11

P12 P13

P14 P15

DSU NoC Tx

Per cluster:

◦ 16 cores + 1 Resource Manager

◦ NoC Tx, NoC Rx, Debug Unit

◦ 16shared memory banks (total size:²MB)

◦ Multi-level bus arbiter per memory bank

Rx

Tx DSU

RM

P₁₅

P

3RR→1

16RR→1 2RR→1

FP shared memory

bank highpriority

G3

G2

G1

(24)

Execution Model: Within a Cluster

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

P0 P1 P2 P3 P4 P5 P6 P7 RM NoC Rx

P8 P9 P10 P11 P12 P13 P14 P15 DSU NoC Tx

P0 P1 P₂

arbiter arbiter arbiter

b₀ b₁ b₂

memory bank (¹²⁸KB)

◦ Tasks mapping on cores

◦ Static non-preemptive scheduling

◦ Spatial Isolation

dierent tasks go to dierent memory banks

◦ Interference from communications

◦ Execution model:

◦ execute in a local bank

◦ write to a remote bank

15^/28

(25)

Execution Model: Within a Cluster

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

P0 P1 P₂

b₀ b₁ b₂

(26)

Execution Model: Within a Cluster

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

P0 P1 P₂

b₀ b₁ b₂

15^/28

(27)

Execution Model: Within a Cluster

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o

P0 P1 P₂

b₀ b₁ b₂

(28)

NoC Communications

b0

b1

P0

P1

RX TX

b0

b1

P0

P1

RX TX

NoC

1 2 3

4

5

Steps:

1 Read from memory

2 Write to TX's buer

3 Start NoC transfer

4 Data transmission through the NoC

5 Write to memory

Interference:

1 Same as other reads

2, 3 One TX channel per sender

⇒ independent accesses.

4 Interferences in each router

→ network calculus

5 High-priority interference⇒ B

16^/28

(29)

NoC Communications

b0

b1

P0

P1

RX TX

b0

b1

P0

P1

RX TX

NoC 1

2 3

4

5

Steps:

1 Read from memory

4 Data transmission through the NoC Write to memory

Interference:

(30)

NoC Communications

b0

b1

P0

P1

RX TX

b0

b1

P0

P1

RX TX

NoC 1

2 3

4

5

Steps:

1 Read from memory

4 Data transmission through the NoC

5 Write to memory

Interference:

16^/28

(31)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

o ◦ Directed Acyclic Task Graph

◦ Mono-rate

◦ Fixed mapping and execution order

◦ For each task τi: WCRTi= WCETi +^X

j6=i

interferencei,j

t

00 40 80 120 160

Processor Demand Memory/TX access time reli

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

(32)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

◦ Mono-rate

j6=i

interferencei,j

t

00 40 80 120 160

Processor Demand Memory/TX access time reli

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

17^/28

(33)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

◦ Mono-rate

j6=i

interferencei,j

t

00 40 80 120 160

Processor Demand Memory/TX access time

reli

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

(34)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

◦ Mono-rate

j6=i

interferencei,j

t

00 40 80 120 160

reli

Ri

Isolation

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

17^/28

(35)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

◦ Mono-rate

j6=i

interferencei,j

t

00 40 80 120 160

reli

Ri

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

(36)

Application Model and Interferences

τ0

NA τ1

NB

τ2

NC

τ3

ND

τ4

NE τ5

NF i0

i1

◦ Mono-rate

j6=i

interferencei,j

t

00 40 80 120 160

reli

Ri

Interference

E E

0

Find ^R

ⁱ

(including the interference)

Find rel

ⁱ

respecting precedence constraints

Previous work:

17^/28

(37)

Outline

(38)

Reminder: NoC Communications

b0

b1

P0

P1

RX TX

b0

b1

P0

P1

RX TX

NoC 1

2 3

4

5

Interference:

⇒independent accesses.

→network calculus

5 High-priority interference ⇒B

Issue:

Predict the possible execution time of 5 as precisely as possible.

19^/28

(39)

Reminder: NoC Communications

b0

b1

P0

P1

RX TX

b0

b1

P0

P1

RX TX

NoC 1

2 3

4

5

Interference:

⇒independent accesses.

4 Interferences in each router network calculus

Issue:

Predict the possible execution time of 5 as precisely as possible.

(40)

Example Tasks with NoC Transmission

Issue 1: overapproximation of RX execution interval

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1

NoC reception

Task 2

◦ Issue:

◦ NoC reception starts after BCET^a(computation before sending) + BCET(transmission of rst it)

◦ We don't have the BCET

⇒large overapproximation for RX tasks

◦ Solution: Split sending task

◦ Compute: no NoC access

◦ Copy to TX (Cpy): write to the TX's buer

◦ Start NoC transfer (EOT): write to TX's control register

aBest Case Execution Time

20^/28

(41)

Example Tasks with NoC Transmission

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1

NoC reception Task 2

◦ Issue:

(42)

Example Tasks with NoC Transmission

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1

NoC reception (worst-case) Task 2

◦ Issue:

aBest Case Execution Time

20^/28

(43)

Example Tasks with NoC Transmission

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1 _Cp^y _EOT

NoC reception Task 2

◦ Issue:

(44)

Example Tasks with NoC Transmission

Issue 2: circular dependency

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1 _Cp^y

EOT1

NoC reception (RX1) Task 2 Cpy EOT2

NoC reception (RX2)

◦ Issue:

◦ WCRT(RX1) = WCRT(EOT1) + WCTT(1→0)

◦ WCRT(EOT1) depends on WCRT(RX2), which depends on WCRT(EOT2) which depends on WCRT(RX1)

◦ ⇒x-point

◦ Solution: get rid of interference on EOT

◦ EOT = only one control register access

◦ Preload code to avoid instruction cache miss

21^/28

(45)

Example Tasks with NoC Transmission

Issue 2: circular dependency

P0

RX

P0

RX Cluster 0

Cluster 1

Task 1 _Cp^y

EOT1

NoC reception (RX1) Task 2 Cpy EOT2

NoC reception (RX2)

◦ Issue:

◦ WCRT(RX1) = WCRT(EOT1) + WCTT(1→0)

◦ WCRT(EOT1) depends on WCRT(RX2), which depends on WCRT(EOT2) which depends on WCRT(RX1)

◦ ⇒x-point

◦ Solution: get rid of interference on EOT

◦ EOT = only one control register access

◦ Preload code to avoid instruction cache miss

(46)

3-phase Tasks Analysis

◦ Compute:

◦ Fits in previous work model

◦ Copy to TX:

◦ Force non-interfering schedule (add articial dependencies if needed)

◦ Start NoC transfer (EOT):

◦ No interference

◦ On the RX side:

◦ RX can only start after Start NoC transfer has started

⇒edge from Copy to TX to RX in the task dependency graph.

22^/28

(47)

Outline

(48)

Example Application: Naive Schedule

P1

P0

RX P1

P0

RX

Cluster 0 Cluster 1

RX1

RX2

n3 n4

n2

n5

n8

n1

n6 n7

n9

n10

Cpy1

Cpy2

EOT1EOT2

24^/28

(49)

Example Application: Improved Schedule

P1

P0

RX P1

P0

RX

Cluster 0 Cluster 1

RX1

RX2

n3 n4

n2

n5

n8

n1

n6 n7

n9

n10

Cpy1

Cpy2

EOT1EOT2

(50)

Recap

Task WCRT Release date

Naive Improved Naive Improved Application 680 650

RX1 230 120 0 220

RX2 580 350 0 200

n3 180 160 0 0

n4 120 120 180 160

n2 100 100 10 0

n5 100 100 580 550

n8 100 100 330 340

n1 110 110 0 0

n6 100 100 0 0

n7 100 100 100 100

n9 100 100 220 220

n10 100 100 240 240

Cpy1 110 110 110 110

Cpy2 100 100 100 100

EOT1 20 20 220 220

EOT2 20 20 200 200

Naive:

P1

P0

RX P1

P0

RX

Cluster 0 Cluster 1

RX1

RX2

n3 n4

n2

n5

n8

n1

n6 n7

n9

n10

Cpy1

Cpy2 EOT1EOT2

Improved:

P1

P0

RX P1

P0

RX

Cluster 0 Cluster 1

RX1

RX2

n3 n4

n2

n5

n8

n1

n6 n7

n9

n10

Cpy1

Cpy2 EOT1EOT2

26^/28

(51)

Outline

(52)

Conclusion and Future Work

◦ Code generation and real-time analysis for many-core (Kalray MPPA²⁵⁶)

= major challenge for industry and research

◦ Hard Real-Time⇒simplicity, predictability ⇒static, time-driven schedule

◦ Critical⇒ traceability ⇒no aggressive optimization

◦ Our work:

◦ Understand and model the precise architecture of MPPA

◦ Extension of Multi-Core Response Time Analysis framework

◦ Integration analysis ↔code generation

◦ Future Work:

◦ Model Kalray MPPA3 chip (new NoC, new arbiters)

◦ Improve the static scheduling algorithm: ^O(n⁴)currently, we can do better.

◦ Integration with RTOS?

28^/28

(53)

BACKUP

(54)

Multicore Response Time Analysis

Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10

t PE0

PE1

00 40 80 120 160

2accesses 2accesses

T1

2accesses T1

2accesses

T1 T1

3accesses

T0

3accesses Response time

1Altmeyer et al., RTNS 2015

(55)

Multicore Response Time Analysis

t PE0

PE1

00 40 80 120 160

2accesses 2accesses

T1

2accesses T1

2accesses

T1 T1

3accesses

T0

◦ Task of interest running on^PE0:

R0=10 + 3×10(response time in isolation)

R₁=10 + 3×10+2×10= 60

R₂=10 + 3×10+2×10+2×10= 80

R3=10 + 3×10+2×10+2×10+ 0 = 80 (xed-point)

(56)

Multicore Response Time Analysis

t PE0

PE1

00 40 80 120 160

2accesses 2accesses

T1

2accesses T1

2accesses

T1 T1

3accesses

T0

R0=10 + 3×10(response time in isolation) R₁=10 + 3×10+2×10= 60

R₂=10 + 3×10+2×10+2×10= 80

R3=10 + 3×10+2×10+2×10+ 0 = 80 (xed-point)

(57)

Multicore Response Time Analysis

t PE0

PE1

00 40 80 120 160

2accesses 2accesses

T1

2accesses T1

2accesses

T1 T1

3accesses

T0

R₂=10 + 3×10+2×10+2×10= 80

R3=10 + 3×10+2×10+2×10+ 0 = 80 (xed-point)

(58)

Multicore Response Time Analysis

t PE0

PE1

00 40 80 120 160

2accesses 2accesses

T1

2accesses T1

2accesses

T1 T1

3accesses

T0

R₂=10 + 3×10+2×10+2×10= 80

R3=10 + 3×10+2×10+2×10+ 0 = 80 (xed-point)

(59)

The Global Picture

Static Mapping/Scheduling

WCRT with Interferences Local WCRT Analysis Timing models (static analysis)

Probabilistic Models High-level

Program

+

Binary Generation

Code Generation

Dependencies Tasks

Mapping Execution

Order Release

Dates

+ Tasks WCRT

WC Access

Response Time Analysis of Data ow Applications on a Many-Core Processor with Shared-Memory and Network-on-Chip