Many-Core Timing Analysis of Real-Time Systems and its application to an industrial processor Hamza Rihani

Texte intégral

(1)Many-Core Timing Analysis of Real-Time Systems and its application to an industrial processor. Hamza Rihani Université Grenoble Alpes / Verimag. December 1, 2017. Jury: Pr. Jan Reineke Pr. Christine Rochange Dr. Robert I. Davis Dr. Benoît de Dinechin Dr. Claire Maïza Dr. Matthieu Moy. Saarland University Université de Toulouse University of York Kalray SA Université Grenoble Alpes Université Claude Bernard - Lyon 1. Reviewer Reviewer Examiner Examiner Supervisor Advisor.

(2) Introduction: Real-Time Systems Many-Core Timing Analysis of Real-Time Systems. Definition (Real-Time Systems) A system that must produce valid outputs before a deadline.. 2. /. 52.

(3) Introduction: Real-Time Systems Many-Core Timing Analysis of Real-Time Systems. Definition (Real-Time Systems) A system that must produce valid outputs before a deadline. ◦ Soft Real-Time ◦ Global Positioning System device ◦ Smartphones ◦ Hard Real-Time ◦ Automatic Braking System ◦ Flight Management System. 2. /. 52.

(4) Introduction: Real-Time Systems Many-Core Timing Analysis of Real-Time Systems. Definition (Real-Time Systems) A system that must produce valid outputs before a deadline. ◦ Soft Real-Time ◦ Global Positioning System device ◦ Smartphones. ?. ◦ Hard Real-Time ◦ Automatic Braking System ◦ Flight Management System. 2. /. 52.

(5) Introduction: Timing Analysis of Arbitration Policies Many-Core Timing Analysis of Real-Time Systems How long will the truck wait to cross the road?.   . . 3. /. 52.

(6) Introduction: Timing Analysis of Arbitration Policies Many-Core Timing Analysis of Real-Time Systems How long will the truck wait to cross the road?.   . waits for the green light. . 3. /. 52.

(7) Introduction: Timing Analysis of Arbitration Policies Many-Core Timing Analysis of Real-Time Systems How long will the truck wait to cross the road?.   . . waits for the green light. ♂. grants each direction at a time. ♂. 3. /. 52.

(8) Introduction: Timing Analysis of Arbitration Policies Many-Core Timing Analysis of Real-Time Systems How long will the truck wait to cross the road?.   . . waits for the green light. ♂. grants each direction at a time gives a priority to the cars. 3. /. 52.

(9) Introduction: Timing Analysis of Arbitration Policies Many-Core Timing Analysis of Real-Time Systems How long will the truck wait to cross the road?.   . . ◦ Crossroad is a shared resource ◦ Vehicles request accesses to pass ◦ Arbitration Policies:. Time Division Multiple Access. ♂. Round Robin Fixed Priority. 3. /. 52.

(10) Introduction: Many-Cores in Real Time Systems Many-Core Timing Analysis of Real-Time Systems. I/O device. …. Cluster m. Cluster 1. router. router. router. router. Network-on-Chip. Cluster 0 4. /. 52.

(11) Introduction: Many-Cores in Real Time Systems Many-Core Timing Analysis of Real-Time Systems P0 I. P1 D. I. …. P2 D. I. D. Pn I. D. I/O device. bus. Timing analysis of cores. shared memory. …. ◦ Existing tools for pipeline and cache. P0 I. P1 D. I. …. P2 D. I. D. Pn I. router. router. router. router. analyses. D. bus shared memory. Network-on-Chip P0 I. P1 D. I. …. P2 D. I. D. Pn I. D. bus shared memory. 4. /. 52.

(12) Introduction: Many-Cores in Real Time Systems Many-Core Timing Analysis of Real-Time Systems P0 I. P1 D. I. …. P2 D. I. D. Pn I. D. I/O device. bus. 3. shared memory. ◦ Existing tools for pipeline and cache. … P0 I. P1 D. I. …. P2 D. I. D. router. Pn I. Where is the potential interference? router. shared memory. router. 2. Network-on-Chip. I. P1 D. I. …. P2 D. I. D. bus. Pn I. analyses. router. D. bus. P0. Timing analysis of cores. 1. Shared buses and memory?. 2. NoC routing. 3. Shared I/O controllers. D. 1. shared memory. 4. /. 52.

(13) Contributions Contribution 1.   . Analysis of Time Division Multiple Access policy ◦ Approach based on Satisfiability Modulo Theory. . 5. /. 52.

(14) Contributions Contribution 1.   . Analysis of Time Division Multiple Access policy ◦ Approach based on Satisfiability Modulo Theory. . Contribution 2 Response time analysis of a many-core processor ◦ Synchronous Data Flow programs ◦ Model of the shared bus arbiter The High Five, Dallas, Texas, USA. 5. /. 52.

(15) Outline. I TDMA Bus Timing Analysis. II Many-Core Response Time Analysis. III Conclusion. 6. /. 52.

(16) TDMA Bus Timing Analysis.

(17) Definition: Time Division Multiple Access. time 00. 20. 40. 60. 80. 100. 120. 140. 160. 180. 8. /. 52.

(18) Definition: Time Division Multiple Access. Core A 00. 20. Core B 40. Core C 60. 80. Core A 100. Core B 120. 140. Core C 160. time 180. slot length σ TDMA period π. 8. /. 52.

(19) Definition: Time Division Multiple Access Core A viewpoint:. Core A 00. 20. Core A 40. 60. 80. 100. time 120. 140. 160. 180. slot length σ TDMA period π. 8. /. 52.

(20) Definition: Time Division Multiple Access Core A viewpoint:. ack. req1 acc. time 00. 20. 40. 60. 80. 100. 120. 140. 160. 180. slot length σ TDMA period π. 8. /. 52.

(21) Definition: Time Division Multiple Access Core A viewpoint:. ack. req1 acc. req2. stall time. acc. ack time. 00. 20. 40. 60. 80. 100. 120. 140. 160. 180. slot length σ TDMA period π. 8. /. 52.

(22) Definition: Time Division Multiple Access Core A viewpoint:. ack. req1 acc. 00. 20. req2. 40. stall time. 60. acc. 80. ack req3. 100. acc 120. worst-case stall time time 140. 160. 180. slot length σ TDMA period π. Worst-Case Stall Time = π - (σ - acc). 8. /. 52.

(23) Definition: Time Division Multiple Access Core A viewpoint: off 1. ack. req1 acc. 00. off 3. off 2. 20. req2. 40. stall time. 60. acc. 80. ack req3 acc. 100. 120. worst-case stall time time 140. 160. 180. slot length σ TDMA period π. Worst-Case Stall Time = π - (σ - acc) ◦ Offsets off 1 , off 2 , off 3 relative to the TDMA period:. off{1,2,3} = time_instant(req{1,2,3} ) mod π 8. /. 52.

(24) Outline: TDMA Bus Timing Analysis 1 Approaches in WCET Analysis of TDMA 2 WCET Analysis by SMT Encoding. Naive SMT Approach Offset-based SMT Encoding. 3 Experimental Evaluation 4 Summary and Future Work of Part I. I. TDMA Bus Timing Analysis. II Many-Core Response Time Analysis. III Conclusion 9. /. 52.


(26) Worst-Case Execution Time (WCET) Analysis of TDMA int f ( int x){ /* 3 c y c l e s */ i f ( cond ) { /* 10 c y c l e s */ } i f ( ! cond ) { /*bus access */ } return ; } . Goal: Estimate the WCET. B1: /*3 cycles*/ if (cond). . True False. B2: /*10 cycles*/. B3: if (!cond). False. True. B4: /* bus access */. B5: return. 11. /. 52.

(27) Worst-Case Execution Time (WCET) Analysis of TDMA int f ( int x){ /* 3 c y c l e s */ i f ( cond ) { /* 10 c y c l e s */ } i f ( ! cond ) { /*bus access */ } return ; } . Goal: Estimate the WCET. B1: /*3 cycles*/ if (cond). . ,→ Existing approaches:. 1 Worst-case everywhere. True False. B2: /*10 cycles*/. B3: if (!cond). False. . [Altmeyer et al., 2015; Rosèn et al., 2007…]. True. B4: /* bus access */ cost=π − σ + 2 acc B5: return. 11. /. 52.

(28) Worst-Case Execution Time (WCET) Analysis of TDMA int f ( int x){ /* 3 c y c l e s */ i f ( cond ) { /* 10 c y c l e s */ } i f ( ! cond ) { /*bus access */ } return ; } . Goal: Estimate the WCET. B1: /*3 cycles*/ if (cond). . ,→ Existing approaches: True. False off1. B2: /*10 cycles*/ off2 B3: if (!cond). False {off10 , off20 }. 1 Worst-case everywhere [Altmeyer et al., 2015; Rosèn et al., 2007…]. 2 Capture all possible offsets [Chattopadhyay et al., 2010; Kelter et al., 2014…]. True {off10 , off20 }. B4: /* bus access */. . off. B5: return. req acc. ack time. slot length σ TDMA period π. 11. /. 52.

(29) Worst-Case Execution Time (WCET) Analysis of TDMA int f ( int x){ /* 3 c y c l e s */ i f ( cond ) { /* 10 c y c l e s */ } i f ( ! cond ) { /*bus access */ } return ; } . Goal: Estimate the WCET. B1: /*3 cycles*/ if (cond). . ,→ Existing approaches: True. False off1. B2: /*10 cycles*/ off2 B3: if (!cond). False {off10 , off20 }. True {off10 , off20 }. 1 Worst-case everywhere [Altmeyer et al., 2015; Rosèn et al., 2007…]. 2 Capture all possible offsets [Chattopadhyay et al., 2010; Kelter et al., 2014…]. ,→ Combined with:. 3 Feasible Path Analysis. B4: /* bus access */. . off. B5: return. req acc. ack time. slot length σ TDMA period π. 11. /. 52.

(30) Approaches in WCET Analysis of TDMA 2 Capture all possible offsets [Kelter et al., 2014] [Chattopadhyay et al., 2010]. 3 Feasible Path Analysis with SMT [Henry et al., 2014]. 12. /. 52.

(31) Approaches in WCET Analysis of TDMA 2 Capture all possible offsets [Kelter et al., 2014] [Chattopadhyay et al., 2010]. 3 Feasible Path Analysis with SMT [Henry et al., 2014]. Contribution (in RTNS 2015): Compute WCET by encoding the semantics and shared resource accesses into an optimization problem (SMT). 12. /. 52.


(33) WCET Analysis by SMT Encoding ◦ Bounded Model Checking ◦ Encode the semantics into a Satisfiability Modulo Theory problem. 14. /. 52.

(34) WCET Analysis by SMT Encoding ◦ Bounded Model Checking ◦ Encode the semantics into a Satisfiability Modulo Theory problem. SMT query = “Is there a trace with a feasible path?” assert(V expr) |. {z. }. ◦ SMT-solver response: ◦ SAT: There is a feasible execution path ◦ UNSAT: There is no feasible execution path. 14. /. 52.

(35) WCET Analysis by SMT Encoding ◦ Bounded Model Checking ◦ Encode the semantics into a Satisfiability Modulo Theory problem ◦ Add execution times on the paths. SMT query = “Is there a trace with a feasible path assert(V expr) ...such that the execution time is greater than X ?” |. {z. }. ◦ SMT-solver response: ◦ SAT: There is a feasible path with an execution time > X ◦ UNSAT: X is an upper-bound on WCET. 14. /. 52.

(36) WCET Analysis by SMT Encoding ◦ Bounded Model Checking ◦ Encode the semantics into a Satisfiability Modulo Theory problem ◦ Add execution times on the paths. SMT query = “Is there a trace with a feasible path assert(V expr) ...such that the execution time is greater than X ?” |. {z. }. ◦ SMT-solver response: ◦ SAT: There is a feasible path with an execution time > X ◦ UNSAT: X is an upper-bound on WCET. Goal Find the smallest X , such that Execution Time > X is UNSAT 14. /. 52.

(37) Example: Semantics and Timing Encoding pred= (y < 0) t1,3 = b1 ∧ !pred. B1 : y =read_value(); if (y < 0). t1,2 = b1 ∧ pred True. B2 : /*10 cycles*/. False b3 = t1,3 ∧ t2,3. b2 = t1,2. Previous work in [Henry et al., 2014] def. Ï bi ”true” ⇐⇒ Bi executed def. Ï ti ,j ”true” ⇐⇒ Bi → Bj taken. B3 : if (y ≥ 0) True. False t 3, 5. B4 : /*bus access*/ B5 : return 15. /. 52.

(38) Example: Semantics and Timing Encoding pred= (y < 0) t1,3 = b1 ∧ !pred. start=0 B1 : y =read_value(); if (y < 0). t1,2 = b1 ∧ pred e1,2 = start + wcet(B1 ). True. B2 : /*10 cycles*/. False. b2 = t1,2. e2,3 = e1,2 + 10 b3 = t1,3 ∧ t2,3. e 3, 5 False t 3, 5. B3 : if (y ≥ 0) e3,4. Previous work in [Henry et al., 2014] def. Ï bi ”true” ⇐⇒ Bi executed def. Ï ti ,j ”true” ⇐⇒ Bi → Bj taken. Ï ei ,j execution time at transition Bi → Bj. True. B4 : /*bus access*/ e4,5 = e3,4 + wcet(B4 ). B5 : return execution time = if t3,5 then e3,5 else e4,5. 15. /. 52.

(39) Example: Semantics and Timing Encoding pred= (y < 0) t1,3 = b1 ∧ !pred. start=0 B1 : y =read_value(); if (y < 0). t1,2 = b1 ∧ pred e1,2 = start + wcet(B1 ). True. B2 : /*10 cycles*/. False. b2 = t1,2. e2,3 = e1,2 + 10 b3 = t1,3 ∧ t2,3. e 3, 5 False t 3, 5. B3 : if (y ≥ 0) e3,4. True. B4 : /*bus access*/. Previous work in [Henry et al., 2014] def. Ï bi ”true” ⇐⇒ Bi executed def. Ï ti ,j ”true” ⇐⇒ Bi → Bj taken. Ï ei ,j execution time at transition Bi → Bj. Contribution. Ï tdma_cost() execution time of a bus. access. e4,5 = e3,4 + tdma_cost(e3,4 ). B5 : return execution time = if t3,5 then e3,5 else e4,5. 15. /. 52.

(40) Naive SMT Encoding 1 reqA 1. ack. acc. 2 reqA 2. π − offset(reqA 2). ack. acc. time. slot length σ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. TDMA period π. tdma_cost(eentry ): returns the execution time of a bus access. eentry off entry ← eentry mod π /* is off entry inside the slot? */ if off entry ∈ [0, σ − acc[ 1 Yes. No 2. return acc. return (π − off entry ) + acc 16. /. 52.

(41) Performance of the Naive Encoding. bus access. N. bus access. bus access. 17. /. 52.

(42) Performance of the Naive Encoding. N. bus access. bus access. time(s) (log scale). bus access. ●. 10000. 100% access (naive). ●. ●. 100 ●. ●. 1. ●. 5. 10. 15. N blocks. 20. 25. 17. /. 52.

(43) Naive SMT Encoding 1 reqA 1. ack. acc. 2 reqA 2. π − offset(reqA 2). ack. acc. time. slot length σ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. TDMA period π. tdma_cost: returns the execution time of a bus access. eentry off entry ← eentry mod π /* is off entry inside the slot? */ if off entry ∈ [0, σ − acc[ 1 Yes. No 2. return acc. return (π − off entry ) + acc 18. /. 52.

(44) Offset-based SMT Encoding start=0 B1 : y =read_value(); if (y < 0). e1,2 = start + wcet(B1 ). ◦ offi ,j = ei ,j mod π. True False. B2 : /*10 cycles*/ e2,3 = e1,2 + 10 B3 : if (y ≥ 0) True. False. B4 : /*bus access*/ e4,5 = e3,4 + tdma_cost(e3,4 ). B5 : return execution time = if t3,5 then e3,5 else e4,5. 19. /. 52.

(45) Offset-based SMT Encoding offs ∈ [0, π[ start=0 B1 : y =read_value(); if (y < 0). e1,2 = start + wcet(B1 ) off1,2 = (offs + wcet(B1 )) mod π True. False. ◦ offi ,j = ei ,j mod π. ◦ offi ,j offset at transition Bi → Bj. B2 : /*10 cycles*/. B3 : if (y ≥ 0). e2,3 = e1,2 + 10 off2,3 = (off1,2 + 10) mod π. True False. B4 : /*bus access*/ e4,5 = e3,4 + tdma_cost(off3,4 ) off4,5 = tdma_offset(off3,4 ). B5 : return execution time = if t3,5 then e3,5 else e4,5. 19. /. 52.

(46) Offset-based SMT Encoding 1 reqA 1. ack. acc. 2 reqA 2. π − offset(reqA 2). ack. acc. time. slot length σ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. TDMA period π. tdma_cost: returns the time after a bus access. tdma_offset: returns the offset after a bus access. offentry. offentry. /* is off entry inside the slot? */ if off entry ∈ [0, σ − acc[ 1 Yes return acc. No 2 return (π − off entry ) + acc. /* is off entry inside the slot? */ if off entry ∈ [0, σ − acc[ 1 Yes return off entry + acc. No 2 return acc. 20. /. 52.

(47) Performance of the Offset-based Encoding. ●. N. bus access. bus access. 1e+03. time(s) (log scale). bus access. ●. naive offset based. ●. ●. 1e+01. 1e−01. ●. ●. 10. 100. #basic blocks (log scale). 1000. 21. /. 52.


(49) Proof-of-Concept Implementation. C code. LLVM compiler. Timing model. WCET. SMT-solving. LLVM bitcode. PAGAI. SMT clauses 23. /. 52.

(50) Proof-of-Concept Implementation. C code. LLVM compiler. LLVM bitcode:. ◦ Unrolled loops. Timing model: ◦ 1 instruction = 1 cycle ◦ Each load and store. PAGAI. requests a bus access. ◦ Timing Compositional. WCET. SMT-solving. SMT clauses 23. /. 52.

(51) Evaluation: Benchmark Descriptions Benchmark from TACLEBench suite Name bs insertsort jfdctint fdct. compressdata fly-by-wire. Description Binary search Insertion sort on a reversed array Discrete Cosine Transformation Fast Discrete Cosine Transform Data compression program adopted from SPEC95 UAV fly-by-wire software. 1 https://github.com/tacle/tacle-bench. 1. #LLVM instr. 231. #bus access 12. 493. 65. 2334. 448. 2502. 385. 674. 131. 2815. 515. 24. /. 52.

(52) Evaluation: Experiments Comparison between estimated WCET and pessimistic WCET π = 40, σ = 20, acc = 10. 0. 6 0. 4 0. 2. 0. 8 0. 6 0. 4 0. 2 wi re. ta. -b y-. fly. da. t. ss. fd c. pr e co m. in t ct. or t. jfd. er ts. bs. in s. co m. fd ct pr es sd at a fly -b ywi re. in t ct. jfd. er. ts or t. 0 bs. 0. π = 400, σ = 200, acc = 40. 1 WCETestimated WCETpess. 0. 8. in s. WCETestimated WCETpess. 1. Best-case of gain: All requests are within the TDMA slots Worst-case of gain: All requests have worst-case delay 25. /. 52.


(54) Summary of Part I. ◦ SMT encodings for TDMA access ◦ Feasible path analysis combined with the WCET computation ◦ Comparison between different encodings ◦ Validation with small but relevant benchmarks. Find in the manuscript: ◦ Linearization of SMT encoding (modulo operators) ◦ Other possible SMT encodings. 27. /. 52.

(55) Future Work of Part I Research perspectives C code:. ◦ Compositional analysis. LLVM compiler. ◦ Partial loop unrolling. LLVM bitcode. Implementation perspectives Executable binary. WCET. Timing model: ◦ Realistic timing values. (e.g. Otawa). SMT-solving. PAGAI. SMT clauses. 28. /. 52.

(56) Future Work of Part I Research perspectives C code:. ◦ Compositional analysis. LLVM compiler. ◦ Partial loop unrolling. LLVM bitcode. Implementation perspectives Executable binary. WCET. Timing model: ◦ Realistic timing values. (e.g. Otawa). SMT-solving. PAGAI. SMT clauses. SMT is an interesting research direction for WCET Analysis 28. /. 52.

(57) From TDMA to Other Arbitration Policy Full isolation with TDMA Model bus. . task. Model bus :. Model bus. . task. task. :. . . task. . task. :. . task. :. task. . :. task. :. . task. . task. :. Model bus. Model bus :. Model bus :. . Model bus. Model bus. Model bus :. Model bus task. :. Model bus. Model bus. . . task. Model bus. . task. :. Model bus :. . task. :. …. Linear complexity in the number of tasks. 29. /. 52.

(58) From TDMA to Other Arbitration Policy No isolation. Full isolation with TDMA Model bus. . task. Model bus :. Model bus. . task. . :. . . task. :. . task. :. . task. :. task. . :. task. :. :. . task. . task. Model :. Model bus. Model bus. Model bus :. . Model bus. Model bus. Model bus. Model bus task. task. :. Model bus. Model bus task. . Model bus. . task. :. Model bus :. . task. …. Linear complexity in the number of tasks. ♂. :. bus  task. . task. . task. . task. . task. . task. . task. . task. . task. . task. . task. . task. . task. . task. . ….  States explosion!. 29. /. 52.

(59) Analysis of Large Multi and Many-Cores B1. 1•. B2. . B3. B3. Complexity. 2 Account for any interference globally during the task’s execution. B4 B5. B4 B5. 1 Exact analysis. B1 B2. B6. P1 P0 t. •2. Pessimism 30. /. 52.

(60) Analysis of Large Multi and Many-Cores B1. 1•. B2. . B3. B3. Complexity. 2 Account for any interference globally during the task’s execution. B4 B5. B4 B5. 1 Exact analysis. B1 B2. B6. 3 Exploit any information about: Kalray MPPA2. ◦ The target architecture ◦ Reduce the interference ◦ Model precisely the shared resources ◦ The target application model. P1. 3. P0 t. •. •2. Synchronous Data Flow. Pessimism 30. /. 52.

(61) Many-Core Response Time Analysis.

(62) Outline: Many-Core Response Time Analysis 5 Implementation Choices of Synchronous Data Flow Programs 6 Multicore Response Time Analysis of SDF Programs 7 Target Many-Core: Kalray MPPA2 8 Evaluation 9 Summary and Future Work of Part II. I. TDMA Bus Timing Analysis. II Many-Core Response Time Analysis. III Conclusion 32. /. 52.


(64) Implementation Choices: SDF on Multi/Many-Cores. i n t NA ( . . . ) i n t NB ( . . . ) i n t NC ( . . . ) i n t ND ( . .{. ) { i n t NE ( . . . ) // t a s k τ1 i n t NF ( . .{. ) { // t a s k τ2 { // t a s k τ3 return ( . . . ) ; { // t a s k τ4 r eturn ( . . . ) ; // t a s k τ5 return ( . . . ) ; // t a s k τ6 r e t u r }n ( . }. . ) ; return ( . . . ) ; r e t u r }n ( . }. . ) ; } }. High-level representation. i1. τ0. τ1. τ2. o. Multi/Many-Core code generation. τ3. i2. τ4. τ5. Static, time triggered, non-preemptive scheduling τ5. τ4. PE2 τ3. τ2. PE1 τ0. τ1. PE0. 34. /. 52.

(65) Implementation Choices: SDF on Multi/Many-Cores. i n t NA ( . . . ) i n t NB ( . . . ) i n t NC ( . . . ) i n t ND ( . .{. ) { i n t NE ( . . . ) // t a s k τ1 i n t NF ( . .{. ) { // t a s k τ2 { // t a s k τ3 return ( . . . ) ; { // t a s k τ4 r eturn ( . . . ) ; // t a s k τ5 return ( . . . ) ; // t a s k τ6 r e t u r }n ( . }. . ) ; return ( . . . ) ; r e t u r }n ( . }. . ) ; } }. High-level representation. i1. τ0. τ1. τ2. o. Multi/Many-Core code generation. τ3. i2. τ4. τ5. Static, time triggered, non-preemptive scheduling 3 Respect the dependency constraints. τ5. τ4. PE2 τ3. τ2. PE1 τ0. τ1. PE0. 34. /. 52.

(66) Implementation Choices: SDF on Multi/Many-Cores. i n t NA ( . . . ) i n t NB ( . . . ) i n t NC ( . . . ) i n t ND ( . .{. ) { i n t NE ( . . . ) // t a s k τ1 i n t NF ( . .{. ) { // t a s k τ2 { // t a s k τ3 return ( . . . ) ; { // t a s k τ4 r eturn ( . . . ) ; // t a s k τ5 return ( . . . ) ; // t a s k τ6 r e t u r }n ( . }. . ) ; return ( . . . ) ; r e t u r }n ( . }. . ) ; } }. High-level representation. i1. τ0. τ1. τ2. o. Multi/Many-Core code generation. τ3. i2. τ4. τ5. Static, time triggered, non-preemptive scheduling 3 Respect the dependency constraints. τ5. τ4. PE2 τ3. τ2. PE1. 3 account for precise upper bounds on the interference. τ0. τ1. PE0. 34. /. 52.

(67) Model of the Application i1. τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. 35. /. 52.

(68) Model of the Application i1. τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. Each task τi :. t 35. /. 52.

(69) Model of the Application i1. τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. Each task τi :. memory accesses. t 35. /. 52.

(70) Model of the Application i1. τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. Each task τi :. ◦ Release date (rel i ). Response time (Ri ) Ri rel i. t 35. /. 52.

(71) Model of the Application i1. τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. Each task τi :. ◦ Release date (rel i ). Response time (Ri ) Ri rel i. t 35. /. 52.

(72) Model of the Application τ0. τ1. τ2. τ4. An execution instance is: ◦ Direct Acyclic Task Graph. τ3. i2. o. ◦ Mono-rate (or at least harmonic rates) τ5. ◦ Fixed mapping and execution order. Each task τi :. ◦ Release date (rel i ). Response time (Ri ) Ri rel i. E0. E. i1. Interference. t 35. /. 52.

(73) Model of the Application i1. τ0. τ1. τ2. o. ◦ Direct Acyclic Task Graph. τ3. i2. τ4. An execution instance is: ◦ Mono-rate (or at least harmonic rates). τ5. ◦ Fixed mapping and execution order. Each task τi :. ◦ Release date (rel i ). Response time (Ri ) Ri rel i.  Find Ri including interference  Find rel i respecting dependencies. E0. E. Static Non-Preemptive Scheduling. Interference. t 35. /. 52.


(75) Response Time Analysis Existing work [Altmeyer et al., 2015] ∀i : Ri = WCETi + I BUS (Ri ) + . . . For each task i: ◦ Response Time ◦ WCET in isolation ◦ Bus Interference. Independent tasks T1. T1. T1. T1. PE1 T0 PE0 t 00. 40. 80. 120. 160. 37. /. 52.

(76) Response Time Analysis Existing work [Altmeyer et al., 2015] ∀i : Ri = WCETi + I BUS (Ri ) + . . .. Contribution (in RTNS 2016) ∀i : Ri = WCETi + I BUS (Ri , Θ ) + . . . ◦ WCET in isolation. For each task i:. ◦ Set of release dates of all tasks. ◦ Response Time. ◦ Bounded interference. ◦ WCET in isolation. Dependent tasks. ◦ Bus Interference. τ5. τ4. PE2. Independent tasks. τ3 T1. T1. T1. τ2. PE1. T1. PE1. τ0. T0. τ1. PE0. PE0 t 00. 40. 80. 120. 160. 37. /. 52.

(77) Response Time Analysis Existing work [Altmeyer et al., 2015] ∀i : Ri = WCETi + I BUS (Ri ) + . . .. Contribution (in RTNS 2016) ∀i : Ri = WCETi + I BUS (Ri , Θ ) + . . . ◦ WCET in isolation. For each task i:. ◦ Set of release dates of all tasks. ◦ Response Time. ◦ Bounded interference. ◦ WCET in isolation. Dependent tasks. ◦ Bus Interference. τ5. τ4. PE2. Independent tasks. τ3 T1. T1. T1. τ2. PE1. T1. PE1. τ0. T0. τ1. PE0. PE0 t 00. 40. 80. 120. 160.  Recursive formula ⇒ iterative algorithm. 37. /. 52.

(78) Response Time Analysis with Dependencies. PE2. 1 initial rel i / iteration l = 0. τ5. τ4. PE1 PE0. 1. τ3 τ0. τ1. τ2. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. Start with initial release dates. 38. /. 52.

(79) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. 1 2. τ3 τ0. τ1. τ2. Ril +1 6= Ril / 2 l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. Start with initial release dates Compute response times .... 38. /. 52.

(80) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. 1 2. τ3 τ0. τ1. τ2. Ril +1 6= Ril / 2 l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. Start with initial release dates Compute response times ... .... 38. /. 52.

(81) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. 1 2. τ3 τ0. τ1. τ2. Start with initial release dates Compute response times. Ril +1 6= Ril / l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. Ri 2. ... ... ... a fixed-point is reached!. 38. /. 52.

(82) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. 1 2. τ3 τ0. τ1. τ2. Start with initial release dates Compute response times. ... ... ... a fixed-point is reached!. 3. Update the release dates. Ril +1 6= Ril / l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. Ri 3 Update release dates for all i do rel i ← latest finish time of all the dependencies end for. 38. /. 52.

(83) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. τ3 τ0. τ1. τ2. Ril +1 6= Ril / l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for. 4. 1 2. Start with initial release dates Compute response times. ... ... ... a fixed-point is reached!. 3 4. Update the release dates Repeat. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. 38. /. 52.

(84) Response Time Analysis with Dependencies. PE2. initial rel i 0 / iteration l = 0. τ5. τ4. PE1 PE0. 1 2. τ3 τ0. τ1. τ2. Start with initial release dates Compute response times. ... ... ... a fixed-point is reached!. 3 4. Update the release dates Repeat until no release date changes (another fixed-point iteration).. Ril +1 6= Ril / l = l +1. WCRT analysis for all i do Ril +1 ← WCETi + I BUS (Ril , Θ) end for new rel i repeat. Ri. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. 4 rel i did not change Return: (rel i , Ri ). 38. /. 52.

(85) Convergence Toward a Fixed-point. P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration:. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(86) Convergence Toward a Fixed-point. P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded. 3. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(87) Convergence Toward a Fixed-point. P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. ◦ Convergence of the 2. 3. fixed-point iteration:. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(88) Convergence Toward a Fixed-point. P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. 3. ◦ Convergence of the 2 fixed-point iteration: ◦ No monotonicity: Ri and rel i may grow or shrink at each iteration. ?. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(89) Convergence Toward a Fixed-point t P2. τ4. τ5. initial rel i 0. τ3. P1. τ0. P0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. 3. ◦ Convergence of the 2 fixed-point iteration: ◦ No monotonicity: Ri and rel i may grow or shrink at each iteration. ?. Theorem At each iteration, at least one task finds its final release date. Full proof in the manuscript.. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(90) Convergence Toward a Fixed-point t P2. τ4 τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. 3. ◦ Convergence of the 2 fixed-point iteration: ◦ No monotonicity: Ri and rel i may grow or shrink at each iteration. ?. Theorem At each iteration, at least one task finds its final release date. Full proof in the manuscript.. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(91) Convergence Toward a Fixed-point t P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. 3. ◦ Convergence of the 2 fixed-point iteration: ◦ No monotonicity: Ri and rel i may grow or shrink at each iteration. ?. Theorem At each iteration, at least one task finds its final release date. Full proof in the manuscript.. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.

(92) Convergence Toward a Fixed-point t P2. τ4. τ5. P0. initial rel i 0. τ3. P1. τ0. τ1. τ2. ◦ Convergence of the 1st fixed-point iteration: ◦ Monotonic and bounded nd. 3. ◦ Convergence of the 2 fixed-point iteration: ◦ No monotonicity: Ri and rel i may grow or shrink at each iteration. ?. Theorem At each iteration, at least one task finds its final release date. Full proof in the manuscript.. Ril +1 6= Ril. WCRT analysis for all i do Ril +1 ← PDi + I BUS (Ril , Θ) end for. Ri. new rel i repeat. Update release dates for all i do rel i ← latest finish time of all the dependencies end for. rel i did not change Return: (rel i , Ri ). 39. /. 52.


(94) Target Many-Core: Kalray MPPA2. I/O Ethernet 1. I/O Ethernet 0. I/O DDR 0. ◦ Kalray MPPA2 (codenamed Bostan) ◦ 16 compute clusters + 4 I/O clusters ◦ Dual NoC. I/O DDR 1. 41. /. 52.

(95) I/O Ethernet 0. I/O Ethernet 1. I/O DDR 1. NoC Rx. NoC Tx. RM. DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. Per cluster: 8 shared memory banks. I/O DDR 0. 8 shared memory banks. Target Many-Core: Kalray MPPA2 ◦ 16 cores + 1 Resource Manager ◦ NoC Tx, NoC Rx, Debug Unit ◦ 16 shared memory banks (total: 2 MB). 41. /. 52.

(96) I/O Ethernet 1. I/O Ethernet 0. I/O DDR 0. I/O DDR 1. DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. Per cluster: ◦ 16 cores + 1 Resource Manager ◦ NoC Tx, NoC Rx, Debug Unit ◦ 16 shared memory banks (total: 2 MB) ◦ 1 bus arbiter per memory bank. …. memory bank b15. DSU RM. bus arbiter Rx Tx. P0. RM. bus arbiter Rx Tx. P1. NoC Tx. DSU RM. Rx Tx. P15. NoC Rx. 8 shared memory banks. 8 shared memory banks. Target Many-Core: Kalray MPPA2. memory bank b1. DSU RM. bus arbiter. memory bank b0 41. /. 52.

(97) I/O Ethernet 1. I/O Ethernet 0. I/O DDR 0. I/O DDR 1. DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. Per cluster: ◦ 16 cores + 1 Resource Manager ◦ NoC Tx, NoC Rx, Debug Unit ◦ 16 shared memory banks (total: 2 MB) ◦ 1 bus arbiter per memory bank. …. memory bank b15. ◦ Possible spatial isolation. DSU RM. bus arbiter Rx Tx. P0. RM. bus arbiter Rx Tx. P1. NoC Tx. DSU RM. Rx Tx. P15. NoC Rx. 8 shared memory banks. 8 shared memory banks. Target Many-Core: Kalray MPPA2. memory bank b1. DSU RM. bus arbiter. memory bank b0. assigning memory banks to cores. ◦ Task execution model: ◦ execute in a “local” bank ◦ write to a “remote” bank ◦ Interference from communications 41. /. 52.

(98) Model of the MPPA2 Bus bus arbiter Rx. . fixed priority. P0. round robin. P1. X. IbBUS (Ri , Θ). b ∈B where B : a set of memory banks. round robin. known upper-bound on the number of RM memory accesses DSU. I BUS (Ri , Θ) =. round robin. Tx. Shared Bus Model. high priority. memory bank b. Bus interference IbBUS (Ri , Θ):. ♂ ♂. 42. /. 52.

(99) Model of the MPPA2 Bus bus arbiter Rx. P0 . IbBUS (Ri , Θ). memory bank b. Bus interference IbBUS (Ri , Θ): 1. E(P0)=. .... : : : :. P1. round robin. 1. round robin. DSU. X. b ∈B where B : a set of memory banks. fixed priority. RM. I BUS (Ri , Θ) =. round robin. Tx. Shared Bus Model. high priority. ♂ ♂. 42. /. 52.

(100) Model of the MPPA2 Bus bus arbiter Rx. . memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. P0. IbBUS (Ri , Θ). : : : :. 2. round robin. P1. round robin. DSU. X. b ∈B where B : a set of memory banks. fixed priority. RM. I BUS (Ri , Θ) =. round robin. Tx. Shared Bus Model. high priority. ♂ ♂. 42. /. 52.

(101) Model of the MPPA2 Bus bus arbiter Rx. memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. . IbBUS (Ri , Θ). : : : :. P0. round robin. P1. round robin. DSU. X. b ∈B where B : a set of memory banks. fixed priority. RM. I BUS (Ri , Θ) =. round robin. Tx. Shared Bus Model. high priority. ♂ ♂. min(E(P0 ),. E(P1)) 42. /. 52.

(102) Model of the MPPA2 Bus bus arbiter Rx. memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. . Y. IbBUS (Ri , Θ). : : : :. P0. round robin. P1. X. b ∈B where B : a set of memory banks. X round robin. DSU. I BUS (Ri , Θ) = fixed priority. RM. round robin. Tx. Shared Bus Model. high priority. ♂ ♂. min(E(P0 ),. E(P1)). + min(E(Y ),. E(X) ) 42. /. 52.

(103) Model of the MPPA2 Bus bus arbiter Rx. memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. . Y. IbBUS (Ri , Θ). : : : :. P0. round robin. P1. X. b ∈B where B : a set of memory banks. X round robin. DSU. I BUS (Ri , Θ) = fixed priority. RM. round robin. Tx. Shared Bus Model. high priority. ♂ ♂. min(E(P0 ),. E(P1)). + min(E(Y ),. E(X) ) + E(Rx) 42. /. 52.

(104) Model of the MPPA2 Bus bus arbiter Rx. RM. …. …. 1≤i ≤15. memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. X. Y. IbBUS (Ri , Θ). : : : :. . round robin. P15. X. b ∈B where B : a set of memory banks. X round robin. DSU. I BUS (Ri , Θ) = fixed priority. round robin. Tx. P0. Shared Bus Model. high priority. ♂ ♂. min(E(P0 ),. E(Pi )). + min(E(Y ),. E(X) ) + E(Rx) 42. /. 52.

(105) Model of the MPPA2 Bus bus arbiter Rx. …. …. Y. memory bank b. Bus interference IbBUS (Ri , Θ): 1 2. E(P0)= E(P1)=. ... .... : : : :. 1≤i ≤15. IbBUS (Ri , Θ). : : : :. X. round robin. P15. round robin. DSU. X. b ∈B where B : a set of memory banks. fixed priority. RM. . I BUS (Ri , Θ) =. round robin. Tx. P0. Shared Bus Model. high priority. ♂ ♂. min(E(P0 ),. E(Pi )). + min(E(Y ), X=. E(X) ) + E(Rx) X. {Tx , RM , DSU } 42. /. 52.


(107) Case Study: ROSACE, a Flight Management System Controller. 2. High level representation h (200Hz). h_filter (100Hz). az (200Hz). az_filter (100Hz). vz (200Hz). vz_filter (100Hz). q (200Hz). q_filter (100Hz). va (200Hz). va_filter (100Hz). 2 Pagetti et al., 2014. altitude (50Hz). vz_control δec (50Hz). va_control δthe (50Hz). 44. /. 52.

(108) Case Study: ROSACE, a Flight Management System Controller. High level representation h (200Hz). h_filter (100Hz). az (200Hz). az_filter (100Hz). vz (200Hz). vz_filter (100Hz). q (200Hz) va (200Hz). q_filter (100Hz) va_filter (100Hz). altitude (50Hz). vz_control δec (50Hz). va_control δthe (50Hz). Unrolled execution Hyper-period Rx. receive 200 Hz. Tx. transmit 50 Hz. P4. h_filter 100 Hz. P3 P2 P1 P0. receive 200 Hz. altitude 50 Hz. receive 200 Hz. vz_control. h_filter. 50 Hz. 100 Hz. az_filter. az_filter. 100 Hz. 100 Hz. vz_filter. vz_filter. 100 Hz. 100 Hz. q_filter. q_filter. 100 Hz. 100 Hz. va_filter 100 Hz. 2 Pagetti et al., 2014. 2. va_control 50 Hz. receive 200 Hz. va_filter 100 Hz. 44. /. 52.

(109) Evaluation: ROSACE Case Study Function altitude az_filter h_filter q_filter va_control va_filter vz_control vz_filter. WCET (cycles) 275 274 326 338 303 301 320 334. Memory accesses 22 22 24 24 24 23 25 25. ◦ Values obtained from measurements ◦ Memory accesses from data and instruction cache misses + communications ◦ Moreover: ◦ NoC Rx: writes 5 words ◦ NoC Tx: reads 2 words 45. /. 52.

(110) Evaluation: ROSACE Case Study Function altitude az_filter h_filter q_filter va_control va_filter vz_control vz_filter. WCET (cycles) 275 274 326 338 303 301 320 334. Memory accesses 22 22 24 24 24 23 25 25. ◦ Values obtained from measurements ◦ Memory accesses from data and instruction cache misses + communications ◦ Moreover: ◦ NoC Rx: writes 5 words ◦ NoC Tx: reads 2 words.  Experiments: Find the smallest schedulable hyper-period. 45. /. 52.

(111) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. MPPA bus. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. 46. /. 52.

(112) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. MPPA bus. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. Naive: - All accesses interfere - Rx bounded by 1 access per bank 46. /. 52.

(113) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. Naive: - All accesses interfere - Rx bounded by 1 access per bank. MPPA bus. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. Ignoring overlap: We don’t use the release dates 46. /. 52.

(114) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. Naive: - All accesses interfere - Rx bounded by 1 access per bank. MPPA bus. Ignoring overlap: We don’t use the release dates. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. Our approach: We use the release dates 46. /. 52.

(115) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. Naive: - All accesses interfere - Rx bounded by 1 access per bank. MPPA bus. Ignoring overlap: We don’t use the release dates. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. Our approach: We use the release dates. Optimistic: No bus interference 46. /. 52.

(116) Evaluation: Experiments Smallest schedulable hyper-period 1 bank. Processor cycles. 15000. 5 banks Naive Ignoring overlaps Our approach Optimistic. +709%. +709%. 10000. 5000 +110.57% +42.62%. 0. Naive: - All accesses interfere - Rx bounded by 1 access per bank. MPPA bus. Ignoring overlap: We don’t use the release dates. +27.24% +11.21%. +0%. Bus policy. +0%. MPPA bus. Our approach: We use the release dates. Optimistic: No bus interference 46. /. 52.

(117) Evaluation: CAPACITES Project Traditional steps for single-cores Input SCADE/Lustre application. Executable binary for the Kalray MPPA Bostan Output. 47. /. 52.

(118) Evaluation: CAPACITES Project Traditional steps for single-cores Input SCADE/Lustre application. Code generation 2: Add the scheduling/mapping in the binary. task mapping + execution order. pre-scheduling WCET analysis initial WCETs. binary with mappings. Code generation 1. binary + task graph. Task mapping and scheduling. Executable binary for the Kalray MPPA Bostan Output. 47. /. 52.

(119) Evaluation: CAPACITES Project Traditional steps for single-cores Input SCADE/Lustre application. Code generation 2: Add the scheduling/mapping in the binary. task mapping + execution order. pre-scheduling WCET analysis initial WCETs. binary with mappings. Code generation 1. binary + task graph. Task mapping and scheduling. Executable binary for the Kalray MPPA Bostan Output. 47. /. 52.

(120) Evaluation: CAPACITES Project Traditional steps for single-cores Input SCADE/Lustre application. Code generation 2: Add the scheduling/mapping in the binary. task mapping + execution order. pre-scheduling WCET analysis. Task mapping and scheduling. m ex app ec in . o gs rd an er d. Task profiles. Response times and release dates computation. binary with mappings. Code generation 1. initial WCETs. post-scheduling WCET analysis and Shared Resource analysis. binary + task graph. Response times, Release dates. Code generation 3: Add the response time and release dates in the binaries. Executable binary for the Kalray MPPA Bostan Output. 47. /. 52.


(122) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2. 49. /. 52.

(123) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2 ◦ Given: ◦ Task profiles: WCET in isolation and number of accesses ◦ Mapping of Tasks ◦ Execution Order. 49. /. 52.

(124) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2 ◦ Given: ◦ Task profiles: WCET in isolation and number of accesses ◦ Mapping of Tasks ◦ Execution Order ◦ We compute: ◦ Tight response times taking into account the interference ◦ Release dates respecting the dependency constraints. 49. /. 52.

(125) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2 ◦ Given: ◦ Task profiles: WCET in isolation and number of accesses ◦ Mapping of Tasks ◦ Execution Order ◦ We compute: ◦ Tight response times taking into account the interference ◦ Release dates respecting the dependency constraints. model of the multi-level arbiter. 49. /. 52.

(126) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2 ◦ Given: ◦ Task profiles: WCET in isolation and number of accesses ◦ Mapping of Tasks ◦ Execution Order ◦ We compute: ◦ Tight response times taking into account the interference ◦ Release dates respecting the dependency constraints. model of the multi-level arbiter. double fixed-point algorithm. 49. /. 52.

(127) Summary of Part II ◦ A response time analysis of synchronous data flow programs on the Kalray MPPA2 ◦ Given: ◦ Task profiles: WCET in isolation and number of accesses ◦ Mapping of Tasks ◦ Execution Order ◦ We compute: ◦ Tight response times taking into account the interference ◦ Release dates respecting the dependency constraints. model of the multi-level arbiter. double fixed-point algorithm. ◦ Find in the manuscript: ◦ Execution phases: execution phase + communication phase ◦ Support of: accesses pipelining, blocking and non-blocking accesses, bursts of accesses ◦ More experiments with randomly generated task graphs 49. /. 52.

(128) Future Work of Part II. NoC Rx RM. NoC Tx DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. 8 shared memory banks. ◦ Analysis with a Real-Time Operating System. 8 shared memory banks. ◦ Model of the Resource Manager. 50. /. 52.

(129) Future Work of Part II. tighter estimation of context switches and other interrupts NoC Rx RM. NoC Tx DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. 8 shared memory banks. ◦ Analysis with a Real-Time Operating System. 8 shared memory banks. ◦ Model of the Resource Manager. 50. /. 52.

(130) Future Work of Part II. tighter estimation of context switches and other interrupts. ◦ Model of the NoC accesses.. NoC Rx RM. NoC Tx DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. 8 shared memory banks. ◦ Analysis with a Real-Time Operating System. 8 shared memory banks. ◦ Model of the Resource Manager. 50. /. 52.

(131) Future Work of Part II. tighter estimation of context switches and other interrupts. ◦ Model of the NoC accesses.. use the output of any NoC analysis. ◦ Comparison between estimated and measured response times. NoC Rx RM. NoC Tx DSU. P6. P7. P14 P15. P4. P5. P12 P13. P2. P3. P10 P11. P0. P1. P8. P9. 8 shared memory banks. ◦ Analysis with a Real-Time Operating System. 8 shared memory banks. ◦ Model of the Resource Manager. 50. /. 52.

(132) Conclusion.

(133) Conclusion ◦ Multi/Many-cores in Real-Time Systems ◦ Full isolation: for example TDMA ◦ Bounded interference. 52. /. 52.

(134) Conclusion ◦ Multi/Many-cores in Real-Time Systems ◦ Full isolation: for example TDMA ◦ Bounded interference ◦ Precise modeling of hardware components. 52. /. 52.

(135) Conclusion ◦ Multi/Many-cores in Real-Time Systems ◦ Full isolation: for example TDMA ◦ Bounded interference ◦ Precise modeling of hardware components ◦ Research directions for multi-core analysis: Multi-core scheduling and timing analysis framework (in RTSOPS 2016) Input SCADE/Lustre application. m ex app ec in . or gs a de n r d. Task profiles. Response times and release dates computation. binary with mappings. Code generation 1. Code generation 2: Add the scheduling/mapping in the binary. task mapping + execution order. WCET analysis initial WCETs. WCET analysis and Shared Resource analysis. binary + task graph. Task mapping and scheduling. Timing information. Response times, Release dates. Code generation 3: Add the response time and release dates in the binaries. Executable binary for the Kalray MPPA Bostan Output. 52. /. 52.

(136) Many-Core Timing Analysis of Real-Time Systems and its application to an industrial processor. Hamza Rihani Université Grenoble Alpes / Verimag. Publications: Rihani, Hamza et al. (2015). “WCET analysis in shared resources real-time systems with TDMA buses”. In: Proceedings of the 23rd International Conference on Real-Time Networks and Systems. Rihani, Hamza, Claire Maiza, and Matthieu Moy (2016a). “Efficient Execution of Dependent Tasks on Many-Core Processors”. In: RTSOPS 2016. 7th International Real-Time Scheduling Open Problems Seminar. Toulouse, France. Rihani, Hamza et al. (2016b). “Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor”. In: Proceedings of the 24th International Conference on Real-Time Networks and Systems (RTNS). This work is funded by grant CAPACITES (PIA-FSN2 n◦ P3425-146798) from the French Ministère de l’économie, des finances et de l’industrie..

(137) References I Altmeyer, Sebastian et al. (2015). “A Generic and Compositional Framework for Multicore Response Time Analysis”. In: Proceedings of the 23rd International Conference on Real Time and Networks Systems (RTNS), pp. 129–138. Chattopadhyay, Sudipta, Abhik Roychoudhury, and Tulika Mitra (2010). “Modeling Shared Cache and Bus in Multicores for Timing Analysis”. In: Proceedings of the 13th International Workshop on Software and Compilers for Embedded Systems. SCOPES ’10. St. Goar, Germany: ACM, 6:1–6:10. Henry, Julien et al. (2014). “How to Compute Worst-case Execution Time by Optimization Modulo Theory and a Clever Encoding of Program Semantics”. In: Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES), pp. 43–52. Kelter, Timon et al. (2014). “Static Analysis of Multi-core TDMA Resource Arbitration Delays”. In: Real-Time Syst. 50.2, pp. 185–229. Rosèn, Jacob et al. (2007). “Bus Access Optimization for Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip”. In: RTSS 2007..

(138) BACKUP.

(139) Multicore Response Time Analysis Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10 T1 PE1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T0 PE0 t 00. 40. 1 Altmeyer et al., RTNS 2015. 80. 120. 160.

(140) Multicore Response Time Analysis Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10 T1 PE1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T0 PE0. 3 accesses t. 00. 40. 80. 120. ◦ Task of interest running on PE 0:. R0 = 10 + 3 × 10 (response time in isolation). 1 Altmeyer et al., RTNS 2015. 160.

(141) Multicore Response Time Analysis Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10 T1 PE1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T0 PE0. 3 accesses t. 00. 40. 80. 120. ◦ Task of interest running on PE 0:. R0 = 10 + 3 × 10 (response time in isolation) R1 = 10 + 3 × 10 + 2 × 10 = 60. 1 Altmeyer et al., RTNS 2015. 160.

(142) Multicore Response Time Analysis Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10 T1 PE1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T1. 2 accesses. T0 PE0. 3 accesses t. 00. 40. 80. 120. ◦ Task of interest running on PE 0:. R0 = 10 + 3 × 10 (response time in isolation) R1 = 10 + 3 × 10 + 2 × 10 = 60. R2 = 10 + 3 × 10 + 2 × 10 + 2 × 10 = 80 1 Altmeyer et al., RTNS 2015. 160.

(143) Multicore Response Time Analysis Example: Fixed Priority bus arbiter, PE1 > PE0 Bus access delay = 10 T1 PE1. 2 accesses. T1. 2 accesses. 2 accesses. Response time. PE0. T1. T1. 2 accesses. T0. 3 accesses t. 00. 40. 80. 120. ◦ Task of interest running on PE 0:. R0 = 10 + 3 × 10 (response time in isolation) R1 = 10 + 3 × 10 + 2 × 10 = 60. R2 = 10 + 3 × 10 + 2 × 10 + 2 × 10 = 80. R3 = 10 + 3 × 10 + 2 × 10 + 2 × 10 + 0 = 80 (fixed-point) 1 Altmeyer et al., RTNS 2015. 160.

(144) Evaluation: Runtime Perfomance Analysis time of randomly generated task graphs in log-log scale 20000. time (second). 2000 200 20 O(n. ). O(n. ). 4. 2. 3. 800. 700. 600. 500. 400. 350. 300. 250. 200. 150. 100. 50. number of nodes n. Theoretical complexity O (n4 ). Experimental complexity O (n3.87 )..

(145) Randomly Generated Task Graphs. Fat Task Graph. 75. 50. 25. 0 0.0. 0.2. 0.4. 0.6. Utilization. 0.8. 1.0. Multi−bank Multi−bank (w/o release) Pessimistic Single−bank Single−bank (w/o release). 100. Schedulable Tasks. Schedulable Tasks. Long Task Graph. Multi−bank Multi−bank (w/o release) Pessimistic Single−bank Single−bank (w/o release). 100. 75. 50. 25. 0 0.0. 0.2. 0.4. 0.6. Utilization. 0.8. 1.0.

(146) Cached Memory Operations D-Cache. I-Cache. Write Buffer 8KB 8 entries 2-way 64 bits 32B lines mem. operation. 32KB 8-way. Hit. store. load Miss. write in D-Cache. write in WB. data not in WB. data in WB. check size of WB S S =8. Static analysis tool Our response time analysis tool. 1/evict 2/store. check WB data in WB. (SMEM). 1/ flush WB 2/ load data. 8 SMEM read accesses (cache line refill). x SMEM write accesses (x ∈ [1, 8]) + 8 SMEM read accesses. load data. S <7 store. 1/store 2/evict. 1 SMEM write access (blocking). Miss. load data not in WB. update. S =7. Hit. (WB). 1 SMEM write access (non-blocking). 1 access = 10 cycles. 0 SMEM access. In isolation: / 1 cache line refill = 17 cycles.

(147) Offset-based SMT Encoding start = 0 off_s ∈ [0, π[. c1,2 = ite t1,2 wcet(B1 ) 0. B1 y = read_value() if (y < 0). Encode the costs of the basic blocks B2. /* 10 cycles*/ c2,3 = ite t2,3 10 0 off2,3 = get_offset(off1,2 ,10). B3 if (y ≥ 0). /* bus access */ c4,5 = ite t4,5 tdma_cost(off3,4 ) 0 off4,5 = tdma_offset(off3,4 ). B5. execution time =. X. ei ,j (absolute time) −−−−→ ci ,j (cost) ci ,j = ite ti ,j cost 0 “ite C A B” ⇔ “if C then A else B”. B4. c3,5 off3,5. ◦ offi ,j = ei ,j mod π. ci , j.

(148) Offset-based SMT Encoding B1: y =read_value(); if (y < 0) True False. B2: /*10 cycles*/. B3: if (y ≥ 0). off2,3 = (off1,2 + 10) mod π. True False. B4: /*bus access*/ B5: return.

(149) Offset-based SMT Encoding B1: y =read_value(); if (y < 0). <π. (offi ,j +c) mod π z }| {. True False. B2: /*10 cycles*/. B3: if (y ≥ 0). off2,3 = (off1,2 + 10) mod π. B4: /*bus access*/ B5: return. def. = α < 2π. (offi ,j + c mod π) mod π z. }|. {. ↓. if α < π then α else α − π True. False. m.

(150) Offset-based SMT Encoding offs ∈ [0, π[ start=0 B1: y =read_value(); if (y < 0). e1,2 = start + wcet(B1 ) off1,2 = (offs + wcet(B1 )) mod π True. False. B2: /*10 cycles*/. B3: if (y ≥ 0). e2,3 = e1,2 + 10 off2,3 = get_offset(off1,2 , 10). True False. B4: /*bus access*/ e4,5 = e3,4 + tdma_cost(off3,4 ) off4,5 = tdma_offset(off3,4 ). B5: return execution time = if t3,5 then e3,5 else e4,5.

(151) Using TDMA functions. BB 1 y = read_value() if (y < 0) BB 2. ◦ if..then..else encoding. off = ite t13 off13 off23 off35 = off off34 = off. /* 10 cycles*/ BB 3 if (y ≥ 0). ◦ sum encoding. off = off13 + off23 off35 = ite t35 off 0 off34 = ite t34 off 0. BB 4 /* bus access */ BB 5 write_value(y) return().

(152) Using TDMA functions K. block. if..then..else (ite) encoding: offi ,j = ( if t1,i then off1,i else if t2,i then off2,i else ... else if tK ,i then offK ,i else 0). sum encoding: offi =. kX =K k =1. offk ,i. offi ,j = if ti ,j then offi else 0.

(153) Performance 3. 100 ●. ●. sum ite ●. block. ×N. time(s) (log scale). 10000 ●. ●. 1000 ● ● ● ● ●. 100. ●. 1. 2. 3. 4. 5. N. 6. 7. 8. 9. 10.

(154) How it works? ◦ Example with binary search: Testing wcet >= 0... SAT (value found = 18). New interval = [18, 73]. Testing wcet >= 46... UNSAT. New interval = [18, 45]. Testing wcet >= 32... UNSAT. New interval = [18, 31]. Testing wcet >= 25... UNSAT. New interval = [18, 24]. Testing wcet >= 21... UNSAT. New interval = [18, 20]. Testing wcet >= 19... UNSAT. New interval = [18, 18]. The maximum value of wcet is 18 . Computation time is 0.010000s.

(155) Evaluation: Analysis Time. Analysis time. Name bs insertsort jfdctint fdct compressdata fly-by-wire. π = 40, σ = 20, acc = 10. 0.45s 1.37s 44.10s 41.36s 4.66s 28.78s. π = 400, σ = 200, acc = 40. 0.80s 7.19s 55.47s 34.42s 3.44s 109.37s.

(156)