Modulo onto

(1)

(2)

(3)

(4)

(5)

Modulo Scheduling Loops onto Coarse-Grained Reconfigurable Architectures

St. John's

by Ra ni Gna naolivu

© Ra ni Gna naoliv u

A t hesis submitted to th e School of Graduate Studies

in partia l fulfilment of t he requiremen ts for the d egree of

Doct or of Philosophy

Faculty of E ngineering and Applied Science Memoria l University of Newfoundla nd

J a nuary 2013

Newfoundland

(6)

Abstract

Reconfigurable systems have drawn increasing attention from both academic re-

search ers and creators of commercial applications in the past few years b ecause they

could combine flexibility with efficiency. There are two ma in types of reconfigurable

a rchitectures - fine-gra ined and coarse-grained . The fun ctionality of fine-gra ined ar-

chitecture ha rdwa re is sp ecified a t t he bit level while t he functionali ty of the coarse-

gra ined architecture hardware is specified at the word level. Coarse-grained recon-

figurable architec tures (CGRAs) h ave gained curren cy in recent years due to t h eir

a bundant parallelism , high computationa l intensity a nd flexibility. A CGRA n or-

mally is comprised of an array of basic computationa l a nd storage resources, which

a re capa ble of processing a la rge volume of applications simultaneou sly. To exploit

the inherent parallelism in the applications to enhan ce performance, CGRAs h ave

b een structured for accelerating computation intensive p arts such as loops, th at re-

quire large amounts of execution time. The loop body is essentially d rav.rn onto th e

CGRA mesh , subject to modulo resource usage constraints. Much research has been

done to exploit the p otentia l parallelism of CGRAs to increase the p erformance of

time-consuming loops. However, sp arse connectivity a nd distributed register files

present difficult ch allenges to th e scheduling phase of t he CGRA compilation fra me-

work . \ iVhile traditiona l schedulers do not take rou tability into considera tion, software

pipelining can improve t he scheduling of instruct ions in loops by overla pping instruc-

tions from different iterations. Modulo scheduling is an ap proach for constructing

software pipelin es that focuses on minimizing t he time b etween the initiations of it-

era tions - t he so-called initiation interval (I I ). For example, if a new iteration is

(7)

sta rted every I I cycles, the time to complete n iterations wi ll app roach I I x n, for large n loops, thereby maximizing p erforman ce.

The problems of scheduling ( deciding when an operation sho'Uld happen), p lacing ( deciding where an operation sho'Uld happen), and routing (the problem of how in-

fo rmation travels tho'Ugh space and time between operations ) can be u nified if t hey are m od elled by a graph embedding problem . The d ata flow graph of the loop is embedded in a rout ing resource gra ph representing t he ha rdware across a number of cycles equal to the initiation interval.

P a rt icle swarm optimization (PSO ) has shown to b e su ccessful in ma ny ap plica- tions in continuous op timization problems. In t his t hesis, we have prop osed algo- rit hms to solve scheduling, placing, and routing of loop op erations simultaneously by using PSO . We call this ap proach modulo-constrained hybrid particle swarm op- timization (MCHPSO ). There are many cons traints and on e opt imization objective, which is the II that need s to be considered during t he m apping and scheduling pro- cedure. The scheduling algorithm tries to minimize t he initiation interval to start the next iteration of th e loop under the resource and modulo constraints for the a rchitecture being used.

When condit iona l bra nches such as if-then-else statements are presen t in the loop,

t hey create multiple execut ion pat hs. Exploit ing con ditional branches t h rough our

predicated exclusivity, t he MCHPSO a lgorithm reuses the resources which are in

t he exclusive execu tion paths and which may allow t he loop to be scheduled with a

lower I I. F ina lly, a priority sch eme algorithm a long wit h recurrence aware modulo

scheduling is proposed to map inter-iteration depend en cies onto CGR As, which is

a ble to save resources for all recurrences cycles and to map remaining operations.

(8)

Acknowledgements

First a nd for emos t I would like to thank God for the wisdom and perseverance that he has blessed me with during this PhD program , and indeed, throughout my life : "He who b egan a goo d work in you will carry it on to completion until the day of Christ Jesus ." (Philippia ns 1: 6)

It i s my pleasure to thank ma ny p eople who made this t hesis possible. I express my sincere thanks to my supervisors, Dr. T . S. Norvell and Dr. R. Venkatesan, for their intellectua l assistan ce, financial support, a nd con tinuous encouragement during my research. Their enthusiasm , inspiration and sound a dvice was motivational a nd helped me through even the roughest patches of my graduate program. I t hank Dr. P . Gillard for taking time to read my work a nd offer inva luable comm ents and suggestions. I thank NSERC for supporting my research at Memorial. I thank Shuang Wu for teaching me his work in gen erating data flow graph from a HARPOL program.

I thank him for allowing me to use his work for my test cases in t he PhD program .

Last but not the least; I thank my family for their boundless love, encouragement,

a nd uncondition al s upport, both finan cially and emotionally throu gh ou t my PhD

program . Especiall y, I thank lV Ioha n Gnanaolivu , my father in-law, for his valuable

editorial corrections. I a lso thank my loving, supportive, husband Praveen Gnanao-

livu whose fait hful supp ort during the final stages of this PhD is so apprecia ted. I

would a lso like to thank my loving son Kevin for t he sincere everyd ay prayers for the

completion of my research. I tha nk a ll my friends for constantly encouraging me a nd

reminding me of my aspirations .

(9)

Abstract

Acknowledgements List of Tables List of Figures List of A lgorithms List of Abbreviations 0 Introduction

0.0 Reconfigurable Comp uting

0.1 Coarse-Grained Reconfigura ble Architecture

0.2 Compiling Loops onto CGRAs with f\/Iodulo Scheduling 0.3 Motivations and Obj ectives

0.4 T hesis Contributions 0.5 Thesis Overview . . .

11

lV

Xl

X Ill

XVI

XVll

0 0 2 5 6 8

10 1 Compilation in Coarse-Graine d R eco nfigurable Architecture s 12 1.0 Introduction . . . . . . . . . . . . . . . . . . 12 1.1 Coarse-Grained Reconfigurable Architect ure 13

1.1.0 Introduction . . . . . . . . 1.1.1 Overview of some CG RAs

13

(10)

1.1.1.0 1.1.1.1 1.1.1.2

1.1.1.3 1.1.1.4

1.1.1.5 1.1.1.6

IVIorphoSys KressArra y Mont ium DReAM CHESS R aPiD . PipeRench 1.1.1.7 ADRES . .

1.1.2 Compa rison a nd Selection of the Target CGRA 1.2 Scheduling . . . . . .

1.2.0 Introduction . 1.2.1 Soft ware Pipelining 1.2 .2 Modulo Scheduling 1.2.3 Graph E mbedding

1.2.4 Modulo R eservation Table 1.2.5 Rou ting Resource Gra ph 1.3 Evolut ionary Algorit hms

1.3.0 1.3. 1

Overview . . . . 1.3.0.0

1.3.0.1 1.3.0.2 1.3.0. 3

Simulated Annealing Genetic Algorithm . Ant Colony Optimization

P article Swarm Optimization Algorith m Selection of P SO Algorithm . . .

1. 4 Various CGRA Comp ilation Procedures .

13

14

15

16

17

18

20

21

22

24

25

26

27

28

29

32

33

(11)

1.4.0 DRESC Compile r . . . . . . . . . . . 36

1.4.0 .0 Adva ntages and Limitations 37

1.4.1 Compilation using Modulo Gra ph Emb edding 37

1.4.1.0 Advantages and Limitations 38

1.4.2 Compilation using Clustering . . .. 38

1.4.2.0 Adva ntages and Limitations 39

1.4 .3 Compilation Using Mod ulo Scheduling with Back tracking Ca-

pability 39

1.4.3. 0 Advantages and Li mitations 40

1.5 Conclusion . . 40

2 Modulo Con stra ined Hybrid P article Swarm Opt imization Sche dul-

ing Algorithm 4 2

2.0 Intro duction . . . . . . 42

2. 1 l\1odulo Scheduling in C GRAs 43

2. 1.0 Problem Id entification 43

2. 1. 1 Solution Structure Formalization 44

2.1. 1.0 Dat a F low Graph . . 47

2.1.1.1 Ta rget Arch itecture 48

2.1.1. 2 Minimal Initiation Interval . 54

2. 1.1.3 Modulo Reservation Table 55

2. 1.1.4 Resource Routing Graph . 56

2.2 Prop osed Mo dulo Sched uling Algorithm 61

2. 2.0 Modulo Sch eduling with MCHPSO 61

2.2.1 Pa rt icle En coding for t he P roblem . 62

(12)

2.2.2 MCHPSO . . . . . . . . . . . . . . . . . 2.2.2.0 Need for t he mutation operator 2. 2.3 Fitne ss Calculation . . . . . . . .. . . 2.2.4 Configuration File and Final Schedule 2.3 Final schedule of the MCHPSO Algorithm 2.4 Conclusion . . . . . . . . . . . . . . . . ..

3 P erf orman ce Analysis of MCHPSO Algorithm

3.0 Introduction . . .. . . 3. 1 Analysis of Scheduling

3.2 Modulo Scheduling wit h MCHPSO 3.2. 0 Ex p erimental Set Up . . .

3.2.0.0 DFG Generation

63 69

70 72

72 73 74

74 75 76

76 76

3 .2.0.1 TA Graph Generat ion 78

3. 2.1 Scheduli ng R esults . . . . . . . 79 3.2.2 lV Iapping of Nod es and Routing of Edges 85 3.2.3 Analysis of Functional Units Usage for Different Topologies . 90 3.2.4 Analysis of Register Files Usage wit h Differen t Interconnections 92 3.2.5 Effect of Varying P article Size in MCHPSO algori thm . 93 3.2.6 Analyzing the Sp eedup of fCHPSO Algorit hm . . . . 94 3.2.7 Function al Uni ts C apable of R ou ting and P erforming Compu-

tations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.3 Comparison of N IC HPSO with Other Mo dulo Sched uling Algorit hms 98

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

(13)

4 Exploiting conditional structures onto CGRAs

4.0 Introduction . . . . . . . . 4. 1 B ackground on HARPO /1

4.2 DFG cha racteristics . . . . 4.3 Ha ndling conditional statements . 4.4 Predicated execution \vith exclusivity

4.5 4.4.0 4.4.1

4.4.2 Motivational example for exclusivity

J\I apping \vith J\ICHPSO predicated no exclus ivity algorithm 4.4.1.0 Iethod description . . . . . . . . . . . . . . . . .

lapping with J\ICHPSO predicated exclusivity algorit hm . 4.4.2. 0 Method description .

Results . . . .

4.5. 0 Experimental Set Up 4.5.1 DFG cha racteristics . 4.5.2 TA ch ar acteristics . . 4. 5.3 Predicat ed Execution .

4.5.3.0 With Exclusivity 4.5.3. 1 No Exclusiv ity 4.6 Comparison .

4 .6.0 II achieved .

102 102 103 103 105 107 107 111 111 116 116 119 120 121 121 123 123 126 127 127 4. 6.1 Usage of resources in Exclusivity vs No exclusivity in 4 x 4 CGRA129 4.6.2 Overuse of resources in Exclusivity vs No exclusiv ity in 4 x 3

CGRA 4. 7 Conclusion . .

130

131

(14)

5 Recurrence exploitation in CGRAs 133

5. 0 Introduction . 133

5.1 Recurrence Handling 134

5. 1.0 Motivational Example 135

5.1.1 Existing Recurrence Handling Approaches 137

5. 1.1.0 Rotation Scheduling 138

5. 1.1.1 Bidirectional Slack Scheduling . 138

5.1.1.2 Edge-cent ric Modulo Scheduling . 139 5.1. 1.3 Recurrence Aware Modulo Scheduling 140 5. 1.1. 4 Comparison of Existing Approach es . 140

5.2 Proposed Method .

. . . . .

143 5.2.0 Recurrence Aware Modulo Scheduling with Priority Scheme 143 5.2.1 Architecture Extensions to Speedup R ecurrence Handling . 147

5.3 Discussion of Results 148

5.3.0 Experiment Set Up 148

5.3. 1 DFG wit h Recurrences 149

5.3.2 TA Characteristics 150

5.3.3 4 x 4 CGRA recurrence schedule results 150 5.3.4 4 x 3 CGRA recurrence schedule results 153

5.4 Conclusion .

. .

154 6 Conclusions and Future Work 155

6.0 Contributions 155

6.1 Suggested Future Work . 157

6.2 Concluding R emarks 159

(15)

Bibliography

A HARPOL code for inhouse ifthen-else benchmarks A.O ift hen-else benchmark -one condition .

A.1 ifthen-else benchmark -two conditions

162 178 178 179

A.2 HARPOL cod e ifthen-else benchmark -three condi tions . . . . . . . . 181

(16)

List of Tables

200 MRT showing all the resources occupied in II time 0 56 201 Fina l sch edule result of t he DFG onto th e TA 0 0 0 72

300 DFG ch aracteristics of t he b enchmarks 77

301 8 X 8 CGRA configuration 0 0 0 0 0 0 0 81

3 0 2 Sche duled a nd placed results of th e lattice synth esis loop kernel 82 30 3 Routing results of lattice synthesis loop kernel -partl 83 3 0 4 Routing results of lattice synthesis loop kernel -part2 84 305 Overall ma pping results of the DSP benchm arks in 8 x 8 CG RA 89 306 Ove rall ma pping results of the DSP benchm arks in 4 x 4 CGRA 90 30 7 Usage of Function a l Units with various top ologies 92 30 8 Varia tion of pa rticle size on an 8 x 8 CGRA 0 0 0 94 309 MCHPSO a lgorithm sp eed up comparison on an Intel i7 processor 96 3010 Comparison of FU ut ilization with placement and routing 0

3011 Comparison of MCHPSO results with Mei et a l work 30 12 Comparing MCHPSO with Dimitroula kos's et a l work 0

400 DFG cha racteristics of t he b ench ma rks 0 0 0 0 40 1 Resources availa ble in the Ta rget Architecture

97 99 100

121

123

(17)

4.2 Exclusivity results in 4 x 4 CGRA . 125

4.3 Exclusivity results in 4 x 3 CGRA . 126

4.4 4 x 4 CGRA results without xclu ivity . 127

4. 5 4 x 3 CGRA results without exclu ivity . 128

4.6 II achieved in 4 x 3 CGRA and 4 x 4 CGRA 128

4.7 Total usage of 4 x 4 CGRA

. . .

129 4.8 Total usage and overuse of 4 x 3 CGRA 130

5.0 R curr nee Benchmark Characteristics 150

5.1 Recurrence schedule results in 4 x 4 CGRA. 152

5.2 Recurrence schedule results in 4 x 3 CG RA . 153

(18)

List of Figures

0.0 Advantages of Reconfigurable Computing 1

0.1 A Generic Coarse-Grain Reconfigurable System taken fr om [Vassiliadis and Soudris, 2007a]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.0 ADRES Architecture taken from [ Mei et aL , 2005c] 18 1.1 a)Modulo Scheduling Example b)DFG and Configuration for 2x2 ma-

trix, modified from [Mei et al., 2003b] . . . . . . . . . . . . . . . . 24 1.2 DRESC Compiler Framework , taken from [ Berekovic et al., 2006] 34 1.3 P seudoco de of the modulo scheduling algorit hm in DRESC, taken from

[ Mei et al., 2002] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.0 Outline of overall mapping of loop kernel of DFG onto RRG of CGRA 46

2.1 A loop body converted into a DFG . . . . . . 49

2.2 4 x 4 T arget Architecture Inst ance of ADRES. 50

2.3 FU Topology (a) Mesh Topology (b) Meshplus1 Topology (c) Mesh- plus2 Topology . . . . . . . . . . . . . 52 2.4 FU and RF Topology (a) Private RF (b) P rivate RF and Column

Adj acent Topology (c) Private RF and Diagonal Adj acent Top ology. 53

(19)

2.5 Various Usage of Buses (a) Row Bus Connections (b) Row and Column Bus Connections . .

2.6 X ed ges in the RRG

2. 7 Y edges in the RRG. Edges from same type of source are shov.rn in same s tyle edge. . . .

2.8 Z edges in t he RRG .

2.9 DFG showing a simple loop structur e without recurrence 2.10 TA taken for the mapping of DFG .. .. .. . . . . .. . 2.11 Overall mapping of loop kernel of DFG onto RRG of C GRA 2.12 Compilation flow of the proposed algorithm

2.13 P art icle encoding for sched uling

54 58

59 60

63 64 65

66 68

3.0 Lattice synthesis filter cod e . . . . . . . . . . . . . . . . . . . . 78 3.1 DFG description file for the Latt ice synthesis fil ter in Figure 3.0 79 3.2 DFG corr sponding to the co de in F ig ure 3.0 . 80 3.3 All particles currentFitness versus Iteration 85 3.4 Global b est fitness for every iteration . . . 86 3.5 BestFitness of all particles v rsus Iteration 87 3.6 P ercentage of register u tilization in different topology 93

4.0 DFG node types, taken from [ Wu , 2011] . . . . . . . . . . . . . . . . 105 4. 1 ALU modification for conditional branch a)original ALU b) modified

ALU , taken from [ Lee et al., 2010] . . . . . . . 108

4.2 Exampl of HARPO / L DFG with if -then-else 109

4.3 .1\IRT Comparison of Exclusivity and o _ Exclusivity Algorithm 110

(20)

4.4 Predi cates of t he exclusive nodes in Figure 4.3 4 .5 Predi cat ed MCHPSO no exclusivity algorithm 4 .6 SPLIT and MERGE edges .. . .. . . . .. . 4. 7 Predicated MCHPSO with exclusivity algorithm 4.8 The first three ben c hmarks loop structure . . .

5.0 Motivating example a) 2 x 2 t arget architecture template instance, b) RRG , c) DFG and d) Final schedule, place and route . . ..

5.1 Flowchart of RAMS algorithm, taken from [ Oh et al., 2009] . 5.2 Successful final schedule for the DFG shown in F igure 5.0 ..

5.3 CGRA architecture with dedicated RFs for live values, taken from [ Oh et al., 2009] . .. .. . . . . . . . . . . . .. . . . . . . . .

5.4 Comparison of 4 x 4 and 4 x 3 architecture configurations .

110 111 114 117 122

136 141 147

148

151

(21)

List of Algorithms

1.0 The Standa rd PSO Algorith m 31

2.0 Mapping DFG onto RRG . 67

2.1 The MCHPSO algorithm . 69

2.2 Routing cost fitness value for MCPSO 71

4.0 Adding Symbolic valu es to DFG cells 113

4 .1 Adding Predi cates to DFG cells 115

4.2 Creating exclusivity set . . . .. 118

4.3 Exclusivity check of TA resource . 119

4.4 Maximum Indep ende nt Set of DFG cells 119

5.0 Ma pping DFG with recurren ces onto CGRAs . 143 5.1 Finding recurrence cycles with Kosaraju 's stron gly connected compo-

nents algorithm . . . . . . . . . . . . . .

⁰ ⁰ ^• ⁰ ⁰ ^• ^• ^• ^• ^• ^• ^• ⁰ ^• ^• ^•

144

(22)

List of Abbreviations

CGRA FPGA II

Mil DFG TA PSO MCHPSO RRG HARPO/ L MRT ASAP ALAP DRESC MRRG FU RF

Coarse Grained Reconfigurable Architecture

F ield Programmable Gate Array

Initiation Interval

Minimal Initiation Interval Data F low Graph

Target Architect ure

P article Swarm Optimization

Modulo Constrained Hybrid P article Swarm Optimization Routing Reso urce Graph

HARdware P arallel Objects Language Modulo Reservation Table

As Soon As Possible As Late As P ossible

D yn amically Reconfigurable Embedded Systems Compiler Modulo Routing Resource Graph

Functional Unit

Register File

(23)

CB RB SRF IPC MU VLIW DSP ASIC SA ACO GA

ILP TLP

Column Bus Row Bus

Shared Register F ile Instruction Per Cycle Memory Unit

Very Long Instruction Word Digital Sign al Processing

Application Specific Integrat ed Circuit Simulat ed Annealing

Ant Colony Optimization Genetic Algorithm

Instruction Level P arallelism

Task Level Parallelism

(24)

Chapter 0 Introduction

0.0 Reconfigurable Computing

R econfigurable systems [ Abielmona, 2009] have drawn increasing attention from b oth

academic and commercial researchers in the past few years b ecause they combine

flexibility with efficiency and upgradability [ Todma n et al., 2005]. The flexibility in

reconfigurable d evices ma inly comes from t heir routing interconnect . Reconfigurable

computing fills the gap b etween application-sp ecific integrated circuits (A SICs) a nd

gen eral purpose processors ( GPPs) , as d escribed in F igure 0 .0. When compared wit h

GPPs, reconfigurable computing h as the ability to ma ke substantial ch a nges in the

d ata path , in addition to the control flow. However, when compared with ASICs ,

it has the p ossibility to ad a pt the hardwa re during the runt ime by "loading" a n ew

configuration in the memory. To avoid the bandwidth limitation between processor

a nd me mory, called the Von Neuma nn bottleneck , a p ort ion of t h e a pplication is

ma pped directly onto the h ardware to increase th e da ta parallelism in reconfigurable

(25)

computing.

Performance

I

ASIC i

Reconfigurable Computing (FPGAs, CGRAs}

Micro- processor

Flexibility

Figure 0.0: Advantages of Reconfigurable Computing

The principal benefits of reconfigurable comput ing compared with ASICs a nd GPPs are the a bility to design larger hardware with fewer gates and to reali ze the flexibility of a softwa re-b ased solution while retaining th e execu tion speed of a more t radition al, h ardware-based approach [Barr , 1998]. Due to the d yn amic nature of reconfigurable computing, it is advantageous to have t h e software man age the process of decid ing which ha rdware objects to execute.

Reconfigurable a rchitectures are broadly classified into fine-grained and coarse-

gra ined. T h e first d evices th at had been used for fine-gra ined reconfigura ble com-

puting were the field-programma ble gate arra ys (FPGAs) . An FPGA consists of a

matrix of programmable logic cells, execu ting bit-level operations, with a grid of in-

terconnect lines running among them. FPGAs allow realizing systems from a low

(26)

granularity level, t hat is, logic gates and flip-flops . This makes FPGAs very p opula r for the implementat ion of complex bit level operations. However , FPGAs are ineffi- cient for coarse-grained d ata path operations due to the high cost of reconfiguration performa nce and powe r [Hartenstein, 2001]. The coarser gra nularity greatly reduces the d elay, power a nd configuration time relative to an FPGA d evice at the expense of reduced flexibility [Dimitroula kos et al. , 2007]. However , coarse-grained reconfig- urability has the adva ntage of much hig her computational de nsity compared to the FPGAs.

0.1 Coarse-Grained Reconfigurable Architecture

Coa rse-gra ined reconfigurable architectures ( CGRAs) have b een emerging as a po- tentia l candida te for embedded systems in recent years. CGRAs h ave a d ata-p ath of word width whereas fine-gra ined a rchitectures are much less efficient a nd have huge routing area overhead and poor routa bility. A m ajor b enefit of CGRAs over FPGAs is a massive reduction of configuration memory, configuration time, and complexity reduction of the P lacement and Routing (P8 R) problem [ Ha rtenstein, 2001]. These a rchitectures combine wit h t he high performa nce of ASICs a nd the flexibility of mi- croprocessors , to accelerate computation intensive p arts of a pplications in embedded systems [ Dimitroula kos et al. , 2007] . However, t here are still many ou tstanding is- sues s uch as a lack of a good design methodology to exploit high perform a nce and efficiency on CGRAs [Vassiliadis and Soudris, 2007a ].

CGRAs con sist of programmable, h ardwired, coarse-grained processing elements

(PEs), which support a predefined set of word-level operations while th e intercon-

(27)

n ection network is based on the needs of a specific a rchitecture doma in. A generic a rchitecture of a coarse-grain reconfigura ble system, shown in Fig ure 0.1, en compasses a set of coarse-gra in reconfigurable units ( CGR Us), a programma ble interconnection network , a configuration memory, a nd a controller. The coarse-grained reconfigurable array executes the computationally-int ensive p arts of the a pplication while the main processor is respons ible for th e remaining parts of the appli cation.

E.x,·c. Control

~.

!

r---~---,

Figure 0.1: A Generic Coarse-Grain Reconfigurable System taken from [ Vassiliadis a nd Soudris, 2007a].

The domain-specific, hardwired , C GRU execu tes a logical or arithmetic operation

required by the considered application domain. The CGRUs and interconnections are

programmed by proper config uration (control) bits that are stored in configurat ion

(28)

me mory. The configuration memory may store one or multiple configur ation contexts, but at a ny given time, one context is active. The controller is responsible for con- t rolling the loading of configuration contexts from th e ma in mem ory to configuration memory, for monitoring t he execu tion process of t he reconfigura ble hardware and for activating the reconfigura tion contexts. The interconnection network can be realized by a crossb ar or a mesh structure.

CGRAs can provide massive amounts of pa ra llelism and high comput ation al ca-

p ability. T ypically, t he application domains of CGRAs are Digital Sign al P rocessing

( DSP) a nd multimedia . These kinds of applications usually spend most of th eir exe-

cution time in loop structures. These computationa l intensive p arts have hig h levels

of operation and data para llelism. The design of such systems requires a good cor-

resp on dence between t he coarse-grained reconfigura ble architecture and th e loop's

cha racteristics. Kernels (loops ) of a n a pplication are mapped onto t he array in a

highly p arallel way. Genera lly, in order to schedule a kernel, it needs richer intercon-

nections. However, richer interconnections come wit h costs such as wider multiplex-

ors, more wires, a nd more configura tion bits which translate to large silicon area a nd

higher p ower cons umption. Moreover, even wit h t he same amount of interconnection

resources, we can exp ect variat ion among topologies. Choosing a good topology is

an essential step in th e a rchitecture exploration. Typically, the applications which

belong to the a pplication domain of t he CGRAs, are characterized by the high d ata

transfer rate be tween t he processor and t he memory [ Dimitroulakos et al. , 2007].

(29)

0.2 Compiling Loops onto CGRAs with Modulo Scheduling

There a re abundant comput ation al resou rces available for parallelism in CGRAs.

The target applications of CGRAs are typically telecommunications a nd mult imedia electronics, which often s pend most of their t ime in critical segments, ty pically loops [IV !ei et al. , 2003b]. The massive amounts of parallelism found in CGRAs can be used to speed up t ime critical loops of a n application. Moreover , t he loops often exhibit high d egree of p arallelism and require a great deal of computation intensive resources . In order to ma p the critical loops, we have to consider t he da ta d ep en dency within a n iteration of a loop and inter-iteration dependency. W hen compiling a loop onto CGRAs, each op eration wit hin th e loop requires a resource to be executed on the CGRA and the time at which the operation will execute. The executed operation has to be routed to the dep endent operations in th e loop.

Since each loop it eration rep eats t he same pattern of executing operat ions, com- piling loops onto CGRAs can be achieved by modulo scheduling [ Hata naka and Bagherzadeh , 2007]. Modulo scheduling is a software pipelining t echnique [Llosa et al. , 2001] th at overlaps several iterations of a loop by generating a schedule for a n iteration of the loop. Modulo scheduling uses t he same schedule for subsequ ent iterations. It era tions are start ed at a constant interval called t he Initiation Interval (II ). T h e time t aken to complete a loop of n itera tions is roughly proportional to II. The main goal of modulo scheduling is to find a sch edule with as low a n II as possib le.

T h e scheduling, p lacing and rou ting loops onto CGRAs faces several architectura l

(30)

constraints and ch allenges. Modulo scheduling adds a time dimen sion to th e combina- t ion of placement and routing, which becomes very similar to placemen t and routing for FPGAs [ Hat an aka and Bagherzadeh, 2007].

0.3 Motivations and Objectives

In order to solve the scheduling, placing and routing problem onto CGRAs with modulo scheduling, several issues have to be considered in t he mapping. A schedul- ing algorithm should be capable of efficiently exploiting regular d ata parallelism in CGRAs with lower ini tiation interval. The follmving issues motivated us to consider a modulo scheduling algorithm for CGRAs.

• An algorithm capable of achieving a lower initiation interval to start t he suc- cessive iterations.

• An algo rithm capable of routing intermediate d ata betwe n the executed oper- ations of loop.

• An algorithm that is fast and efficient with optimal usage of resources in the final schedule.

• An algorithm capable of mapping different execu tion paths of a loop caused by co nditional branch es.

• An algorit hm able to do parallel search of solutions with placement , scheduling

and routing .

(31)

• An a lgorit hm must be able to consider the h ardware constraints a nd conserve resources.

• An sch eduling algorithm s hould be compa tible with the front end application.

• An algorithm th at is capable of m apping crit ical nodes and edges.

• An scheduling algorithm s hould be a pplica ble to different CGRAs and different topologies .

• An algorit hm t ha t is cap able of analyzing best topology of the CGRA.

Unfortunately, the available parallelism in CGRAs has been exploited by only a

few a utomated design and compilation tools [ Mei et al. , 2003b]. The modulo schedul-

ing algorithm used in [Hatanaka and Bagherzad eh , 2007] a nd [Vassiliad is a nd Soudris,

2007b] was not able to find optimal usage of resources and took a long time to find

t he valid sch edule. Several heuristic techniques were tried by researchers in solving

the mod ulo scheduling problem, but the techniques were not fast and efficient [ Llosa

et al., 1996]. For example, the existing scheduling algorithms find the placement a nd

routing solut ion wit h a sequ entia l search for each Data Flow Graph (DFG) operation

a nd does not solve conditional cod e. Pa rticle swarm optimization (PSO) applied to

instruction scheduling [Abdel-Ka der, 2008], provides near optimal solut ions, wit h fast

convergence and low execution t ime for various combinatory and multidimensional

optimization problems. A s imple PSO can get stuck in a locally optimal solution a nd

can b e ma de efficient in combination with mutation operators [ Grundy and Stacey,

2008]. To th e best of our knowledge, PSO has n ot b een used in modulo scheduling

(32)

for coarse-grained a rch itectures. As a result , a fast and efficient modulo sch eduling a lgorith m for CGRAs wit h parallel search is develop ed.

The objectives of this thesis are:

• To develop a fast a nd efficient scheduling, placing a nd rout ing algorithm called modulo constrained hybrid pa rt icle swarm optimization (MCHPSO) to exploit loop-level parallelism of differen t target applications.

• To a nalyze t he performa nce of MCHPSO in various CGRA to pologies and con- figurations.

• To a pply MCHPSO to various benchmarks in telecommunications and in mul- timedia applications a nd to compare the II achieved with oth er scheduling a l- gorithms.

• To d evelop a n algorithm to a na lyze th e DFG with condition al code generated from a HARdware P ara llel Objects Language (HARPO/ L) program and to schedule t he conditional code with MCHPSO wit h efficient u se of reso urce .

• To d evelop an a lgorithm to handle loop-carried dependen ces or recurrences in DF G , where a n op era tion d ep ends on itself or an other operation from previous iterations.

0.4 Thes is Contributions

T h e following are the contributions of this thesis.

(33)

• Designed the solution structure for the p articles in PSO to ma p DFG onto a t ime-space graph ca lled rout ing resource graph ( RRG), whe re each p article represents a scheduling solution to the mapping process.

• Designed and implemented MCHPSO algorithm to place, schedule and route DFG onto CGRA. The algorithm succeed ed in scheduling with lower initia- tion interva l, and with minimal usage of resources. How ever, the MCHPSO a lgorithm did not conflict with any da t a dependency and satisfied the modulo constra ints for the CGRA resources.

• Compa red the p erforma nce of MCHPSO with other scheduling algori thms a nd a na lyzed MCHPSO on various topologies and various CGRA configura tions, the MCHPSO algorith m achieved fast ex ecution t ime and bet ter sch edule results than other a lgorit hms. Ana lyzed the speedup of MCHPSO in intel i7 qua d core processor. The MCHPSO pa ra llelizes well with man y logical processors and produces faster result.

• Designed and implemented a predicated exclusivity MCHPSO algorit hm to map conditional code in DFG. The exclusivity algorit hm was able to mini- mize the number of resources used in the scheduling process . The exclusivity algorithm re used the same resource for conditional code in DFG to be mapped onto CGRAs.

• Designed a prepro cessing algorit hm t o extract information from DFG generated

by the HARPO / L program compiler. The algorithm added predicates a nd

symbolic informa tion to the DFG cells (nod es and edges). D esigned a method

(34)

to create exclusivity matrix of a ll DFG cells.

• Designed a method to find empty slots in MRT (modu lo reservation table) using Maximum Indep endent Set a lgorit hm.

• Analyzed the performance of predicated exclusive MCHPSO algorit hm with va rious CGRA configurat io ns. Compared the performance of p redicated exclu- sive MCHPSO a lgorithm with non-exclusive predicated MCHPSO algori thm on various benchmarks.

• Implem ented a nd evaluated a method to ha ndle loop carried dependence in DFG to be mapped onto CGRAs.

0.5 Thesis Overview

This t hesis is organized as follows. Chapter 1 provides a d etailed review of modulo scheduling in CGRAs. First, an overview of CGRA h as been ou tlined and it is followed by selecting a suitable CGRA for the selected prob lem. Secondly, an overview of modulo scheduling has been discussed. Thirdly, t he cha pter discusses evolut ionary a lgorithms and the use of particle swarm optimization in modulo scheduling.

Chapter 2 discusses t he proposed algorithm called Modulo Constrained Hy brid Particle Swarm Optimization (MCHPSO) . An overview of the compilation fram ework has been discussed. The chapter also provides a review of the related work. T h e encodin g of particle and fitness calculation in MCHPSO are presented in t his chapter.

Cha pter 3 presents the simulation results for MCHPSO. The performance an alysis

of MCHPSO is discussed, based on the interconnections, resource availa bility and

(35)

particle size. MCHPSO speedup is a nalyzed on the Intel i7 quad core p rocessor.

Chapter 4 discusses t he exploitation of condit ional structure in CGRAs . This chapter presents the predicated exclusivity algo rithm . The input DFG was taken from the HARPO/ L (HARdware P arallel Objects Language) compiler and th e simulation results of predicated exclusivity algorithm are discussed.

Chapter 5 presents t h e recurrence handling in loops. This chapter reviews var- ious methodologies t o map recurrence relation s onto CGRAs. It also presents th e recurrence aware prioritized MCHPSO algorithm and its simulation results.

Chapter 6 concludes t he thesis and presents the scop e for fu ture work.

(36)

Chapter 1 Compilation in Coarse-Grained Reconfigurable Architectures

1.0 Introduction

Coarse-g ra ined reconfigurable a rchitectures ( CGRA s) have t he potentia l to exploit both the effi ciency of ha rdware a nd flexibility of software to map large applications.

A good compiler should employ the CGRA's resources to exploit a high amount of operation and loop-level parallelism in t he application's loops [ Tuhin, 2007] . The compiler must carefull y schedule the application's loop body and facilitate high p er- formance a t a reasonable cost.

An overview of CGRAs and the selection of target a rchitecture is given in Section

1.1. Compiling loops to CGRAs involves the modulo scheduling process which is a

combina tion of 3 tasks: sch eduling, placement, and routing whi ch will be discussed

in Section 1.2. In this thesis, t he modulo sch eduling is done with pa rticle swarm op-

(37)

timization . The various kinds of evolutionary a lgorithms and the reason for selection of PSO are discussed in Section 1.3 . This ch ap ter concludes with a discussion of t he different compilation procedur es attempted so far in the CGRAs a nd the need for a new modulo scheduling algorit hm in Section 1. 4.

1.1 Coarse-Grained Reconfigurable Architecture

1.1.0 Int roduct ion

Coarse-Grained R econfigura ble Architectures h ave b een used widely for accelerating time consuming lo ops. Processing elements (PEs), available in a large number of CGRAs, can be used to exploit t he inherent para llelism found in loops to accelerat e the execution of a pplications. In a CGRA, the PEs are organized in a 2-dimensiona l (2D) arr ay, connected with a configurable interconnect network [Dimitroulakos et al. , 2009].

1.1.1 Overview of some CGR As

1.1.1.0 MorphoSys

The .IVIorphoSys architecture h as b een design ed for multimedia applications to accom-

modate applications with data parallelism and high throughput constra ints, such as

video compression [Singh et al., 2000a] . The components of t h e MorphoSys architec-

ture are an array of reconfigurable cells (RCs), processing units (called RC A rray), a

gen eral-purpose (core ) processor ( TinyRISC) and a high-bandwidt h memory inter-

face, implemented as a single chip. The co mputation-intensive operation s are ha ndled

(38)

by the single instruction multiple data (SIMD) array of coarse-grained reconfigurable cells ( CGRCs) . The sequential processing and the RC array operation controls are performed by the TinyRISC [ Singh et al., 2000b]. A context word is loaded into the RC's context register for every execution cycle.

1.1.1.1 KressArray

KressArray (al o known as rDPA) has a 32-bit-wide data path with an array of recon-

figurable processing elements. The KressArray reconfigurable archi tecture features arithmetic and logic operators on t he level of the C programming language, making t he mapping simpl r than for FPGAs [ Hartenstein et al., 2000]. It consists of a mesh of P Es, also called rDPUs ( reconfigurable Data Path Units), which are connected to each of their 4 nearest neighbors by 2 bidirectional links with a d ata path widt h of 32- bits, wh r "bidirectional

¹¹

m eans a direction is selected at configuration t ime.

1.1.1.2 Montium

The coar -grained reconfigurable part of the Chameleon system-on-chip is called the

Montium Tile [ Heysters and Smit, 2003]. The Montium Tile is esp ecially d esigned

for mobile computing and targets the 16-bit digital signal proce sing (DSP) algo-

rithm domain [ Smit et al. , 2007]. Iontium supports both integ r and fixed-point

arithmetic, with a 16-bit datapath width . The tile is interfaced with the outside

world with the communication and configuration unit ( CCU). The tile h as 5 identi-

cal arithmetic and logic units (ALUl. . . ALU5) that can exp loit spatial concurrency

to enhance performance. Dedicated input output units (DIGs) are used to handle

fast a nd parallel transfers of input/ output d ata th at are placed around the array

(39)

architecture [ Alsolaim et aL, 1999].

1.1.1.3 DReAM

Dynamically reconfigurable architecture for mobile systems (DReAM) [ Alsolaim , 2002]

was designed to be a part of a system-on-a-chip (SoC) solu tion for the third and fu- ture generations of wireless mobile terminals. It consists of an array of concurrently operating coarse-grained reconfigurable processing units ( RP Us) . Each RPU was designed to execute all required arithmetic data m anipulations a nd control-flow op- erations . To perform fast dyna mic reconfiguration, the configuration m emory unit ( CMU) holds configuration data for each of the RPUs and is controlled by one re- sponsible communication switching unit ( CSU) .

1.1.1.4 CHESS

The reconfigura ble arit hmetic array (RAA), termed CHESS [ Marshall et al., 1999],

was d eveloped by he wlett packard (HPJ Labs to provide high comput ation al density,

wid e internal data bandwidth , distributed registers, and memory resources for im-

portant multimedia algorithm cores. CHESS also offers strong scalability, software

flexibili ty and advanced features for dynamic reconfiguration. CHESS 's functional

units are 4-bit ALUs a nd it reduces t he number of bits of configuration memory by

having 4-bit bus conn ections. It allows a small configuration memory to speedup

reconfiguration.

(40)

1.1.1.5 RaPiD

RaPiD [ Ebeling, 2002] is a coarse-grained reconfigurable architecture to achieve the low cost and high power efficiency of a pplication-specific integrated circuits (ASICs), without losing the flexibility of programmable processors. Ra PiD architecture is configured to form a linear computational pipeline, with a linear a rray of functional units (FUs ). Each R aPiD cell contains 3 ALUs, one multiplier, three 32-word local memories, 6 general-purpose "datap ath registers" a nd 3 sma ll local memories. The RaPiD array is designed to b e clocked at 100MHz and reconfiguration time for th e a rray is conservatively estimated to be 2000 cycles [Eb eling et al., 1997].

1.1.1.6 PipeRench

PipeRench [Goldstein et al., 2000] is a reconfigurable fa bric with a network of in-

t erconnected configurable logic and storage elements. PipeR ench cont ains a set of

physical pipeline stages called stripes. In each stripe, the interconnection network ac-

cepts inputs from each processing element in that stripe and one of the register values

from each regis ter fil e in the previous stripe. Each PE contains an arithmetic logic

unit ( AL U) and a pass register file where the AL U contains lookup t a bles ( L UTs)

a nd extra circuitry for carry cha ins, zero detection , a nd so on. PipeRench was de-

signed to improve reconfiguration time, compil ation time, a nd forwa rd compatibility,

increased flexibility, reduced chip developme nt and maintenan ce fabrication costs.

(41)

1.1.1.7 ADRES

The a rchitecture for dynamically r econfigura ble embedded systems (ADRES) [Mei et al., 2005a] t ightly couples a very long instruction word ( VLI W) processor a nd a r econfigurable array. The architecture has 2 virtual function al views : the VLIW processor view and t he r econfigurable array view built into a single a rchitecture [Mei et al., 2003b]. The VLI\iV processor, consisting of several functional units and a mul- t ipart register file (RF) , serves the first row of the r econfigura ble array. Some FUs in t he first row can connect with m emory t o facilitate d ata access for load / stor e opera- tions . The reconfigura ble array is intend ed to efficien tly execute only computation ally intensive kernels of applications [f\! I ei et al. , 2003a]. The architecture templa te, shown in Figure 1.0, consists of many basic components, including computational, storage, a nd routing r esources .

The FUs can execute a set of word-level opera tions selected by a control s ignal.

Register files and m emory blocks can s tore intermediate data. Rout ing resources, including wires, mult iplexers, and buses connect the computationa l resources and storage resources defined by t he topology through point- to-point connections or a sha red b us . The differ ent instances of the a rchitecture can be generated by a scri pt- based technique a nd by sp ecifying d ifferent values for the communication topology, the supported operation set, r esource allocation, and latency in t he target architecture [Zalamea et al. , 2004].

The results can b e written to the d istributed RFs, which ar e small and have fewer

ports than the sha red RF, or they can b e rou ted to other FUs. An ou tput register

buffers each of the FU's outputs, to guarantee timing . Multiplexers a re used to route

(42)

.' ---.. ^-~-^~-^-^-·---·-·----

-- ·--

-~

-- - -- ----

^·^-^-^-^·^'"'^-^--

· ---

·--·-...

,

I I

'

\

: VLIW view

Reconfigurable array view

Figure 1.0: ADRES Architecture taken from [ Mei et al., 2005c]

da ta from different sources. The configuration RAM st ores t he configura tion for each cycle. In ADRES , the integration of predicate support , d istributed register fi les and configu ration RAM m ake it applicable and efficient t o many applications.

1.1.2 Comparison and S e lection of the Ta rget CGRA

The various CGRAs discussed a bove have t heir a dvan tages and disad vantages. Mor-

phoSys has a 16-bit granularity with mesh based structure, fast memory interface,

dyna mic p rogramming and requires a manu al placement and rout ing tool [University

of California, 2009]. KressArray has a highly flexible ma pp er used to map m assively

(43)

communication-intensive applications [Hartenstein et al., 2000] and provides area ef- ficient and t hroughput efficient design. KressArray can be used only for limited applications with regular data dep endencies [ Becker et al., 1998]. Montium focuses on providing sufficient fl exibility, provides abundant parallelism, but has limited con- figuration spaces [ Guo, 2006] . ADRES uses the VLIW processor for non-kernel code and reduces the communication cost between t he VLIW and reconfigurable matrix through the shar d RFs for resource sh aring [Vassiliadis and Soudris , 2007a]. DReAM was designed for modern wireless communication system and provides an accep table trade off between fl exibility and application performance [ B ecker et al., 2000].

CHESS offers strong scalability, dynamic reconfiguration b ut it has a constraint that t he ALU and switchbox should be of the same size and the need of long wires for the transfer of data [ Marshall et al., 1999]. RaPiD features static and dyn amic control to map a range of applications but it has the disadvantage of a data path with an implicit directionality [ Eb eling, 2002]. PipeRench trades off configuration size for compilation speed by hardware virtualization and improved compilation time , reconfiguration time, and forward compatibility. PipeR en ch has a low bandwidt h between main memory a nd processor, which limits the type of applications which require sp eed up [Goldstein et al., 2000].

Among the various coarse-grained architect ures discussed , the ADRES architec-

ture was considered for t he proposed research. T he reason for this ch oice was that

t he ADRES architecture is a flexible a rchitecture template, with low communication

costs. The loops present in an application can be mapped onto the ADRES array in

a highly p arallel way wit h ease of programming. The compiler wit hin th e ADRES

template is a utomatically retargetable i.e., it h as been designed to be relatively easy

(44)

to modify to generate cod e for different confi gurations and h ave provided a good d eal of data for comparison.

1.2 Scheduling

1.2.0 Introduction

The objective of scheduling is to minimize the execution time of a parallel computation application by properly allocating tasks to the processors by avoiding the processor stall cycles. Scheduling inner loop bodies is a NP-hard problem which implies that there is no polynomial t ime algorithm t hat can give an optimal solution to the problem (assuming P =J NP) [ Kwok and Ahmad, 1999] . The ultimate goal of scheduling is to create an optimal sch edule, a schedule wit h t he shortest length of the given application. Schedule len gth or makespan is measured as th e overall execution-time of a parallel program in cycles . Additionally, when a schedule is produced , the scheduling algorithms must satisfy both resource and precedence constraints .

Depending on the constraints, scheduling may be broadly classified into 3 main categories [ ChingandK es hab, 1995].

Time-Constrained Scheduling minimizes t he number of the required resources wh en t he iteration period is fixed.

Unconstrained Scheduling d oes not have any fixed timing or resource usage dur- ing the scheduling .

R esource-Constrained Scheduling fixes th e number of resources and th e obj ec-

tives to d etermine the fastest schedule, or the smallest iteration period.

(45)

List scheduling is the most commonly used scheduling approach. It can be clas- sified under resource constrained scheduling a nd time constrained scheduling. A scheduling list is statically constructed before node allocation b egins, and most im- portantly, the sequencing in the list is not modified. List scheduling is of ten used for both instruction sch eduling and processor scheduling [ Beaty, 1994]. In an iteration , nodes with a higher priority are scheduled first and lower priority nodes are deferred to a lat er clock cycle based on the priority functions like as soon as possible (ASAP), as late as possibl e (ALAP) , mobility, h eight-based priority etc . [ Tuhin, 2007]. The priority sorting is carried out by selecting a node based on t he priorities listed above a nd added to the priority sort li st. The sorting is then carried out for each child node of the selected node until all t he nodes in the list are processed.

1.2.1 Software Pipelining

Software pipelining [ Lam , 1988] is a scheduling technique which overla ps the op er- ations in the su ccessive iteration to yield processors's fast execution rate. Software pipelining is a global cyclic sch eduling problem to exploit the instruction level pa ral- lelism (ILP) available in loops . The idea is to look for a p attern of operations from various iterations (often termed as th e kernel) so t hat when repeatedly iterating over this pattern, it produces the effect that iterations a re initiated at a regular interval.

This interval is te rmed the initiation interval (II). T hus su ccessive iterations of t he

loop a re in execution with different stages of their computation. Once a sch edule is

obtained, the loop is reconstructed into a prologue, a kernel, and an epilogu e. Instruc-

tions in the prologue a re repeated unt il t he pipeline is filled. The prologue consists of

(46)

code from the first few iterations of th e loop. The loop kernel or steady state [Alla n et al. , 1995] co nsists of instruct ions from mul tiple iterations of the origina l loop, a nd a new iteration of the kernel is initiated at every II cycles. Instructions in th e epilogue a re d esigned to complete the functionality of code and consist of code t o complete the last few iterations of the loop.

1.2.2 Modulo Scheduling

Modulo scheduling [Mei et al. , 2003a] is a software pipeline technique which overlaps severa l iterations of a loop by starting successive iterations at a regular interval.

The main goal of modulo scheduling is to simplify the process of software pipelining by generating a schedule for an iterat ion of the loop a nd use t he same sch edule for subsequent iterations at constant intervals. Modulo Scheduling ensures t hat it satisfies data dependence constra ints and intra- a nd inter-iteration d ependency, and no reso urce availability conflicts.

The sch edule for an iteration is divided into stages so that different stages of t he successive iteration execu tion get overlapped. The number of stages in an iteration is called its stage count ( SC), and t h e number of cycles per stage is termed the initiation interval. The Initiation Interval should be minimi zed to exploit as much parallelism from a loop as is poss ible a nd modulo sch eduling tries to minimize it [ Tuh in, 2007].

The II is constrained either by loop-carried dependences of the loop (i.e cases where

data from an earlier iteration is used in a la ter iteration) or by resource constra ints of

the ha rdware . The limit on the II set by loop-carried dependence is called recurrence

minimal initiation interval ( R ecMII ), while t he limit set by reso urce constra ints is

(47)

called resource minima l initiation interval (ResMII). The minimal initiation interva l (MII ) is a lower bound to start th e pipeline sch eduling process a nd it is computed as Mil = max(R esMII , RecMII ) [Llosa et al., 2001]. If a valid schedule cannot be obtained by an II equal to Mil, then II is incremented by one and the scheduling process is repeated until a valid schedule is obtained or the algorithm gives up.

Modulo sch eduling can be illustra ted by taking an example of the dep endence graph shown in Figure 1.1b, a long with a 2 x 2 a rchitecture. The data d ep endence graph unrolled for 3 iterations, is shown in Figure 1.1a. The initiation interval is 1 and so at time cycle 2, all the 3 iterations are executing at different stages.

A modulo schedule can be generated by the use of heuristics and integer linear programming. Since modulo scheduling is based on h euristics, it may not always give the optimal solution. T here are many heuristi c algorithms develop ed for modulo scheduling such as

• Iterative modulo scheduling [Rau , 1994]

• Recurrence cycle aware modulo scheduling [ Oh et al., 2009]

• Clustered modulo scheduling [Sanch ez and Gonzalez, 2001]

• Swing modulo scheduling [ Llosa et al., 1996]

• Hypernode reduction modulo scheduling [ Llosa et al. , 1995]

• Modulo scheduling with integrated register spilling [ Zala mea et al., 2001].

(48)

f~l fu3 fu4 fu2 t

=

C•

t = 1

t

=

t = }

/

steady s t a te

t 4

~--- ^nl)

a)

' -·-<

I \

/ \

tl

r-~ ~~

^{.-. 't} ^.^~'I ^.Eu^l ^fu2

(

, _

nL

_ \

:

7/

n~1

fu3 fu4

-.,_ j'

^-

(n.i\

b )

\.

_ __

/

Figure 1.1: a)Modulo Scheduling Example b)DFG and Configuration for 2x2 m atrix , modified from [ Mei et al., 2003b]

1.2.3 Graph Embedding

Graph embedding is a problem in graph theory [ Newsome and Song , 2003] in which a directed guest graph G

₁

= ( V

1,

E

1 )

is embedded in anot her directed host grap h G2

=

(V2, E2) [Heath , 1997]. The embedding consists of a one to one function P v from V1 to V2 and a function P e that m aps each edge ( u , v)

E

1

to a p ath in G2 between p( u) and p( v) . T here are 3 kinds of primary cost , measured in graph embedding:

dila tion, expansion, and congestion [Heath, 1997]. For a given embedding (pv,P J , the

congestion of edge e 2 in G2 is the number of edges e

₁

in G

₁

su ch that e 2 is on the

(49)

path P e(e

_{1 );}

the congestion of an embedding is its m aximum edge congestion. The length of the longest assigned path is called the dilation of the graph embedding. The

ratio ~ ~~ ~ is called the exp ansion of the graph [ Heath , 1997]. Using graph embedding, t he performan ce of one network (gu est graph) over anoth er network (host graph) can b e investigated. Graph embedding provides a systematic approach to various node- node communication problems [Newsome and Song, 2003]. The concep t of graph embedding can be extended to solve many problems [Guattery and Guattery, 1997].

Graph models are successfully employed in various application s such as computer aided circuit layout , network topologies, d ata-centric applicat ions in sensor n etworks, and so on [Newsome and Song, 2003], [ Levi and Luccio, 1971]. Graph embedding is effective in scheduling, placing and routing because it can take into account th e communication structure of the loop body and scales well with resp ect to the number of operations [Park et al., 2006].

1.2.4 Modulo Reservation Table

In software pipelining, the modulo reservation table ( MRT) is used in d etermining if there is a resource conflict whi le allocating resources . MRT can be used to repres ent the resource usage of t he steady state by mapping th e resource u sage at time t to that at timet mod s [ Lam , 1988].

1.2.5 Routing Resource Graph

When modulo scheduling is ap plied to th e data flow graph , the intermediate operands

are routed by allocating resources in the routing resource graph (RRG ) [Ebeling et

(50)

al., 1995]. The RRG is replicated from the a rchi tecture graph for every t ime cycle.

RRG reserves resources by enforcing modulo constraints .

1.3 Evolutionary Algorithms

In order to find a scheduling, placing, and routing for the loops in CGRAs, we have to find a valid schedule with the minimum numb er of resource usage a nd with the sma llest possible II a nd also sat i sfy all d ependen ce and modulo constra ints . Some approaches h ave been tried to schedule loops, su ch as wit h simulat ed a nnealing [ Mei et al., 2005c],[Ha tanaka and B agherzadeh , 2007] to minimize the numb er of reso urces used in routing. In this section, some selected evolutionary algorithms will b e dis- cussed briefl y a nd we will conclude with th e selection of an evolutionary a lgorithm for our m odulo scheduling a lgorithm.

1.3.0 Overview

1.3.0.0 Simulated Annealing

Simulated annealing (SA) [\ iVang et al., 2001 ] is a meth od to solve global optimization problems, wit h a metaheuristic approach, to t he globa l minimum of a given function in a large search space. The term simulated a nnealing, is derived from the roughly a na logous process of heating a nd controlled coolin g of a material to increase th e size of its crystals and reduce t he number of defects to obtain a strong crystalline structure [Fang, 2000]. SA i s often used when the search s pace of the problem is continuous.

SA can accept wors e n eighboring solut ions, with a certain proba bility t h at depends

on a variable called the temperature (T) . In the SA meth od, the temperature T is

(51)

gra dually reduced as the simulation proceeds. Initia lly, T is set to a high valu e (or infinity), and it is decreased based on a reduction ratio r, which is close t o 1, a t each time step a nd ends with T = 0 a t the end of t he a llotted time budget. The simulated annealing process is stopped when the system reaches a frozen solution st ate, t hat is when there is no improvement in t he solution configurations.

1.3.0.1 Genetic Algorithm

Genetic algorithms ( GA s) were originally develop ed by John Holland a nd his research stude nts. GA is the most widely used evolutiona ry comput ation tech nique [Uysal a nd Bulkan, 2008] . GA op era tes on strings of data in which each string represents a solution , in a way t ha t resembles a chromosom e in natura l selection . Genetic a lgorithm exhibits implicit p ara llelism because t hey an alyze and modif y a set of solutions simultan eously [ Song et al., 2008].

Genetic algorithms generate random solut ions as t he ini tial p opulation. There are 3 stochastic op era tors applied t o the population.

Selection Is a portion of the existing p opulation selected to breed a new generation of popula tion.

Crossove r Is a genetic op erator t hat generates new offsp rings by randomly choosing some crossover point and everything before t his point is copied fr om a first pa rent and t hen everything after a crossover point copy from the second parent ., which h op efully retain good features from the parents.

Mutation Is a genetic op erator th at randomly modifies the new offspring with a

probability. It can enhan ce t he diversity of t he population and p rovide a chance

Modulo onto

Modulo Scheduling Loops onto Coarse-Grained Reconfigurable Architectures

St. John's

by Ra ni Gna naolivu

© Ra ni Gna naoliv u

A t hesis submitted to th e School of Graduate Studies

in partia l fulfilment of t he requiremen ts for the d egree of

Doct or of Philosophy

Faculty of E ngineering and Applied Science Memoria l University of Newfoundla nd

J a nuary 2013

Newfoundland

Abstract

Reconfigurable systems have drawn increasing attention from both academic re-

search ers and creators of commercial applications in the past few years b ecause they

could combine flexibility with efficiency. There are two ma in types of reconfigurable

a rchitectures - fine-gra ined and coarse-grained . The fun ctionality of fine-gra ined ar-

chitecture ha rdwa re is sp ecified a t t he bit level while t he functionali ty of the coarse-

gra ined architecture hardware is specified at the word level. Coarse-grained recon-

figurable architec tures (CGRAs) h ave gained curren cy in recent years due to t h eir

a bundant parallelism , high computationa l intensity a nd flexibility. A CGRA n or-

mally is comprised of an array of basic computationa l a nd storage resources, which

a re capa ble of processing a la rge volume of applications simultaneou sly. To exploit

the inherent parallelism in the applications to enhan ce performance, CGRAs h ave

b een structured for accelerating computation intensive p arts such as loops, th at re-

quire large amounts of execution time. The loop body is essentially d rav.rn onto th e

CGRA mesh , subject to modulo resource usage constraints. Much research has been

done to exploit the p otentia l parallelism of CGRAs to increase the p erformance of

time-consuming loops. However, sp arse connectivity a nd distributed register files

present difficult ch allenges to th e scheduling phase of t he CGRA compilation fra me-

work . \ iVhile traditiona l schedulers do not take rou tability into considera tion, software

pipelining can improve t he scheduling of instruct ions in loops by overla pping instruc-

tions from different iterations. Modulo scheduling is an ap proach for constructing

software pipelin es that focuses on minimizing t he time b etween the initiations of it-

era tions - t he so-called initiation interval (I I ). For example, if a new iteration is

sta rted every I I cycles, the time to complete n iterations wi ll app roach I I x n, for large n loops, thereby maximizing p erforman ce.

The problems of scheduling ( deciding when an operation sho'Uld happen), p lacing ( deciding where an operation sho'Uld happen), and routing (the problem of how in-

fo rmation travels tho'Ugh space and time between operations ) can be u nified if t hey are m od elled by a graph embedding problem . The d ata flow graph of the loop is embedded in a rout ing resource gra ph representing t he ha rdware across a number of cycles equal to the initiation interval.

When condit iona l bra nches such as if-then-else statements are presen t in the loop,

t hey create multiple execut ion pat hs. Exploit ing con ditional branches t h rough our

predicated exclusivity, t he MCHPSO a lgorithm reuses the resources which are in

t he exclusive execu tion paths and which may allow t he loop to be scheduled with a

lower I I. F ina lly, a priority sch eme algorithm a long wit h recurrence aware modulo

scheduling is proposed to map inter-iteration depend en cies onto CGR As, which is

a ble to save resources for all recurrences cycles and to map remaining operations.

Acknowledgements

First a nd for emos t I would like to thank God for the wisdom and perseverance that he has blessed me with during this PhD program , and indeed, throughout my life : "He who b egan a goo d work in you will carry it on to completion until the day of Christ Jesus ." (Philippia ns 1: 6)

I thank him for allowing me to use his work for my test cases in t he PhD program .

Last but not the least; I thank my family for their boundless love, encouragement,

a nd uncondition al s upport, both finan cially and emotionally throu gh ou t my PhD

program . Especiall y, I thank lV Ioha n Gnanaolivu , my father in-law, for his valuable

editorial corrections. I a lso thank my loving, supportive, husband Praveen Gnanao-

livu whose fait hful supp ort during the final stages of this PhD is so apprecia ted. I

would a lso like to thank my loving son Kevin for t he sincere everyd ay prayers for the

completion of my research. I tha nk a ll my friends for constantly encouraging me a nd

reminding me of my aspirations .

Contents

Abstract

Acknowledgements List of Tables List of Figures List of A lgorithms List of Abbreviations 0 Introduction

0.0 Reconfigurable Comp uting

0.1 Coarse-Grained Reconfigura ble Architecture

0.2 Compiling Loops onto CGRAs with f\/Iodulo Scheduling 0.3 Motivations and Obj ectives

0.4 T hesis Contributions 0.5 Thesis Overview . . .

0 0 2 5 6 8

10 1 Compilation in Coarse-Graine d R eco nfigurable Architecture s 12 1.0 Introduction . . . . . . . . . . . . . . . . . . 12 1.1 Coarse-Grained Reconfigurable Architect ure 13

1.1.0 Introduction . . . . . . . . 1.1.1 Overview of some CG RAs

13

13

1.1.1.0 1.1.1.1 1.1.1.2

1.1.1.3 1.1.1.4

1.1.1.5 1.1.1.6

IVIorphoSys KressArra y Mont ium DReAM CHESS R aPiD . PipeRench 1.1.1.7 ADRES . .

1.1.2 Compa rison a nd Selection of the Target CGRA 1.2 Scheduling . . . . . .

1.2.0 Introduction . 1.2.1 Soft ware Pipelining 1.2 .2 Modulo Scheduling 1.2.3 Graph E mbedding

1.2.4 Modulo R eservation Table 1.2.5 Rou ting Resource Gra ph 1.3 Evolut ionary Algorit hms

1.3.0

1.3. 1

Overview . . . . 1.3.0.0

1.3.0.1 1.3.0.2 1.3.0. 3

Simulated Annealing Genetic Algorithm . Ant Colony Optimization

P article Swarm Optimization Algorith m Selection of P SO Algorithm . . .