Hardware and software architecture facilitating the operation by the industry of dynamically adaptable heterogeneous embedded systems.

(1)

HAL Id: tel-01019909

https://tel.archives-ouvertes.fr/tel-01019909

Submitted on 27 Nov 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

operation by the industry of dynamically adaptable

heterogeneous embedded systems.

Laurent Gantel

To cite this version:

Laurent Gantel. Hardware and software architecture facilitating the operation by the industry of

dynamically adaptable heterogeneous embedded systems.. Signal and Image processing. Université

de Cergy Pontoise, 2014. English. �NNT : 2014CERG0684�. �tel-01019909�

(2)

Universitéde Cergy-Pontoise

PhD Thesis

Hardware and Software Ar hite ture for Heterogeneous

and Dynami ally Re ongurable Systems-on-Chip

by

Laurent Gantel

Equipes Traitement de l'Information et Systèmes(ETIS)

CNRS UMR8051

Embedded System Lab(ESL)

THALES Resear h &Te hnology FRANCE

Thesis defendedon

14 th

January,2014

M. Gilles Sassatelli Reporter

M. Frédéri Petrot Reporter

M. Daniel Chillet Examiner

M. Guy Gogniat Examiner

M. François Verdier Dire tor

M. Fabri e Lemonnier Dire tor

(3)

(4)

Tellme andI forget, tea h meand I mayremember, involveme andI learn.

(5)

(6)

Abstra t

Thisthesisaimstodenesoftwareandhardwareme hanismshelpinginthe

manage-mentoftheDynami andHeterogeneousRe ongurableSystems-on-Chip(DHRSoC).

The heterogeneity is due to the presen e of general pro essing units and

re ong-urable IPs. Our obje tive is to provide to an appli ation developer an abstra ted

view of this heterogeneity, regarding thetask mapping on the available pro essing

elements. First,wehomogenizetheuserinterfa edeningahardwarethreadmodel.

Then,wepursuewiththehomogenizationofthehardwarethreadsmanagement. We

implementedOSservi espermittingtosaveandrestoreahardwarethread ontext.

Con eptiontoolshavealsobeendevelopedinordertoover ometherelo ationissue.

Thelaststep onsisted inextendingthea essto thedistributedOSservi esto

ev-erythreadrunningontheplatform. Thisa essisprovided independentlyfromthe

thread lo ation and isis realized implementing theMRAPI API.Withthese three

steps,webuildasolidbasistoprovidetothedeveloperinfuturework,adesignow

dedi atedto DHRSoCallowing to perform pre isear hite tural spa e explorations.

Finally, to validate these me hanisms, we realize a demonstration platform on a

Virtex 5FPGArunning adynami tra king appli ation.

Résumé

Cette thèse s'intéresse à la dénition de mé anismes logi iels et matériels,

fa ili-tantlagestiondessystèmes-sur-pu e hétérogènesetdynamiquement re ongurable

(DHRSoC).L'hétérogénéitédesesar hite turessemanifesteparlaprésen eàlafois

depro esseurs de al ulgénéralistesetdemodulesmatériels re ongurables. Notre

obje tif est de permettre à un développeur d'appli ation de s'abstraire de ette

hétérogénéité en e qui on erne l'allo ation destâ hessur lesdiérentes unités de

al uldisponibles. Cetteabstra tionpasseparunepremièrephased'homogénéisation

des interfa es utilisateurs (API) et la dénition d'un modèle de thread matériel.

Cette homogénéisation sepoursuit ensuite par la gestion de es threads matériels.

Nousavonsimplémentédesservi esauniveau dusystèmed'exploitation(OS)

per-mettant desauvegarderetrestaurer le ontexted'un threadmatériel. Desoutilsde

on eptionontégalement étédéveloppésan desurpasser leproblème dela

relo a-tiond'unthreadmatérielauseind'unFPGA.Enn,ladernièreétapeaétéd'étendre

l'a èsauxservi esoertspar touslesOSdistribuésauseindelaplateforme àtous

lesthreadss'exé utantsur elle- i,indépendammentdeleurlo alisation. Ce iaété

réalisé via une implémentation originale de l'API MRAPI. Ave es trois étapes,

nousavonsapportéune base solidean, danslefutur, de proposerau développeur

unotde on eptiondédiéauxar hite turesDHRSoCluipermettantdepro éderà

une exploration ar hite turale pré ise de sonsystème. Finalement, an d'éprouver

lefon tionnement de esmé anismes, nousavonsréalisé une plateformede

démon-stration sur FPGA Virtex 5 mettant en s ène une appli ation de suivi de ibles

(7)

(8)

Remer iements

Je voudrais tout d'abord remer ier mes dire teurs de thèse, Amine Benkhelifa qui

m'afaitdé ouvrir lemondedelare her he etm'atoujours pousséà allerplusloin,

depuis mes premières années universitaires jusqu'autermede e do torat, et qui a

su me guider et me motiver tout au long de ette thèse, François Verdier dont les

onseils et les remarques m'ont été utiles pour mener à bien e projet, et Fabri e

Lemonnier qui m'a fait onan e et m'a a ueilli au sein du laboratoire LSE hez

Thales Resear hand Te hnology durant monMaster et mathèse.

Mer i également aux membres du jury qui m'ont fait l'honneur d'évaluer mon

travail,GillesSassatellietFrédéri Petrot quiont a eptéd'enêtrelesrapporteurs,

DanielChillet etGuyGognat qui en ont étéles examinateurs.

Je tiens en parti ulier à remer ier mes ollègues de bureau, Amel Khiar, qui a

toujoursétélàpour m'en ourageretave quij'aipasséd'ex ellentsmoments. Je la

remer ieen orepoursabonnehumeur ommuni ativeettout equ'ellem'aapporté

durant toutes esannées. Ungrand mer iàLiang Zhouque j'aiappris à onnaître

età grandement appré ierau ldutemps. Mer i également à LounisZerioul, Guy

Wassi, et Christian Gamom, qui ont aussi ététrès présents etqui sont devenus au

ldutemps devéritables amis.

J'adresse mesremer iements auxmembres de TRTque j'ai eula han e de

o-toyer, ave lesquelsj'aipu ollaborerdansunenvironnement detravailagréable, et

dontlesdiverses ompéten esm'ontététrèsutilesetsurtouttrèsinstru tives,parmi

lesquels Jimmy Le Rhun, Christophe Clienti, Paul Brelet, Rémi Barrere, Téodora

Petrisor, Philippe Millet, Philippe Bonnot etLionel Thavot, ainsiqu'aux membres

du laboratoireETIS dont entreautres Frédéri de Melo, LounisKessal, Emmanuel

Hu k, SamuelGar ia, ThomasLefebvre, Kaouthar Bousselam, Laurent Rodriguez,

BenoitMiramond, Lot BendaouiaetFakhreddine Ghaari.

Unepartde esremer iementsvaauxmembresduprojetFOSFORave lesquels

j'ai travaillé régulièrement: Fabri e Muller, Daniel Chillet, Sébastien Pillement et

Ni olas Kne ht.

Enn jesouhaiteexprimertoutemagratitudeenversmafamille etmespro hes

(9)

(10)

1 Introdu tion 1

1.1 Context . . . 1

1.1.1 Real-timeappli ations for embedded systems . . . 1

1.1.2 Heterogeneous Systems-on-Chip. . . 3

1.1.3 ModernFPGAs . . . 4

1.1.4 Dynami andPartialRe onguration. . . 6

1.2 HSoCprogramming model . . . 7

1.2.1 Programming issue . . . 7

1.2.2 Dynami ally Re ongurableHSoC . . . 8

1.3 Obje tives . . . 10

2 Unied Thread Model 11 2.1 Related work . . . 11

2.1.1 Softwarekernel management . . . 11

2.1.2 Run-timemanager . . . 14

2.1.3 Hardwarethread model . . . 17

2.1.4 Con lusion . . . 21

2.2 Thread model . . . 22

2.2.1 Pro ess denition . . . 22

2.2.2 Thread denition . . . 22

2.2.3 Softwarethread model . . . 23

2.2.4 Thread attributes . . . 25

2.2.5 Syn hronization te hniques amongthreads . . . 26

2.2.6 Con lusion . . . 28

2.3 OurHardwareThread model . . . 28

2.3.1 Context: TheFOSFORproje t . . . 28

2.3.2 HardwareThread spe i ations . . . 30

2.3.3 HardwareThread ar hite ture. . . 31

2.4 HardwareThread programming model . . . 36

2.4.1 OperatingSystem servi esproto ol . . . 36

2.4.2 Network ommuni ationproto ol . . . 38

2.4.3 A elerator interfa e . . . 39

2.5 Con lusion. . . 41

3 Hardware threads preemption using Dynami and Partial Re on-guration 43 3.1 Introdu tion . . . 43

3.2 Related works . . . 44

(11)

3.2.3 Designtools . . . 49

3.3 FPGAre onguration knowledge . . . 51

3.3.1 Virtex 5FPGA resour es . . . 51

3.3.2 FPGA onguration . . . 52

3.3.3 Bitstream parser . . . 54

3.4 Preemption me hanisms . . . 58

3.4.1 Context management servi e . . . 58

3.4.2 Re onguration servi e. . . 59

3.4.3 Relo ationServi e . . . 59

3.5 Designowfor hardwarethreads relo ation . . . 61

3.5.1 Standard ow . . . 61

3.5.2 Problemati s . . . 62

3.5.3 Relo ationow . . . 65

3.5.4 Experimentedtools . . . 69

3.5.5 AdaptedIsolation DesignFlow . . . 71

3.6 Con lusion. . . 77

4 Operating System for Dynami ally and Re ongurable Heteroge-neous SoC 81 4.1 Context anddenitions . . . 82

4.1.1 Kernel stru ture . . . 82

4.1.2 Thread API . . . 83

4.2 Related works . . . 85

4.2.1 Introdu tion. . . 85

4.2.2 Inter- ore ommuni ation inMPSoC . . . 86

4.2.3 HRSoC middlewares . . . 90

4.2.4 HybridOS forHRSoC . . . 94

4.2.5 Con lusion . . . 95 4.3 Spe i ations . . . 96 4.3.1 Obje tives . . . 96 4.3.2 Programming model . . . 97 4.3.3 Memory onstraints . . . 97 4.3.4 Ar hite ture. . . 98 4.3.5 Portability. . . 99 4.4 Con eption . . . 99

4.4.1 Operating systemar hite ture. . . 100

4.4.2 Platform ar hite ture . . . 102

4.4.3 Multi ore layer . . . 109

4.5 Implementation . . . 111

4.5.1 Modularoperatingsystem: MutekH . . . 111

4.5.2 MRAPI Spe i ation . . . 114

4.5.3 Hardwarear hite ture . . . 118

(12)

4.5.6 MRAPItypes . . . 120

4.5.7 Resour essystem alls . . . 120

4.6 Con lusion. . . 123

5 Appli ation deployment 125 5.1 Introdu tion . . . 125

5.2 Platformbuilding . . . 126

5.2.1 Mi roblaze platform . . . 126

5.2.2 Read andWrite timings . . . 127

5.2.3 System alls . . . 131

5.2.4 HardwareThreadsen apsulation . . . 134

5.3 Tra king appli ation . . . 135

5.3.1 Presentation. . . 135

5.3.2 TheCamshiftIP . . . 137

5.3.3 TheDVI IP . . . 138

5.3.4 Appli ation deployment . . . 139

5.3.5 Resultsand performan es . . . 142

5.4 Con lusion. . . 143

6 Con lusions 147 6.1 Summary . . . 147

6.1.1 Dis ussion . . . 147

6.1.2 Key ontributions . . . 147

6.1.3 Hypothesis andLimitations . . . 148

6.2 FutureWork . . . 149

A Network Interfa e API 153 A.1 Supported requests . . . 153

A.1.1 Write request . . . 153

A.1.2 Read request . . . 154

A.1.3 Read requestresponse . . . 155

A.1.4 Re eive request . . . 155

B Hardware CRC 157 B.1 Relo ation pro ess . . . 157

B.2 CRC omputation . . . 157

B.3 HardwareCRCmodule. . . 157

(13)

(14)

1.1 Partial and Dynami Re onguration (PDR) appli ation example

[Xilinx2010a ℄ . . . 2

1.2 Designowfromdeveloper's point ofview . . . 3

1.3 Xilinx Zynq7000 EPP blo kdiagram. . . 5

1.4 Dynami andPartialRe onguration prin iple . . . 6

1.5 Abstra tionleveldieren esbetweenhardwareandsoftware program-mingmodels. . . 7

1.6 Heterogeneous threadingappli ation . . . 9

1.7 HardwareThread preemption . . . 9

2.1

µ

C-LinuxICAPdriver [Bergmann2003 ℄ . . . 12

2.2 RAPTORsoftware ar hite ture [Rana2007 ℄ . . . 13

2.3 OS4RSplatform ar hite ture [Nollet2003 ℄ . . . 15

2.4 Operating System for Re ongurable Systems software ar hite ture [Steiger 2004 ℄ . . . 16

2.5 VFPGAruntime manager ar hite ture [El-Araby2008 ℄ . . . 18

2.6 Fun tional Unitar hite ture [Verdos ia 1994 ℄ . . . 19

2.7 HybridThread model[Agron 2009a ℄ . . . 20

2.8 Re onOShardware threadmodel[Lubbers2008 ℄ . . . 21

2.9 Pro ess andThread . . . 22

2.10 Thread life y le . . . 23

2.11 UserThread model . . . 23

2.12 KernelThread model . . . 24

2.13 HybridThread model . . . 25

2.14 FOSFOR platformar hite ture . . . 29

2.15 HardwareThread Ar hite ture . . . 31

2.16 OSSC ar hite ture . . . 32

2.17 Softwareand HardwareThread States . . . 33

2.18 HardwareThread FSMexample. . . 34

2.19 HardwareThread HDLles example . . . 34

2.20 Network Interfa e ar hite ture . . . 35

2.21 OSSC StatusWord ontent . . . 36

2.22 SystemCall pro edure . . . 37

2.23 SystemCall pro eduresteps . . . 37

2.24 Network Interfa e Send and Re eive proto ol . . . 38

2.25 Network Interfa e Writeand Read proto ol . . . 39

2.26 Parallel pro essing usingpipelining . . . 40

(15)

3.2 (a) Implementation of PRR-PRRrelo ation (b) Top-Level blo k

di-agram of ARC[Kallam 2009 ℄ . . . 46

3.3 ICAP a eleratorssolutions [Liu 2009 ℄ . . . 47

3.4 FaRM ar hite ture [Duhem2011℄ . . . 47

3.5 Upar ar hite ture [Bonamy2012 ℄ . . . 48

3.6 ICAP HardMa roblo kdiagram[Hansen2011℄ . . . 48

3.7 RapidSmith s reen apture[Lavin 2011℄ . . . 49

3.8 OpenPR s reen apture fromFPGAEditor [Sohanghpurwala 2011 ℄ . 50 3.9 IsolationDesignFlows reen apturefromFPGAEditor[Corbett 2012℄ 50 3.10 Sli e-L andSli e-M [Xilinx2009 ℄ . . . 51

3.11 FPGAorganization . . . 52

3.12 Type 1 Paquet Header Format [Xilinx2009b ℄ . . . 53

3.13 Type 2 Paquet Header Format [Xilinx2009b ℄ . . . 53

3.14 Frame address[Xilinx 2009b ℄ . . . 54

3.15 Resour es memory ongurationfor theVirtex 5ar hite ture . . . . 55

3.16 Frame omposition[Xilinx2009b ℄ . . . 55

3.17 Multiple Rows bitstream ontent . . . 57

3.18 ICAP driver forPartialRe onguration . . . 59

3.19 Partialbitstream relo ation pro ess . . . 60

3.20 Partialre onguration: Partition and modules. . . 61

3.21 Proxy Ma roPla ed andRouted example . . . 62

3.22 Sli e Ma ro . . . 63

3.23 PlanAhead Sli e Ma ropla ement . . . 63

3.24 Stati route throughRe ongurable Partition . . . 64

3.25 Relo ationow . . . 65

3.26 Stati pla e . . . 66

3.27 XDLFile stru ture . . . 67

3.28 Internal andexternal swit hmatri es . . . 68

3.29 PIP types . . . 68

3.30 XDLNet example . . . 69

3.31 Trusted routes . . . 70

3.32 Test design . . . 71

3.33 Software BusMa roimplementation . . . 73

3.34 RoutedsoftwareBus Ma ro . . . 74

3.35 HardwareBus Ma roextra tion . . . 74

3.36 HardwareBus Ma roextra tionand homogenization . . . 77

3.37 AdaptedIsolation DesignFlow . . . 78

3.38 Designtest - Partitionisolation . . . 78

4.1 Toppers/FMP[Tomiyama 2008℄ . . . 86

4.2 SMP System[Huerta2008 ℄ . . . 87

4.3 ICPC Servi e [Lin2009 ℄ . . . 88

(16)

4.6 Self-re ongurable platform[Shiyanovskii 2009a ℄ . . . 91

4.7 Systemframeworkoverview[Guerin 2009a ℄ . . . 92

4.8 HardwareDependant Softwarelayer [Senou i 2006 ℄ . . . 93

4.9 MCAPIfor MPSoC[Matilainen2011 ℄ . . . 93

4.10 HybridThreadsplatform [Agron 2009b ℄ . . . 95

4.11 Userpoint ofview . . . 97

4.12 Platformmemory ar hite ture . . . 98

4.13 Sys all Pro edure . . . 100

4.14 Server types . . . 101

4.15 OSServerAr hite ture . . . 101

4.16 MessageTemplate . . . 102

4.17 StudyCase Platform . . . 103

4.18 Distant system all . . . 104

4.19 S enario 1platform . . . 106 4.20 S enario 1datagram . . . 106 4.21 S enario 2platform . . . 107 4.22 S enario 2datagram . . . 107 4.23 S enario 3a platform . . . 108 4.24 S enario 3a datagram . . . 108 4.25 S enario 3b platform . . . 109 4.26 S enario 3b datagram . . . 109

4.27 Operatingsystemar hite ture . . . 110

4.28 MutekHglobal view . . . 113

4.29 Homogeneous NoC-basedPlatform . . . 118

4.30 Heterogeneous NoC-based Platform . . . 119

4.31 MRAPIlibrary lestru ture. . . 121

4.32 MRAPIlo al tables . . . 122

4.33 Requestsmanagement proxies . . . 122

5.1 Demonstration platform . . . 126

5.2 Mi roblaze platform . . . 127

5.3 Read andwrite test platform . . . 128

5.4 BridgePLB-NoC ar hite ture . . . 129

5.5 Hardwareplatformused totest system allspro edures . . . 131

5.6 HardwareMRAPIglobal ar hite ture. . . 132

5.7 MRAPIremote all se tions . . . 134

5.8 Target Tra king Appli ation . . . 136

5.9 BinaryLong Obje t(Blob) . . . 137

5.10 Pipelined Camshifthardwarenode . . . 138

5.11 Pipelined CamshiftUser FSM . . . 138

5.12 Integration of theDVI IPintheDemonstration Platform . . . 139

5.13 Appli ation deployment . . . 140

(17)

5.16 Detailed appli ation deployment . . . 145

6.1 Hardwarenode implementation hoi es . . . 150

A.1 Write request pa ket . . . 154

A.2 Read request pa ket . . . 154

A.3 Read request response . . . 155

(18)

1.1 Pro essingElements omparisonregarding ontrolability,performan es

andgeneral programmability . . . 4

1.2 Platformte hnology omparisonregarding ontrol ost,exibilityand performan es . . . 4

3.1 Bitstream header ontents . . . 56

3.2 Bitstream initialization ommands . . . 57

4.1 Resour estable example . . . 105

5.1 Softwarelayers footprints . . . 127

5.2 Code exe ution timefor aMi roblaze pro essor(ML506125 MHz) 128 5.3 Timingsin y les to writeinto platform memories. . . 129

5.4 Timingsto readfrom platformmemories . . . 130

5.5 Network Interfa e Communi ation Measurements . . . 130

5.6 NoCSend timings for 1KB data . . . 131

5.7 HwMRAPI Resour esusage . . . 132

5.8 Timingsto lo ally initialize anode . . . 133

5.9 Timingsto a essa lo alMutex resour e. . . 133

5.10 Timingsto a essa remoteMutexresour e . . . 133

5.11 Detailed timings toa ess aremoteMutex resour e . . . 135

5.12 HardwareThread Resour esUsage. . . 136

5.13 Demonstration Platformresour e utilization . . . 142

5.14 HardwareThread Resour esUsage . . . 143

5.15 Camshiftslot resour eutilization . . . 143

5.16 Appli ation timings. . . 144

B.1 ICAPregister involved inCRC omputation . . . 158

(19)

(20)

Introdu tion

Contents

1.1 Context . . . 1

1.1.1 Real-timeappli ationsforembedded systems . . . 1

1.1.2 HeterogeneousSystems-on-Chip . . . 3

1.1.3 ModernFPGAs. . . 4

1.1.4 Dynami andPartialRe onguration . . . 6

1.2 HSoCprogramming model . . . 7

1.2.1 Programmingissue . . . 7

1.2.2 Dynami allyRe ongurableHSoC . . . 8

1.3 Obje tives . . . 10

1.1 Context

1.1.1 Real-time appli ations for embedded systems

Appli ations for embedded systems dedi ated to image and signal pro essing are

be oming in reasingly omplex. The amount of data pro essed by these systems

tendto bemore and moreimportant and so, developers need more andmore

om-puting power. This is the ase for instan e, of monitoring system, automotive or

radarappli ations. Thisleadsto designnew omputingsystemsableto respe tthe

highperforman e onstraintsimposedbytheseappli ationsandtheir environment.

In order to satisfythese onstraints, appli ations must be proled and divided

into several tasks. Ea h taskwhi his onsidered responsiblefor thefailureto hold

onstraints, has to be implemented separately on a dedi ated pro essing unit. For

instan e, ommuni ating systems su h asa network swit h, have to handle several

proto ols, transfer information at highrates andpro ess largeamount of data. To

a hieve good performan es and gain in exibility, ommuni ation proto ol sta ks

may be implemented in hardware and take advantages of thepartial and dynami

re onguration (Fig. 1.1).

In general, the multipli ity offeatures needed by theend-users and mostly the

(21)

!

"

#$

!

"

#$

Figure 1.1: Partial and Dynami Re onguration (PDR) appli ation example

[Xilinx 2010a℄

globalperforman eofthe appli ation. However,thedrawba kisthatit ompli ates

thedevelopment pro ess.

Another onstraint is the need of exibility, or more pre isely, of adaptability.

The appli ations omplexity requests to adapt the parameters and the provided

features of thesesystems. For example, the omputation power an depend on the

qualityofservi erequired,andthepower onsumptionofasystem anbemonitored

regarding its environment or random events. Also, asembedded systems are more

andmoreintegratedinourenvironment,thesehumanorenvironmentalintera tions

requirethesesystemstoadaptthemselvestothevarious queriesandneedsthatthis

implies.

In ontrast,designerswouldwantto getasimpleviewoftheirappli ationwhi h

wouldabstra ttheplatformspe i ity,espe iallytheheterogeneity(Fig. 1.2). The

aim is to disso iate the fun tional validation of the appli ation and thedesign

ex-ploration of itsimplementation.

In the fun tional validation, tasks are des ribed regarding high-level exe ution

parameters su h as the exe ution time, the deadline, or the priority. During the

design exploration, these parameters and new ones like the power onsumption or

the memory usage are added regarding one or several possible partitioning. These

two pointsleadus to onsiderthedesign ofheterogeneous systems-on- hipandthe

(22)

!

"

Figure 1.2: Designowfrom developer's point ofview

1.1.2 Heterogeneous Systems-on-Chip

Platformsbasedondierentpro essingelementsare alledHeterogeneous

Systems-on-Chip(HSoC).Insu haplatform, theappli ation isdividedinto tasks. Whereas

some tasks are implemented as hardware a elerators and allo ated into a

parti-tion of the hip, others run as software tasks on omputing pro essor elements. A

hardwarea eleratorisdenedasahard-wiredfun tiondevelopedto a eleratethe

pro essingof a task. A omputing pro essorunit ould be a GeneralPurpose

Pro- essor (GPP), a spe ialized one like a Digital Signal Pro essor (DSP), a Graphi s

Pro essingUnit (GPU) or asimple Mi ro-Controller Unit(MCU).

Ea h one of these pro essing elements is moreor lesssuited to ertaintypesof

tasks[Leon Adams2007℄. The hardwarea elerator is well suited to intensive

pro- essingtasks,espe iallytaskswhoseoperations anbeparallelized. Onthe ontrary,

it an hardlybe used withintensive ontrol tasks. The latterare more suitable to

run on a GPP. Homogeneous tasks with a low data dependen y an be easily and

e iently parallelized on a GPU, whereas heterogeneous tasks with omplex data

paths are not re ommended for this ar hite ture. Simple ontrol tasks pro essing

small and well ordered data would likely be implemented on a Mi ro-Controller

Unit. Playing with these dierent pro essing elements, it is possible to adapt the

appli ation to be deployed regarding time exe ution onstraints or memory and

logi resour es. Table1.1 summarizes strengths and weaknesses of ea hpro essing

(23)

Pro essing Element Control Performan es Programmability GPP +++ + +++ GPU + ++ ++ DSP + ++ ++ MCU +++ + +++ Hw. A . + +++ +

Table1.1: Pro essing Elements omparison regarding ontrol ability, performan es

and general programmability

platform whi h in ludes all these omponents an be implemented using dierent

te hnologies: anAppli ation Spe i IntegratedCir uits(ASIC),aMulti-Pro essor

System-on-Chip (MPSoC),or a FieldProgrammable Gate-Array(FPGA).

ASIC te hnology oers great performan es but is very expensive and not

ex-ible at all. In this do ument we onsider a MPSoC as a SoC made up of at most

a dozen of ores like the OMAP5430 based on a Cortex-A15 multipro essor ore

[Instrument 2011℄. Theyarelesse ient but heaper,moreexibleregardingtasks

pla ementandsoftwarebugsmaybere overed. FPGAsisagoodtrade-obetween

theASICte hnologyandtheMPSoC hoi ebe auseitisexible,itprovidesbetter

performan es ompared with MPSoC and both software and hardware bugs may

be re overed after the appli ation system being pla ed on the market. Table 1.2

summarizes the strengths and weaknessesof ea hte hnology.

Te hnology Cost Flexibility Performan es

ASIC + + +++

FPGA ++ +++ ++

MPSoC +++ ++ +

Table 1.2: Platform te hnology omparison regarding ontrol ost, exibility and

performan es

The solution whi h interest us is the FPGA te hnology. The exa t reasons of

this hoi e,namely the hara teristi s, thepotential aswell asthepros and onsof

thelast family ofFPGA aredetailedinthenext subse tion.

1.1.3 Modern FPGAs

A FPGA is a re ongurable hip omposed of several logi elements whose the

(24)

in-A modern FPGA is a matrix of resour es disposed in parallel olumns. Ea h

olumn ontainseither ongurablelogi blo ks(CLB),butalsoblo krammemories

(BRAM) or dedi ated digital signal pro essing (DSP) blo ks. For this platform

we dened a hardware a elerator as hard-wired fun tion using a set of resour es

allo atedina partition oftheFPGA.

Inadditiontothese ongurableelements, latestfamiliesofFPGAs,forinstan e

Xilinx Virtex 7 FPGAs (Fig. 1.3), in lude hard ore elements to a elerate ertain

pro essingor ommuni ation. Thisisthe aseoftheDDR ontroller, theEthernet

MAC ontroller, or even of hard ore pro essors implemented with all the needed

peripheralsasafullmi ro- ontroller unit(dualARM9 ores withtimers,UART,or

ICAP (Internal Conguration A ess Port) ontrollers).

Figure1.3: Xilinx Zynq 7000 EPPblo kdiagram

As modern FPGAs matri es tend to be ome larger and larger, designers have

now more spa e to implement multi- ore systems in luding several soft-pro essors

and hardware a elerators. In order to oer the best performan es, and as told

(25)

isreally e ient butrather expansivefor smallprodu tion lines,whereasthelatter

isexible and anrelies onmanyCOTSbutdoesn't allowto rea hthewanted

per-forman es. Namely, a FPGA is a good trade-o between power onsumption and

pro essing power.

Moreover, all pro essing units detailed inSe tion 1.1.2 an be implemented

in-side a FPGA. This apability provides to the developer the exibility to explore

dierentsolutions when designinghisplatform. Several ar hite ture hoi es anbe

made and ompared. Ea h fun tion an thenbe implemented on thewanted

pro- essing units inorderto obtain thebest partitioning.

1.1.4 Dynami and Partial Re onguration

The natural evolution of FPGAs leads them, due to the miniaturization, to oer

more andmore logi resour es [Ko h 2010b℄. Thisin reasehelpsto fa ethe

impor-tant need offeaturesrequired bytheend-user. To managethedramati in rease of

the size of the FPGAs, espe ially the design time, manufa turers provided partial

re onguration featuresto theirFPGAs (Fig. 1.4).

Figure1.4: Dynami andPartialRe onguration prin iple

Theuseofthepartialre ongurationhastheadvantageofde reasingthe

imple-mentation time be ause partial modules an be implemented separately while the

(26)

ModernFPGAsmanufa turers,fromnowXilinxandAltera,providesome

me h-anismsto dynami ally re ongure the hip. Thedynami re onguration allows to

re ongure a partialmodulewhile keeping the stati part un hanged. The system

onthe hipwouldbeabletore ongureapartofitself,thiswithoutanydisturban e

on the exe ution of the rest of the system. In addition to the fun tional interest,

itbringsa onsequentresour es impa tfor autonomousembedded systems-on- hip.

Moreover, insome ases it isa good wayto de rease thepower onsumption while

being apable ofprovidingalarger hoi eofhardwarea eleratorsto agiven

appli- ation.

1.2 HSoC programming model

1.2.1 Programming issue

Despite the real interest of this te hnology, the main drawba k of using

hetero-geneous platforms is that they are di ult to program. Indeed, abstra tion level

dieren es between software fun tions running on pro essors and hardware

a el-erators, make the development of appli ations really tough. In order to ease the

validation and the exploration of the possible partitioning for a given platform, a

ommon abstra tion hasto beprovided to theend-user(Fig. 1.5).

!

" #

!

" $

!

" %

#

!

" #&%

!

" #&$

$

!

" $&%

!

" $&$

%

!

" %&%

!

" %&$

'(

)

* '

Figure 1.5: Abstra tion level dieren es between hardware and software

(27)

Toa hieveit,ageneraltrendwhi hisemerging onsistsinadoptingahigh-level

language to des ribe theappli ation. Coupledwith newe ient toolsable to

sim-ulate and automati ally generate low-level ode sour es, su h a design ow would

allowtota kle thelast FPGAsprogramming issues. Indeed,dueto theirin reasing

size,thesystem omplexityisin reasingtooandsu htoolswouldprovideasimpler

view of the wholesystem. For instan e, alanguage su h astheSyn hronous

Data-Flow language (SDF) [Lee 1987℄ provides a model of omputation whi h an be

adapted both to software and hardware threads, and soabstra t theheterogeneity

of theplatform.

Anintermediateapproa h anbeadoptedwhi hprovidesnotaunique

program-ming language to des ribe both thesoftwareand thehardware, but in arst step,

a ommon programming model. In this way, a ommonly adopted programming

modelin the software embedded domain is the threading model. To design a

het-erogeneous platform using this model, we have to raisethe abstra tion level ofthe

hardwarea elerators. Thisallowsus to reuselega yworksinthesoftwaredomain

and sotofo usonthehardwarepartofthemodel. Inour ase, theimplementation

hoi e is done between a software implementation on a pro essor and a

re ong-urable hardwarelogi partition.

Like software threads, we dene hardware threads. A hardware thread

en ap-sulates the hardware a elerator and allows it to behave like a software thread.

Namely, a hardware thread would be able to a ess operating system servi es and

would have, from a ertain point of view, a sequential exe ution. These servi es

in lude the abilityto reate or delete aresour e, and tooperate asystem all. The

user should have the apabilityto preempt anythread, both softwareor hardware,

and so to save and restore its ontext. A parti ular eort should be done on the

implementation of me hanisms permittingthe threads to ommuni ate ina

trans-parent way. Our nal obje tive is to oer to the end-user a simple thread view of

its appli ation, and to thedesigner an e ient way to reate relo atable hardware

a elerators whi ha t likesoftwarethreads (Fig. 1.6). To doso, hardware

a eler-atorsshouldbeimplementedinwhatwewill allahardwarethreadto ommuni ate.

Developedinthe standardhardwaredes riptionlanguage whi h isVHDL(Very

High-speed integrating ir uitDevelopment Language),generi interfa esandan

ab-stra tedexe ution modelwillallowinthefutureto integrate thisintermediate

pro-gramming model with high-level design tools. This will result in the automati

generation of hardwarethreads, takingadvantageof existinglow-levelstru ture.

1.2.2 Dynami ally Re ongurable HSoC

(28)

in-Figure 1.6: Heterogeneous threadingappli ation

re ongurable HSoC, hardware threads are dened as relo atable modules whi h

an thenbe allo atedinto anyavailable re ongurable partition oftheFPGA.

The system, using the dynami and partial re onguration, allows the user to

preempt any module (Fig. 1.7). Namely, a part of the hipis divided into several

dynami partitions. Ea h partition is thenallo atedbythe ontrol part of the

ap-pli ation to one hardware threadfor a ertainamount oftime.

Figure1.7: HardwareThread preemption

The listoftarget appli ations anthenbeextendedto multi-modeappli ations

and to those whi h need environment adaptation. As a perspe tive, other

appli- ations based on the dynami dete tion of events, su h as se urity systems, ould

take advantageofthiste hnology. We analso itebio-inspired ar hite tureswhi h

would relyon dynami re onguration me hanisms inorder to dynami ally

(29)

Moreover,beingabletoupdatesystemafteritsrelease ouldhelpthedesignerto

improve the adaptabilitytounknownspe i ationmodi ations,for instan ewhen

implementing aH264 ode . We an alsonoti e thatit ouldhavea goodae t on

thedesign ostsof theseprodu ts.

1.3 Obje tives

The goal of this PhD thesis is to propose a software and hardware ar hite ture

in order to improve the appli ation development pro ess when targeting a

Hetero-geneous System-on-Chip. With the in reasing omplexity of the appli ation, an

abstra ted programming model has to be adopted to fa ilitate the des ription of

theseappli ationsandimprove theexibilityregardingtheimplementation hoi es.

Theproposedar hite tureshouldrelyontheexistingoperatingsystemstru tureand

provideservi esandlow-levelme hanismstoeasilyhandlethethreadheterogeneity.

InChapter 2,wepropose a modelof hardwarethread whi h allows to abstra t

this heterogeneity. Then we study me hanisms and tools permitting to manage

hardware threads in the same way that what is done with software ones. In the

next hapter, an operating system dedi ated to heterogeneous systems-on- hip is

spe ied. The main feature of this operating system isto provide a exible a ess

totheoperatingsystemservi esforeverythreads,bothsoftwareorhardware,

what-ever is the ore they are running on. Finally, an appli ation will be detailed and

(30)

Unied Thread Model

Contents

2.1 Relatedwork . . . 11

2.1.1 Softwarekernelmanagement . . . 11

2.1.2 Run-timemanager . . . 14

2.1.3 Hardwarethreadmodel . . . 17

2.1.4 Con lusion . . . 21

2.2 Thread model. . . 22

2.2.1 Pro essdenition. . . 22

2.2.2 Threaddenition . . . 22

2.2.3 Softwarethreadmodel . . . 23

2.2.4 Threadattributes . . . 25

2.2.5 Syn hronizationte hniques amongthreads . . . 26

2.2.6 Con lusion . . . 28

2.3 OurHardware Thread model . . . 28

2.3.1 Context: TheFOSFORproje t . . . 28

2.3.2 HardwareThreadspe i ations . . . 30

2.3.3 HardwareThreadar hite ture . . . 31

2.4 Hardware Thread programming model . . . 36

2.4.1 OperatingSystemservi esproto ol. . . 36

2.4.2 Network ommuni ationproto ol . . . 38

2.4.3 A eleratorinterfa e . . . 39

2.5 Con lusion . . . 41

2.1 Related work

2.1.1 Software kernel management

With the emergen e of heterogeneous platform in luding both software pro essors

and re ongurable areas, a natural way to ta kle theheterogeneity of these

re on-gurable platforms hasbeen to relyonthe existing software abstra tion layers. To

(31)

Thiss hemeleadstodesignanewkindofplatforminwhi haprimitiveor

fun -tion usedbya task,or ataskitself, an be a elerated inhardware. The following

worksaimto providea simplewaytoloadandrun thesea elerators. Theypermit

to abstra t the omplexity of the ommuni ation between a pro essor, namely an

appli ation running on top ofan operating system,and ahardwarea elerator.

This is the ase of the Egret platform [Bergmann 2003℄whi h the obje tive is

to provide a fullymodular platform. A Mi roblaze mi ro- ontroller unitis running

a

µ

C-Linux operating system and allows the developer to hoose whi h hardware

a elerators have tobeexe uted. Todo so,a lassi driverusing theIOCTL

1

API

2

[IOC 1997℄permits thedeveloperto loadapartial bitstreamofthewished

ongu-ration through the Internal Conguration A ess Port (ICAP) of the FPGA(Fig.

2.1).

Figure 2.1:

µ

C-Linux ICAPdriver[Bergmann 2003℄

Authors of [Donato 2005℄ presented a platform based on the Linux operating

system. This hoi e has been done be ause its sour e ode is available for free,

it has been ported on numerous platforms and it is modular regarding additional

drivers.

ThisplatformhasbeennamedCaronte : itis omposedofaVirtex2ProFPGA

1

InputandOutputControl

(32)

in luding aPowerPC 405 and one ICAPport. A software driverallows the

devel-oper to ontrol the ICAP using the IOCTL proto ol again. When loading a new

IP

3

ore, a ommuni ation proto olhasbeenimplementedtoallowthisIPto laim

itselfto the Core Manager IP,following a hot-plug philosophy. The inter onne t is

aWishboneBusandspe i Medium A essController (MAC)areusedtoprovide

the ability to allo ate address spa e at run-time. This work led to the laun h of

the ommer ial proje t PetaLinux, whi h aims to simplify the deployment of the

Linux operating systemon re ongurable platforms. The use of Linux in MPSoC

platforms is a growing trend as shown by the re ent a quisition of the PetaLogix

ompany byXilinx.

In[Rana 2007℄,aplatform omposedofseveralFPGAsisintrodu ed. Thewhole

platformis supervised bya unique pro essorrunning Linux, andallowing

re ong-uration ability, partially or totally. Simple primitives are also implemented as a

driverusing the IOCTLproto ol.

The main issueto solve is themanagement ofthe on urrent exe ution ofea h

taskpresentinthe system. Tohandlethis,weneedto relyonamultitaskoperating

systemprovidingsimpleandlega ywaysof ommuni ationtoeverytask,both

soft-wareorhardware(Fig. 2.2). Espe ially,hardwaretasksare onne tedtoaMedium

A essController(MAC),whi hprovidestheabilitytodynami allyallo ateaddress

spa e forea h loadedmoduleat run-time.

Figure2.2: RAPTORsoftwarear hite ture [Rana 2007℄

Theoperatingsystemusedtoabstra tthere ongurationpro essisbasedonthe

workofDonatoetal. [Donato2005℄. Whenare ongurablea eleratorisloadedon

theFPGA,adriverisloadedinto theLinuxkernelandisasso iatedtothis

(33)

tor. To ontrolthemodule,theappli ationreliesonthe lassi alIOCTL ommands.

Inalltheseworks,themanagementofthehardwarea eleratorsimpliesminimal

modi ationintheoperatingsystemandiseasilyportable. However,thea elerator

is onsidered asa hardware IP ore and not asa hardware thread. From the user

point of view, this situation leads to a heterogeneous programming model for the

developer. It is not su ient regarding our obje tives whi h impose us to bearin

mindtoallowahomogeneousprogrammingmodelatahigherlevelofrepresentation.

2.1.2 Run-time manager

Other solutions go further and propose to design a run-time manager. A

run-time manager is responsible for s heduling hardware a elerators at run-time and

managing the a ess to shared resour es. The system knows whi h partitions are

available and whi h a elerators need to be loaded. Using adaptive algorithm, a

real-time unit dynami ally pla es and ongures the a elerators. More than a

management of the hardware a elerators as o-pro essor modules, the goal is to

dene amodel inwhi hthese a elerators ouldbe onsideredasreal tasks,inthe

same waythatthesoftwareonesare.

Nolletet al. [Nollet2003℄introdu es one of therst approa h to design an

op-erating system dedi ated to Re ongurable Systems alled OS4RS. It spe i ally

targets the Heterogeneous Re ongurable System-on-Chips omposed of ISP

(In-stru tion Set Pro essor) andre ongurable tiles.

ThisOSmust be apable ofprovidinga similarset ofservi es for the

heteroge-neoustasks, asatraditionalOSdoesforsoftwareappli ation. It isbasedonRTAI,

a real-time Linuxextension.

The hardware task are pla ed into slots and onne ted to ea h other via a

network-on- hip. TheHardwareAbstra tion Layer (HAL)of theoperatingsystem

provides ommuni ation primitivessu hassendandre eiveaswell as ontrol

mes-sages to pla e a new task and read or modify the network parameters (Fig. 2.3).

The ommuni ation API has been ported both in hardware and software. This

ommon interfa eallows tomigrateataskfromasoftwaretoahardwarepro essing

element ina transparent way.

The operating system in ludes a two-level s heduler. The rst level dispat hes

thetaskon the pro essingunits whereas lo als hedulershandles thetaskassigned

tothem. Attherstlevel,thes hedulerreliesona he kpointingme hanismtosave

tasks ontexts. They hoose this solution be ause this has the advantage to make

the ontext independent from the targeted pro essing element. A the lower level,

lo als hedulersmayemploypro essor-spe i ontexts, sin e theywill nevermove

tasks to another pro essor. The denitionand themanagement of the he kpoints

(ie. the denition of what needs to be saved) is up to theuser. We an noti e that

this information is parti ularly di ultto dene and isstill an open issue.

(34)

Figure2.3: OS4RSplatform ar hite ture [Nollet2003℄

In[Steiger2004℄,Platzneretal. alsointrodu eanoperatingsystemdedi atedto

re ongurable systems and dis uss about two dierent points. The rst dis ussion

is about design issues for re ongurable hardware operatingsystem. The required

degreeofexibilitypairedwithhigh omputation demandsasksforpartially

re on-gurable hardwarethatis operated inatrue multitasking manner.

Forthe authors,itisne essaryto denethreethings: (1)aprogramming model

dedi ated to re ongurable systems with a set of well-dened system servi es, (2)

a run-time system to handle the dynami ity of the system and resolve oni ts

between exe utable obje ts,and (3)thesmallest unitof exe ution,that isto saya

pro ess or athread.

They dene a hardware thread as a pre-pla ed and pre-routed digital ir uit

whi h an be loaded and relo ated easily in any available slots of the FPGA. A

squareisthesimplestshapetomanageinspiteofthefa tthatitalsoleadstoamore

important internal fragmentation than more omplex shapes,su haspolyominoes.

Thentheyexplainthat1-Dimensional(1D)pla ement involvesaneasiers heduling

ofthedierent threadsbut anin reaseoftheexternalfragmentation. Ontheother

hand,2-Dimensional(2D)pla ementoersmorepossibilityofpla ementandsoless

external fragmentation butis harderto manage.

Inthis paper, theytargeta real-times enariowhere ea hin omingthreadis

ei-thera eptedwithaguaranteetomeetthedeadlineorreje ted. AsinrealityFPGA

resour esdistributionisnot homogeneous,we anassumethatatleastmemoryand

FIFOs are managed by the operating system, and so that a thread an a ess to

these resour es using operating system servi es : memory allo ation and message

queue. They on lude sayingthat 1Dpla ement ismore realisti regarding urrent

FPGAsar hite ture but2Dpla ement isaninteresting open issuesinthewaythat

2Ds heduling isreally more interesting interm ofperforman e.

The se ond dis ussion deals with hard real-time tasks s heduling. Target

(35)

Figure 2.4: Operating System for Re ongurable Systems software ar hite ture

[Steiger2004℄

between the operating system and the thread. This port is alled Standard Task

Interfa e(STI).TheTaskCommuni ationBus(TCB)runshorizontallythroughall

hardwarethread areainto a numberofdummytasks.

Thesoftwareoperatingsystemisdividedinthreelayers(Fig. 2.4): a rstlayer

to manage tasks and resour es, a se ond to handle the ontext issue, and thelast

one whi h isresponsiblefor the ommuni ationand the onguration.

In[Wigley2001℄,authorsdis ussthes hedulingproblemofrelo atablehardware

tasksbyanoperatingsystem. Theygiveaspe i ationofanidealoperatingsystem

dedi ated to the re ongurable omputers. This operating system must provide a

s heduler able to manage expli it ontext hanges, namely the user has to insert

he kpoints insidetaskssour e odein orderto ensure a orre t ontext save.

Intheir spe i ation, theoperatingsystemis responsible for managing the

vir-tual memory and prote ting platform physi al resour es from oni ting a esses.

Taskpartitioningmustbedynami aswemustbeabletooperateloadbalan ingor

taskmigration from software to hardwareand vi e-versa.

(36)

it by initializing dire t ommuni ation between these kind of tasks. Otherwise, a

buer should be usedin order to pro ess ommuni ation. A last point is theneed

of veri ation toolsand test ases,that isto say appli ation exampleswhi h ould

benetfromtheDynami and Partial Re onguration.

Another example of run-time manager is introdu ed in [Shiyanovskii 2009b℄.

Re ongurationismanagedbyasoftwarelayerupontherealtimeoperatingsystem.

This layer is alled Adaptation Manager, and an be ustomized inorder to get a

trade-obetweenthe power onsumptionandtheexe utionspeed. Todosoitrelies

on alearningpro ess whi h allows it toimprove its de ision skill.

The re ongurable platform is omposed oftiles whi h abstra t thelogi blo k

programming level to provide to the developeran a ess to oarse grain primitives

su h aslters, FFT

4

or others higher level fun tions. S heduler poli y is basedon

priority. Tasks an have three dierent states : Ina tive, A tive and Reserved and

havereal-time attributes su h asexe ution time,deadline, or laxity.

Theseworksshowthatanoperatingsystemisne essarytomanagethehardware

a elerators. Thisabstra tion layer hasto take advantage ofthedynami

re ong-urationandprovideshigh-levelme hanismsto managetheavailableslots. Itmeans

oeringthe abilitytotheend-userto reate,suspend,resumeanddeleteahardware

task. At a lower-level, a re ongurable partition should be seen asa pro essing

el-ement. The operatingsystem should be able to share this resour e between every

hardware a elerators, leading us to view a hardware a elerator as an equivalent

of asoftwarethread.

2.1.3 Hardware thread model

Usingtheabilityto ontroltheDynami andPartialRe onguration(DPR),re ent

arti les proposed abstra tion models for the hardware a elerators. The obje tive

istoimprove the programmabilityoftheseheterogeneous platformandto fa ilitate

the ommuni ation between thea elerators andtherestof thesystemprovidinga

default interfa e.

Authors of [El-Araby2008℄ dene VFPGAs. This a ronym stands for Virtual

FPGAs. A VFPGA is a re ongurable zone ontrolled by a pro essor (Fig. 2.5).

A VFPGA an be seen as a hardware task. This kind of task has three dierent

states: onguredand waitingforinput data(datain), pro essing,or sending data

(data out).

A virtualization manager is implemented to re eive exe ution requests oming

from pro essors. It is responsible of loading the VFPGAs. As expe ted, dierent

testsshowa gainregarding theexe utionspeed.

(37)

Figure 2.5: VFPGAruntimemanager ar hite ture [El-Araby2008℄

In[Verdos ia 1994℄,authorsta kletheissueofthehardwareimplementationofa

Data-FlowGraph(DFG)modelof omputation(MoC).InaDFGmodel, apro ess

an be representedbyana tor. A tors ommuni ate bysendingea h otherpa kets

of data alledtokens[Lee 1987℄. Althoughthis modelisgenerallystati ,this paper

denes a dynami model in whi h a tors inputs and outputs tokens ome and go

from and to innite FIFOs. Every a tors have two inputs and a unique output

allowing to dene three typesoflinks between them:

•

lassi al link: 2

→

2 (two outputs of two dierent a tors tothe inputs of one or two other a tors)

•

joint link: 2

→

1(two outputs oftwo dierent a tors tothe inputs of another a tor)

•

and repli a link: 1

→

2 (one output of an a tor to the inputs of one or two other a tors)

A tors aregroupedin lusterswhi h ommuni ate byMessagePassing. Insidea

luster, a tors are alledFun tional Units(FUs). TheseFUs ommuni ate through

a rossbar. Messages ex hanged between FUs and between FUs and the host

or-respondto the graph onguration andtheprodu ed tokens. A FUis omposed of

three elements (Fig. 2.6):

(38)

Figure2.6: Fun tionalUnit ar hite ture [Verdos ia 1994℄

•

"Syn hronization Unit": it is responsible for ontrolling the presen e of the inputtokens. Two signals aregenerated: ABILifthetwo tokens arepresent,

ABOL withadelayofone y leto allowoutput ring

•

"ComputationUnit": it omposedofanALU

5

,amultiplierandoneSele tion

module. If a test is requested and that it passes, the output is a tivated on

the arrivalof the ABOL signal

The proposedmodel hasthree advantages. Firstly,all a tors have thesame

ar- hite ture (two inputs - one input) sothe same interfa es withtheexternal world,

thenitallowsto getanar hite ture adaptedtoVLSI,andnallyalla tors areable

to manage loopsand onditional instru tions.

(39)

Mu h more omplex a elerators have then been developed, su h as Hybrid

Thread[Agron2009a℄. Inthisarti le,theauthorsdeneamodelofPOSIX

6

ompli-anthardwarethread, apableofpro essingoperatingsystem allsthroughashared

memory, as software thread does (Fig. 2.7). A thread is omposed of two nite

states ma hines (FSM). One usedto answerto operating systemrequests and get

system alls results, and the other one to pro ess system alls and geta ess to a

heap. TheseFSMsare ontrolled by thehardware a eleratorsen apsulated inthe

User Logi omponent.

Figure2.7: HybridThread model[Agron2009a℄

Heap and sta k are stored in an internal Blo k RAM (BRAM) of the thread.

Like in a software POSIX thread, the sta k is used to store the system alls

pa-rameters. Moreover, inorderto enhan e theprogrammability ofthese threads,the

authors dened a high-level API whi h allows the developer to des ribe a

hetero-geneous appli ation usingtheClanguage. A dedi ated ompiler written inPython

permitstotranslatetheC odeintoaVHDLimplementationoftheHybridThread.

In[Lubbers2008℄,theauthorsintrodu eanoperatingsystemdedi atedto

re on-gurable ar hite tures: Re onOS. This operating system provides a homogeneous

abstra tionlayertothethreads,bothsoftwareorhardware,andallowsthemto

pro- ess system alls. Thispaperdeals withtheportage ofRe onOS on a Linuxbased

platform, and omparesits performan es withanother one based on theeCOS

op-erating system. Thegoal isto demonstrate theportabilityof the on epts brought

bytheRe onOS ar hite ture.

(40)

In this operating system, every servi es are managed by a software operating

system running on a pro essor. Hardware system alls are done through an API

des ribed inaVHDLlibrary. The hardwarethread nitestate ma hineis

syn hro-nizedwiththe softwareoperatingsysteminorderitto pro essthesystem all. The

interfa e responsible for the ommuni ation is alled OSIF for OS InterFa e and

representsa set ofregistersa essible throughthepro essor bus(Fig. 2.8).

Figure2.8: Re onOS hardware threadmodel [Lubbers2008℄

Regardingtheinter-thread ommuni ation,thethreadheterogeneityisabstra ted

asso iating ea h hardware thread with a software one, whi h is a proxy or a

del-egate. When requested by a hardware thread, the system all is exe uted by the

orresponding softwarethread.

In order to linkoperatingresour es requested by thehardware thread withthe

ones a essible by the softwareone, a table of the usedinstan es ismaintained by

thedelegate. Inthisway,thesamehardwarethread anbeusedbyseveralinstan es

of a software thread. This me hanism has been implemented to foresee the future

useof the partialand dynami re onguration.

2.1.4 Con lusion

As explained in the introdu tion, our hoi e is oriented to the threading model.

Ourgoalistoproposeahardwarethreadmodelwhi hisableto ommuni ate with

softwarethreads inthe same waythat what has been proposedbyHybrid Thread

[Agron2009a℄ or Re onOS [Lubbers 2008℄. This model has to be adapted to the

re ongurable platform and take advantage of the parallelism and the exibility

oered bythis type of platform. The denitionof this modelis thebasi proposal

of this thesis and will lead us to dene in the next hapters, an operating system

(41)

2.2 Thread model

2.2.1 Pro ess denition

A pro ess is dened asan independent stream of instru tions, running on top of a

pro essing element. A pro ess permits to group some of a pro essing element

re-sour es together,su h asthe memory spa e, theopenles,thesignal handlersand

other information. Grouping resour es insidea same entity fa ilitates the

manage-mentof these resour es bythe running pro ess[Tanenbaum 2001℄.

Pro ess exe ution is prote ted by the fa t that it has a private address spa e.

Pro esses ares heduled bythekernel operating systemand ompetefor thea ess

tothepro essingelement. Whenapro essisblo kedbyasystem all,thes heduler

is responsible for saving the ontext of this pro ess and sele ting another pro ess

among the onesready to be exe uted.

2.2.2 Thread denition

A thread is exe uted inside a pro ess (Fig. 2.9). The main dieren e between a

thread andapro essisthatthelatterhasafull viewofthememory spa e

address-able by the pro essor whereas threads inside a same pro ess share the pro essing

element resour esownedbythepro ess.

Figure2.9: Pro essand Thread

A threading model provides the advantage to isolate appli ation fun tions

ex-e utions regarding one to the others and so enfor es parallelism when targeting

multi ore platforms. Itimprovestheprogrammabilitydividing appli ationinto

sev-eral tasks. In addition, a thread is easier to reate or destroy than a pro ess. A

simple representation of the threadlife y leis depi ted inFigure2.10.

Moreover, as a thread is a sub-entity of a pro ess, it has a smaller ontext to

save than the latter. Indeed, it does not have to manage global resour es su h as

(42)

Figure2.10: Thread life y le

2.2.3 Software thread model

Generally,itexists twoways to implement a threadmodel inan operatingsystem.

Either inuserspa e or inkernel spa e.

2.2.3.1 User thread model

Intheuserthreadmodel,theoperatingsystemkernelisonlyawareofasinglethread

inthepro ess. Threads ares heduledbya threadslibrary implementedintheuser

spa e. Theadvantageofthismodelisthatthereisnoneedto modifytheoperating

system,whi hisinterestingifthisonedoesnotsupportthethreadexe utionmodel.

Figure2.11: UserThread model

The user-level s heduler allows only one thread to be a tively running in the

pro essat atime. There isonethreadtable perpro esswhi hallowsafast ontext

(43)

annotmanage a lo kinterrupt,around-robins heduling annotbe implemented.

Regarding the parallelism, themain drawba kof this model isthat a blo king all

from athread wouldblo kall thethreads implemented insidethesame pro ess.

2.2.3.2 Kernel thread model

Inthekernel threadmodel, kernelthreads areseparatedtaskswhi hareasso iated

with a pro ess. In a kernel thread model, one kernel thread per pro ess is reated.

The pro ess table and the thread table are both managed at the kernel level. A

preemptive s heduling poli y is used in whi h the operating system de ides whi h

thread iseligible toshare the pro essor.

Figure2.12: KernelThread model

Moreover, when a thread performs a blo king all, its state is notied to the

kernel whi h ande ideto preemptthethreadinfavorofanotherready thread. As

thread aremanaged at the kernel level, thedrawba k isthat system alls ostsare

higher than inthe user threadmodel.

2.2.3.3 Hybrid thread model

In a hybrid threadmodel, severaluser-levelthreads arerunning on top of akernel

thread (Fig. 2.13). A ommonly used hybrid thread model is thePOSIX threads

spe i ation (Pthreads). POSIX stands for Portable Operating System Interfa e.

Threads are user-level threads but are managed using a kernel-assisted

ontext-swit hing. It means that when a thread performs a system all, ifthe all is

non-blo king, the thread rely on the user-level API. Otherwise, the kernel thread is

notied thatthethread isblo ked and thekernel s heduler an tryto nd another

pro ess whose at least one thread is runnable. This solution is more omplex to

implement but tries to ombinethebestof the two models.

(44)

Figure2.13: HybridThread model

ation asthe kernel thread stru ture is lighterthan thepro ess one. Onthe other

hand, performan e is lower due to the ne essity to regularly swit h from the user

modetothekernelmode. ThehybridthreadmodellikePOSIXtendstobeadopted

be ause the memory footprint be ome negligible regarding the available resour es

and above all be ause itis a widely used standard inthe omputing domain. The

adoption ofa standard beinga good thingfor theimprovement oftheappli ations

portability.

2.2.4 Thread attributes

2.2.4.1 Storagestru tures

At thetime ofits reation, athread isasso iated withtwostorage stru tures:

•

a Datastru ture: Datais where all of theprogram variables arestored. It is broken down into storage for global and stati variables (stati ), storage for

dynami allyallo ated storage(heap),and storage for variables thatarelo al

to the fun tion.

•

a Sta k stru ture: The sta k ontains dataabout the program or pro edure allowina thread. Thesta k,alongwithlo alstorage,isallo atedforea h

thread reated. Whileinusebyathread, thesta kandlo alstorageare

on-sideredto bethread resour es. Whenthethread ends, theseresour es return

to the pro essfor subsequent usebyanother thread.

2.2.4.2 Thread-private data

(45)

•

Thread identier: Aunique number that an be usedto identify thethread.

•

Priority: iftheoperatingsystemallowsspe i ation of athread priority, this valuewould determinetherelative importan e ofone threadto otherthreads

inthe appli ation.

•

Callsta k: The allsta k ontainsdataabout theprogram owor pro edure all owinthethread.

2.2.4.3 Thread-spe i data (TLS)

Threads an havetheir ownviewofdataitems alledthread-spe i data.

Thread-spe i data is dierent from thread-private data. The threads implementation

denes thethread-privatedataat thekernel level, whiletheappli ation denesthe

thread-spe i data. Threadsdonot sharethread-spe i storage, butallfun tions

within thatthread an a essit.

Dueto the designof theappli ation,threads maynot fun tion orre tlyifthey

share the global storage of the appli ation. If eliminating the global storage isnot

feasible, usingthread-spe i datais agoodalternative.

2.2.5 Syn hronization te hniques among threads

Even if an appli ation is thread-safe, in order to keep good performan es, some

globalresour eshavetobesharedbetweenthreads. Inthis ase,themostimportant

aspe tofprogrammingbe omestheabilitytosyn hronizethreads. Syn hronization

is the ooperative a toftwo or morethreads thatensuresthatea hthread rea hes

a known point of operationregarding toother threads before ontinuing.

Threads an be syn hronized using operating system servi es. These servi es

ensure the developer that riti al resour es are a essed in a safe way and allow

threads to ommuni ate. The most ommon syn hronization primitives are:

•

Mutexes

•

Semaphores

•

Condition variables

•

Threads assyn hronization primitives

•

MessagePassing

2.2.5.1 Mutexes

(46)

en-appli ation ode ata time. Themutex isusually logi ally asso iated withthedata

itprote ts bythe appli ation.

Create, lo k, unlo k,and delete areoperations typi allypreformed ona mutex.

Anythreadthatsu essfullylo ksthemutexistheowneruntilitunlo ksthemutex.

Anythreadthatattemptstolo kthemutexwaitsuntiltheownerunlo ksthemutex.

Whenthe owner unlo ksthemutex, ontrol isreturned to one waitingthread with

that thread be oming the owner of themutex. There an be only one owner of a

mutex at atime.

2.2.5.2 Semaphores

Semaphores an be usedto ontrol a esstoshared resour es. Asemaphore an be

thought ofasanintelligent ounter. Every semaphorehasa urrent ount, whi h is

greaterthan or equal to zero.

Anythread ande rementthe ount lo kingortakingthesemaphore.

Attempt-ingtode rementthe ountpast0 ausesthethreadthatis allingtowaitforanother

thread to unlo k the semaphore. In the same way, any thread an in rement the

ount unlo king or posting the semaphore. Posting a semaphore may wake up a

waitingthread ifthereis one present.

Intheirsimplestform(withaninitial ount of1),semaphores anbethoughtof

asamutualex lusion(mutex). Theimportant distin tionbetween semaphoresand

mutexesisthe on eptofownership. Noownershipisasso iatedwithasemaphore.

Unlikemutexes, itispossibleforathreadthatnevertookforthesemaphoretopost

thesemaphore.

2.2.5.3 Condition variables and threads

Condition variables allow threads to wait for ertain events or onditions to o ur

andtheynotifyotherthreadsthatarealsowaitingforthesameeventsor onditions.

The thread an wait on a ondition variable and broad ast a ondition su h that

one or allof the threads thatarewaiting onthe ondition variable be ome a tive.

Conditionvariablesdonothaveownershipasso iatedwiththemandareusually

stateless. A stateless ondition variable means thatif a thread signals a ondition

variableto wake up a waitingthread when there urrently are no waiting threads,

the signal is dis arded and no a tion is taken. The signal is ee tively lost. It is

possible for one thread to signal a ondition immediately beforea dierent thread

beginswaiting forit withoutanyresulting a tion.

2.2.5.4 Threads as syn hronization primitives

Threads themselves an be used as syn hronization primitives when one thread

spe i ally waits for another thread to omplete. The waiting thread does not

(47)

2.2.5.5 Message Passing

A message passing API an be implemented on top of the previous me hanisms.

Threads an use this higher abstra tion layer to syn hronize and ex hange data.

This API provides blo king or non blo king primitives to transparently send or

re eive messages from a thread to another. Implementation an be realized using

eitherthesharedmemory paradigmora networkproto ol ifadedi atednetworkis

available.

2.2.6 Con lusion

Finally, to be onsidered as a software thread equivalent, the operating system

managing the hardware threads has to provide them the ability to a ess to the

same servi es than the software ones. The hardware thread model has to take

it into a ount, and spe ies additional me hanisms whi h allow the developer to

pro ess system alls.

2.3 Our Hardware Thread model

2.3.1 Context: The FOSFOR proje t

2.3.1.1 Presentation

The FOSFOR proje t is an ANR

7

proje t started in January 2008 and ompleted

in De ember 2011. Thisis a ollaboration between four partners: Thales Resear h

and Te hnologyFran ebasedinPalaiseau, theETISlablo atedinCergy-Pontoise,

theCAIRN fromLannion, and the LEATbased inNi e Sophia-Antipolis.

FOSFOR stands for Flexible Operating System FOr Re ongurable platform.

The aim of this proje t is to dene a new kind of heterogeneous platform. This

platformisheterogeneous inthe sensethatthreadsand operatingsystems ouldbe

implemented either insoftware(running on one of the pro essors), or inhardware

(running in a partition of the FPGA).

Ea h part ould thenbe adaptedregarding thedeployed appli ation. The goal

istoproposeahomogeneousprogrammingmodelfortheappli ation. This

ar hite -ture is done to demonstrate the re ongurable ar hite ture viability regarding the

development pro ess omplexity.

2.3.1.2 Platform ar hite ture

The FOSFOR ar hite ture is omposed of multiple pro essing elements onne ted

to a entralbus(Fig. 2.14). Wedistinguishsoftwarepro essingelementsand

hard-warepro essing elements. Both implement respe tfully a softwareand a hardware

versionoftheRTEMS

8

[RTE1988℄operatingsystem. Onea hpro essor, asoftware

7

Agen eNationalepourlaRe her he

(48)

operating system manages lassi software threads whereas a hardware operating

system(HwOS) isableto manage re ongurable partitions. Hardwarea elerators

ares heduled into these partitions.

! "

#

$

#

#$

$

$%

#

&

#

" % %

Figure2.14: FOSFOR platformar hite ture

The obje tive is to provide at the user-level a homogeneous thread point of

view. Toa hieve it, we abstra t hardware a eleratorsinto hardware threads. The

ar hite ture of these hardware threads is dened in details in Se tions 2.3.2 and

2.3.3.

2.3.1.3 High-level ommuni ation me hanisms

Communi ation between threads an behandled using twoways. For

syn hroniza-tion and small data transfer, threads an rely on the operating system servi es.

These servi es an be lo ally managed or shared between all pro essing elements.

For largeramount ofdata,amiddlewarelayerprovidesamessagepassingAPIwith

Send andRe eiveprimitives.

This middleware layer (Mw) is inserted between the appli ation layer, based

on POSIX threads, and the operating system servi es API. If a thread wants to

ommuni ate with another one, it has a ess to the simple middleware API using

transparent message passing proto ol, or it an a ess dire tly to the operating

systemservi es,su h asMutexor Message Queues primitives.

This high-levelAPI omposedof these two typesof primitives hasbeen ported

(49)

softwareandhardware ode. Forinstan e, basingthedes riptionof theappli ation

on the omponents an be a good solution to fa ilitate theimplementation of

het-erogeneous appli ationson HRSoCplatforms.

Insoftware,theMPCI

9

layerin ludedinRTEMSisthebaseoftheheterogeneous

ommuni ation. It provides a transparent a ess to distant servi es. We extended

ittothehardwareimplementation oftheservi es. Thebridgehasto betransparent

to abstra tboththe lo ationand theheterogeneityoftheappli ation threads. The

lo ationofea hhardwarethreadwhi hdynami ally hangesregardingtheavailable

slots isdynami allymanaged and abstra tedbythe middlewarelayer.

2.3.2 Hardware Thread spe i ations

2.3.2.1 Obje tives

In orderto simplify the programming omplexity oftheHRSoC,hardware

a eler-ators have to adopt the same behaviour as their software ounterparts. To do so,

theyshouldbeabletoobeytheordersoftheoperatingsystem. Theyalsomusthave

the ability to all operating system servi es available in the whole platform, read

and write data from and to memories, and spe i ally they should be asso iated

withan interfa e allowing thedeveloperto ontrol theexe utionof thesehardware

a elerators. The hardware thread life y le should be equivalent to the software

one. All these features and interfa es are assembled in order to en apsulate the

a elerator andsoto dene what we all ahardwarethread.

2.3.2.2 Denition

We dened a hardware thread to take advantage of the dynami re onguration

provided, for instan e, in the Xilinx FPGAs. It is omposed of two main parts:

a stati part whi h ontains all the interfa es with the platform, and a dynami

appli ation-spe i part, whi h ontains theA elerator, theFinite State Ma hine

(FSM) ontrolling its exe ution,anda privatememory (Fig. 2.15).

Comparedto a softwarethread, a hardware threadwill run on are ongurable

partition. Thisre ongurable partition anbe ompared to apro ess,inwhi hthe

logi resour es are equals to the pro essor resour es shared between every threads

running inside this pro ess. In this s heme, a set of re ongurable partitions is a

pro essing element ontaining several pro essor ores. A parallel an be done

be-tween a re ongurable partition anda pro essor ore.

Stati interfa es orrespond to the user-level API. It provides to the thread

an a ess to the operating system and ommuni ation servi es. The User FSM is

the sequential ode exe uted by the thread and nally, the double port memory

onne ted both to the A elerator and the Network Interfa e is used as the heap

and sta kstoragebythe thread.