HAL Id: tel-01019909
https://tel.archives-ouvertes.fr/tel-01019909
Submitted on 27 Nov 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
operation by the industry of dynamically adaptable
heterogeneous embedded systems.
Laurent Gantel
To cite this version:
Laurent Gantel. Hardware and software architecture facilitating the operation by the industry of
dynamically adaptable heterogeneous embedded systems.. Signal and Image processing. Université
de Cergy Pontoise, 2014. English. �NNT : 2014CERG0684�. �tel-01019909�
Universitéde Cergy-Pontoise
PhD Thesis
Hardware and Software Ar hite ture for Heterogeneous
and Dynami ally Re ongurable Systems-on-Chip
by
Laurent Gantel
Equipes Traitement de l'Information et Systèmes(ETIS)
CNRS UMR8051
Embedded System Lab(ESL)
THALES Resear h &Te hnology FRANCE
Thesis defendedon
14
th
January,2014
M. Gilles Sassatelli Reporter
M. Frédéri Petrot Reporter
M. Daniel Chillet Examiner
M. Guy Gogniat Examiner
M. François Verdier Dire tor
M. Fabri e Lemonnier Dire tor
Tellme andI forget, tea h meand I mayremember, involveme andI learn.
Abstra t
Thisthesisaimstodenesoftwareandhardwareme hanismshelpinginthe
manage-mentoftheDynami andHeterogeneousRe ongurableSystems-on-Chip(DHRSoC).
The heterogeneity is due to the presen e of general pro essing units and
re ong-urable IPs. Our obje tive is to provide to an appli ation developer an abstra ted
view of this heterogeneity, regarding thetask mapping on the available pro essing
elements. First,wehomogenizetheuserinterfa edeningahardwarethreadmodel.
Then,wepursuewiththehomogenizationofthehardwarethreadsmanagement. We
implementedOSservi espermittingtosaveandrestoreahardwarethread ontext.
Con eptiontoolshavealsobeendevelopedinordertoover ometherelo ationissue.
Thelaststep onsisted inextendingthea essto thedistributedOSservi esto
ev-erythreadrunningontheplatform. Thisa essisprovided independentlyfromthe
thread lo ation and isis realized implementing theMRAPI API.Withthese three
steps,webuildasolidbasistoprovidetothedeveloperinfuturework,adesignow
dedi atedto DHRSoCallowing to perform pre isear hite tural spa e explorations.
Finally, to validate these me hanisms, we realize a demonstration platform on a
Virtex 5FPGArunning adynami tra king appli ation.
Résumé
Cette thèse s'intéresse à la dénition de mé anismes logi iels et matériels,
fa ili-tantlagestiondessystèmes-sur-pu e hétérogènesetdynamiquement re ongurable
(DHRSoC).L'hétérogénéitédesesar hite turessemanifesteparlaprésen eàlafois
depro esseurs de al ulgénéralistesetdemodulesmatériels re ongurables. Notre
obje tif est de permettre à un développeur d'appli ation de s'abstraire de ette
hétérogénéité en e qui on erne l'allo ation destâ hessur lesdiérentes unités de
al uldisponibles. Cetteabstra tionpasseparunepremièrephased'homogénéisation
des interfa es utilisateurs (API) et la dénition d'un modèle de thread matériel.
Cette homogénéisation sepoursuit ensuite par la gestion de es threads matériels.
Nousavonsimplémentédesservi esauniveau dusystèmed'exploitation(OS)
per-mettant desauvegarderetrestaurer le ontexted'un threadmatériel. Desoutilsde
on eptionontégalement étédéveloppésan desurpasser leproblème dela
relo a-tiond'unthreadmatérielauseind'unFPGA.Enn,ladernièreétapeaétéd'étendre
l'a èsauxservi esoertspar touslesOSdistribuésauseindelaplateforme àtous
lesthreadss'exé utantsur elle- i,indépendammentdeleurlo alisation. Ce iaété
réalisé via une implémentation originale de l'API MRAPI. Ave es trois étapes,
nousavonsapportéune base solidean, danslefutur, de proposerau développeur
unotde on eptiondédiéauxar hite turesDHRSoCluipermettantdepro éderà
une exploration ar hite turale pré ise de sonsystème. Finalement, an d'éprouver
lefon tionnement de esmé anismes, nousavonsréalisé une plateformede
démon-stration sur FPGA Virtex 5 mettant en s ène une appli ation de suivi de ibles
Remer iements
Je voudrais tout d'abord remer ier mes dire teurs de thèse, Amine Benkhelifa qui
m'afaitdé ouvrir lemondedelare her he etm'atoujours pousséà allerplusloin,
depuis mes premières années universitaires jusqu'autermede e do torat, et qui a
su me guider et me motiver tout au long de ette thèse, François Verdier dont les
onseils et les remarques m'ont été utiles pour mener à bien e projet, et Fabri e
Lemonnier qui m'a fait onan e et m'a a ueilli au sein du laboratoire LSE hez
Thales Resear hand Te hnology durant monMaster et mathèse.
Mer i également aux membres du jury qui m'ont fait l'honneur d'évaluer mon
travail,GillesSassatellietFrédéri Petrot quiont a eptéd'enêtrelesrapporteurs,
DanielChillet etGuyGognat qui en ont étéles examinateurs.
Je tiens en parti ulier à remer ier mes ollègues de bureau, Amel Khiar, qui a
toujoursétélàpour m'en ourageretave quij'aipasséd'ex ellentsmoments. Je la
remer ieen orepoursabonnehumeur ommuni ativeettout equ'ellem'aapporté
durant toutes esannées. Ungrand mer iàLiang Zhouque j'aiappris à onnaître
età grandement appré ierau ldutemps. Mer i également à LounisZerioul, Guy
Wassi, et Christian Gamom, qui ont aussi ététrès présents etqui sont devenus au
ldutemps devéritables amis.
J'adresse mesremer iements auxmembres de TRTque j'ai eula han e de
o-toyer, ave lesquelsj'aipu ollaborerdansunenvironnement detravailagréable, et
dontlesdiverses ompéten esm'ontététrèsutilesetsurtouttrèsinstru tives,parmi
lesquels Jimmy Le Rhun, Christophe Clienti, Paul Brelet, Rémi Barrere, Téodora
Petrisor, Philippe Millet, Philippe Bonnot etLionel Thavot, ainsiqu'aux membres
du laboratoireETIS dont entreautres Frédéri de Melo, LounisKessal, Emmanuel
Hu k, SamuelGar ia, ThomasLefebvre, Kaouthar Bousselam, Laurent Rodriguez,
BenoitMiramond, Lot BendaouiaetFakhreddine Ghaari.
Unepartde esremer iementsvaauxmembresduprojetFOSFORave lesquels
j'ai travaillé régulièrement: Fabri e Muller, Daniel Chillet, Sébastien Pillement et
Ni olas Kne ht.
Enn jesouhaiteexprimertoutemagratitudeenversmafamille etmespro hes
1 Introdu tion 1
1.1 Context . . . 1
1.1.1 Real-timeappli ations for embedded systems . . . 1
1.1.2 Heterogeneous Systems-on-Chip. . . 3
1.1.3 ModernFPGAs . . . 4
1.1.4 Dynami andPartialRe onguration. . . 6
1.2 HSoCprogramming model . . . 7
1.2.1 Programming issue . . . 7
1.2.2 Dynami ally Re ongurableHSoC . . . 8
1.3 Obje tives . . . 10
2 Unied Thread Model 11 2.1 Related work . . . 11
2.1.1 Softwarekernel management . . . 11
2.1.2 Run-timemanager . . . 14
2.1.3 Hardwarethread model . . . 17
2.1.4 Con lusion . . . 21
2.2 Thread model . . . 22
2.2.1 Pro ess denition . . . 22
2.2.2 Thread denition . . . 22
2.2.3 Softwarethread model . . . 23
2.2.4 Thread attributes . . . 25
2.2.5 Syn hronization te hniques amongthreads . . . 26
2.2.6 Con lusion . . . 28
2.3 OurHardwareThread model . . . 28
2.3.1 Context: TheFOSFORproje t . . . 28
2.3.2 HardwareThread spe i ations . . . 30
2.3.3 HardwareThread ar hite ture. . . 31
2.4 HardwareThread programming model . . . 36
2.4.1 OperatingSystem servi esproto ol . . . 36
2.4.2 Network ommuni ationproto ol . . . 38
2.4.3 A elerator interfa e . . . 39
2.5 Con lusion. . . 41
3 Hardware threads preemption using Dynami and Partial Re on-guration 43 3.1 Introdu tion . . . 43
3.2 Related works . . . 44
3.2.3 Designtools . . . 49
3.3 FPGAre onguration knowledge . . . 51
3.3.1 Virtex 5FPGA resour es . . . 51
3.3.2 FPGA onguration . . . 52
3.3.3 Bitstream parser . . . 54
3.4 Preemption me hanisms . . . 58
3.4.1 Context management servi e . . . 58
3.4.2 Re onguration servi e. . . 59
3.4.3 Relo ationServi e . . . 59
3.5 Designowfor hardwarethreads relo ation . . . 61
3.5.1 Standard ow . . . 61
3.5.2 Problemati s . . . 62
3.5.3 Relo ationow . . . 65
3.5.4 Experimentedtools . . . 69
3.5.5 AdaptedIsolation DesignFlow . . . 71
3.6 Con lusion. . . 77
4 Operating System for Dynami ally and Re ongurable Heteroge-neous SoC 81 4.1 Context anddenitions . . . 82
4.1.1 Kernel stru ture . . . 82
4.1.2 Thread API . . . 83
4.2 Related works . . . 85
4.2.1 Introdu tion. . . 85
4.2.2 Inter- ore ommuni ation inMPSoC . . . 86
4.2.3 HRSoC middlewares . . . 90
4.2.4 HybridOS forHRSoC . . . 94
4.2.5 Con lusion . . . 95 4.3 Spe i ations . . . 96 4.3.1 Obje tives . . . 96 4.3.2 Programming model . . . 97 4.3.3 Memory onstraints . . . 97 4.3.4 Ar hite ture. . . 98 4.3.5 Portability. . . 99 4.4 Con eption . . . 99
4.4.1 Operating systemar hite ture. . . 100
4.4.2 Platform ar hite ture . . . 102
4.4.3 Multi ore layer . . . 109
4.5 Implementation . . . 111
4.5.1 Modularoperatingsystem: MutekH . . . 111
4.5.2 MRAPI Spe i ation . . . 114
4.5.3 Hardwarear hite ture . . . 118
4.5.6 MRAPItypes . . . 120
4.5.7 Resour essystem alls . . . 120
4.6 Con lusion. . . 123
5 Appli ation deployment 125 5.1 Introdu tion . . . 125
5.2 Platformbuilding . . . 126
5.2.1 Mi roblaze platform . . . 126
5.2.2 Read andWrite timings . . . 127
5.2.3 System alls . . . 131
5.2.4 HardwareThreadsen apsulation . . . 134
5.3 Tra king appli ation . . . 135
5.3.1 Presentation. . . 135
5.3.2 TheCamshiftIP . . . 137
5.3.3 TheDVI IP . . . 138
5.3.4 Appli ation deployment . . . 139
5.3.5 Resultsand performan es . . . 142
5.4 Con lusion. . . 143
6 Con lusions 147 6.1 Summary . . . 147
6.1.1 Dis ussion . . . 147
6.1.2 Key ontributions . . . 147
6.1.3 Hypothesis andLimitations . . . 148
6.2 FutureWork . . . 149
A Network Interfa e API 153 A.1 Supported requests . . . 153
A.1.1 Write request . . . 153
A.1.2 Read request . . . 154
A.1.3 Read requestresponse . . . 155
A.1.4 Re eive request . . . 155
B Hardware CRC 157 B.1 Relo ation pro ess . . . 157
B.2 CRC omputation . . . 157
B.3 HardwareCRCmodule. . . 157
1.1 Partial and Dynami Re onguration (PDR) appli ation example
[Xilinx2010a ℄ . . . 2
1.2 Designowfromdeveloper's point ofview . . . 3
1.3 Xilinx Zynq7000 EPP blo kdiagram. . . 5
1.4 Dynami andPartialRe onguration prin iple . . . 6
1.5 Abstra tionleveldieren esbetweenhardwareandsoftware program-mingmodels. . . 7
1.6 Heterogeneous threadingappli ation . . . 9
1.7 HardwareThread preemption . . . 9
2.1
µ
C-LinuxICAPdriver [Bergmann2003 ℄ . . . 122.2 RAPTORsoftware ar hite ture [Rana2007 ℄ . . . 13
2.3 OS4RSplatform ar hite ture [Nollet2003 ℄ . . . 15
2.4 Operating System for Re ongurable Systems software ar hite ture [Steiger 2004 ℄ . . . 16
2.5 VFPGAruntime manager ar hite ture [El-Araby2008 ℄ . . . 18
2.6 Fun tional Unitar hite ture [Verdos ia 1994 ℄ . . . 19
2.7 HybridThread model[Agron 2009a ℄ . . . 20
2.8 Re onOShardware threadmodel[Lubbers2008 ℄ . . . 21
2.9 Pro ess andThread . . . 22
2.10 Thread life y le . . . 23
2.11 UserThread model . . . 23
2.12 KernelThread model . . . 24
2.13 HybridThread model . . . 25
2.14 FOSFOR platformar hite ture . . . 29
2.15 HardwareThread Ar hite ture . . . 31
2.16 OSSC ar hite ture . . . 32
2.17 Softwareand HardwareThread States . . . 33
2.18 HardwareThread FSMexample. . . 34
2.19 HardwareThread HDLles example . . . 34
2.20 Network Interfa e ar hite ture . . . 35
2.21 OSSC StatusWord ontent . . . 36
2.22 SystemCall pro edure . . . 37
2.23 SystemCall pro eduresteps . . . 37
2.24 Network Interfa e Send and Re eive proto ol . . . 38
2.25 Network Interfa e Writeand Read proto ol . . . 39
2.26 Parallel pro essing usingpipelining . . . 40
3.2 (a) Implementation of PRR-PRRrelo ation (b) Top-Level blo k
di-agram of ARC[Kallam 2009 ℄ . . . 46
3.3 ICAP a eleratorssolutions [Liu 2009 ℄ . . . 47
3.4 FaRM ar hite ture [Duhem2011℄ . . . 47
3.5 Upar ar hite ture [Bonamy2012 ℄ . . . 48
3.6 ICAP HardMa roblo kdiagram[Hansen2011℄ . . . 48
3.7 RapidSmith s reen apture[Lavin 2011℄ . . . 49
3.8 OpenPR s reen apture fromFPGAEditor [Sohanghpurwala 2011 ℄ . 50 3.9 IsolationDesignFlows reen apturefromFPGAEditor[Corbett 2012℄ 50 3.10 Sli e-L andSli e-M [Xilinx2009 ℄ . . . 51
3.11 FPGAorganization . . . 52
3.12 Type 1 Paquet Header Format [Xilinx2009b ℄ . . . 53
3.13 Type 2 Paquet Header Format [Xilinx2009b ℄ . . . 53
3.14 Frame address[Xilinx 2009b ℄ . . . 54
3.15 Resour es memory ongurationfor theVirtex 5ar hite ture . . . . 55
3.16 Frame omposition[Xilinx2009b ℄ . . . 55
3.17 Multiple Rows bitstream ontent . . . 57
3.18 ICAP driver forPartialRe onguration . . . 59
3.19 Partialbitstream relo ation pro ess . . . 60
3.20 Partialre onguration: Partition and modules. . . 61
3.21 Proxy Ma roPla ed andRouted example . . . 62
3.22 Sli e Ma ro . . . 63
3.23 PlanAhead Sli e Ma ropla ement . . . 63
3.24 Stati route throughRe ongurable Partition . . . 64
3.25 Relo ationow . . . 65
3.26 Stati pla e . . . 66
3.27 XDLFile stru ture . . . 67
3.28 Internal andexternal swit hmatri es . . . 68
3.29 PIP types . . . 68
3.30 XDLNet example . . . 69
3.31 Trusted routes . . . 70
3.32 Test design . . . 71
3.33 Software BusMa roimplementation . . . 73
3.34 RoutedsoftwareBus Ma ro . . . 74
3.35 HardwareBus Ma roextra tion . . . 74
3.36 HardwareBus Ma roextra tionand homogenization . . . 77
3.37 AdaptedIsolation DesignFlow . . . 78
3.38 Designtest - Partitionisolation . . . 78
4.1 Toppers/FMP[Tomiyama 2008℄ . . . 86
4.2 SMP System[Huerta2008 ℄ . . . 87
4.3 ICPC Servi e [Lin2009 ℄ . . . 88
4.6 Self-re ongurable platform[Shiyanovskii 2009a ℄ . . . 91
4.7 Systemframeworkoverview[Guerin 2009a ℄ . . . 92
4.8 HardwareDependant Softwarelayer [Senou i 2006 ℄ . . . 93
4.9 MCAPIfor MPSoC[Matilainen2011 ℄ . . . 93
4.10 HybridThreadsplatform [Agron 2009b ℄ . . . 95
4.11 Userpoint ofview . . . 97
4.12 Platformmemory ar hite ture . . . 98
4.13 Sys all Pro edure . . . 100
4.14 Server types . . . 101
4.15 OSServerAr hite ture . . . 101
4.16 MessageTemplate . . . 102
4.17 StudyCase Platform . . . 103
4.18 Distant system all . . . 104
4.19 S enario 1platform . . . 106 4.20 S enario 1datagram . . . 106 4.21 S enario 2platform . . . 107 4.22 S enario 2datagram . . . 107 4.23 S enario 3a platform . . . 108 4.24 S enario 3a datagram . . . 108 4.25 S enario 3b platform . . . 109 4.26 S enario 3b datagram . . . 109
4.27 Operatingsystemar hite ture . . . 110
4.28 MutekHglobal view . . . 113
4.29 Homogeneous NoC-basedPlatform . . . 118
4.30 Heterogeneous NoC-based Platform . . . 119
4.31 MRAPIlibrary lestru ture. . . 121
4.32 MRAPIlo al tables . . . 122
4.33 Requestsmanagement proxies . . . 122
5.1 Demonstration platform . . . 126
5.2 Mi roblaze platform . . . 127
5.3 Read andwrite test platform . . . 128
5.4 BridgePLB-NoC ar hite ture . . . 129
5.5 Hardwareplatformused totest system allspro edures . . . 131
5.6 HardwareMRAPIglobal ar hite ture. . . 132
5.7 MRAPIremote all se tions . . . 134
5.8 Target Tra king Appli ation . . . 136
5.9 BinaryLong Obje t(Blob) . . . 137
5.10 Pipelined Camshifthardwarenode . . . 138
5.11 Pipelined CamshiftUser FSM . . . 138
5.12 Integration of theDVI IPintheDemonstration Platform . . . 139
5.13 Appli ation deployment . . . 140
5.16 Detailed appli ation deployment . . . 145
6.1 Hardwarenode implementation hoi es . . . 150
A.1 Write request pa ket . . . 154
A.2 Read request pa ket . . . 154
A.3 Read request response . . . 155
1.1 Pro essingElements omparisonregarding ontrolability,performan es
andgeneral programmability . . . 4
1.2 Platformte hnology omparisonregarding ontrol ost,exibilityand performan es . . . 4
3.1 Bitstream header ontents . . . 56
3.2 Bitstream initialization ommands . . . 57
4.1 Resour estable example . . . 105
5.1 Softwarelayers footprints . . . 127
5.2 Code exe ution timefor aMi roblaze pro essor(ML506125 MHz) 128 5.3 Timingsin y les to writeinto platform memories. . . 129
5.4 Timingsto readfrom platformmemories . . . 130
5.5 Network Interfa e Communi ation Measurements . . . 130
5.6 NoCSend timings for 1KB data . . . 131
5.7 HwMRAPI Resour esusage . . . 132
5.8 Timingsto lo ally initialize anode . . . 133
5.9 Timingsto a essa lo alMutex resour e. . . 133
5.10 Timingsto a essa remoteMutexresour e . . . 133
5.11 Detailed timings toa ess aremoteMutex resour e . . . 135
5.12 HardwareThread Resour esUsage. . . 136
5.13 Demonstration Platformresour e utilization . . . 142
5.14 HardwareThread Resour esUsage . . . 143
5.15 Camshiftslot resour eutilization . . . 143
5.16 Appli ation timings. . . 144
B.1 ICAPregister involved inCRC omputation . . . 158
Introdu tion
Contents
1.1 Context . . . 1
1.1.1 Real-timeappli ationsforembedded systems . . . 1
1.1.2 HeterogeneousSystems-on-Chip . . . 3
1.1.3 ModernFPGAs. . . 4
1.1.4 Dynami andPartialRe onguration . . . 6
1.2 HSoCprogramming model . . . 7
1.2.1 Programmingissue . . . 7
1.2.2 Dynami allyRe ongurableHSoC . . . 8
1.3 Obje tives . . . 10
1.1 Context
1.1.1 Real-time appli ations for embedded systems
Appli ations for embedded systems dedi ated to image and signal pro essing are
be oming in reasingly omplex. The amount of data pro essed by these systems
tendto bemore and moreimportant and so, developers need more andmore
om-puting power. This is the ase for instan e, of monitoring system, automotive or
radarappli ations. Thisleadsto designnew omputingsystemsableto respe tthe
highperforman e onstraintsimposedbytheseappli ationsandtheir environment.
In order to satisfythese onstraints, appli ations must be proled and divided
into several tasks. Ea h taskwhi his onsidered responsiblefor thefailureto hold
onstraints, has to be implemented separately on a dedi ated pro essing unit. For
instan e, ommuni ating systems su h asa network swit h, have to handle several
proto ols, transfer information at highrates andpro ess largeamount of data. To
a hieve good performan es and gain in exibility, ommuni ation proto ol sta ks
may be implemented in hardware and take advantages of thepartial and dynami
re onguration (Fig. 1.1).
In general, the multipli ity offeatures needed by theend-users and mostly the
!
"
#$
!
"
"
#$
Figure 1.1: Partial and Dynami Re onguration (PDR) appli ation example
[Xilinx 2010a℄
globalperforman eofthe appli ation. However,thedrawba kisthatit ompli ates
thedevelopment pro ess.
Another onstraint is the need of exibility, or more pre isely, of adaptability.
The appli ations omplexity requests to adapt the parameters and the provided
features of thesesystems. For example, the omputation power an depend on the
qualityofservi erequired,andthepower onsumptionofasystem anbemonitored
regarding its environment or random events. Also, asembedded systems are more
andmoreintegratedinourenvironment,thesehumanorenvironmentalintera tions
requirethesesystemstoadaptthemselvestothevarious queriesandneedsthatthis
implies.
In ontrast,designerswouldwantto getasimpleviewoftheirappli ationwhi h
wouldabstra ttheplatformspe i ity,espe iallytheheterogeneity(Fig. 1.2). The
aim is to disso iate the fun tional validation of the appli ation and thedesign
ex-ploration of itsimplementation.
In the fun tional validation, tasks are des ribed regarding high-level exe ution
parameters su h as the exe ution time, the deadline, or the priority. During the
design exploration, these parameters and new ones like the power onsumption or
the memory usage are added regarding one or several possible partitioning. These
two pointsleadus to onsiderthedesign ofheterogeneous systems-on- hipandthe
!
"
Figure 1.2: Designowfrom developer's point ofview
1.1.2 Heterogeneous Systems-on-Chip
Platformsbasedondierentpro essingelementsare alledHeterogeneous
Systems-on-Chip(HSoC).Insu haplatform, theappli ation isdividedinto tasks. Whereas
some tasks are implemented as hardware a elerators and allo ated into a
parti-tion of the hip, others run as software tasks on omputing pro essor elements. A
hardwarea eleratorisdenedasahard-wiredfun tiondevelopedto a eleratethe
pro essingof a task. A omputing pro essorunit ould be a GeneralPurpose
Pro- essor (GPP), a spe ialized one like a Digital Signal Pro essor (DSP), a Graphi s
Pro essingUnit (GPU) or asimple Mi ro-Controller Unit(MCU).
Ea h one of these pro essing elements is moreor lesssuited to ertaintypesof
tasks[Leon Adams2007℄. The hardwarea elerator is well suited to intensive
pro- essingtasks,espe iallytaskswhoseoperations anbeparallelized. Onthe ontrary,
it an hardlybe used withintensive ontrol tasks. The latterare more suitable to
run on a GPP. Homogeneous tasks with a low data dependen y an be easily and
e iently parallelized on a GPU, whereas heterogeneous tasks with omplex data
paths are not re ommended for this ar hite ture. Simple ontrol tasks pro essing
small and well ordered data would likely be implemented on a Mi ro-Controller
Unit. Playing with these dierent pro essing elements, it is possible to adapt the
appli ation to be deployed regarding time exe ution onstraints or memory and
logi resour es. Table1.1 summarizes strengths and weaknesses of ea hpro essing
Pro essing Element Control Performan es Programmability GPP +++ + +++ GPU + ++ ++ DSP + ++ ++ MCU +++ + +++ Hw. A . + +++ +
Table1.1: Pro essing Elements omparison regarding ontrol ability, performan es
and general programmability
platform whi h in ludes all these omponents an be implemented using dierent
te hnologies: anAppli ation Spe i IntegratedCir uits(ASIC),aMulti-Pro essor
System-on-Chip (MPSoC),or a FieldProgrammable Gate-Array(FPGA).
ASIC te hnology oers great performan es but is very expensive and not
ex-ible at all. In this do ument we onsider a MPSoC as a SoC made up of at most
a dozen of ores like the OMAP5430 based on a Cortex-A15 multipro essor ore
[Instrument 2011℄. Theyarelesse ient but heaper,moreexibleregardingtasks
pla ementandsoftwarebugsmaybere overed. FPGAsisagoodtrade-obetween
theASICte hnologyandtheMPSoC hoi ebe auseitisexible,itprovidesbetter
performan es ompared with MPSoC and both software and hardware bugs may
be re overed after the appli ation system being pla ed on the market. Table 1.2
summarizes the strengths and weaknessesof ea hte hnology.
Te hnology Cost Flexibility Performan es
ASIC + + +++
FPGA ++ +++ ++
MPSoC +++ ++ +
Table 1.2: Platform te hnology omparison regarding ontrol ost, exibility and
performan es
The solution whi h interest us is the FPGA te hnology. The exa t reasons of
this hoi e,namely the hara teristi s, thepotential aswell asthepros and onsof
thelast family ofFPGA aredetailedinthenext subse tion.
1.1.3 Modern FPGAs
A FPGA is a re ongurable hip omposed of several logi elements whose the
in-A modern FPGA is a matrix of resour es disposed in parallel olumns. Ea h
olumn ontainseither ongurablelogi blo ks(CLB),butalsoblo krammemories
(BRAM) or dedi ated digital signal pro essing (DSP) blo ks. For this platform
we dened a hardware a elerator as hard-wired fun tion using a set of resour es
allo atedina partition oftheFPGA.
Inadditiontothese ongurableelements, latestfamiliesofFPGAs,forinstan e
Xilinx Virtex 7 FPGAs (Fig. 1.3), in lude hard ore elements to a elerate ertain
pro essingor ommuni ation. Thisisthe aseoftheDDR ontroller, theEthernet
MAC ontroller, or even of hard ore pro essors implemented with all the needed
peripheralsasafullmi ro- ontroller unit(dualARM9 ores withtimers,UART,or
ICAP (Internal Conguration A ess Port) ontrollers).
Figure1.3: Xilinx Zynq 7000 EPPblo kdiagram
As modern FPGAs matri es tend to be ome larger and larger, designers have
now more spa e to implement multi- ore systems in luding several soft-pro essors
and hardware a elerators. In order to oer the best performan es, and as told
isreally e ient butrather expansivefor smallprodu tion lines,whereasthelatter
isexible and anrelies onmanyCOTSbutdoesn't allowto rea hthewanted
per-forman es. Namely, a FPGA is a good trade-o between power onsumption and
pro essing power.
Moreover, all pro essing units detailed inSe tion 1.1.2 an be implemented
in-side a FPGA. This apability provides to the developer the exibility to explore
dierentsolutions when designinghisplatform. Several ar hite ture hoi es anbe
made and ompared. Ea h fun tion an thenbe implemented on thewanted
pro- essing units inorderto obtain thebest partitioning.
1.1.4 Dynami and Partial Re onguration
The natural evolution of FPGAs leads them, due to the miniaturization, to oer
more andmore logi resour es [Ko h 2010b℄. Thisin reasehelpsto fa ethe
impor-tant need offeaturesrequired bytheend-user. To managethedramati in rease of
the size of the FPGAs, espe ially the design time, manufa turers provided partial
re onguration featuresto theirFPGAs (Fig. 1.4).
Figure1.4: Dynami andPartialRe onguration prin iple
Theuseofthepartialre ongurationhastheadvantageofde reasingthe
imple-mentation time be ause partial modules an be implemented separately while the
ModernFPGAsmanufa turers,fromnowXilinxandAltera,providesome
me h-anismsto dynami ally re ongure the hip. Thedynami re onguration allows to
re ongure a partialmodulewhile keeping the stati part un hanged. The system
onthe hipwouldbeabletore ongureapartofitself,thiswithoutanydisturban e
on the exe ution of the rest of the system. In addition to the fun tional interest,
itbringsa onsequentresour es impa tfor autonomousembedded systems-on- hip.
Moreover, insome ases it isa good wayto de rease thepower onsumption while
being apable ofprovidingalarger hoi eofhardwarea eleratorsto agiven
appli- ation.
1.2 HSoC programming model
1.2.1 Programming issue
Despite the real interest of this te hnology, the main drawba k of using
hetero-geneous platforms is that they are di ult to program. Indeed, abstra tion level
dieren es between software fun tions running on pro essors and hardware
a el-erators, make the development of appli ations really tough. In order to ease the
validation and the exploration of the possible partitioning for a given platform, a
ommon abstra tion hasto beprovided to theend-user(Fig. 1.5).
!
" #
!
" $
!
" %
#
!
" #&%
!
" #&$
$
!
" $&%
!
" $&$
%
!
" %&%
!
" %&$
'(
)
* '
Figure 1.5: Abstra tion level dieren es between hardware and software
Toa hieveit,ageneraltrendwhi hisemerging onsistsinadoptingahigh-level
language to des ribe theappli ation. Coupledwith newe ient toolsable to
sim-ulate and automati ally generate low-level ode sour es, su h a design ow would
allowtota kle thelast FPGAsprogramming issues. Indeed,dueto theirin reasing
size,thesystem omplexityisin reasingtooandsu htoolswouldprovideasimpler
view of the wholesystem. For instan e, alanguage su h astheSyn hronous
Data-Flow language (SDF) [Lee 1987℄ provides a model of omputation whi h an be
adapted both to software and hardware threads, and soabstra t theheterogeneity
of theplatform.
Anintermediateapproa h anbeadoptedwhi hprovidesnotaunique
program-ming language to des ribe both thesoftwareand thehardware, but in arst step,
a ommon programming model. In this way, a ommonly adopted programming
modelin the software embedded domain is the threading model. To design a
het-erogeneous platform using this model, we have to raisethe abstra tion level ofthe
hardwarea elerators. Thisallowsus to reuselega yworksinthesoftwaredomain
and sotofo usonthehardwarepartofthemodel. Inour ase, theimplementation
hoi e is done between a software implementation on a pro essor and a
re ong-urable hardwarelogi partition.
Like software threads, we dene hardware threads. A hardware thread
en ap-sulates the hardware a elerator and allows it to behave like a software thread.
Namely, a hardware thread would be able to a ess operating system servi es and
would have, from a ertain point of view, a sequential exe ution. These servi es
in lude the abilityto reate or delete aresour e, and tooperate asystem all. The
user should have the apabilityto preempt anythread, both softwareor hardware,
and so to save and restore its ontext. A parti ular eort should be done on the
implementation of me hanisms permittingthe threads to ommuni ate ina
trans-parent way. Our nal obje tive is to oer to the end-user a simple thread view of
its appli ation, and to thedesigner an e ient way to reate relo atable hardware
a elerators whi ha t likesoftwarethreads (Fig. 1.6). To doso, hardware
a eler-atorsshouldbeimplementedinwhatwewill allahardwarethreadto ommuni ate.
Developedinthe standardhardwaredes riptionlanguage whi h isVHDL(Very
High-speed integrating ir uitDevelopment Language),generi interfa esandan
ab-stra tedexe ution modelwillallowinthefutureto integrate thisintermediate
pro-gramming model with high-level design tools. This will result in the automati
generation of hardwarethreads, takingadvantageof existinglow-levelstru ture.
1.2.2 Dynami ally Re ongurable HSoC
in-Figure 1.6: Heterogeneous threadingappli ation
re ongurable HSoC, hardware threads are dened as relo atable modules whi h
an thenbe allo atedinto anyavailable re ongurable partition oftheFPGA.
The system, using the dynami and partial re onguration, allows the user to
preempt any module (Fig. 1.7). Namely, a part of the hipis divided into several
dynami partitions. Ea h partition is thenallo atedbythe ontrol part of the
ap-pli ation to one hardware threadfor a ertainamount oftime.
Figure1.7: HardwareThread preemption
The listoftarget appli ations anthenbeextendedto multi-modeappli ations
and to those whi h need environment adaptation. As a perspe tive, other
appli- ations based on the dynami dete tion of events, su h as se urity systems, ould
take advantageofthiste hnology. We analso itebio-inspired ar hite tureswhi h
would relyon dynami re onguration me hanisms inorder to dynami ally
Moreover,beingabletoupdatesystemafteritsrelease ouldhelpthedesignerto
improve the adaptabilitytounknownspe i ationmodi ations,for instan ewhen
implementing aH264 ode . We an alsonoti e thatit ouldhavea goodae t on
thedesign ostsof theseprodu ts.
1.3 Obje tives
The goal of this PhD thesis is to propose a software and hardware ar hite ture
in order to improve the appli ation development pro ess when targeting a
Hetero-geneous System-on-Chip. With the in reasing omplexity of the appli ation, an
abstra ted programming model has to be adopted to fa ilitate the des ription of
theseappli ationsandimprove theexibilityregardingtheimplementation hoi es.
Theproposedar hite tureshouldrelyontheexistingoperatingsystemstru tureand
provideservi esandlow-levelme hanismstoeasilyhandlethethreadheterogeneity.
InChapter 2,wepropose a modelof hardwarethread whi h allows to abstra t
this heterogeneity. Then we study me hanisms and tools permitting to manage
hardware threads in the same way that what is done with software ones. In the
next hapter, an operating system dedi ated to heterogeneous systems-on- hip is
spe ied. The main feature of this operating system isto provide a exible a ess
totheoperatingsystemservi esforeverythreads,bothsoftwareorhardware,
what-ever is the ore they are running on. Finally, an appli ation will be detailed and
Unied Thread Model
Contents
2.1 Relatedwork . . . 11
2.1.1 Softwarekernelmanagement . . . 11
2.1.2 Run-timemanager . . . 14
2.1.3 Hardwarethreadmodel . . . 17
2.1.4 Con lusion . . . 21
2.2 Thread model. . . 22
2.2.1 Pro essdenition. . . 22
2.2.2 Threaddenition . . . 22
2.2.3 Softwarethreadmodel . . . 23
2.2.4 Threadattributes . . . 25
2.2.5 Syn hronizationte hniques amongthreads . . . 26
2.2.6 Con lusion . . . 28
2.3 OurHardware Thread model . . . 28
2.3.1 Context: TheFOSFORproje t . . . 28
2.3.2 HardwareThreadspe i ations . . . 30
2.3.3 HardwareThreadar hite ture . . . 31
2.4 Hardware Thread programming model . . . 36
2.4.1 OperatingSystemservi esproto ol. . . 36
2.4.2 Network ommuni ationproto ol . . . 38
2.4.3 A eleratorinterfa e . . . 39
2.5 Con lusion . . . 41
2.1 Related work
2.1.1 Software kernel management
With the emergen e of heterogeneous platform in luding both software pro essors
and re ongurable areas, a natural way to ta kle theheterogeneity of these
re on-gurable platforms hasbeen to relyonthe existing software abstra tion layers. To
Thiss hemeleadstodesignanewkindofplatforminwhi haprimitiveor
fun -tion usedbya task,or ataskitself, an be a elerated inhardware. The following
worksaimto providea simplewaytoloadandrun thesea elerators. Theypermit
to abstra t the omplexity of the ommuni ation between a pro essor, namely an
appli ation running on top ofan operating system,and ahardwarea elerator.
This is the ase of the Egret platform [Bergmann 2003℄whi h the obje tive is
to provide a fullymodular platform. A Mi roblaze mi ro- ontroller unitis running
a
µ
C-Linux operating system and allows the developer to hoose whi h hardwarea elerators have tobeexe uted. Todo so,a lassi driverusing theIOCTL
1
API
2
[IOC 1997℄permits thedeveloperto loadapartial bitstreamofthewished
ongu-ration through the Internal Conguration A ess Port (ICAP) of the FPGA(Fig.
2.1).
Figure 2.1:
µ
C-Linux ICAPdriver[Bergmann 2003℄Authors of [Donato 2005℄ presented a platform based on the Linux operating
system. This hoi e has been done be ause its sour e ode is available for free,
it has been ported on numerous platforms and it is modular regarding additional
drivers.
ThisplatformhasbeennamedCaronte : itis omposedofaVirtex2ProFPGA
1
InputandOutputControl
in luding aPowerPC 405 and one ICAPport. A software driverallows the
devel-oper to ontrol the ICAP using the IOCTL proto ol again. When loading a new
IP
3
ore, a ommuni ation proto olhasbeenimplementedtoallowthisIPto laim
itselfto the Core Manager IP,following a hot-plug philosophy. The inter onne t is
aWishboneBusandspe i Medium A essController (MAC)areusedtoprovide
the ability to allo ate address spa e at run-time. This work led to the laun h of
the ommer ial proje t PetaLinux, whi h aims to simplify the deployment of the
Linux operating systemon re ongurable platforms. The use of Linux in MPSoC
platforms is a growing trend as shown by the re ent a quisition of the PetaLogix
ompany byXilinx.
In[Rana 2007℄,aplatform omposedofseveralFPGAsisintrodu ed. Thewhole
platformis supervised bya unique pro essorrunning Linux, andallowing
re ong-uration ability, partially or totally. Simple primitives are also implemented as a
driverusing the IOCTLproto ol.
The main issueto solve is themanagement ofthe on urrent exe ution ofea h
taskpresentinthe system. Tohandlethis,weneedto relyonamultitaskoperating
systemprovidingsimpleandlega ywaysof ommuni ationtoeverytask,both
soft-wareorhardware(Fig. 2.2). Espe ially,hardwaretasksare onne tedtoaMedium
A essController(MAC),whi hprovidestheabilitytodynami allyallo ateaddress
spa e forea h loadedmoduleat run-time.
Figure2.2: RAPTORsoftwarear hite ture [Rana 2007℄
Theoperatingsystemusedtoabstra tthere ongurationpro essisbasedonthe
workofDonatoetal. [Donato2005℄. Whenare ongurablea eleratorisloadedon
theFPGA,adriverisloadedinto theLinuxkernelandisasso iatedtothis
tor. To ontrolthemodule,theappli ationreliesonthe lassi alIOCTL ommands.
Inalltheseworks,themanagementofthehardwarea eleratorsimpliesminimal
modi ationintheoperatingsystemandiseasilyportable. However,thea elerator
is onsidered asa hardware IP ore and not asa hardware thread. From the user
point of view, this situation leads to a heterogeneous programming model for the
developer. It is not su ient regarding our obje tives whi h impose us to bearin
mindtoallowahomogeneousprogrammingmodelatahigherlevelofrepresentation.
2.1.2 Run-time manager
Other solutions go further and propose to design a run-time manager. A
run-time manager is responsible for s heduling hardware a elerators at run-time and
managing the a ess to shared resour es. The system knows whi h partitions are
available and whi h a elerators need to be loaded. Using adaptive algorithm, a
real-time unit dynami ally pla es and ongures the a elerators. More than a
management of the hardware a elerators as o-pro essor modules, the goal is to
dene amodel inwhi hthese a elerators ouldbe onsideredasreal tasks,inthe
same waythatthesoftwareonesare.
Nolletet al. [Nollet2003℄introdu es one of therst approa h to design an
op-erating system dedi ated to Re ongurable Systems alled OS4RS. It spe i ally
targets the Heterogeneous Re ongurable System-on-Chips omposed of ISP
(In-stru tion Set Pro essor) andre ongurable tiles.
ThisOSmust be apable ofprovidinga similarset ofservi es for the
heteroge-neoustasks, asatraditionalOSdoesforsoftwareappli ation. It isbasedonRTAI,
a real-time Linuxextension.
The hardware task are pla ed into slots and onne ted to ea h other via a
network-on- hip. TheHardwareAbstra tion Layer (HAL)of theoperatingsystem
provides ommuni ation primitivessu hassendandre eiveaswell as ontrol
mes-sages to pla e a new task and read or modify the network parameters (Fig. 2.3).
The ommuni ation API has been ported both in hardware and software. This
ommon interfa eallows tomigrateataskfromasoftwaretoahardwarepro essing
element ina transparent way.
The operating system in ludes a two-level s heduler. The rst level dispat hes
thetaskon the pro essingunits whereas lo als hedulershandles thetaskassigned
tothem. Attherstlevel,thes hedulerreliesona he kpointingme hanismtosave
tasks ontexts. They hoose this solution be ause this has the advantage to make
the ontext independent from the targeted pro essing element. A the lower level,
lo als hedulersmayemploypro essor-spe i ontexts, sin e theywill nevermove
tasks to another pro essor. The denitionand themanagement of the he kpoints
(ie. the denition of what needs to be saved) is up to theuser. We an noti e that
this information is parti ularly di ultto dene and isstill an open issue.
Figure2.3: OS4RSplatform ar hite ture [Nollet2003℄
In[Steiger2004℄,Platzneretal. alsointrodu eanoperatingsystemdedi atedto
re ongurable systems and dis uss about two dierent points. The rst dis ussion
is about design issues for re ongurable hardware operatingsystem. The required
degreeofexibilitypairedwithhigh omputation demandsasksforpartially
re on-gurable hardwarethatis operated inatrue multitasking manner.
Forthe authors,itisne essaryto denethreethings: (1)aprogramming model
dedi ated to re ongurable systems with a set of well-dened system servi es, (2)
a run-time system to handle the dynami ity of the system and resolve oni ts
between exe utable obje ts,and (3)thesmallest unitof exe ution,that isto saya
pro ess or athread.
They dene a hardware thread as a pre-pla ed and pre-routed digital ir uit
whi h an be loaded and relo ated easily in any available slots of the FPGA. A
squareisthesimplestshapetomanageinspiteofthefa tthatitalsoleadstoamore
important internal fragmentation than more omplex shapes,su haspolyominoes.
Thentheyexplainthat1-Dimensional(1D)pla ement involvesaneasiers heduling
ofthedierent threadsbut anin reaseoftheexternalfragmentation. Ontheother
hand,2-Dimensional(2D)pla ementoersmorepossibilityofpla ementandsoless
external fragmentation butis harderto manage.
Inthis paper, theytargeta real-times enariowhere ea hin omingthreadis
ei-thera eptedwithaguaranteetomeetthedeadlineorreje ted. AsinrealityFPGA
resour esdistributionisnot homogeneous,we anassumethatatleastmemoryand
FIFOs are managed by the operating system, and so that a thread an a ess to
these resour es using operating system servi es : memory allo ation and message
queue. They on lude sayingthat 1Dpla ement ismore realisti regarding urrent
FPGAsar hite ture but2Dpla ement isaninteresting open issuesinthewaythat
2Ds heduling isreally more interesting interm ofperforman e.
The se ond dis ussion deals with hard real-time tasks s heduling. Target
Figure 2.4: Operating System for Re ongurable Systems software ar hite ture
[Steiger2004℄
between the operating system and the thread. This port is alled Standard Task
Interfa e(STI).TheTaskCommuni ationBus(TCB)runshorizontallythroughall
hardwarethread areainto a numberofdummytasks.
Thesoftwareoperatingsystemisdividedinthreelayers(Fig. 2.4): a rstlayer
to manage tasks and resour es, a se ond to handle the ontext issue, and thelast
one whi h isresponsiblefor the ommuni ationand the onguration.
In[Wigley2001℄,authorsdis ussthes hedulingproblemofrelo atablehardware
tasksbyanoperatingsystem. Theygiveaspe i ationofanidealoperatingsystem
dedi ated to the re ongurable omputers. This operating system must provide a
s heduler able to manage expli it ontext hanges, namely the user has to insert
he kpoints insidetaskssour e odein orderto ensure a orre t ontext save.
Intheir spe i ation, theoperatingsystemis responsible for managing the
vir-tual memory and prote ting platform physi al resour es from oni ting a esses.
Taskpartitioningmustbedynami aswemustbeabletooperateloadbalan ingor
taskmigration from software to hardwareand vi e-versa.
it by initializing dire t ommuni ation between these kind of tasks. Otherwise, a
buer should be usedin order to pro ess ommuni ation. A last point is theneed
of veri ation toolsand test ases,that isto say appli ation exampleswhi h ould
benetfromtheDynami and Partial Re onguration.
Another example of run-time manager is introdu ed in [Shiyanovskii 2009b℄.
Re ongurationismanagedbyasoftwarelayerupontherealtimeoperatingsystem.
This layer is alled Adaptation Manager, and an be ustomized inorder to get a
trade-obetweenthe power onsumptionandtheexe utionspeed. Todosoitrelies
on alearningpro ess whi h allows it toimprove its de ision skill.
The re ongurable platform is omposed oftiles whi h abstra t thelogi blo k
programming level to provide to the developeran a ess to oarse grain primitives
su h aslters, FFT
4
or others higher level fun tions. S heduler poli y is basedon
priority. Tasks an have three dierent states : Ina tive, A tive and Reserved and
havereal-time attributes su h asexe ution time,deadline, or laxity.
Theseworksshowthatanoperatingsystemisne essarytomanagethehardware
a elerators. Thisabstra tion layer hasto take advantage ofthedynami
re ong-urationandprovideshigh-levelme hanismsto managetheavailableslots. Itmeans
oeringthe abilitytotheend-userto reate,suspend,resumeanddeleteahardware
task. At a lower-level, a re ongurable partition should be seen asa pro essing
el-ement. The operatingsystem should be able to share this resour e between every
hardware a elerators, leading us to view a hardware a elerator as an equivalent
of asoftwarethread.
2.1.3 Hardware thread model
Usingtheabilityto ontroltheDynami andPartialRe onguration(DPR),re ent
arti les proposed abstra tion models for the hardware a elerators. The obje tive
istoimprove the programmabilityoftheseheterogeneous platformandto fa ilitate
the ommuni ation between thea elerators andtherestof thesystemprovidinga
default interfa e.
Authors of [El-Araby2008℄ dene VFPGAs. This a ronym stands for Virtual
FPGAs. A VFPGA is a re ongurable zone ontrolled by a pro essor (Fig. 2.5).
A VFPGA an be seen as a hardware task. This kind of task has three dierent
states: onguredand waitingforinput data(datain), pro essing,or sending data
(data out).
A virtualization manager is implemented to re eive exe ution requests oming
from pro essors. It is responsible of loading the VFPGAs. As expe ted, dierent
testsshowa gainregarding theexe utionspeed.
Figure 2.5: VFPGAruntimemanager ar hite ture [El-Araby2008℄
In[Verdos ia 1994℄,authorsta kletheissueofthehardwareimplementationofa
Data-FlowGraph(DFG)modelof omputation(MoC).InaDFGmodel, apro ess
an be representedbyana tor. A tors ommuni ate bysendingea h otherpa kets
of data alledtokens[Lee 1987℄. Althoughthis modelisgenerallystati ,this paper
denes a dynami model in whi h a tors inputs and outputs tokens ome and go
from and to innite FIFOs. Every a tors have two inputs and a unique output
allowing to dene three typesoflinks between them:
•
lassi al link: 2→
2 (two outputs of two dierent a tors tothe inputs of one or two other a tors)•
joint link: 2→
1(two outputs oftwo dierent a tors tothe inputs of another a tor)•
and repli a link: 1→
2 (one output of an a tor to the inputs of one or two other a tors)A tors aregroupedin lusterswhi h ommuni ate byMessagePassing. Insidea
luster, a tors are alledFun tional Units(FUs). TheseFUs ommuni ate through
a rossbar. Messages ex hanged between FUs and between FUs and the host
or-respondto the graph onguration andtheprodu ed tokens. A FUis omposed of
three elements (Fig. 2.6):
Figure2.6: Fun tionalUnit ar hite ture [Verdos ia 1994℄
•
"Syn hronization Unit": it is responsible for ontrolling the presen e of the inputtokens. Two signals aregenerated: ABILifthetwo tokens arepresent,ABOL withadelayofone y leto allowoutput ring
•
"ComputationUnit": it omposedofanALU5
,amultiplierandoneSele tion
module. If a test is requested and that it passes, the output is a tivated on
the arrivalof the ABOL signal
The proposedmodel hasthree advantages. Firstly,all a tors have thesame
ar- hite ture (two inputs - one input) sothe same interfa es withtheexternal world,
thenitallowsto getanar hite ture adaptedtoVLSI,andnallyalla tors areable
to manage loopsand onditional instru tions.
Mu h more omplex a elerators have then been developed, su h as Hybrid
Thread[Agron2009a℄. Inthisarti le,theauthorsdeneamodelofPOSIX
6
ompli-anthardwarethread, apableofpro essingoperatingsystem allsthroughashared
memory, as software thread does (Fig. 2.7). A thread is omposed of two nite
states ma hines (FSM). One usedto answerto operating systemrequests and get
system alls results, and the other one to pro ess system alls and geta ess to a
heap. TheseFSMsare ontrolled by thehardware a eleratorsen apsulated inthe
User Logi omponent.
Figure2.7: HybridThread model[Agron2009a℄
Heap and sta k are stored in an internal Blo k RAM (BRAM) of the thread.
Like in a software POSIX thread, the sta k is used to store the system alls
pa-rameters. Moreover, inorderto enhan e theprogrammability ofthese threads,the
authors dened a high-level API whi h allows the developer to des ribe a
hetero-geneous appli ation usingtheClanguage. A dedi ated ompiler written inPython
permitstotranslatetheC odeintoaVHDLimplementationoftheHybridThread.
In[Lubbers2008℄,theauthorsintrodu eanoperatingsystemdedi atedto
re on-gurable ar hite tures: Re onOS. This operating system provides a homogeneous
abstra tionlayertothethreads,bothsoftwareorhardware,andallowsthemto
pro- ess system alls. Thispaperdeals withtheportage ofRe onOS on a Linuxbased
platform, and omparesits performan es withanother one based on theeCOS
op-erating system. Thegoal isto demonstrate theportabilityof the on epts brought
bytheRe onOS ar hite ture.
In this operating system, every servi es are managed by a software operating
system running on a pro essor. Hardware system alls are done through an API
des ribed inaVHDLlibrary. The hardwarethread nitestate ma hineis
syn hro-nizedwiththe softwareoperatingsysteminorderitto pro essthesystem all. The
interfa e responsible for the ommuni ation is alled OSIF for OS InterFa e and
representsa set ofregistersa essible throughthepro essor bus(Fig. 2.8).
Figure2.8: Re onOS hardware threadmodel [Lubbers2008℄
Regardingtheinter-thread ommuni ation,thethreadheterogeneityisabstra ted
asso iating ea h hardware thread with a software one, whi h is a proxy or a
del-egate. When requested by a hardware thread, the system all is exe uted by the
orresponding softwarethread.
In order to linkoperatingresour es requested by thehardware thread withthe
ones a essible by the softwareone, a table of the usedinstan es ismaintained by
thedelegate. Inthisway,thesamehardwarethread anbeusedbyseveralinstan es
of a software thread. This me hanism has been implemented to foresee the future
useof the partialand dynami re onguration.
2.1.4 Con lusion
As explained in the introdu tion, our hoi e is oriented to the threading model.
Ourgoalistoproposeahardwarethreadmodelwhi hisableto ommuni ate with
softwarethreads inthe same waythat what has been proposedbyHybrid Thread
[Agron2009a℄ or Re onOS [Lubbers 2008℄. This model has to be adapted to the
re ongurable platform and take advantage of the parallelism and the exibility
oered bythis type of platform. The denitionof this modelis thebasi proposal
of this thesis and will lead us to dene in the next hapters, an operating system
2.2 Thread model
2.2.1 Pro ess denition
A pro ess is dened asan independent stream of instru tions, running on top of a
pro essing element. A pro ess permits to group some of a pro essing element
re-sour es together,su h asthe memory spa e, theopenles,thesignal handlersand
other information. Grouping resour es insidea same entity fa ilitates the
manage-mentof these resour es bythe running pro ess[Tanenbaum 2001℄.
Pro ess exe ution is prote ted by the fa t that it has a private address spa e.
Pro esses ares heduled bythekernel operating systemand ompetefor thea ess
tothepro essingelement. Whenapro essisblo kedbyasystem all,thes heduler
is responsible for saving the ontext of this pro ess and sele ting another pro ess
among the onesready to be exe uted.
2.2.2 Thread denition
A thread is exe uted inside a pro ess (Fig. 2.9). The main dieren e between a
thread andapro essisthatthelatterhasafull viewofthememory spa e
address-able by the pro essor whereas threads inside a same pro ess share the pro essing
element resour esownedbythepro ess.
Figure2.9: Pro essand Thread
A threading model provides the advantage to isolate appli ation fun tions
ex-e utions regarding one to the others and so enfor es parallelism when targeting
multi ore platforms. Itimprovestheprogrammabilitydividing appli ationinto
sev-eral tasks. In addition, a thread is easier to reate or destroy than a pro ess. A
simple representation of the threadlife y leis depi ted inFigure2.10.
Moreover, as a thread is a sub-entity of a pro ess, it has a smaller ontext to
save than the latter. Indeed, it does not have to manage global resour es su h as
Figure2.10: Thread life y le
2.2.3 Software thread model
Generally,itexists twoways to implement a threadmodel inan operatingsystem.
Either inuserspa e or inkernel spa e.
2.2.3.1 User thread model
Intheuserthreadmodel,theoperatingsystemkernelisonlyawareofasinglethread
inthepro ess. Threads ares heduledbya threadslibrary implementedintheuser
spa e. Theadvantageofthismodelisthatthereisnoneedto modifytheoperating
system,whi hisinterestingifthisonedoesnotsupportthethreadexe utionmodel.
Figure2.11: UserThread model
The user-level s heduler allows only one thread to be a tively running in the
pro essat atime. There isonethreadtable perpro esswhi hallowsafast ontext
annotmanage a lo kinterrupt,around-robins heduling annotbe implemented.
Regarding the parallelism, themain drawba kof this model isthat a blo king all
from athread wouldblo kall thethreads implemented insidethesame pro ess.
2.2.3.2 Kernel thread model
Inthekernel threadmodel, kernelthreads areseparatedtaskswhi hareasso iated
with a pro ess. In a kernel thread model, one kernel thread per pro ess is reated.
The pro ess table and the thread table are both managed at the kernel level. A
preemptive s heduling poli y is used in whi h the operating system de ides whi h
thread iseligible toshare the pro essor.
Figure2.12: KernelThread model
Moreover, when a thread performs a blo king all, its state is notied to the
kernel whi h ande ideto preemptthethreadinfavorofanotherready thread. As
thread aremanaged at the kernel level, thedrawba k isthat system alls ostsare
higher than inthe user threadmodel.
2.2.3.3 Hybrid thread model
In a hybrid threadmodel, severaluser-levelthreads arerunning on top of akernel
thread (Fig. 2.13). A ommonly used hybrid thread model is thePOSIX threads
spe i ation (Pthreads). POSIX stands for Portable Operating System Interfa e.
Threads are user-level threads but are managed using a kernel-assisted
ontext-swit hing. It means that when a thread performs a system all, ifthe all is
non-blo king, the thread rely on the user-level API. Otherwise, the kernel thread is
notied thatthethread isblo ked and thekernel s heduler an tryto nd another
pro ess whose at least one thread is runnable. This solution is more omplex to
implement but tries to ombinethebestof the two models.
Figure2.13: HybridThread model
ation asthe kernel thread stru ture is lighterthan thepro ess one. Onthe other
hand, performan e is lower due to the ne essity to regularly swit h from the user
modetothekernelmode. ThehybridthreadmodellikePOSIXtendstobeadopted
be ause the memory footprint be ome negligible regarding the available resour es
and above all be ause itis a widely used standard inthe omputing domain. The
adoption ofa standard beinga good thingfor theimprovement oftheappli ations
portability.
2.2.4 Thread attributes
2.2.4.1 Storagestru tures
At thetime ofits reation, athread isasso iated withtwostorage stru tures:
•
a Datastru ture: Datais where all of theprogram variables arestored. It is broken down into storage for global and stati variables (stati ), storage fordynami allyallo ated storage(heap),and storage for variables thatarelo al
to the fun tion.
•
a Sta k stru ture: The sta k ontains dataabout the program or pro edure allowina thread. Thesta k,alongwithlo alstorage,isallo atedforea hthread reated. Whileinusebyathread, thesta kandlo alstorageare
on-sideredto bethread resour es. Whenthethread ends, theseresour es return
to the pro essfor subsequent usebyanother thread.
2.2.4.2 Thread-private data
•
Thread identier: Aunique number that an be usedto identify thethread.•
Priority: iftheoperatingsystemallowsspe i ation of athread priority, this valuewould determinetherelative importan e ofone threadto otherthreadsinthe appli ation.
•
Callsta k: The allsta k ontainsdataabout theprogram owor pro edure all owinthethread.2.2.4.3 Thread-spe i data (TLS)
Threads an havetheir ownviewofdataitems alledthread-spe i data.
Thread-spe i data is dierent from thread-private data. The threads implementation
denes thethread-privatedataat thekernel level, whiletheappli ation denesthe
thread-spe i data. Threadsdonot sharethread-spe i storage, butallfun tions
within thatthread an a essit.
Dueto the designof theappli ation,threads maynot fun tion orre tlyifthey
share the global storage of the appli ation. If eliminating the global storage isnot
feasible, usingthread-spe i datais agoodalternative.
2.2.5 Syn hronization te hniques among threads
Even if an appli ation is thread-safe, in order to keep good performan es, some
globalresour eshavetobesharedbetweenthreads. Inthis ase,themostimportant
aspe tofprogrammingbe omestheabilitytosyn hronizethreads. Syn hronization
is the ooperative a toftwo or morethreads thatensuresthatea hthread rea hes
a known point of operationregarding toother threads before ontinuing.
Threads an be syn hronized using operating system servi es. These servi es
ensure the developer that riti al resour es are a essed in a safe way and allow
threads to ommuni ate. The most ommon syn hronization primitives are:
•
Mutexes•
Semaphores•
Condition variables•
Threads assyn hronization primitives•
MessagePassing2.2.5.1 Mutexes
en-appli ation ode ata time. Themutex isusually logi ally asso iated withthedata
itprote ts bythe appli ation.
Create, lo k, unlo k,and delete areoperations typi allypreformed ona mutex.
Anythreadthatsu essfullylo ksthemutexistheowneruntilitunlo ksthemutex.
Anythreadthatattemptstolo kthemutexwaitsuntiltheownerunlo ksthemutex.
Whenthe owner unlo ksthemutex, ontrol isreturned to one waitingthread with
that thread be oming the owner of themutex. There an be only one owner of a
mutex at atime.
2.2.5.2 Semaphores
Semaphores an be usedto ontrol a esstoshared resour es. Asemaphore an be
thought ofasanintelligent ounter. Every semaphorehasa urrent ount, whi h is
greaterthan or equal to zero.
Anythread ande rementthe ount lo kingortakingthesemaphore.
Attempt-ingtode rementthe ountpast0 ausesthethreadthatis allingtowaitforanother
thread to unlo k the semaphore. In the same way, any thread an in rement the
ount unlo king or posting the semaphore. Posting a semaphore may wake up a
waitingthread ifthereis one present.
Intheirsimplestform(withaninitial ount of1),semaphores anbethoughtof
asamutualex lusion(mutex). Theimportant distin tionbetween semaphoresand
mutexesisthe on eptofownership. Noownershipisasso iatedwithasemaphore.
Unlikemutexes, itispossibleforathreadthatnevertookforthesemaphoretopost
thesemaphore.
2.2.5.3 Condition variables and threads
Condition variables allow threads to wait for ertain events or onditions to o ur
andtheynotifyotherthreadsthatarealsowaitingforthesameeventsor onditions.
The thread an wait on a ondition variable and broad ast a ondition su h that
one or allof the threads thatarewaiting onthe ondition variable be ome a tive.
Conditionvariablesdonothaveownershipasso iatedwiththemandareusually
stateless. A stateless ondition variable means thatif a thread signals a ondition
variableto wake up a waitingthread when there urrently are no waiting threads,
the signal is dis arded and no a tion is taken. The signal is ee tively lost. It is
possible for one thread to signal a ondition immediately beforea dierent thread
beginswaiting forit withoutanyresulting a tion.
2.2.5.4 Threads as syn hronization primitives
Threads themselves an be used as syn hronization primitives when one thread
spe i ally waits for another thread to omplete. The waiting thread does not
2.2.5.5 Message Passing
A message passing API an be implemented on top of the previous me hanisms.
Threads an use this higher abstra tion layer to syn hronize and ex hange data.
This API provides blo king or non blo king primitives to transparently send or
re eive messages from a thread to another. Implementation an be realized using
eitherthesharedmemory paradigmora networkproto ol ifadedi atednetworkis
available.
2.2.6 Con lusion
Finally, to be onsidered as a software thread equivalent, the operating system
managing the hardware threads has to provide them the ability to a ess to the
same servi es than the software ones. The hardware thread model has to take
it into a ount, and spe ies additional me hanisms whi h allow the developer to
pro ess system alls.
2.3 Our Hardware Thread model
2.3.1 Context: The FOSFOR proje t
2.3.1.1 Presentation
The FOSFOR proje t is an ANR
7
proje t started in January 2008 and ompleted
in De ember 2011. Thisis a ollaboration between four partners: Thales Resear h
and Te hnologyFran ebasedinPalaiseau, theETISlablo atedinCergy-Pontoise,
theCAIRN fromLannion, and the LEATbased inNi e Sophia-Antipolis.
FOSFOR stands for Flexible Operating System FOr Re ongurable platform.
The aim of this proje t is to dene a new kind of heterogeneous platform. This
platformisheterogeneous inthe sensethatthreadsand operatingsystems ouldbe
implemented either insoftware(running on one of the pro essors), or inhardware
(running in a partition of the FPGA).
Ea h part ould thenbe adaptedregarding thedeployed appli ation. The goal
istoproposeahomogeneousprogrammingmodelfortheappli ation. This
ar hite -ture is done to demonstrate the re ongurable ar hite ture viability regarding the
development pro ess omplexity.
2.3.1.2 Platform ar hite ture
The FOSFOR ar hite ture is omposed of multiple pro essing elements onne ted
to a entralbus(Fig. 2.14). Wedistinguishsoftwarepro essingelementsand
hard-warepro essing elements. Both implement respe tfully a softwareand a hardware
versionoftheRTEMS
8
[RTE1988℄operatingsystem. Onea hpro essor, asoftware
7
Agen eNationalepourlaRe her he
operating system manages lassi software threads whereas a hardware operating
system(HwOS) isableto manage re ongurable partitions. Hardwarea elerators
ares heduled into these partitions.
! "
#
$
$
$
$
#
#
#
#$
#$
#$
#$
$
$%
#
&
#
" % %
Figure2.14: FOSFOR platformar hite ture
The obje tive is to provide at the user-level a homogeneous thread point of
view. Toa hieve it, we abstra t hardware a eleratorsinto hardware threads. The
ar hite ture of these hardware threads is dened in details in Se tions 2.3.2 and
2.3.3.
2.3.1.3 High-level ommuni ation me hanisms
Communi ation between threads an behandled using twoways. For
syn hroniza-tion and small data transfer, threads an rely on the operating system servi es.
These servi es an be lo ally managed or shared between all pro essing elements.
For largeramount ofdata,amiddlewarelayerprovidesamessagepassingAPIwith
Send andRe eiveprimitives.
This middleware layer (Mw) is inserted between the appli ation layer, based
on POSIX threads, and the operating system servi es API. If a thread wants to
ommuni ate with another one, it has a ess to the simple middleware API using
transparent message passing proto ol, or it an a ess dire tly to the operating
systemservi es,su h asMutexor Message Queues primitives.
This high-levelAPI omposedof these two typesof primitives hasbeen ported
softwareandhardware ode. Forinstan e, basingthedes riptionof theappli ation
on the omponents an be a good solution to fa ilitate theimplementation of
het-erogeneous appli ationson HRSoCplatforms.
Insoftware,theMPCI
9
layerin ludedinRTEMSisthebaseoftheheterogeneous
ommuni ation. It provides a transparent a ess to distant servi es. We extended
ittothehardwareimplementation oftheservi es. Thebridgehasto betransparent
to abstra tboththe lo ationand theheterogeneityoftheappli ation threads. The
lo ationofea hhardwarethreadwhi hdynami ally hangesregardingtheavailable
slots isdynami allymanaged and abstra tedbythe middlewarelayer.
2.3.2 Hardware Thread spe i ations
2.3.2.1 Obje tives
In orderto simplify the programming omplexity oftheHRSoC,hardware
a eler-ators have to adopt the same behaviour as their software ounterparts. To do so,
theyshouldbeabletoobeytheordersoftheoperatingsystem. Theyalsomusthave
the ability to all operating system servi es available in the whole platform, read
and write data from and to memories, and spe i ally they should be asso iated
withan interfa e allowing thedeveloperto ontrol theexe utionof thesehardware
a elerators. The hardware thread life y le should be equivalent to the software
one. All these features and interfa es are assembled in order to en apsulate the
a elerator andsoto dene what we all ahardwarethread.
2.3.2.2 Denition
We dened a hardware thread to take advantage of the dynami re onguration
provided, for instan e, in the Xilinx FPGAs. It is omposed of two main parts:
a stati part whi h ontains all the interfa es with the platform, and a dynami
appli ation-spe i part, whi h ontains theA elerator, theFinite State Ma hine
(FSM) ontrolling its exe ution,anda privatememory (Fig. 2.15).
Comparedto a softwarethread, a hardware threadwill run on are ongurable
partition. Thisre ongurable partition anbe ompared to apro ess,inwhi hthe
logi resour es are equals to the pro essor resour es shared between every threads
running inside this pro ess. In this s heme, a set of re ongurable partitions is a
pro essing element ontaining several pro essor ores. A parallel an be done
be-tween a re ongurable partition anda pro essor ore.
Stati interfa es orrespond to the user-level API. It provides to the thread
an a ess to the operating system and ommuni ation servi es. The User FSM is
the sequential ode exe uted by the thread and nally, the double port memory
onne ted both to the A elerator and the Network Interfa e is used as the heap
and sta kstoragebythe thread.