Extended overlay architectures for heterogeneous FPGA cluster management

(1)

HAL Id: hal-01643297

https://hal.archives-ouvertes.fr/hal-01643297

Submitted on 12 Feb 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Extended overlay architectures for heterogeneous FPGA cluster management

Mohamad Najem, Théotime Bollengier, Jean-Christophe Le Lann, Loïc Lagadec

To cite this version:

Mohamad Najem, Théotime Bollengier, Jean-Christophe Le Lann, Loïc Lagadec. Extended overlay

architectures for heterogeneous FPGA cluster management. Journal of Systems Architecture, Elsevier,

2017, 78, pp.1-14. �10.1016/j.sysarc.2017.06.001�. �hal-01643297�

(2)

management

Mohamad Najem

^a^,^∗

, Théotime Bollengier

^a^,^b

, Jean-Christophe Le Lann

^a

, Loïc Lagadec

^a

aLab-STICC UMR 6532, ENSTA Bretagne, Brest, France

bB <>com, Research Institute of Technology, Brest, France

1. Introduction

Nowadays, hardware architectures, especially the ones dedi- catedforsignalandimageprocessing,mustofferhighperformance computation, design flexibility, and upgrade capabilities. Recon- figurable chips,such asField ProgrammableGate Arrays (FPGAs), havebeenaddressedasareasonablesolutioninthisarea,combin- ingflexibility,re-programmability,powerefficiency,andlowdevel- opmentcost[1,2].

Moreover, a trendwhich hasrecently emerged isthe remote- useofcommodityoff-the-shelf(COTS)FPGAsashardwareacceler- ation units in a heterogeneous computing cluster [3,4]. Over the lifetime of the infrastructure, components of such a cluster are graduallyupdatedandreplacedtofollowthetechnologyevolution overtime,andFPGAssalesandtrends.ThisresultsinFPGAswith differentcharacteristics andfromdifferentvendors beingusedat the sametime. However, a bitstreamgenerated fora givenFPGA cannot be loaded into an FPGA of a different model.It is there- forenotpossibletoblindlydispatchhardwareapplicationsinsuch

∗ Corresponding author.

E-mail addresses: [email protected] (M. Najem), [email protected] (T. Bollengier),

[email protected] (J.-C. Le Lann), [email protected] (L. Lagadec).

a FPGAcluster.Forthispurpose,thispaperproposesanFPGAre- sourcesvirtualizationapproachbasedonoverlayarchitectures.The ideaistodeployanintermediatelayerofreconfigurableresources, the overlay-basedvirtual FPGA (vFPGA),to hide therealm ofthe infrastructureandofferaunifiedviewofresources.Anapplication designtargetingthevFPGAisnolongertiedtoalimitedsetofFP- GAs fromthe cluster,andcanrunonanynodeimplementingthe vFPGA.

Moreover,thesharingofreconfigurableresourcesamongdiffer- ent applications is a key requirementthat givesrise to a higher hardware utilization. Forthispurpose,applicationshaveto beef- ficiently instrumented,henceoffer asimple management ofsuch a cluster.Inthiswork,we proposetoextendtheclassicaloverlay architectures by addingnewfeatures tofreely snapshotthe state oftherunningapplication,andenablerestoringthecircuitbackto any previously saved state. Hardware taskpreemptivescheduling on anode, applicationlivemigrationbetweennodescanthen be deployed;ourcompletesystemcanperformloadbalancingorcon- solidation, managesapplicationprioritiesandprovidesfaulttoler- ance.Moreover,tosupportearlyperformancesestimation,wepro- posesomeaccuratecostmodelsforschedulingandlivemigration time.

The remainder of this paper is organized as follows.

Section 2 presents the limitation of existing works from the literature, while in Section 3, the proposed FPGA virtualization

(3)

approach is discussed and the classical overlay architectures are presented. Section 4 discusses Zeff, the deployment architecture, whileSection 5presentsournoveloverlayfeatures. Section6de- scribesthecompletesystemsupportinglivemigrationandoffering schedulingofhardware tasksonoverlays.Section7drawsconclu- sionsanddiscussesfuturelinesofresearch.

2. Relatedworks

ThemaingoalofthispaperistoproposeanFPGAvirtualization solution foran efficientremote-use ofFPGAsforgeneralpurpose computing, whichhasrecentlybecome anewareaofinterest for manyresearchers.AusefulanalysisofthedifficultytobringingFP- GAsasshareableresourcesispresentedin[5],wherefourrequire- mentsareidentified,includingthebinarycompatibilityamongFP- GAs andresourcesharingcapabilities. Theauthorsproposeto ex- pose FPGAsto thecloud stack asaresources pool, whichcan be dynamically managedby aglobalmanager. Theyclearly highlight theneedofavirtualizationsupportforanefficientsharingofFPGA resources.

The virtualization of FPGA resources provides more flexibility andportability foruserapplications.Severalframeworksare proposed in the literature, which abstract the physicalresources by asystemlayer(orservices)providingavirtualizedenvironmentto accessthem.In[6],aparavirtualizedXenVirtualMachine(VM)en- vironmentprovidesmultiple-userservicestoaccessFPGAaccelera- tors.Moreover,BORPH[7]isalsoawellknownprojectworkingon UNIXinterfacesforcreatinghardware-baseddriversandproviding FPGAhardwareabstractionsandmanagement,whiletheVirtualRC platform[8]proposesauniformhardware/softwareinterfacetoac- cess FPGAs. Moreover in [9], authors have proposed the Object- Oriented Communication Engine (OOCE), a system-level middle- wareforFPGA-SoCheterogeneousarchitecture.TheyabstractFPGA synthesis flowandaccelerationbya setofOScalls.Theseframe- works move a step forward inthe virtualization of FPGAs. How- ever, such servicesvirtualize (orabstract) theaccessto theFPGA not the resource itself.In other words,FPGAs arenot completely virtualized ina generic way, andremains an open questionhow to providean efficient sharing algorithm betweenheterogeneous FPGAs.

Mostofhardware virtualizationschemes arebasedon thedy- namicreconfigurationoftheFPGA[4,9–11].Thekeyideaistode- composethephysicalFPGAintoseveralreconfigurableregionsus- ingtheDynamicPartialReconfiguration(DPR).Eachregioniscon- sideredasa singlevirtualized FPGAresource(vFPGA) makingthe FPGAamulti-tenantdevice.Additionally,astaticregionisusedto connect all vFPGAsineach physicalFPGAtothesystemmanager.

Authorsin[10],haveproposedanarchitectureoffourvFPGAswith NetFPGA-10G infrastructure inthestatic region,whilein [11]the control unit inthe static region communicateswiththe external hostviaaPCI-expressinterface.Moreover,in[4],authorshavepro- posedaprototypeframework(RC3E)forintegratingvFPGAsinthe cloud usingalso a DPR technique. The virtualization of FPGAre- sourcesbasedontheDPRtechniqueseemsverypromising,butre- mains dependent on the FPGA; it is a service provided by FPGA vendors, and applications must be synthesized andprogrammed fora specificFPGAandusingvendorstools.Also,DPR-basedvFP- GAsdonotmeetapplicationmanagementrequirements,especially when migration andscheduling techniquesare envisionedto in- creasetheusabilityofFPGAs[5,11].

Furthermore, sharing of FPGA resources is a natural require- ment and a promising technique to reuse resources by different circuits (configurations). In [12], authors propose to switch between signalprocessingDVB-T2tasksina timemultiplexed fash- ion using DPR. Another technique is proposed in [13] to share circuit-specific DSP blocksfrom Xilinx. Instead ofreconfiguring a

setofDSPblockstoimplementalloperations,authorsusemultiple setsofDSPblockscontrolledbystatemachinestoensurethateach setachieves ahighinitiation interval.A hardware context-switch mechanism is introduced in [14], where user tasks, designed by HLStools,aremodifiedatdesign-time:severalcheckpointsarein- serted inside the code anda scan-chain structure is included to extract/restore the state of the circuit. A useful approach based onFaRM (FastReconfigurableManner) controllerforanadvanced scheduling of hardware tasks on a DPR-based resource is intro- ducedin[15]. Authors present an architecture withon-chipcon- trollersandFSMsenablingthepre-loadingofthepartialbitstream inordertoreducePRconfigurationtimeoverhead.

Fromthevarietyofapproachesintheliterature,itisclearthat mostworkshavefocusedonthevirtualizationofFPGAsasasoft- ware layer. Few efforts have been reported on the hardware ar- chitectureofthevFPGA,whichiscrucialforeffectivemanagement ofFPGAresources.Aspatial-sharing ofFPGAsisprovidedthrough theDPRtechnique, whilethe time-sharing isaddressedforsome application-specific andcircuit-specific contexts. DPR-based solu- tionscan hardlybe generalizedforwhichevertaskorFPGA.Also, thecompatibilityrequirementamongDPRsofdifferentFPGAscan- not be entirely fulfilled. Despite the effort in reducing resource reconfigurationtime, DPRs mightsuffer froma significant reconfiguration latency, asthey use relatively slowcommunication in- terfaces for the configuration [16]. In this paper, we propose a novel approach forthe hardware virtualization of FPGAs. vFPGAs are overlay-based architectures that can be implemented on any FPGAdevice.Ourgenericsolutionhomogenizesanyclusterofhet- erogeneousFPGAs.Severaloverlaysarchitecturalfeaturesareintro- ducedtoextendtheflexibilityprovidingacost-effectiveconfigura- tion,thankstoourbitstreampre-loadingfeature,andalightweight managementforrunningapplications.

3. FPGAvirtualization:overlayapproach

Virtualizing FPGAs aims at designing, implementing and run- ninghardwareapplicationsindependentlyfromtheFPGAtechnol- ogyand tools.This section presents ourproposed method,high- lightstheadvantagesofoverlayarchitectures,andintroducesmost ofexistingarchitecturesfromtheliterature.

3.1. Overlay-basedapproach

Weinvestigateoverlayarchitecturestovirtualizeheterogeneous FPGAs. Overlays have been reported to offer many advantages againstFPGAs:

• Bitstreamcompatibility:Overlaysareportableoveranyphysical FPGA.Hence,circuitstargetingtheoverlaycanbeimplemented on anyphysicalFPGAsupportingthevirtual layer.All physical FPGAs,whichhavealmostdifferentcharacteristicsandfromdif- ferentvendors,areseenashomogeneousre-configurableover- laysforfront-endapplications.

• Open-sourceusage:TheabilitytodeployavFPGA, withafully accessiblearchitecturedescription,allowsthedeveloperstouse any open-source tool targetingsuch architectures. This allevi- atestheneedforFPGAvendortools.

• Flexibility: Asa generalrule, beingindependentfrom theun- derlay,overlaysimplementationscanintegrateadvancedarchi- tectural features to providenewcapabilities, such as:applica- tionmonitoring,clockmanagement,bitstreampre-loading,etc. Fig.1illustratestheproposedapproach.Itconsistsondeploying alayerbetweentheapplicationandthephysicalFPGA,theoverlay.

Overlays are re-configurable architectures mapped on top of the COTSFPGAs.Theproposedoverlay-basedvirtualFPGAiscomposed ofthreelayers(asshownintheFig.1):

(4)

Hypervisor Controller or Soft-core Manage and Communicate

with the vFPGA

Virtual bitstream Generation

Homogeneous Virtual FPGAs

vFPGA (Overlay)

Board B Board A

Hypervisor

Configuration Computation

Configuration Computation Virtual

Synthesis

Module mult_32b (output<32>

out) : reg<32> a;

reg<32> b;

{

a = b = X " beef beef " ; out = a x b;

… }

Overlay Matrix of reconfigurable

resources

Heterogeneous Physical resources

HW Application

Synthesized Application

Physical Synthesis

Snapshot vFPGA

Fig. 1. Overview of the proposed FPGA virtualization approach.

• The computation layer, which is the set ofreconfigurable ele- mentsavailabletotheapplication.

• Theconfigurationlayer,whichconfiguresthecomputationlayer.

• Thesnapshotlayer,which snapshotsthestateofthe computa- tionlayer,hencethestateoftherunningapplication.

Fromtheapplicationinterface,thevirtualFPGAcorrespondsto asetofre-configurableelementsavailabletotheapplicationwhich targetsthevFPGA(functionalarchitecture),whiletheinterfacewith the physical FPGA focuses on how the functional architecture is implementedandsynthesizedtothehostFPGA.So,thefunctional architecture of the vFPGA and the implementation are indepen- dent,hencetheconceptofFPGAvirtualization.

3.2. Overlayarchitectures

OverlaysconsistofaregulararrangementofReconfigurableEl- ements(REs),connectedbyroutingchannelsofaninterconnection network. Intheliterature, overlays havebeendevelopedforvari- ousapplications,andareeitherfine-grainedorcoarse-grainedarchi- tectures.Fig.2showsaconceptualviewforthecomputationlayer ofisland-styleoverlayarchitectures.

3.2.1. Fine-grainedoverlays

Fine-grained overlays are FPGA-like architectures [17,19–21], where REs are composed of fine-grain reconfigurable elements, such as Configurable Logic Blocks (CLBs). Fig. 2a) presents a genericLUT-basedarchitecture compatiblewithstandardarchitec- tures used in the academic Versatile Placeand Route (VPR) tool [22].ThebasicarchitecturehasHeight×WidthCLBs,eachofwhich hasIbitsinputsandNbitsoutputs.ACLBiscomposedofNBasic LogicElements(BLEs). ABLEhasone LUTwithK inputsandone register that can be bypassed (also called the virtual application register).InputsofBLEsarederivedfromaglobalcrossbarthathas I+Ninputs(theICLBinputsplusNfeedbacksignalsfromtheBLEs outputs). Each Switch Box (SB) has W unidirectional wires, con- nectedtootherwiresfromadjacentSBinaconfigurableway.Each element ofsuch fine-grainedarchitectureis configuredby oneor morebitsfromtheconfigurationregister.

Mapping a fine-grained reconfigurable architecture on top of the fine-grained FPGA might suffer from a significant virtualization overhead.However due tolarge commercial FPGAcapacities thisapproachmakessense forsome applicationsbe portabilityas importantasresourcesutilization.Despite,severalworksfromthe literature focus on optimizing the implementation of such a vF- PGA on the physical FPGA. Brant andLemieux have proposed to usetarget specificdynamicallyre-programmableLUTsavailablein some hostarchitectures (calledLUTRAMs) tooptimizethe implementation of their fine-grained overlay ZUMA [17], getting a ra- tiodown to40physicalLook-Up-Table(LUT) pervirtualLUT. Dirk Koch etal. [19]integrateda fine-grainedoverlay in thedatapath ofaMIPSprocessortogetaportablecustominstructionset.They alsopropose anoptimizationofthedirectmappingofoverlay in- terconnectionnetworkintotheswitchfabricofthehostFPGA.Be- sides,therealsoexistresearcheffortsthatfocusonoverlayintegra- tionandimplementation.Wiersemaetal.[21]proposetoembeda ZUMA-basedvFPGAarchitectureintotheirconfigurablesystem-on- chip ReconOS.Moreover, in[20],authors designeda fine grained overlaywithextraroutingresources toeasethetaskoftheirJust- In-Timesynthesizer.

3.2.2. Coarse-grainedoverlays

The key attraction of coarse-grained overlays is the compu- tational, energy,and software-likeengineering efficiency.Fig. 2b) shows thebasicarchitecture ofcoarse-grained overlays; REs con- sist of an array of Functional Units (FUs), orProcessing Element (PEs). FUs can execute common word-level operations, including addition, subtraction, andmultiplication, comparedto the single- bit BLE operation in fine-grained architectures. Some architecture maycontain in-tilememories,such asRegister files, to hold temporary valuesandinstructions.The majorityofcoarse-grained overlays can be restricted to justtwo classes, using the classifi- cationin[16]:spatially-configured,andtime-multiplexedoverlays.

The largestgroup consistsofspatially-configured overlays[16,23–

25], where FU executes a single arithmetic operation and data is transferred over a dedicated point-to-point linksbetweenFUs.

BoththeFUandtheinterconnectremainunchangedwhileanap- plication is executing,thus supporting the executionofpipelined

(5)

Fig. 2. Basic overlay architecture: a) LUT-based fine-grained VPR compatible architecture similar to ZUMA [17] , and b) ALU-based coarse-grained architecture [18] .

Data Flow Graph (DFG) applications. The configuration layer is thereforesimilartothefine-grainedoverlay,whereaspecificreg- ister isusedtoconfiguretheentirearchitectureforimplementing a givenDFG function. Furthermore,the time-multiplexed coarse- grainedoverlays[26] behave similarly tomulti-core architectures;

eachtile(FU)containsaspecificinstructionsmemory,thatistime- multiplexedamongmultipleoperations.

3.2.3. Summary

Tosummarize, theoverlayarchitecturecanbe seenasa setof reconfigurableelements.Fine-grainedarchitecturesallowthesyn- thesis ofalarge rangespectrumofapplications,butsufferfroma significant overhead.Coarse-grainedaremorecost-effective,butit is hard to define a particular configuration that suitsa sufficient range of applications for the approach to be viable as a stand- alone solution. Inthis paper,we focus on exploitation ofoverlay architecturesforvirtualizing FPGAs,andtoextendtheir function- alitiesinordertoprovideanefficientmanagementofapplications in a heterogeneous cluster. For this purpose, the target overlay, which serves as a casestudy, is a fine-grain architecture similar toZUMA,themostadvancedopen-sourcearchitecturefreelyavail- able today. Furthermore, ourwork provides a baseline forfuture virtual FPGAapproachesthatmayreducetheperformanceorarea overheadthroughvariousmeans.

3.3. Virtualandphysicalsynthesis

The overlayHDL descriptionisautomaticallygeneratedfroma specificationofthecomputationlayer.Thisspecificationexpresses the available resources andtheir interconnections; theconfigura- tion layer isthen automatically derived and eventually added to the model. Finally, a model transformation generates VHDL tex- tual description of the architecture, allowing the overlay module to be instantiated fromauser design.The generatedRTL codeis portable, simulation friendly, and synthesizable. The synthesis of this RTLmodelfor thephysicalFPGA corresponds tothe physical synthesis,andisasteprequiredonceeachtimenewFPGAisintro- ducedtotheinfrastructure.

The top partof the Fig.3 shows the synthesis flow fromthe overlay generation to its physical implementation on the FPGA.

Synthesizing an application design to the overlay architecture is doneindifferentsteps.FirstanRTLsynthesizertransformstheap- plication descriptioninto a netlistcomposed oflatches andarbi- trary logic gates. This netlistis then transformed, optimized and mapped to the overlay resources. Next, it is placed and routed on the overlay. Finally, the virtual bitstream (vBitstream) is gen- eratedbyextractingtheconfigurationofeachoneoftheoverlay’s resources according to the placed and routed netlist. These four synthesisstepsaregatheredinonestepcalled“synthesistargeting theoverlay” atthebottomoftheFig.3.

3.4.Vfpgaarchitectureevolution

Overlaysofferaclearseparationofconcerns betweenarchitec- ture(thevFPGA)andimplementation(theFPGA)pointsofview.As previouslymentioned,thispromotesstability overtimeofthear- chitecture,whicheverphysicalFPGAactsasthehostplatform.Ob- viously,thisdoesnot necessarilymeanthat nonewoverlaytem- plates should be made available. The questions are: when does changehappen?andHowcanweensurethatoverlaysdosupport changes?

Changemaybeapplicationdriven,withtheintenttoofferatai- loredoverlaytonewapplicationneeds.Yet,developingnewover- lays,inordertomaximizeperformancesforaclassofapplications, raises no portability issue as the applications are novel with no legacytopreserve.Changemayalsocomefromrelaxingtheimple- mentationconstraints,hencereflectingsomeimprovementsinthe hostplatform up to theoverlay. When the overlays evolve while the applications donot, portability isa criticalissue. Twodirec- tionshavetobeconsideredforoverlaysscaling.

First,anewFPGA,withmoreabundantresourcescanhostsev- eraloverlays,appearingasnodeswithinahierarchicalarchitecture.

Thisleads toupdatethearchitecturesupervisor,butprogramming tools are kept the same. Second, scaling can be absorbedby re- shapingthesametemplate,butthen,binarycompatibilityislost.

However,asourframeworkoffersafullcontroloverthedesign phaseandoperationofoverlays,weensurebackwardcompatibility ispreserved.ThisisillustratedinFig.4,inwhichanA-architecture bitstreamisimplementedoveraB-Architecture.Abitstreamisfirst read back to produce a netlist, which is placed and routed over

(6)

Fig. 3. Two flows: overlay synthesis on the FPGA, and application synthesis on the overlay.

Fig. 4. Ensuring backward bitstream compatibility.

the newoverlay,prior to bitstreamgeneration.From a functional point of view, this appears as a model-to-model transformation.

This translatorcan be automaticallygenerated, assuming atag is insertedwithinthebitstreamtoidentifythetargetoverlay.

Algorithm 1 illustrates how the netlist is extractedfrom the bitstream. Note that the bitstream can be extremely noisy, only

Algorithm1:Generateanetlist Data:Nodes,Nets,Sources Result:Anetlist

initializeNetsandNodesasEmptyCollections;

initializeSourcesasthesetofUsedLUTsplusUsedIOs;

whileSourcesisnotemptydo removesfromSources;

registersasNode;

buildpthepathfromtheoutputpinofs;

forallthedestinationsofpdo addtoSources;

end

registerpasNet;

end

Fig. 5. ZeFF platform overview.

the informationthat makessense isconsidered (pathsignore any routethatdoesnotstart/endbyLUT,registerorIO).Dependingon thenewoverlay,thenetlistispost-processed(technologymapping, LUT/flip-floppacking,etc.).

4. Zeff deploymentplatform

Aspreviouslydiscussed,designingofavirtualFPGA(vFPGA)on top ofCOTS FPGAsallowstohomogenizeanyheterogeneousclus- ter. However, anaturalrequirement atthisstage isto be ableto communicate withthe vFPGA layer:loada bitstream tothe configuration register,extractthestate ofthesnapshot, stopandrun theclock,etc.Forthispurpose,thissectionintroducestheintegra- tionofthevFPGAinacompleteSystemonChip(SoC),calledZeff.

4.1. Hardwareplatform

A vFPGA is seen as an element, integrated along with a configuration controller, some memories andcommunicationdevices to form a SoC. To this end, Lagadec et al. [27] introduced ZeFF, a vFPGA host platform synthesized in the physical FPGA, along withitsattachedvirtualFPGA.ZeFFoffersmonitoringandmanage- ment facilities(guestconfigurations anddata streams,remoteac- cessthroughastandardEthernetinterfaceandTCP/IPprotocol).As showninFig.5,theSoCplatformembeds,amongothers:aproces- sor,somememorycontrollers(e.g..externalRAMs,flashmemories, andSDcards),andsomecommunicationperipherals(e.g.Ethernet, UART).

In thisarchitecture,the vFPGAis wrappedasa peripheralde- vice, connected to thesystem bus. The wrapperhas aWishbone interface,whichmapinmemoryslaveregisters:i)toreceivefrom external interfacesthevalue oftheconfigurationregisterdecoded fromthevBitstream,ii)toextract andrestorethestate oftheap- plication,andiii)run/stoptheapplicationclock.Moreover,thevir- tual inputs/outputs (vIOs)are also mapped ina specific memory

(7)

area. The access tothese vIOs isdone through a dedicated DMA (streaminandout).

ZeFF SoC is minimalist, inorder tolet mostof thehost FPGA resources availabletothevFPGA.Thesoft-coreprocessorisaZPU, known to be thesmallest 32-bitprocessorsupported by gcc. For example,theZeff platformwithoutvFPGAoccupiesonly11%ofthe total LUTsand10% ofthetotalflip-flopsontheXilinxArtix7,and 7% ofthecombinatorialfunctionsand2%ofthetotalregisterson the Altera Cyclone-IV. Evenwith its low resource andportability constraints,theSoCcanruntherealtimeoperatingsystemFreeR- TOS,theFatFsFATfilesystemmodule,andthelwIPTCP/IPstack.

Theembeddedsoftwareallowstomanagetheplatformandthe vFPGA throughan API andservices such asa TFTP server to ex- change the vFPGA’s configuration and data files, a minimal web servertoeasilybrowsethefile-systemandalocalorTELNETshell to issue commands. As a result, the vFPGA management API can be used from the embeddedsoftware, from shell commands for humaninteractions,andthroughanetworkprotocolontopofTCP forremotemachinemanagement.

WhenportingthewholeplatformfromoneFPGAboardtoan- other,some partsoftheSoCmayhaveto bechangedtoadaptto differentexternaldevices,suchasmemoriesortransceivers,which canhavedifferentIOinterfacesbetweenboards.Therefore,theSoC is organized around awishbone sharedbus, associatedto a dedicated generator, easing the addition and removalof peripherals, thusmakingtheplatformmoreflexible.

4.2. Softwareplatform

The vFPGA architecturethat we integratedintoZeFF allowsto have full control of the vFPGA clock and to save the execution state ofthevirtual fabric.Thesefeatureshelp toabstractapplica- tions awayfromthebaremetal FPGA,providingintrospectionca- pabilitiesandflexibilityoverexecutiononthefabric.Thesecontrol mechanisms areorchestrated throughZeFF’s processor,givingthe softwareflexibilitytomanagethevirtualhardwareexecutionflow.

SoftwaremanagementofthevFPGAcanbedoneatdifferentlevels, bringingthefollowingfourusecases:

(1) The vFPGA andapplication management canbe entirelydone bytheembeddedoperatingsystemsrunningonZeFF.Applica- tionsaretreatedonthevFPGAthesamewaycommonsoftware processes are executed on processors.The OS manageswhich application runsand when, andalso manages dataprocessed bythevFPGA.

(2) The vFPGA management can alsobe done through an API. In thiscase,theapplication’sdeveloperhastopartitionhisappli- cation targeting the vFPGA. He must provide a script making use oftheAPI to control the executionofits applicationseg- ments.Thisissimilartoasoftwareapplicationsegmentedinto threads,andinwhichthedeveloperisinchargeofsynchroniz- ingthosethreads.

(3) Another classical use of this API is software / hardware co- design, where the software part has a more important com- putationloadandonlydelegatessomeparts ofthe processing to the vFPGA. However, ZeFF’s softcoreprocessor (the ZPU) is notsuitedforcomputationloads.Runningsoftware/hardware applicationsonZeFFrequiresamorepowerfulprocessortobe added tothe platform viathe systembus. It wouldonly pro- cesscomputationaltasks,lettingtheZPUorchestratethewhole platform.

(4) ZeFFcanalsoserveasanintermediatebetweenthevFPGAand aremotemachine,receivingdata,virtualconfigurationsandvF- PGA management orders from thenetwork. A cluster of ZeFF platformscouldthenbeconnectedtoandmanagedbyasingle hostcomputer.

5. Newoverlayfeatures

Theperfect masteryofthe virtual layer architecture allowsto integratefeatures intothevirtualfabricthat areconsideredmiss- inginthehost,increasingtheFPGAcapabilities,andofferingnew features:ControllabilityandIntrospection. Inthissection,wepro- posenewoverlay functionalitiesenhancing the capacityofappli- cationsmanagementofoverlayarchitectures.

5.1. Pre-configuration

Configuration latency is a challenge in reconfigurable computing, especially for frequent reconfigurations, as it can offset the performance improvement achieved by hardware accelera- tion. The configuration layer in the overlay can be implemented as a multiple-chain shift register to speed-up the dynamic reconfiguration. This register contains the application vBitstream, which is a contiguous sequence of bits that corresponds to the adequate configurationofoverlay resources(LUTs, CLBs, FUoper- ations,etc.) toimplementa givencircuit.However, suchimprove- mentisnotsufficient,asthereconfigurationtimedependsalsoon thecommunicationinterfaceandthesizeofthevBitstreamtobe shifted.Inordertoachievea higherefficiency,weproposetoadd apre-loadingfeaturetotheoverlayarchitecturebyduplicatingthe configurationregister, asshowningreen inFig.6. Theidea is to startthetransferofthenewconfigurationfile (vBitstream),without affecting the configuration of the executed application. Once thetransferiscompleted,theoverlaycommutesfromtheoldcon- figurationtothenewoneinoneclockcycle(shiftconfigsignal)on demand.Inthisway,thelatencyoftheconfigurationisneglected enabling the implementation of cost-effective live migration and schedulingalgorithms.

5.2.Thesnapshot

Ingeneral,applicationsrunningonreconfigurablearchitectures canberepresentedbytheresourcesconfigurationandthestateof theapplication.Authorsin[28]report twowaysforaccessingthe stateofataskthatexecutesonaFPGA:

• ByusingtheInternalConfigurationAccessPort(ICAP),whichis mostlyusedfortheDynamicPartialReconfiguration(DPR).This solution remains technology andvendordependent. Addition- allythestateisreadbackalong withconfigurationbits,which leads toa slowextractionprocess.However themechanismis transparenttotheapplication.

• By decoratingapplications withsome accessfacilities to state bits. Thissolution isportable, andstate extractionis efficient.

However, every applicationhasto bereworked, andbotharea andfrequencyareimpacted.

Inan overlaycontext,configuringthevBitstreamonthevFPGA happensaspresentedintheprevioussub-section,whilethestate oftheapplicationisholdonmemoryelementsinthecomputation layer.Thesememoriescorrespondinthisarchitecturetothevirtual applicationregistersintegratedineachreconfigurableelement.In orderto enablethe state extraction andrestoringof applications runningonoverlays,anovelfeatureextendstheproposedoverlay architecture:thestatelayer.Associatingonesnapshotforeachvir- tual applicationregister(as showninred in theFig. 6doesthis.

Twoglobalsignalscontrolthecopyoftheapplicationregisterval- ues to their associated snapshot registers (save), or to force the snapshotregister values to the applicationregisters (restore). All snapshotregisters areconnectedtoformone ormoreshiftregis- ters,similarlytotheconfigurationregister,allowingtheextraction orloadingoftheoverlaystatewithoutaffectingtheexecution.Ex- tracting orloading a state snapshot requires severalclock cycles

(8)

Fig. 6. Configuration double-chain shift register, reconfigurable elements and application registers, snapshot mechanism.

(dependingonthenumberofmemoryelements).However,saving orrestoringasnapshot(fromthe snapshotregisters totheappli- cationregisters)onlytakesoneclockcycle.Adedicatedcontroller ensures thecommunicationwiththestatelayer throughboththe overlaystateinandstateoutsignals.

5.3. Clockmanagement

Inordertostop/runtheapplicationclock,amodifiedversionof the architecture applicationregisteris proposed. Weintegrate an enablecontrolsignaltoensuretheclockgatingby:i)allowingthe propagationoftheinformation(applicationclockisrunning),orii) blockingtheinformation(applicationclockisstopped).

5.4. Hardwarecostevaluation

The previously presented extra features of the overlay architecture allow the easy management of application on any FPGA device. In thissection, we evaluatethe cost ofthesefeatures for twoFPGAs: theXilinxArtix-7X7A100T-1CSG324C(nexys4board), and the Altera Cyclone-V 5CGXFC9A7U19C8 (APF6-SP board). To this end, we have considered a case study the ZUMA-like fine- grain overlay (see Section3.2.1), withtwo architectures:(1) a 10

× 10CLBs with4-input LUTs (K=4) and 4 LUTsper CLB (N=4) that fitinbothFPGAs,and(2) a20×20CLBswith4-inputLUTs (K=4) and4 LUTs per CLB (N=4) that allocates moreCyclone- VFPGAresources.Theonlyvariableparameterinthisexperiment is the numberofwires (W) ineach routing channel. For agiven architecture,Whasadirectimpact ontheroutabilityoftheover- lay and the timing performance of applications; the more wires, thelesscongestioninroutingchannels.Thehardwarecostiscom- putedhereasthepercentageoftheadditionaloccupiedresources oneachhostFPGA.

In thisexperiment, severalsynthesesofthe overlay, Wranges from8to24,havebeencarriedforbothFPGAs.Thesynthesisofa reconfigurablearchitectureontopofaphysicalFPGA,makesdiffi- cultforclassicaltoolstodeterminethemaximumcircuitfrequency asallpathsdependsontheapplicationthathastobelaterconfig- ured. Forthispurpose,we haveusedtheconceptofVirtualTime Propagation Registers (VTPRs), proposed in [29], to break down

Fig. 7. Occupied resources in terms of total used slices for the synthesis of a 10x10 CLBs overlay on Xilinx Artix-7 varying W .

Fig. 8. Occupied resources in terms of Adaptive Logic Module (ALM) for the syn- thesis of a 10x10 CLBs overlay on Altera Cyclone-V varying W .

physical logic chains intoshort segments, and prevent any com- binatorialloopfromappearingonthephysicalFPGAwhateverthe virtual configurationis.Figs.7–9plotthehostFPGAresourcesoc- cupiedbytheoverlay:(i)withoutanyfeature(blue),(ii)withsnap-

(9)

Fig. 9. Occupied resources in terms of Adaptive Logic Module (ALM) for the syn- thesis of a 20x20 CLBs overlay on Altera Cyclone-V varying W .

shotregister(green),(iii)withpre-configurationregister(orange), and(v)withbothfeatures(brown).Asshowninthesefigures,the snapshotregisterhasalow-costcomparedtothenaiveimplemen- tation(blue).Theaverageoverheadisabout3.52%fortheArtix-7 and2%fortheCyclone-V.Thiscostisduetotheadditionalsingle- bit snapshot shift register addedin each reconfigurableelement, and doesnot depend onthe parameterW. However, itis important toremarkthattheoverheadofthe pre-configurationregister ismuchhigher;theaverageoverheadisabout37.89%fortheArtix- 7 andabout20–30%fortheCyclone-V.Infact,thiscost isdueto the highnumber of bits required to configure the experimented fine-grained overlay. Forinstance, the vBitstreamfor thisoverlay with W=16 has 22,296 bits: 30.5% for the logic elements and 69.5%fortheroutingelements.Byincreasingthenumberofwires in SBchannels,thenumberofrequiredbits fortheconfiguration increases. In caseof a coarser architecture, the snapshot mechanism wouldhave a higher overhead,while the pre-configuration would have less overhead, as in proportions the state is bigger and the configurationis smaller. Infact, the executionstate ofa coarse grainedoverlayistheoutputwordsofeachFU,whileitis composed ofonly one bit per BLEin a fine grainedarchitecture.

Also, theconfigurationselectsfromfewoperatorsperFUsinstead ofafullLUTcontent,androutingisdoneword-wiseinsteadofbit- wise.

6. Hardwaretaskmanagement:schedulingandlivemigration

As reconfigurable architectures are increasingly integrated in hardware designs,sharingoftheseresourcesisakeyrequirement that givesrisetoahigherhardwareutilization.Inthissection,we investigate a possible solutionto truly shareprogrammable logic in a generic framework, enabling an efficient and cost-effective scheduling andlivemigrationofhardware applications ina clus- terofheterogeneousFPGAs.

6.1. Proposedframework

A genericandcomplete systemisproposed andshowninthe Fig.10.Itiscomposedof:(i)ahardwarelayerbasedonthevFPGA architecture previouslypresentedinSection3integratingthepro- posedfeatures fromthe Section5,(ii) ahypervisortocontrolthe entire system(Section 4) and, (iii) the systemmemory. The pro- posedarchitectureisbuiltuponasmartinterface,withmasterand slave interfacesthatcanbepluggedtoawrapper.Theslaveinter- face deals withthecontrolandstatusregisters, whilethemaster interface isresponsibleforaccessingthememory.I/O streamsare storedindoubleping-pongbuffersinsidetheon-boardmemory.A

Fig. 10. Overview of the proposed framework.

dedicatedI/Ostreamcontrollerisresponsibleforthemanagement ofthese buffers. An interrupt isgenerated each time a complete inputbuffer (’Buff-I’)isconsumed,oroneoutput buffer(‘Buff-O’) isfilled.ThisIRQinformsthe hypervisortoupdate thebuffersin thememory. We also allocatean applicationscratchpad memory (‘Mem’)thatcorrespondstotheon-chip’BRAMs’inclassicalFPGA designs.

6.2.Tasksscheduling

Fig.11a)illustratesatypicalfinitestatemachinefortaskexecu- tionstates:(i)Running,theapplicationisexecutingonthevFPGA, (ii)Ready,the applicationisreadyto run andthevFPGA isbusy, (iii)Blocked,theapplicationiswaitingforaneventtoresumethe executiononthevFPGA(e.g.waitingforinputdatatobeprovided).

Atthisstage,wefocusontheinfrastructureitself,andthereforeon themainactionsrequiredtosetupsuchaschedulingscenario.

First,wedefinethestructure‘Job’(showninFig.11b)torepre- sentthestateoftheentiresystemforeachapplication.Itincludes thefollowingelements:

• Initialization: It refersto thevBitstreamfile,andtothe status registers.

• State: The state at a given time: (i) the value of the snapshotregister(circuitstate),(ii)theexecutiontime,and(iii)the scratchpadmemory.

• Data:Thispartcontainspointerstotheremaininginputdatato manipulate,andtotheproducedoutputdata.

Thisstructureismanipulatedbya setoffunctions(the hypervisor), which ensures the management and communication with thevFPGAandtheinitializationofthememory.Itcanbeexecuted onthecore,usuallycombinedwithFPGAinmodernSoC+FPGAde- vices, orcan be synthesizedasa lightweight soft-coreIP on the FPGA(Zeff corepreviouslypresentedinSection4.Asshowninthe Fig.11a,twoactions are requiredinordertoset upa scheduling algorithm:

(1)Save Job:is called ateach change from the Running state to BlockedorReady.Fig.11c)plotsthesequencediagram,illustrat- ing thesetofactionsandcommandsofthehypervisortosave therunningJob.First,itsendsastoprequesttothevFPGAcon- troller to freezetheapplicationclock, andthenstartsextract- ingthevalueofthesnapshotregister.Thenextrequestisdedi- catedtothesystemmemoryinordertoreadtheJobscratchpad memoryandtoflushtheoutputbuffers(Buff-O1andBuff-O2).

Collected dataservetoupdatethe Jobstructureandtheinput andoutputfiles.

(10)

Fig. 11. The entire software layer and functions: a) remind of the classical scheduler FSM, b) the proposed Job structure, c) the sequence diagram of the save job action, and d) the one for the load job action.

Fig. 12. A Total-copy live migration process.

(2) LoadJob:isrequestedbytheschedulereachtimeanewjobis electedtobeexecuted(changetoRunningstate).Thesequence diagramofthisactionisshownintheFig.11d).Thehypervisor initializes thecontroller registers withanew initializationin- formation,such asthe numberofclock cyclestobe executed.

Theapplicationbitstreamisthensenttothiscontroller,which inturnpushesthe valueinto theconfigurationregisterof vF- PGA.Thisstepcanbedoneupstreambysendingittothepre- configuration register before saving the Job, which results in reducing the schedulingoverhead. The scratchpad memory is next restoredandthedoubleinput buffersare filled.Lately, a requestissenttothecontrollertostarttheclockagain.

6.3. Livemigration

Applicationslivemigrationisatechniquewidelyusedindata- centers tomove a virtual machineVM fromone physicalhostto another. The mechanism issame as forthe previously presented scheduling, except that the application isrestored on a different fabric that theone itwas previously executing on.The mostfre- quentlyusedmigrationalgorithmisshownintheFig.12:aTotal- Copyprocesstransferstheentirestateandthedatabeforethepro- cessexecutionresumedonthedestinationFPGA[30].

The followingactions are performedsuccessively tomigrate a JobfromFPGA1toFPGA2:

(1) Reservation: The cluster manager first searches for the host withthe lowestworkload,and sendsa request toreserve the resource.

(2) StopandSavetheJob:ThemanagerrealizesthesameSaveJob action,previouslyspecified,inordertostopandupdatetheJob structure.

(3) Job Copy: The entire Job is copied from the host FPGA1 to FPGA2.Thetime requiredto completethisactiondependson:

i)thesizeoftheJob,whichisafunctionofthevFPGAspecifi- cation,andii)theclusternetworkbandwidth.

(4) Data Copy: The data manipulated by the Job are then copied to the FPGA2. The time required to complete thisaction also dependsonthesizeofthedatafile,andonthenetworkband- width.

(5) Load and Resume Job: At this stage, the manager requests a Load Job action to initialize the vFPGA withthe newjob and to resume the executionon the new host.This action isalso detailedintheprevioussub-section.

6.4. Demonstration:taskmigration

Thissystemhasbeendemonstratedintheinternationalconfer- enceonDesign&ArchitecturesforSignal&ImageProcessing[31], inordertoillustratehowtooffer:

• anhomogeneousviewofaheterogeneoussetofFPGAs;

• the live migration of a hardware application between two nodes;

• faulttoleranceofanoverlaycluster.

ThesetupincludestwoFPGAsfromtwovendors(XilinxandAl- tera)ascomputenodes,andahostPCasacontroller.Fig.13shows theexperimental setup.ThetwoFPGAs areconnectedtothehost PC through Ethernet, each one is attached to an ARM processor running alocal hypervisor,whichtransfers the hostmanagement requests to the FPGA. Forthis demonstration, the processors are also usedtodisplay theimage beingprocessed,so thattheaudi- encecanvisualizetheprogressofthejobexecution.

The experimented vFPGA isa 14× 13 CLBsZUMA-like overlay architecture, andhas728 4-inputsLUTsandFFs.Thisarchitecture takes: (i) 64% ofthe total LUTs(40935) and 41%ofthe total FFs (52258) ontheXilinxnexys-4X7A100T-1CSG324C,and(ii)30% of the totalLogicelements (34352)and55% ofthe totalFFs(63140) ontheCycloneV5CGXFC9A7U19C8.ThesizeofthevBitstreamand the snapshotforthisvFPGAare 5116Bytesand92Bytes, respectively.Thetime requiredtoconfigure,saveandrestorethevFPGA depends on several parameters and is finely studied in the next section.Inthisdemonstration,threeimageprocessingapplications were synthesized, placedandroutedforthe vFPGA(synthesisre- sult are shown in the Table 1): (i) Sobel creates an image, em-

(11)

Fig. 13. Experimental setup of the DASIP demonstration.

Table 1

Image processing applications synthesized for a demonstration on a 14 ×13 fine-grained vFPGA.

Application Virtual Synthesis F max

vLUTs vFFs pairs vLUT-FF X7A100T-1CSG324C 5CGXFC9A7U19C8 Sobel 567 (77.9%) 195 (26.8%) 195 (34.4%) 2.7 MHz 1.8 MHz Smoothing 340 (46.7%) 116 (15.9%) 116 (34.1%) 3.33 MHz 2.16 MHz harpening 376 (51.6%) 244 (64.9%) 132 (35.1%) 3.22 MHz 2.09 MHz

phasizingedgesbasedontheSobel-Feldmanoperator,(ii)Smooth- ing,whichusesamatrixtosmooththeimagedataset,and(iii)a Sharpeningoperationoftheimage.

The first goalisto show binary compatibilityamongboth FP- GAs.Tothisend,thesamevBitstreamofoneoftheapplicationsis dispatchedby thehostPConbothFPGAs.Theresultoftheimage filtering startsthen appearing onboth screens atthe sametime.

This compatibilityis ensured,thanks to thevFPGA overlay archi- tecturedeployedonbothphysicalFPGAs.

The live migration is then demonstrated by running the filter application on the first node, halting the application execution,capturingtheexecutionstate ofthenode’soverlay, andthen restoringthestateofthevFPGA onthesecond node.Theapplica- tionresumes filteringtheimageonthesecondnode, asshownin theFig.13.

Finally, distributing computations over a set of nodes offer speedup the execution orguarantee fault tolerant. The Fault tol- eranceattheclusterlevelisillustratedbyrunningoneapplication on one node. Thehost controllerperiodically backupsthe execution state of the runningnode (everysecond in thisdemonstra- tion).Thenthepoweroftherunningnodeisshutdown.Whenthe hostcontrollernoticesthattherunningnodehavedisappeared,the nodedoesnotrespondtoheartbeatpingsanymore,thehostsends the Job ofthe interrupted applicationalong with the last execution state backup tothe second node. The executionresumes on thesecondnodeatthelastbackup.

6.5. Timingmodels

No efficient schedulingnor meaningful task migration can be gained without an early knowledge of the reconfiguration time overhead.Tothisend,wehavedevelopedthecostmodelinterms ofthe requiredtimeforeachcomponenttoperform bothactions saveandloadjob aspreviously presented.We definefirstthefol- lowingvariables:

• S_snapandS_config:thesizeofthesnapshotregisterandthevBit- stream(inBytes),whicharefunctionsoftheoverlaygranularity andparameters.

• S_scratch andS_Buff: the size in bytes of the scratchpadmemory, and the input buffers (‘Buff-I1’ and ‘Buff-I2’ in the previous Fig.10.

• S_d_out:thesizeinbytesoftheavailabledataintheoutputbuffer (‘Buff-O1’or‘Buff-O2’).

6.5.1. Savejob

AscanbeconductedfromFig.11c),thetime takentosavethe Job isafunction oftheneededtime toextract thesnapshotreg- ister(Ts_snap),tosavetheapplicationscratchpadmemory(T_s_{_}_scratch) andto flushout the output buffer (T_s_{_}_{Bu f f O}). Thisfunction isde- scribed in the equation (1), where T_s⁰ is the constant part, and corresponds to the time required by the hypervisor (or core) to executetherequestverificationcode(e.g.checkingthecompatibil- itybetweenthebitstreamoftheapplicationandtheavailablevF- PGAarchitecture,etc.).Thecostmodelofthesnapshotextractionis linearlyapproximatedintermsofSsnapasshowninthisequation, where,L^r_v_F_PGAis thelatency for a read operation ofone word from thehardwarelayer.T_s_{_}_scratchandT_s_{_}_{Bu f f O}arealsolinearfunctionsof thelatencyforareadoperationfromtheon-boardmemoryL^r_MEM. Ts=T_s⁰+L^rv_F_PGA×Ssnap+L^r_MEM×S_scratch+L^r_MEM×S_d_out (1)

6.5.2. Loadjob

Thetime requiredtocompletetheloadjob actioncanalsobe conductedfromFig.11d).Theequation (2)describesitasa function of the needed time to: (i) configure the vFPGA T_config, (ii) restore the snapshot (Trest_snap), (iii) load the applicationscratch- padmemory(T_l_{_}_scratch) and(iv)thetime tofillboth inputbuffers (T_l_{_}_{Bu f f I}).T_l⁰ istheconstantpart,andalsocorrespondtothetime ofthejob verification.L^w_v_F_PGA andL^w_Mem arethelatencyforawrite operation of one word to the hardware architecture and to the memoryrespectively.

T_l =T_l⁰+L^wv_F_PGA×S_{con f ig}+L^wv_F_PGA×Ssnap+L^w_Mem×S_scratch

+L^w_Mem×Size

(

2×S_{Bu f f}

)

(2)

6.6.Reconfigurationtimeestimation

Thissection reports ourexperiments when evaluating the ac- curacy of the proposed cost models, and explores the overhead of the implementation of schedulingalgorithms in the proposed framework. The setup used in our experiments is the APF6-SP SoC+FPGA platform from Armadeus, which is composed on the i.MX6 Cortex-A9 processor and the 5CGXFC9A7U19C8 Cyclone V

(12)

Fig. 14. Model fitting for both: a) Save job time, and b) Load job time.

Table 2

Estimated parameters for the APF6 SP target SoC+FPGA.

Parameters Value [ µs/32-bit] Constants Value [ µsec]

L ^rvF PGA 2.37 T _s⁰ 1300

L ^wvF PGA 0.96 T _l⁰ 3700

L ^r_Mem 2.34 – –

L ^w_Mem 0.26 – –

GX FPGAfromAltera.I.MX6implementsourproposedhypervisor, while thehardware architecture is synthesizedfortheCyclone V FPGA. PCI-express Gen1 is used forthe communicationbetween thehypervisorandvFPGA,andtheon-boardDDR3isusedtoim- plementthesystemmemories.

6.6.1. Parametersestimationandmodelsaccuracy

Thefirstexperimentaimsatestimatingthemodelparameters.

Tothisend,weimplementtheentiresystemontheAPF6-SPplat- form, and we usethe CLOCK_MONOTONICto measure each timing component. Severalsystemconfigurations arealso considered in order to find the model parameters: S_Buff from 1KB to 1MB, S_scratch from64Bytesto124KB,S_configfrom1KBto18KB,andSsnap

from20Bytesto 340Bytes. The Table2illustrates theestimated parameters by the linear regression from Matlab for more than 10,000measurements.L^w_v_F_PGA islowerthanL^r_v_F_PGA,duetothefact that the PCI write burstis generated by theSoC, while the read burst was not supported in theexperimented architecture. Simi- larly,L^r_memandL^w_Memare estimatedfrom the access tothe on-board DDR.

The model’saccuracyis evaluatedbased onthefollowing two metrics: (i) the square of the correlation between the measured and estimated values R² (closer to 1 indicates the model better fits), and(ii) MeanAbsoluteError (MAE),which istheaverageof theabsoluteerroroftheregression(closerto0isbetter).Fig.14a) andb)plottheestimatedtimecomparedtothemeasuredtimefor thesaveJobandloadJob,respectively.Ascanbenoticed,theproposed cost models accurately estimate the time requiredto load andsavejobs: R² ishigher than0.99witha MAEabout0.85ms and1.36msforbothTsandT_l respectively.

6.6.2. Equaltimeroundrobinscenarios

Atthisstage,weaimtoevaluatethere-configurationtimeover- head. Tothisend, we implementclassical scheduling algorithms, for a 14x13 CLBs ZUMA-like overlay on the APF6 SoC + Cyclone VC9 FPGA,with4BLEs ineach CLBand4inputsLUTs. Thepre- loading feature isfirst disabled. Inour system, the nature ofthe application to be mapped onthe overlay is definedby its band- width,orthefrequencyofproducingdataF_out.Inthisexperiment, we choose differentapplications,asshown inthe Table3, corre- spondingtodifferentbandwidths.

We first implement the Equal Time Round Robin (ETRR) scheduling,whichisacyclicexecutiveprocesswithoutpriority.It allocatesauniquetime-slicetoeachrunningJob;thisiscalledthe Quantum (denoted Q). The executionprofile of thesejobsfor an ETRR with Q=250 ms isshown inFig. 15. As itcan be noticed in thisfigure, theJob 3, correspondingto the highestFout,fills 5 timestheoutputbuffer,activatinganinterruptsignaltofillthein- putbuffer(DMAINupdateevent)andtoemptytheoutputbuffer (DMA OUTupdate event). Moreover,the measured time between twocontextsswitchisvariable,dependingon:(i)thetimeneeded to empty theoutput buffer, where thesize ofthe produceddata isa functionofFout,(ii) thesizeoftheoldapplicationscratchpad memory to save, (iii) the size of the new application scratchpad memorytoload,and(iv)thetimeneededtofilltheinputbuffers.

Themeasuredvaluetotalcontextswitchtimeisbetween11msto 50ms,correspondingto4%–20%ofperformanceoverhead.

Moreover,weaimatcomparingdifferentconfigurationsofETRR tothebasicFirstComeFirstServescheduling(FCFS)algorithm.The normalized totalexecutiontimeis showninFig.16,whichisthe time takenby all jobsto complete their computation on 4GB of input data.FCFSrequires499.5 stosequentiallyexecuteall Jobs, whilethetotaltimeforETRRsisbetween504and560 sdepend- ing ontheconfigurationofthesystem. From thisexperiment, we conclude thatthe size ofbothinputs andoutputbuffersandthe choice of theQuantum Qhave significant impactson the system performance,upto12%ofreconfigurationtimeoverhead.

6.6.3. Overlayreconfigurationtime

In this paper, we have experimented our virtualization approach using a fine-grainedvFPGA architecture, which is a naive andagenericoverlayallowingthesynthesisofalargespectrumof applications.However, theexperimentedarchitecturewas notop- timized toefficientlyexploit physicalFPGAresources. Inthissec- tion,weusetheprevioustimingmodelstoextrapolatetheimpact ofthe complexityofoverlay architecturesonboth:the totaltime toloadajob(T_l)andthetotaltimetosaveajob(Ts).Forthispur- pose, boththesizeoftheinput/outputbuffersandthescratchpad memory are constant in thisexperiment. We recall that overlay- basedarchitecturesarecharacterizedby:(1)thesizeoftheir configurationregister(S_config),and(2)thesizeofthesnapshotregister (S_snap).

Fig. 17plots each componentof thetotal time fora loadjob, considering differentsizesofoverlays;wehaveincreasedthema- trix size of resource elements on ZUMA-like architecture. As can be seen in this figure, the time to configure the overlay (T_config) is the mostsignificant part(more than 90%)of thetotal time to load the job. This motivates the need ofa bitstream pre-loading mechanism in order to tackle this challenge. The proposed non- preemptivepre-loadingfeaturesupportreducingupto99% ofthe

(13)

Table 3

Applications for a 14 ×13 ZUMA-like fine-grained overlay architecture with: 4 BLEs per CLB and LUT4.

Job App. F out S scratch Description id name (KB/s) (Bytes)

0 cordic 15.2 0 Trigonometric function, 16 bits operations.

1 IIR filter 11.9 64 IIR filter 4th order, 9 16 bits signed multiplication and addition, 20 bits accumulation.

2 mult 120 0 Combinatorial multiplication, 16bits operands and 32 bits result.

3 smult 1031.7 0 Pipelined multiplication, 8 bits operands and 16 bits result.

4 sobel 83.96 4096 Sobel filter for image processing, 3 ×3 pixels operations and 16 bits per pixel.

Fig. 15. Execution of the jobs for an ETRR scheduling with Q = 250 ms.

Fig. 16. The total execution time of different configuration of ETRR compared to the FCFS scheduling.

Fig. 17. The time for a load job: the architecture configuration time ( T config), the time to load both scratchpad and input buffers, and the time to restore the snapshot( T rest_snap).

total reconfiguration time overhead, as the bitstream configura- tionisdoneupstream.Moreover,thetimetorestorethesnapshot (Trest_snap)modestlyincreaseswiththeincrease inthevFPGAma- trixsize:T_rest_{_}_snapisestimatedto9.89msforthebiggestarchitec- turewith90kLUTsandFFs.

Moreover, the overlay architecture has only impact on the time tosave the snapshotforthe saving job time. Fig.18 shows bothcomponentofthetotalsavingjobtime,consideringdifferent sizesofoverlayarchitecture.Indeed,Ts_snapbecomessignificantfor

Fig. 18. The time for a save job: the time to save the scratchpad memory and to empty the output buffer, and the time to save the snapshot ( T s_snap).

biggest overlayarchitectures, but howeverthisisnegligible com- paredtothetotaltimeofaloadjobforthesamearchitecture.

7. Conclusion

This paper has presented a new hardware virtualization ap- proachforheterogeneousFPGA cluster.When FPGAare gradually updatedandreplacedtofollowthetechnologyevolutionover the lifetimeof the infrastructure, overlays have demonstrated to im- proveportability,speedupreconfiguration,andpromoteresources abstraction.Theproposed ideaistomapa secondlayerofrecon- figurable resources on top ofthe commercial-of-the-shelf (COTS) FPGAs in order to homogenize the infrastructure. This work has alsodemonstratedhowslightlyextendingtheoverlayarchitecture canbringnovelfeaturesforsakeofimprovedmanagement ofap- plications.Theproposedplatformwascapableofnode-to-nodeap- plicationmigration. Wealsopresented accuratelinearmodels for theestimationofthereconfigurationtime overhead.Designingof efficientschedulingandlivemigrationalgorithmsandsystemsfor anyFPGAplatformwillusethepresentedmodelsfortheoverhead estimation:few measurements are required toadapt models’parameters.Infuturework,wewishtodeveloppowermodelsforthe dynamicreconfiguration.Wewillalsoinvestigatethedevelopment ofefficientschedulingandloadbalancing,takingintoaccountboth performanceandpowerastargetsforoptimizations.