Implementing Wilson-Dirac Operator on the Cell Broadband Engine

(1)

HAL Id: inria-00203478

https://hal.inria.fr/inria-00203478

Submitted on 10 Jan 2008

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub-

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non,

Broadband Engine

Khaled Z. Ibrahim, François Bodin

To cite this version:

Khaled Z. Ibrahim, François Bodin. Implementing Wilson-Dirac Operator on the Cell Broadband

Engine. [Research Report] PI 1880, 2007, pp.23. �inria-00203478�

(2)

I ^R

I

INS^T ITUT^D

E R ECH^E

R^C HE^E

N I^N FORM

P U B L I C A T I O N I N T E R N E

N

^o

1880

IMPLEMENTING WILSON-DIRAC OPERATOR ON THE CELL BROADBAND ENGINE

KHALED Z. IBRAHIM , F. BODIN

(3)

(4)

KhaledZ. Ibrahim, F. Bodin

Systèmescommunicants

ProjetCAPS

Publicationinternen1880December200723pages

Abstract: ComputingtheactionsofWilson-DiracoperatorsconsumesmostoftheCPU

timeforthegrandchallengeproblemofsimulatingLatticeQuantumChromodynamics(Lat-

ticeQCD).Thisroutineexhibitsmanychallengestoimplementationonmostcomputational

environmentsbecauseofthemultiplepatternsofaccessingthesamedatathatmakesitdif-

cult to align the data eciently at compile time. Additionally, the low computation to

memoryaccessratiomakesthiscomputationbothmemorybandwidthandmemorylatency

bounded.

Inthiswork,wepresentanimplementationofthisroutineonCellBroadbandEngine. We

proposeruntimedatafusion,anapproachaimingataligningdataatruntime,fordatathat

cannotbealignedoptimallyat compiletime, toimproveSIMDizedexecution.

Wealso show DMAoptimization techniquethat reduces the impactof BWlimitson per-

formance. Our implementation for this routine achieves 31.2 GFlops for single precision

computationsand8.75GFlops fordoubleprecisioncomputations.

Key-words: IBM Cell BE, Vectorization, SIMD, Lattice QCD, Parallel Algorithms,

Wilson-Dirac

(Résumé: tsvp)

(5)

Résumé: Lamiseen÷uvredel'opérateurdeWilson-Diracest l'undescalculs lesplus

coûteuxdeQCDsurréseau. Cecalculposedenombreuxchallengesdemiseen÷uvreliésà

l'accèsauxdonnées. Eneetlesmultiplespatronsd'accèsrendentdicilel'optimisationde

l'alignementdesdonnéespourdesaccèsmémoireecaces. Parailleurs,leratiocalcul/accès

mémoireengendreunesaturationdelabandepassantemémoirequilimitel'exploitationdes

unitésdecalculs.

Ce rapport présente une mise en ÷uvre sur l'architecture Cell de l'opérateur de Wilson-

Dirac. Des techniques defusion de donnéesà l'exécution sont proposéespourrésoudre le

problèmedescontraintes d'alignementde donnéeset l'utilisationdesopérateursSIMDdes

unitésdecalculduCell. Deplus,unetechniquepermettantd'optimiserlestransfertsDMA

estaussidécrite. Notreimplémentationatteint31.2Gopsencalcul simpleprécision,8.75

Gopsencalcul doubleprécision.

Mots clés : IBM Cell, Vectorisation, SIMD, QCD sur réseau, Algorithme parallèle,

Wilson-Dirac

(6)

Figure1: 4-dimensionalspacetimelatticeQCD.

1 Introduction

Ecient implementation for computing the action of Wilson-Dirac operators is of critical

importance for the simulationof latticeQuantum Chromodynamics (Lattice QCD). Sim-

ulatingLatticeQCD aimsat understanding thestronginteractions that bindsquarksand

gluonstogethertoformhadrons. InlatticeQCD,afour-dimensionalspace-timecontinuum

issimulated,wherequantumelds(quarks)aresymbolizedatthelatticesitesandquantum

elds (gluons) are symbolized at thelinks betweenthese sites. Lattice spacing should be

smalltoobtainreliableresultswhichrequires enormousamountof computations. Figure 1

showsthediscretization ofthefour-dimensionalspace-timespaceofthelatticeQCD.

Theuse of acceleratorforscientic computinghas alwaysbeenexperimentedbymany

researchers. Amongrecentlyattractivetechnologiesaregraphicprocessingunits(GPU)and

CellBroadband Engine. The LatticeQCD community, aswell asotherhigh performance

computingcommunities,startedexploringthepossibilityofusingtheseacceleratorstobuild

costeectivesupercomputerstosimulate theseproblems.

Using GPU, for instance, has been investigated [1, 2], especially with the advent of

generalpurpose programmingenvironmentsuch asCuda [3] forgraphic cards. The main

challenge in these environments is the over-protection that most manufacturers adopt to

hidetheirproprietaryinternalhardwaredesign.

The use of Cell broadband engine is also under consideration of many Lattice QCD

groups. AnanalyticalmodeltopredicttheperformancelimitsofsimulatinglatticeQCDis

developed [4]. Somesimplied computationwasalso ported to the cell[5]. These studies

armed the fact that the computation of lattice QCD is bandwidth limited (or memory

bound) andtriedto predicttheperformanceofarealimplementation.

Inthisstudy,weintroduceanimplementationofthemainkernelroutineforsimulating

LatticeQCD. In thisimplementation, wetriedto provideanswersto two main questions;

(7)

therstishowtoSIMDizethecomputationinanecientway;thesecondquestionishow

todistributethelatticedataandhowtohandlememoryeciently.

ForecientSIMDization,weintroducethenotionofruntimedata fusiontoaligndata

atruntimethatcannotbealignedoptimallyatcompiletime. Furthermore,whileallocating

latticedataonthemainmemory,weintroduceanalysisfordataontheframesleveltocreate

optimized DMA requests that removes redundancy of data transfersas well as improves

contiguityofmemoryaccesses.

Therestof thisreportis organizedasfollows: Section2introducestheCellbroadband

architectureand thesoftwaredevelopmentenvironment. Section 3 introducesthe Wilson-

Dirac computation kernel. The SIMDization problem is tackled in Section 4. Section 5

details the proposed memory layout and the analysis leading to optimizing the memory

transfers. Wecommentontheutilization oftheCellBEonSection6. Section7concludes

thisreport.

2 Cell Broadband Engine and Software Development En-

vironment

Inthis study, wetarget developingecient implementation of the main kernel routineof

simulatingLatticeQCDonCellBroadbandEngine(BE).WeusedIBMCellBESDK3.0[6].

Weexplored our implementation on current CellBEas well asthefuture generationCell

withenhanceddouble precision(EDP) 1

.

We usedthe simulator provided bythe SDK to analyzetheperformanceof ourimple-

mentationandweveriedtheperformance,exceptforCellEDP,onaIBMBladeCenter®

QS20systemwithdual-CellBEprocessors(runningat3.2GHz).

Figure 2 outlines the basiccomponent of the CellBE processor. TheCell BE chip is

composed of multiple heterogeneouscores; aPowerPC compatible master processor(with

dualSMT)(PPE)andeightsynergisticprocessingelements(SPE).

The executionunit onthe PPE can handle control ow intensivecodes whilethe exe-

cutionunit onthe SPE is optimizedto handle SIMD computations. Each SPE hasa 128

registerle,each16-byteswide. TheSPE hasasmall(256KB)specialmemorycalledlocal

store thatexecutionunit canaccesswithapipelinedlatencyofonecycle.

The maindata isusually storedin the externalmemoryand dataare transferredback

and forth with memory through special DMA APIs. Each SPE has twopipelines, one is

specialized mainly on doing integer and oating point operations (even pipeline) and the

other is specialized mainly in doing shuing, branching and load/store operations (odd

pipeline).

1

For CellEDP,the performance numbers arejust estimatesbased on informationcollected fromthe

simulator.

(8)

Figure2: CellBroadbandEngine

3 Wilson-Dirac Operator

Inthisstudy,weportedthecomputationoftheactionsWilson-Diraconspinoreld based

onthecodeoftheETMCcollaboration,seeforinstance[7,8].

ComputingtheactionsofWilson-Diracoperatoristhemosttimeconsumingoperationin

simulatinglatticeQCD.Equation1details thecomputation oftheactionsofWilson-Dirac

operator. Thiscomputationinvolvesasumoverquarkeld(ψi⁾^multiplied^by^a^gluon^gauge

link(Ui,µ⁾^through^the^spin^projector(I±γµ)^. χi= X

µ={x,y,z,t}

κµ

n

Ui,µ(I−γµ)ψi+ ˆµ+U_i−ˆ^† _µ,µ(I+γµ)ψi−ˆµ

o

(1)

Whereγx=







0 0 0 i 0 0 i 0 0 −i 0 0

−i 0 0 0





 , γy=







0 0 0 −1 0 0 1 0 0 1 0 0

−1 0 0 0





 , γz=







0 0 i 0 0 0 0 −i

−i 0 0 0 0 i 0 0





 ,

γt=







0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0







, I îsâûnity^matrix,ândκµ ^represents^the^hopping^term.

TherepresentationofeachgaugeeldisaspecialunitarySU(3)matrix(3×3^complex

variables). The spinorsare representedby four SU(3)vectorscomposed ofthree complex

variables. The routine implementing this computation is called Hopping_Matrix. Paral-

lelizationof theroutineinvolvesdividingthelatticeinto twodependent subeldsoddand

even,asshownin Figure1. Eachspinoroftheoddsubeldissurroundedbyspinorsofthe

evensubeldandviceversa. Thecomputationsweepsonspinorsfromonesubeldmaking

theothersubeldtemporarilyconstant,thusbreakingdatadependency.

(9)

Even though Equation1 showsregular computation across all sites of the lattice, the

computation usually faces the challenge of the low ratio of oating point operations to

memoryreferences. Thismakesthiscomputationmemorybandwidthandlatencybounded.

In Section 4,we discussthe problemof eciently SIMDizing this code, while thedata

alignmentandcommunicationisinvestigatedinSection 5.

4 SIMDizing Wilson-Dirac Computationson Cell Broad-

band Engine

The main problem that prevents SIMDizing this code eciently is the dierent patterns

of accessing the samedata, due to the spin projector in Equation1, that make no single

representationoptimalatruntime. EachgaugeeldSU(3)matrixisaccessedtwice(positive

andnegativedirections). Thecomputationmayinvolvetheoriginalmatrixortheconjugate

transpose of the matrix. The matrixis usually stored with onlyone representation. The

problemisexacerbatedforspinorsbecauseeachspinorisaccessedineightdierentcontexts

depending onthe space direction. Each access involves dierent spinor vectorsand oper-

ationsalternatingbetweenvectoraddition, subtraction,conjugateaddition, andconjugate

subtraction.

Aligning data such that allthese operationsare performed optimallyat thesametime

isnotpossible. Dierentearlierapproachesfor SIMDizing thiscodealignthedata onone

layoutandthenuseshueoperationstochangealignmentofthedataatruntimetoperform

theneededcomputations.

Another problem is that data are represented by 3×3 ^complex ^matrix ^and ⁴ 3×1

complex vectors, which do not map perfectly to powerof 2 data alignment. For the cell

processordatashouldbealignedin the16Bytesboundarytobeecientlyaccessed. DMA

isalsobetteralignedin128bytesboundary.

Inthis work,wedene RuntimeData Fusion asasolutionforthe aboveproblems of

dataalignment. Weshowthattheperformancecanbegreatlyimprovedusingthistechnique.

EcientSIMDization ofthecoderequires alignmentof thedata inawaythat reduces

thedependencybetweeninstructions,reducesthenumberofshueinstructions,andallows

ecientinstructionslikemultiply-addtobeexecuted.

4.1 Runtime data fusion

Twoconventionalrepresentationofcomplexstructurearecommonlyused;therstcombines

thereal andtheimaginaryparts into onestructure; thesecond separatesthereal andthe

imaginarypartintotwoseparatearrays. Figure3showsthesedatalayoutforstoringa3×1

complexvectorand3×3 ^matrix^for^double^precision^aligned^into¹⁶^bytes^word.

In Table 1, we list the number of instructions that are needed to do a matrix-vector

multiplication. Weconsidertheaveragefordoingvector-matrixmultiplyandvector-trans-

posed conjugatematrix multiply. Forthe rst representation, thereal and the imaginary

(10)

Figure 3: Multiple representationsof structures basedon complexvariables alignedon16

bytesword. Thetopshowsmergedrepresentation,themiddleshowsseparaterepresentation,

andthebottomshowsthefusedrepresentation.

partfaces dierenttreatment,thusrequiringshues. Thesecondrepresentationseparates

therealpartandimaginarypartsintotwoseparatearrays,whichinvolvesadditionalshues

fortransposingthematrix.

Thecomputationalrequirementsbasedonthesealignmentsfavorseparatingthearrays

fortherealpartandtheimaginarypart. Thenumberofoatingpointsisreducedbecause

ofthepossibilitytousemultiply-addandmultiplysubtractinstructions. Theseinstructions

cannot be used withthe rstlayoutbecausethe real andimaginaryparts share the same

16bytesword.

Even with the largernumber of instructions, the rst layoutis favored by the perfect

alignmentwithintheboundaryoftheword. Thesecondlayoutrequireseithertousepadding

(25%ofthetotalspaceforspinorsand10%ofthegaugeeld link)orthedatawillnotbe

perfectlyaligned. Increasingthe sizeof thedatabecauseof alignmentcanseverely reduce

theperformance of this application because it is bandwidth limitedas will bedetailed in

Section5. Aligningdatanotto16bytesboundarywillseverelypenalizeloadingandstoring

data and will nullify the computation benet achieved by the reduced instructions. We

adopted the rst alignment of complex data as base for comparison, especially that the

originalimplementation(optimizedforSIMDizationonIntelSSE2)adopted itthroughout

thewholecodeforsimulatingLatticeQCD.

ConsideringthefusedversioninFigure3,thenumberofoatingvariablescanbereduced

signicantlyandnoshuingisneededexceptatthefusion/disjoinstages. Thefusionintro-

ducedFigure 3showsthetwo-waymatrixfusionforcomplexvariablesofdouble precision.

(11)

Merged Seperate Fused

add, sub 18 12 6

madd,msub 0 10 9

mul 22 10 9

compute/reduceshues 31 14 0

transpose/conjugateshues 3 5 0

fuse shues 0 0 6

Table1: Instructiondecompositionforvector-matrixmultiplyonSPE.Thecountrepresent

theaverageforvector-by-matrixandvector-by-transposedmatrix.

The fused matrix removes the need for shuing in case of conjugate access to the array

elements because complex and real variables are aligned in separate 16 Bytes boundary.

Transposingorconjugatingthematrixdoesnotinvolveadditionalshues.

Thisfusedalignmentisunfortunatelynotpossibleatcompiletime,especiallyforspinors

because it requires having a unique order of accessing spinor at compile time. In other

word,givenaspinori ^we^need^to ^determineâûnique^spinor^that ^will âlways^precedeît ⁱⁿ

computationandauniquespinorthatwillalwaysfollowit. Thesurroundingspinorsinevery

access contextcanbedierent andit willbeaverylarge spaceoverheadto keepmultiple

coherentcopies.

Because ofthe performance associatedwith datafusion anddiculty ofdoing itstati-

cally,weconsiderthefollowingproposalforruntimedatafusion

1. Data are fused at runtime for a number of structures dependent on the number of

elementspermemoryword,whichusuallyinvolvesomestartupshueoperations.

2. Optimizedkernelofcomputationiswrittenassumingfuseddata. Fuseddataarekept

alivein registersaslongastheyareneeded.

3. The nal results for the optimized kernel (output spinors) are then disjoined back

beforestoringthemtothememory.

Thestepsinvolvedincodetransformationstosupportruntimedatafusionareshownin

Figure4. Ourtechniqueinvolvesfusingunrolledcodetoaligndatathatcannotbealigned

staticallydue to the multiple access patterns encountered at runtime. Thefusion process

involvesgroupingdatathat willbeaccessedwiththesamepattern ofaccessonSPUword

size (i.e., aligningthem in 16 bytes boundary). Forsingle precisioncomputation, 4-bytes

oating points, spinors are computed in a groupof 4 output spinors. Consequently, the

input gaugeelds and the input spinors arecombinedinto groups of four (4-way fusion).

For double precision, optimal alignment requires fusing the computation for two output

spinors(2-wayfusion).

Runtimefusionadoptsthesamedatastructureusedforthebase. Noalignmentproblem

isencountered. Thefusedversiononlyexistsduring computation,livinginregisters.

(12)

Figure 4: Codetransformationsforruntimedatafusion computation.

The main feature of the Cell SPE that allows this technique is the large register le.

Merging data structure in the beginning of computation incurs minimal overhead if all

fused data are kept in registers as long as they are needed for computation. Runtime

fusion can provedicult for other processors with SIMD instructionset, that havesmall

register count, for instance Intel SSE. We need to keep not only input fused data but

alsointermediateresultsaliveinregisters,especially thatdataarenotaccessed frequently.

During the computation of a group of two spinors in double precision, almost 6 KB of

memory are accessed while the register le canhold only 2 KBytes. Knowing that some

registersare needed to hold shue patterns, intermediate results, and other bookkeeping

operations,carefulregisterlifenessanalysisofregistersisneededtominimizethepossibility

ofspillingtheregisterletothememory. Wedidthisanalysisonabasicblocksizeranging

between2-2.4Kiloinstructions. Inourimplementation, wemanaged to usealmost 110of

theSPE128-bitregisterswithouttheneedtospillingfuseddataorintermediateresultsto

thememory.

Figure 5showsthe dynamicinstruction decompositionfor four implementations ofthe

base complex alignmentand the fusedversion. Wehave twocomputation ows in terms

of theshue needed,one forthe doubleprecisionand the 2-waysingleprecision, andthe

other kernel is for the 2-way double precision and 4-way single precision. Perfect fusion

kernel, 4-waysingleand2-waydouble, providesbetterchance of reducingshuesof data.

Theshuingisreducedto lessthan 37%oftheoriginalshue operationscountfor single

precisionwhen we use4-wayfusionand less than22%for double precision. Going forno

fusionforsingleprecision,notpresented,willincuradditionaloverheadsduringtransposing

matrixesandwould provideamuchworseperformance.

Figure5showsalsoreductionsinoatingpointoperations,inadditiontothereduction

inshueinstructions,becauseruntimefusionofdataallowsusingmultiply-addormultiply-

subtractmorefrequentlyasclariedearlierin Table1.

(13)

2−way single 4−way Single

double 2−way double 0

200 400 600 800 1000 1200 1400 1600 1800 2000

Others (even) FP (even) Others (odd) LD/ST (odd) Shuffle (odd)

Dynamic instructions per spinor

Figure5: Decompositionofdynamicinstructionsperspinorcomputation. Twofusionlevel

areshownforbothsingleprecisionanddoubleprecisioncomputations.

TheimpactonaverageexecutioncyclesperspinorcomputationisdetailedinTable2.For

theseperformancecycles,wealignedalldataonthelocal storeofCellSPE.Theexecution

cycles were reduced signicantly for the versions with fusion compared with the versions

withlessornofusion. Forsingleprecisionthereductionis35%for4-wayfusioncompared

with2-wayfusion. Fordoubleprecision,thereductionis40%oncurrentgenerationCellBE

andispredictedto be53%forfuture generationCellEDP.

InTable2,wedenememory instructionseciency asthepercentageoftheminimum

load/store instructions, needed by the computation, to the actual load/storeinstructions

executedduringcomputation. Asshowninthetable,memoryinstructionseciencyisvery

high(rangingfrom 97.3%to99%)in ourimplementation. Carefulregisterlifenessanalysis

isneededtoachievethisbutisalsofacilitatedbythelargeregisterleavailableontheCell

BEthat makesitpossible tohold intermediate computation ofthe spinorsdata structure

after fusionuntil theend ofthecomputation. Onlysingleprecisionwith 4-wayfusion has

additionalmemoryoperationsassociatedwiththefuse/disjoinphase,reducingthememory

instructionseciencyto83%. Stilltheoverallperformancewith4-wayfusionismuchbetter

that2-wayfusion.

Table2also shows theperformancein GFlops forthese implementations. Apparently,

ifthe dataare requested from the memorysystemoutsidethe CellBE, then thememory

subsystemwill notbeableto aord these bandwidths 2

. Only double precisiononcurrent

generationCellBErequiresmoderatebandwidthbecausetheexecutionofdoubleprecision

computationisseverelypenalizedduringtheissuestageofinstructionexecution. Thistable

2

Paddingisaddedforsingleprecisiongaugeeldmatrixtomakethenumberofelementeven(lessthan

4%aditionaloverhead).