HAL Id: inria-00203478
https://hal.inria.fr/inria-00203478
Submitted on 10 Jan 2008
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub-
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non,
Broadband Engine
Khaled Z. Ibrahim, François Bodin
To cite this version:
Khaled Z. Ibrahim, François Bodin. Implementing Wilson-Dirac Operator on the Cell Broadband
Engine. [Research Report] PI 1880, 2007, pp.23. �inria-00203478�
I R
I
INST ITUT D
E R ECHE
RC HEE
N IN FORM
P U B L I C A T I O N I N T E R N E
N
o1880
IMPLEMENTING WILSON-DIRAC OPERATOR ON THE CELL BROADBAND ENGINE
KHALED Z. IBRAHIM , F. BODIN
KhaledZ. Ibrahim, F. Bodin
Systèmescommunicants
ProjetCAPS
Publicationinternen1880December200723pages
Abstract: ComputingtheactionsofWilson-DiracoperatorsconsumesmostoftheCPU
timeforthegrandchallengeproblemofsimulatingLatticeQuantumChromodynamics(Lat-
ticeQCD).Thisroutineexhibitsmanychallengestoimplementationonmostcomputational
environmentsbecauseofthemultiplepatternsofaccessingthesamedatathatmakesitdif-
cult to align the data eciently at compile time. Additionally, the low computation to
memoryaccessratiomakesthiscomputationbothmemorybandwidthandmemorylatency
bounded.
Inthiswork,wepresentanimplementationofthisroutineonCellBroadbandEngine. We
proposeruntimedatafusion,anapproachaimingataligningdataatruntime,fordatathat
cannotbealignedoptimallyat compiletime, toimproveSIMDizedexecution.
Wealso show DMAoptimization techniquethat reduces the impactof BWlimitson per-
formance. Our implementation for this routine achieves 31.2 GFlops for single precision
computationsand8.75GFlops fordoubleprecisioncomputations.
Key-words: IBM Cell BE, Vectorization, SIMD, Lattice QCD, Parallel Algorithms,
Wilson-Dirac
(Résumé: tsvp)
Résumé: Lamiseen÷uvredel'opérateurdeWilson-Diracest l'undescalculs lesplus
coûteuxdeQCDsurréseau. Cecalculposedenombreuxchallengesdemiseen÷uvreliésà
l'accèsauxdonnées. Eneetlesmultiplespatronsd'accèsrendentdicilel'optimisationde
l'alignementdesdonnéespourdesaccèsmémoireecaces. Parailleurs,leratiocalcul/accès
mémoireengendreunesaturationdelabandepassantemémoirequilimitel'exploitationdes
unitésdecalculs.
Ce rapport présente une mise en ÷uvre sur l'architecture Cell de l'opérateur de Wilson-
Dirac. Des techniques defusion de donnéesà l'exécution sont proposéespourrésoudre le
problèmedescontraintes d'alignementde donnéeset l'utilisationdesopérateursSIMDdes
unitésdecalculduCell. Deplus,unetechniquepermettantd'optimiserlestransfertsDMA
estaussidécrite. Notreimplémentationatteint31.2Gopsencalcul simpleprécision,8.75
Gopsencalcul doubleprécision.
Mots clés : IBM Cell, Vectorisation, SIMD, QCD sur réseau, Algorithme parallèle,
Wilson-Dirac
Figure1: 4-dimensionalspacetimelatticeQCD.
1 Introduction
Ecient implementation for computing the action of Wilson-Dirac operators is of critical
importance for the simulationof latticeQuantum Chromodynamics (Lattice QCD). Sim-
ulatingLatticeQCD aimsat understanding thestronginteractions that bindsquarksand
gluonstogethertoformhadrons. InlatticeQCD,afour-dimensionalspace-timecontinuum
issimulated,wherequantumelds(quarks)aresymbolizedatthelatticesitesandquantum
elds (gluons) are symbolized at thelinks betweenthese sites. Lattice spacing should be
smalltoobtainreliableresultswhichrequires enormousamountof computations. Figure 1
showsthediscretization ofthefour-dimensionalspace-timespaceofthelatticeQCD.
Theuse of acceleratorforscientic computinghas alwaysbeenexperimentedbymany
researchers. Amongrecentlyattractivetechnologiesaregraphicprocessingunits(GPU)and
CellBroadband Engine. The LatticeQCD community, aswell asotherhigh performance
computingcommunities,startedexploringthepossibilityofusingtheseacceleratorstobuild
costeectivesupercomputerstosimulate theseproblems.
Using GPU, for instance, has been investigated [1, 2], especially with the advent of
generalpurpose programmingenvironmentsuch asCuda [3] forgraphic cards. The main
challenge in these environments is the over-protection that most manufacturers adopt to
hidetheirproprietaryinternalhardwaredesign.
The use of Cell broadband engine is also under consideration of many Lattice QCD
groups. AnanalyticalmodeltopredicttheperformancelimitsofsimulatinglatticeQCDis
developed [4]. Somesimplied computationwasalso ported to the cell[5]. These studies
armed the fact that the computation of lattice QCD is bandwidth limited (or memory
bound) andtriedto predicttheperformanceofarealimplementation.
Inthisstudy,weintroduceanimplementationofthemainkernelroutineforsimulating
LatticeQCD. In thisimplementation, wetriedto provideanswersto two main questions;
therstishowtoSIMDizethecomputationinanecientway;thesecondquestionishow
todistributethelatticedataandhowtohandlememoryeciently.
ForecientSIMDization,weintroducethenotionofruntimedata fusiontoaligndata
atruntimethatcannotbealignedoptimallyatcompiletime. Furthermore,whileallocating
latticedataonthemainmemory,weintroduceanalysisfordataontheframesleveltocreate
optimized DMA requests that removes redundancy of data transfersas well as improves
contiguityofmemoryaccesses.
Therestof thisreportis organizedasfollows: Section2introducestheCellbroadband
architectureand thesoftwaredevelopmentenvironment. Section 3 introducesthe Wilson-
Dirac computation kernel. The SIMDization problem is tackled in Section 4. Section 5
details the proposed memory layout and the analysis leading to optimizing the memory
transfers. Wecommentontheutilization oftheCellBEonSection6. Section7concludes
thisreport.
2 Cell Broadband Engine and Software Development En-
vironment
Inthis study, wetarget developingecient implementation of the main kernel routineof
simulatingLatticeQCDonCellBroadbandEngine(BE).WeusedIBMCellBESDK3.0[6].
Weexplored our implementation on current CellBEas well asthefuture generationCell
withenhanceddouble precision(EDP) 1
.
We usedthe simulator provided bythe SDK to analyzetheperformanceof ourimple-
mentationandweveriedtheperformance,exceptforCellEDP,onaIBMBladeCenter®
QS20systemwithdual-CellBEprocessors(runningat3.2GHz).
Figure 2 outlines the basiccomponent of the CellBE processor. TheCell BE chip is
composed of multiple heterogeneouscores; aPowerPC compatible master processor(with
dualSMT)(PPE)andeightsynergisticprocessingelements(SPE).
The executionunit onthe PPE can handle control ow intensivecodes whilethe exe-
cutionunit onthe SPE is optimizedto handle SIMD computations. Each SPE hasa 128
registerle,each16-byteswide. TheSPE hasasmall(256KB)specialmemorycalledlocal
store thatexecutionunit canaccesswithapipelinedlatencyofonecycle.
The maindata isusually storedin the externalmemoryand dataare transferredback
and forth with memory through special DMA APIs. Each SPE has twopipelines, one is
specialized mainly on doing integer and oating point operations (even pipeline) and the
other is specialized mainly in doing shuing, branching and load/store operations (odd
pipeline).
1
For CellEDP,the performance numbers arejust estimatesbased on informationcollected fromthe
simulator.
Figure2: CellBroadbandEngine
3 Wilson-Dirac Operator
Inthisstudy,weportedthecomputationoftheactionsWilson-Diraconspinoreld based
onthecodeoftheETMCcollaboration,seeforinstance[7,8].
ComputingtheactionsofWilson-Diracoperatoristhemosttimeconsumingoperationin
simulatinglatticeQCD.Equation1details thecomputation oftheactionsofWilson-Dirac
operator. Thiscomputationinvolvesasumoverquarkeld(ψi)multipliedbyagluongauge
link(Ui,µ)throughthespinprojector(I±γµ). χi= X
µ={x,y,z,t}
κµ
n
Ui,µ(I−γµ)ψi+ ˆµ+Ui−ˆ† µ,µ(I+γµ)ψi−ˆµ
o
(1)
Whereγx=
0 0 0 i 0 0 i 0 0 −i 0 0
−i 0 0 0
, γy=
0 0 0 −1 0 0 1 0 0 1 0 0
−1 0 0 0
, γz=
0 0 i 0 0 0 0 −i
−i 0 0 0 0 i 0 0
,
γt=
0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0
, I isaunitymatrix,andκµ representsthehoppingterm.
TherepresentationofeachgaugeeldisaspecialunitarySU(3)matrix(3×3complex
variables). The spinorsare representedby four SU(3)vectorscomposed ofthree complex
variables. The routine implementing this computation is called Hopping_Matrix. Paral-
lelizationof theroutineinvolvesdividingthelatticeinto twodependent subeldsoddand
even,asshownin Figure1. Eachspinoroftheoddsubeldissurroundedbyspinorsofthe
evensubeldandviceversa. Thecomputationsweepsonspinorsfromonesubeldmaking
theothersubeldtemporarilyconstant,thusbreakingdatadependency.
Even though Equation1 showsregular computation across all sites of the lattice, the
computation usually faces the challenge of the low ratio of oating point operations to
memoryreferences. Thismakesthiscomputationmemorybandwidthandlatencybounded.
In Section 4,we discussthe problemof eciently SIMDizing this code, while thedata
alignmentandcommunicationisinvestigatedinSection 5.
4 SIMDizing Wilson-Dirac Computationson Cell Broad-
band Engine
The main problem that prevents SIMDizing this code eciently is the dierent patterns
of accessing the samedata, due to the spin projector in Equation1, that make no single
representationoptimalatruntime. EachgaugeeldSU(3)matrixisaccessedtwice(positive
andnegativedirections). Thecomputationmayinvolvetheoriginalmatrixortheconjugate
transpose of the matrix. The matrixis usually stored with onlyone representation. The
problemisexacerbatedforspinorsbecauseeachspinorisaccessedineightdierentcontexts
depending onthe space direction. Each access involves dierent spinor vectorsand oper-
ationsalternatingbetweenvectoraddition, subtraction,conjugateaddition, andconjugate
subtraction.
Aligning data such that allthese operationsare performed optimallyat thesametime
isnotpossible. Dierentearlierapproachesfor SIMDizing thiscodealignthedata onone
layoutandthenuseshueoperationstochangealignmentofthedataatruntimetoperform
theneededcomputations.
Another problem is that data are represented by 3×3 complex matrix and 4 3×1
complex vectors, which do not map perfectly to powerof 2 data alignment. For the cell
processordatashouldbealignedin the16Bytesboundarytobeecientlyaccessed. DMA
isalsobetteralignedin128bytesboundary.
Inthis work,wedene RuntimeData Fusion asasolutionforthe aboveproblems of
dataalignment. Weshowthattheperformancecanbegreatlyimprovedusingthistechnique.
EcientSIMDization ofthecoderequires alignmentof thedata inawaythat reduces
thedependencybetweeninstructions,reducesthenumberofshueinstructions,andallows
ecientinstructionslikemultiply-addtobeexecuted.
4.1 Runtime data fusion
Twoconventionalrepresentationofcomplexstructurearecommonlyused;therstcombines
thereal andtheimaginaryparts into onestructure; thesecond separatesthereal andthe
imaginarypartintotwoseparatearrays. Figure3showsthesedatalayoutforstoringa3×1
complexvectorand3×3 matrixfordoubleprecisionalignedinto16bytesword.
In Table 1, we list the number of instructions that are needed to do a matrix-vector
multiplication. Weconsidertheaveragefordoingvector-matrixmultiplyandvector-trans-
posed conjugatematrix multiply. Forthe rst representation, thereal and the imaginary
Figure 3: Multiple representationsof structures basedon complexvariables alignedon16
bytesword. Thetopshowsmergedrepresentation,themiddleshowsseparaterepresentation,
andthebottomshowsthefusedrepresentation.
partfaces dierenttreatment,thusrequiringshues. Thesecondrepresentationseparates
therealpartandimaginarypartsintotwoseparatearrays,whichinvolvesadditionalshues
fortransposingthematrix.
Thecomputationalrequirementsbasedonthesealignmentsfavorseparatingthearrays
fortherealpartandtheimaginarypart. Thenumberofoatingpointsisreducedbecause
ofthepossibilitytousemultiply-addandmultiplysubtractinstructions. Theseinstructions
cannot be used withthe rstlayoutbecausethe real andimaginaryparts share the same
16bytesword.
Even with the largernumber of instructions, the rst layoutis favored by the perfect
alignmentwithintheboundaryoftheword. Thesecondlayoutrequireseithertousepadding
(25%ofthetotalspaceforspinorsand10%ofthegaugeeld link)orthedatawillnotbe
perfectlyaligned. Increasingthe sizeof thedatabecauseof alignmentcanseverely reduce
theperformance of this application because it is bandwidth limitedas will bedetailed in
Section5. Aligningdatanotto16bytesboundarywillseverelypenalizeloadingandstoring
data and will nullify the computation benet achieved by the reduced instructions. We
adopted the rst alignment of complex data as base for comparison, especially that the
originalimplementation(optimizedforSIMDizationonIntelSSE2)adopted itthroughout
thewholecodeforsimulatingLatticeQCD.
ConsideringthefusedversioninFigure3,thenumberofoatingvariablescanbereduced
signicantlyandnoshuingisneededexceptatthefusion/disjoinstages. Thefusionintro-
ducedFigure 3showsthetwo-waymatrixfusionforcomplexvariablesofdouble precision.
Merged Seperate Fused
add, sub 18 12 6
madd,msub 0 10 9
mul 22 10 9
compute/reduceshues 31 14 0
transpose/conjugateshues 3 5 0
fuse shues 0 0 6
Table1: Instructiondecompositionforvector-matrixmultiplyonSPE.Thecountrepresent
theaverageforvector-by-matrixandvector-by-transposedmatrix.
The fused matrix removes the need for shuing in case of conjugate access to the array
elements because complex and real variables are aligned in separate 16 Bytes boundary.
Transposingorconjugatingthematrixdoesnotinvolveadditionalshues.
Thisfusedalignmentisunfortunatelynotpossibleatcompiletime,especiallyforspinors
because it requires having a unique order of accessing spinor at compile time. In other
word,givenaspinori weneedto determineauniquespinorthat will alwaysprecedeit in
computationandauniquespinorthatwillalwaysfollowit. Thesurroundingspinorsinevery
access contextcanbedierent andit willbeaverylarge spaceoverheadto keepmultiple
coherentcopies.
Because ofthe performance associatedwith datafusion anddiculty ofdoing itstati-
cally,weconsiderthefollowingproposalforruntimedatafusion
1. Data are fused at runtime for a number of structures dependent on the number of
elementspermemoryword,whichusuallyinvolvesomestartupshueoperations.
2. Optimizedkernelofcomputationiswrittenassumingfuseddata. Fuseddataarekept
alivein registersaslongastheyareneeded.
3. The nal results for the optimized kernel (output spinors) are then disjoined back
beforestoringthemtothememory.
Thestepsinvolvedincodetransformationstosupportruntimedatafusionareshownin
Figure4. Ourtechniqueinvolvesfusingunrolledcodetoaligndatathatcannotbealigned
staticallydue to the multiple access patterns encountered at runtime. Thefusion process
involvesgroupingdatathat willbeaccessedwiththesamepattern ofaccessonSPUword
size (i.e., aligningthem in 16 bytes boundary). Forsingle precisioncomputation, 4-bytes
oating points, spinors are computed in a groupof 4 output spinors. Consequently, the
input gaugeelds and the input spinors arecombinedinto groups of four (4-way fusion).
For double precision, optimal alignment requires fusing the computation for two output
spinors(2-wayfusion).
Runtimefusionadoptsthesamedatastructureusedforthebase. Noalignmentproblem
isencountered. Thefusedversiononlyexistsduring computation,livinginregisters.
Figure 4: Codetransformationsforruntimedatafusion computation.
The main feature of the Cell SPE that allows this technique is the large register le.
Merging data structure in the beginning of computation incurs minimal overhead if all
fused data are kept in registers as long as they are needed for computation. Runtime
fusion can provedicult for other processors with SIMD instructionset, that havesmall
register count, for instance Intel SSE. We need to keep not only input fused data but
alsointermediateresultsaliveinregisters,especially thatdataarenotaccessed frequently.
During the computation of a group of two spinors in double precision, almost 6 KB of
memory are accessed while the register le canhold only 2 KBytes. Knowing that some
registersare needed to hold shue patterns, intermediate results, and other bookkeeping
operations,carefulregisterlifenessanalysisofregistersisneededtominimizethepossibility
ofspillingtheregisterletothememory. Wedidthisanalysisonabasicblocksizeranging
between2-2.4Kiloinstructions. Inourimplementation, wemanaged to usealmost 110of
theSPE128-bitregisterswithouttheneedtospillingfuseddataorintermediateresultsto
thememory.
Figure 5showsthe dynamicinstruction decompositionfor four implementations ofthe
base complex alignmentand the fusedversion. Wehave twocomputation ows in terms
of theshue needed,one forthe doubleprecisionand the 2-waysingleprecision, andthe
other kernel is for the 2-way double precision and 4-way single precision. Perfect fusion
kernel, 4-waysingleand2-waydouble, providesbetterchance of reducingshuesof data.
Theshuingisreducedto lessthan 37%oftheoriginalshue operationscountfor single
precisionwhen we use4-wayfusionand less than22%for double precision. Going forno
fusionforsingleprecision,notpresented,willincuradditionaloverheadsduringtransposing
matrixesandwould provideamuchworseperformance.
Figure5showsalsoreductionsinoatingpointoperations,inadditiontothereduction
inshueinstructions,becauseruntimefusionofdataallowsusingmultiply-addormultiply-
subtractmorefrequentlyasclariedearlierin Table1.
2−way single 4−way Single
double 2−way double 0
200 400 600 800 1000 1200 1400 1600 1800 2000
Others (even) FP (even) Others (odd) LD/ST (odd) Shuffle (odd)
Dynamic instructions per spinor
Figure5: Decompositionofdynamicinstructionsperspinorcomputation. Twofusionlevel
areshownforbothsingleprecisionanddoubleprecisioncomputations.
TheimpactonaverageexecutioncyclesperspinorcomputationisdetailedinTable2.For
theseperformancecycles,wealignedalldataonthelocal storeofCellSPE.Theexecution
cycles were reduced signicantly for the versions with fusion compared with the versions
withlessornofusion. Forsingleprecisionthereductionis35%for4-wayfusioncompared
with2-wayfusion. Fordoubleprecision,thereductionis40%oncurrentgenerationCellBE
andispredictedto be53%forfuture generationCellEDP.
InTable2,wedenememory instructionseciency asthepercentageoftheminimum
load/store instructions, needed by the computation, to the actual load/storeinstructions
executedduringcomputation. Asshowninthetable,memoryinstructionseciencyisvery
high(rangingfrom 97.3%to99%)in ourimplementation. Carefulregisterlifenessanalysis
isneededtoachievethisbutisalsofacilitatedbythelargeregisterleavailableontheCell
BEthat makesitpossible tohold intermediate computation ofthe spinorsdata structure
after fusionuntil theend ofthecomputation. Onlysingleprecisionwith 4-wayfusion has
additionalmemoryoperationsassociatedwiththefuse/disjoinphase,reducingthememory
instructionseciencyto83%. Stilltheoverallperformancewith4-wayfusionismuchbetter
that2-wayfusion.
Table2also shows theperformancein GFlops forthese implementations. Apparently,
ifthe dataare requested from the memorysystemoutsidethe CellBE, then thememory
subsystemwill notbeableto aord these bandwidths 2
. Only double precisiononcurrent
generationCellBErequiresmoderatebandwidthbecausetheexecutionofdoubleprecision
computationisseverelypenalizedduringtheissuestageofinstructionexecution. Thistable
2
Paddingisaddedforsingleprecisiongaugeeldmatrixtomakethenumberofelementeven(lessthan
4%aditionaloverhead).