HAL Id: inria-00277901
https://hal.inria.fr/inria-00277901v2
Submitted on 13 May 2008
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Communication Avoiding Gaussian Elimination
Laura Grigori, James Demmel, H. Xiang
To cite this version:
Laura Grigori, James Demmel, H. Xiang. Communication Avoiding Gaussian Elimination. [Research
Report] RR-6523, INRIA. 2008. �inria-00277901v2�
a p p o r t
d e r e c h e r c h e
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--6
5
2
3
--F
R
+
E
N
G
Thème NUM
Communication Avoiding Gaussian Elimination
Laura GRIGORI — James W. DEMMEL — Hua XIANG
N° 6523
Centre de recherche INRIA Saclay – Île-de-France
Parc Orsay Université
Laura GRIGORI
∗
, JamesW. DEMMEL†
, HuaXIANG‡
ThèmeNUMSystèmesnumériques
Équipe-ProjetGrand-large
Rapportdere her he n°6523Mai200818pages
Abstra t: ThispaperpresentsCALU,aCommuni ationAvoidingalgorithmfortheLU
fa tor-izationof densematri esdistributed in atwo-dimensional(2D) y li layout. Thealgorithmis
basedonanewpivotingstrategy,referredtoas a-pivoting,thatisshowntobestableinpra ti e.
The a-pivoting strategy leads to asigni ant de rease in the number of messages ex hanged
duringthefa torizationofablo k- olumnrelativelyto onventionalalgorithms,andthusCALU
over omes the laten y bottlene k of the LU fa torization as in urrent implementations like
S aLAPACKandHPL.
Theexperimentalpartofthis paperfo useson theevaluation oftheperforman eof CALU
on two omputational systems, an IBM POWER 5 system with 888 ompute pro essors
dis-tributedamong111 omputenodes,andaCrayXT4systemwith
9660
dual- oreAMDOpteron pro essors. We ompare CALU with S aLAPACKPDGETRF routine that omputes the LUfa torization. OurexperimentsshowthatCALUleadstoaredu tionintheparallel timeofthe
LUfa torization. Thegaindependsonthesizeofthematri esandonthe hara teristi softhe
omputerar hite ture. In parti ularthe ee t isfound to be signi antin the aseswhenthe
laten y time isan importantfa torof the overall time, asfor examplewhen asmall matrixis
exe utedonlargenumberofpro essors.
The fa torization of a blo k- olumn, referred to as TSLU, rea hes a performan e of
215
GFLOPs/son64
pro essorsoftheIBMPOWER5system,andaperforman eof240
GFLOPs/s on64
pro essorsof theCray XT4system. It represents44%
and36%
of the theoreti alpeak performan esonthese systems. TSLUoutperforms the orrespondingroutinePDGETF2 fromS aLAPACKuptoafa torof
4.37
ontheIBMPOWER5systemanduptoafa torof5.58
on theCrayXT4system.Onsquarematri esoforder
10
4
,CALU outperformsPDGETRFbyafa torof1.24onIBM
POWER 5 and by a fa tor of
1.31
on Cray XT4. It represents40%
and23%
of the peak performan eon thesesystems. Thebest improvementobtainedbyCALU is aspeedup of2.29
onIBMPOWER5andaspeedupof1.81
onCrayXT4.Key-words: denseLUfa torization,parallel pivotingstrategies,avoiding ommuni ation
∗
INRIA Sa lay-Ile de Fran e, Bat 490, Universite Paris-Sud 11, 91405 Orsay Fran e
(laura.grigoriinria.f r).
†
Computer S ien e Division and Mathemati s Department, UC Berkeley, CA 94720-1776, USA
(demmel s.berkeley.ed u).
‡
Résumé: Dans e papiernousprésentonsCALU,unalgorithmequiminimizeles
ommuni a-tionsdelafa torizationLUdesmatri esdensesave unedistributionbi-dimensionnelle y lique.
L'algorithmeest basésurune nouvellestrategiedepivotage,appelée a-pivotage,quiest stable
enpratique. Lastrategie a-pivotagemèneàunediminutionsigni ativedunombredemessages
é hangéspendantlafa torisationd'unbloque- olonnerelativementauxalgorithmes
onvention-nels, et CALU surmonte ainsi le goulot d'étranglementde laten ede la fa torisation LUdans
desréalisations ourantes ommeS aLAPACK etHPL.
Lapartie expérimentalede et arti lese on entre surl'évaluation de CALU surdeux
ma- hines,unema hineIBMPOWER5ave 888pro esseursdistribuésparmi111noeudsde al ul,
et une ma hine Cray XT4 ave
9660
pro esseurs dual- ore AMD Opteron. Nous omparons CALUàlaroutinePDGETRFdeS aLAPACKqui al ulelafa torisationLU.Nosexpérien esmontrentqueCALUmèneàunerédu tiondutemps delafa torisationLU.Legaindépendde
latailledesmatri esetdes ara téristiquesdel'ar hite turedelama hine. Enparti ulierl'eet
s'avèresigni atifdansles asoùletempsdelaten eestunfa teurimportantdutempsglobal,
omme parexemple quand une matri e de taille réduite est exé utée sur ungrand nombrede
pro esseurs.
Lafa torisationd'unbloque- olonne,désignéesouslenom deTSLU,atteint
215
GFLOPs/s sur64
pro esseurs de IBM POWER 5, et240
GFLOPs/s sur64
pro esseurs de Cray XT4. Ce ireprésente44%
et36%
desexé utionsmaximalesthéoriquessur esma hines. Lemeilleur speedup de TSLUpar rapport à laroutine orrespondante PDGETF2 de S aLAPACKest de4.37
surIBMPOWER5etde5.58
surCrayXT4. Surdesmatri es arréesdetaille10
4
,CALUsurpassePDGETRFparunfa teur
1.24
surIBM POWER5et parunfa teur de1.31
surCrayXT4. Ce i représente40%
et23%
del'exé ution maximalesur essystèmes. LameilleureaméliorationobtenueparCALUestunspeedupde2.29
surIBMPOWER 5et unspeedupde1.81
surCrayXT4.1 Introdu tion
Solvinglinearsystemsofequationsisoneofthemostusedoperationinvariousappli ationsand
numeri alsimulationsins ienti omputing. Theseappli ationsfrequentlyleadtosolvingvery
largedensesets oflinearequations, oftenwithmillionsof rowsand olumns, andsolvingthese
problemsisverytime onsuming.
InthispaperwepresentaCommuni ation-AvoidingLUfa torization(CALU)algorithmfor
omputingtheLUfa torizationof adensematrix
A
distributed inatwo-dimensional(2D) lay-out. CALUisbasedonanewpivotingstrategy,thatweshowitisnumeri allystableinpra ti e.CALU has twomain hara teristi s. First,it islaten y avoiding, asthenew pivotingstrategy
allowsfor asigni antde rease in the numberof messagesex hangedduring thefa torization
relativelyto onventionalalgorithms,thoughthat omes atthe ostofsomeredundant
ompu-tations. Werefertothenewpivotingstrategyas a-pivoting. Thisapproa histhusparti ularly
bene ial on parallel ar hite tures and for sizes of matri es for whi h the overhead asso iated
with sendingamessagebetweentwopro essorsis anexpensivefa torin thealgorithm.
More-over, today'ste hnology trends predi t that arithmeti will ontinue to improveexponentially
fasterthanbandwidth,andbandwidthexponentiallyfasterthanlaten y. SoCALUiswellsuited
forfuture parallel ar hite tures, inwhi h onventionalalgorithms willspendmoreandmoreof
theirtime ommuni atingandlessandlessdoingarithmeti . Se ond,itallowstheusageofthe
bestavailablesequentialalgorithmfor omputingtheLUfa torizationofablo k- olumn,asfor
examplethere ursivealgorithms[6,9℄.
CALU uses ablo kright-looking approa h in whi h adense matrix
A
with a2D layoutis fa torizedbytraversingiterativelyblo ksof olumns. Atea hiteration,rstablo k- olumnofwidth
b
is fa tored. Then thetrailing matrixis updated, and thede omposition ontinueson thetrailingmatrix. Themaindieren ewithrespe ttootherblo kright-lookingalgorithmsliesinthefa torizationofablo k- olumn,whi hisperformedverye ientlyinCALUbyusingthe
new a-pivotingstrategyas follows. TheLUde ompositionoftheblo k- olumnisperformedin
twosteps. Therststep,aprepro essingstep,identiese ientlyin parallel
b
pivotrows,that providegood pivots forthe LUfa torizationof the entire blo k- olumn. Wedes ribe in detaillaterin thepaperhowtheserowsareidentied. Thepivotrowsarepermutedto bein therst
b
positionsoftheblo k- olumn. Inthese ondsteptheLUfa torizationwithnopivotingofthe blo k- olumnis performed. Werefer tothis approa h forperformingtheLUfa torizationof ablo k- olumn asTSLU (Tall Skinny LU), sin ea blo k- olumn an onsidered to beamatrix
with a 1D layout for whi h the verti al dimension (number of rows)is mu h larger than the
horizontaldimension(numberof olumns).
CALU over omesthelaten y bottlene kof the lassi LUfa torization, as implemented in
S aLAPACKPDGETRFroutine[2℄. InPDGETRF,theLUde ompositionofan
m × n
matrix is performedin parallel using ablo k y li distribution of the matrixoveraP
r
byP
c
gridof pro essors, whereP
r
× P
c
= P
andP
isthe numberof pro essors. Thelaten y bottlene k in S aLAPACKlies in theLU fa torizationof a blo k- olumn that is spread overP
r
pro essors, that leads to2n log
2
P
r
number of messages ommuni ated during the fa torization. All the other terms are of the formO(n/b) log
2
P
r
+ O(n/b) log
2
P
c
, whereb
is the size of the blo k usedinthe2Ddistribution. CALUhasanumberofmessages ommuni atedof3(n/b) log
2
P
r
+
3(n/b) log
2
P
c
, i.e. smallerby a fa torofb
. Thepri e for fewer messagesisb(mn − n
2
/2)/P
r
moreoatingpointwork,whi hisasmall fra tionoftheoverall
(mn
2
− n
3
/3)/P
work.
Thispaperfo useson omparingCALUwiththeapproa husedinS aLAPACKPDGETRF,
and the parallel implementation we present for CALU follows the main steps used in
S aLA-PACK.However,the a-pivotings heme anbeused inotherparallel algorithmsimplementing
exam-ple in thehighly optimizedHigh Performan e Linpa k(HPL) ben hmark, usedin determining
theTop500list[1℄. Thisistheobje tofour urrentresear h.
Thenew a-pivotings hemeusedinCALUmayleadtoadierentrowpermutationthanthe
lassi LUfa torization. Inthispaperwepresentnumeri alresultsthat showthat a-pivoting
s heme is stable in pra ti e. We observe that it behaves as a threshold pivoting, where the
minimumthresholdvalueinpra ti alexperimentsis
0.33
. Inother words,|L|
isbounded by3
, whileinLUfa torizationwithpartialpivoting,|L|
isboundedby1
,where|L|
denotesthematrix ofabsolutevaluesoftheentriesofL
. Wealsondthatthea ura ytestsperformedinHPL[5℄, whi h onsist of omputing several s aled residuals, are fullled by the a-pivoting strategy.Hen e thisapproa h ouldbeusedforevaluatingtheperforman eofparallel omputers.
ThealgorithmpresentedinthispaperbearssomesimilaritiestotheCommuni ation-Avoiding
QR (CAQR) fa torization dis ussed in [3℄. Both algorithms usea redu e-like omputation for
thepanelfa torization,thusde reasingthe ommuni ation ost. Howeverthenumeri alstability
issues relatedto theLUfa torization leadto anumberofsigni antdieren es. Forinstan e,
CALU performsthepanelfa torizationtwi e,but theupdateofthetrailingmatrixisthesame
as in the lassi LU fa torization. In ommuni ation-avoiding QR, the panel fa torization is
performedon e,butthereissomeredundant omputationintheupdateofthetrailingmatrix.
The rest of the paper is organized as follows. Se tion 2 introdu es CALU and the new
a-pivoting s heme. Se tion 3 des ribes the parallel LU fa torization of a tall-skinny matrix
using a-pivoting, and dis usses its performan e in terms of omputation and ommuni ation
ost. Se tion 4 presents the parallel CALU algorithm of amatrix distributed in a 2D layout
and dis usses its omputation and ommuni ation ost. Se tion 5 ompares the lassi LU
fa torizationalgorithms implemented inS aLAPACKand thenewproposed CALU algorithm.
Se tion 6 des ribes experimental results that rst dis uss the stability of a-pivoting s heme,
andse ondevaluatetheperforman eofCALUontwo omputationalsystems,anIBMPOWER
5 systemand aCray XT4system, lo ated at National Energy Resear h S ienti Computing
Center (NERSC).AndSe tion7presentsthe on lusionsandourfuturework.
2 Des ription of CALU
Inthisse tionwedes ribethemainstepsofCALUalgorithmfor omputingtheLUfa torization
of amatrix
A
ofsizem × n
. Weuseseveralnotations. Werefertothe submatrixofA
formed by elements of rowindi es fromi
toj
and olumn indi es fromd
toe
asA(i : j, d : e)
. IfA
is theresult ofthe multipli ationof twomatri esB
andC
, we referto thesubmatrix ofA
as(BC)(i : j, d : e)
.CALUis ablo kalgorithmthat fa torizes theinputmatrixbytraversingiterativelyblo ks
of olumns. Attherstiteration,thematrix
A
ispartitionedasfollows:A =
A
11
A
12
A
21
A
22
where
A
11
is of sizeb × b
,A
21
is of size(m − b) × b
,A
12
is of sizeb × (n − b)
andA
22
is of size(m − b) × (n − b)
. Asother lassi rightlooking algorithms,CALU rst omputesthe LU fa torizationoftherst blo k- olumn, thendeterminestheblo kU
12
,andupdates thetrailing matrixA
22
.Themain dieren ewithrespe tto otherexisting algorithmsliesinthefa torizationofthe
rst blo k- olumn. CALU uses the new a-pivoting strategy, onsisting in performing rst a
prepro essing step in whi h agood set of pivot rowsis identied. Se ond, thepivot rowsare
rstblo k- olumnisperformed. CALU onsidersthattherstblo k- olumnispartitionedin
P
blo k-rows. Wepresentherethesimple aseP = 4
. Forthesakeofsimpli ity,wesupposethatm
isamultipleof4
.The prepro essing step starts by performing the LU fa torization with partial pivoting of
ea hblo k-row,asfollows:
A
=
A
0
A
1
A
2
A
3
=
¯
Π
00
L
¯
00
U
¯
00
¯
Π
10
L
¯
10
U
¯
10
¯
Π
20
L
¯
20
U
¯
20
¯
Π
30
L
¯
30
U
¯
30
=
¯
Π
00
¯
Π
10
¯
Π
20
¯
Π
30
·
¯
L
00
¯
L
10
¯
L
20
¯
L
30
·
¯
U
00
¯
U
10
¯
U
20
¯
U
30
≡
Π
¯
0
L
¯
0
U
¯
0
Thisleadstoade ompositioninwhi htherstfa tor,
Π
¯
0
,isanm×m
blo kdiagonalmatrix, whereea hdiagonalblo kΠ
¯
i0
isapermutationmatrix. These ondfa tor,L
¯
0
,isanm×P b
blo k diagonal matrix, where ea h diagonal blo kL
¯
i0
is anm/P × b
lowerunit trapezoidalmatrix. Thethirdfa tor,U
¯
0
,isaP b × b
matrix,whereea hblo kU
¯
i0
isab × b
uppertriangularfa tor. Note that this step aimsat identifying in ea h blo k-rowaset ofb
linearlyindependentrows, whi h orrespondtotherstb
rowsofΠ
¯
T
i0
A
i
,withi = 0 . . . 3
.From the
P
sets of lo al pivot rows, we perform abinary tree (of depthlog
2
P = 2
in our example)ofLUfa torizationsofmatri esofsize2b × b
toidentifyb
globalpivotrows. The2LU fa torizationsat theleavesofourdepth-2binarytreeareshownhere, ombinedin onematrix.Thisde ompositionleadstoa
P b × P b
permutationmatrixΠ
¯
1
,aP b × 2b
fa torL
¯
1
anda2b × b
fa torU
¯
1
.
¯
Π
T
0
A (1 : b, 1 : b)
¯
Π
T
0
A (m/P + 1 : m/P + b, 1 : b)
¯
Π
T
0
A (2m/P + 1 : 2m/P + b, 1 : b)
¯
Π
T
0
A (3m/P + 1 : 3m/P + b, 1 : b)
=
¯
Π
01
L
¯
01
U
¯
01
¯
Π
11
L
¯
11
U
¯
11
=
¯
Π
01
¯
Π
11
·
¯
L
01
¯
L
11
·
¯
U
01
¯
U
11
≡
Π
¯
1
L
¯
1
U
¯
1
Theglobalpivot rowsareobtainedafter applying onemoreLUde omposition(at theroot
ofourdepth-2binarytree)onthepivotrowsidentiedpreviously:
¯
Π
T
1
Π
¯
T
0
A (1 : b, 1 : b)
¯
Π
T
1
Π
¯
T
0
A (2m/P + 1 : 2m/P + b, 1 : b)
= ¯
Π
02
L
¯
02
U
¯
02
≡ ¯
Π
2
L
¯
2
U
¯
2
Thepermutations identiedin theprepro essing stepareapplied ontheoriginal matrix
A
. ThentheLUfa torizationwithnopivotingoftherstblo k- olumnisperformed,theblo k-rowof
U
is omputed and thetrailing matrixis updated. Note thatU
11
= ¯
U
2
. Thefa torization ontinuesonthetrailingmatrixA
¯
. Thepermutation matri esΠ
¯
0
, ¯
Π
1
, ¯
Π
2
donothavethesame dimensions. By abuse of notation, we onsider thatΠ
¯
1
, ¯
Π
2
are extended by the appropriate identitymatri estothedimensionofΠ
¯
0
.¯
Π
T
2
Π
¯
T
1
Π
¯
T
0
A =
L
11
L
21
I
n−b
·
I
b
¯
A
·
U
11
U
12
U
22
The a-pivotingstrategyhasseveral important hara teristi s. First,when
b = 1
orP = 1
, a-pivoting isequivalentto partialpivoting. Se ond,theeliminationof ea h olumn ofA
leads toarank-1update ofthetrailingmatrix. Therank-1updatepropertyisshownexperimentallytobeveryimportantforthestabilityofLUfa torization[10℄. Alargerankupdatemightleadto
anunstableLUfa torization,asforexampleinanotherstrategysuitableforparallel omputing
alled parallelpivoting [10℄. Third,thenumeri altestspresentedin Se tion 6showthatit an
3 TSLU algorithm
In this se tion we present TSLU, a parallel algorithm for omputing the LU fa torization of
an
m × b
matrixA
, withm ≫ b
, whi h is distributed overP
pro essorsusing a 1D layout. We also dis uss its performan e in terms of ops and number of messages ex hanged duringthe fa torization. This algorithm will be used in CALU for performing the fa torization of a
blo k- olumn.
TSLUessentiallydoesanall-redu tion (withabuttery ommuni ationpattern)wherethe
redu tionoperationisGaussian eliminationonapairofmatri esofsize
b × b
sta kedontopof oneanother. For ompleteness,wedes ribethisall-redu tionoperationinmoredetailasfollows,andthenshowanexample.
The omputation of TSLUis performed asanall-redu tion operation, and usesabuttery
forthe ommuni ationpattern. Thebutterymethodusesatree-like omputationasdes ribed
inthepreviousse tion,andtakespla ein(
log
2
P + 1
)steps,startingfromthebottomlevelk = 0
ofabinarytree. Ea hnodeofthebinarytreeisasso iatedwithasetofpro essors. Forthesakeofsimpli ity,wesupposethatthepro essorsareapoweroftwo,numberedfrom
0
toP − 1
,and thatm
dividesP
. Weusenotationssimilarto[3℄:f stP (i, k)
denotestherstpro essorae ted to thenode ofthebinary treeat levelk
to whi hpro essori
belongs;target(i, k)
referstothe pro essor with whi h pro essori
ex hanges data at levelk
of the tree in a buttery pattern;tgtf stP (i, k)
denotes thepro essorwith whi hf stP (i, k)
ex hangesdata at levelk
;level(i, k)
denotesthenodeatlevelk
ofthebinarytreewhi hisassignedtoasetofpro essorsthatin ludes pro essori
,andis omputedas:level(i, k) =
i
2
k
f stP (i, k) =
2
k
level(i, k)
target(i, k) =
f stP (i, k) + (i + 2
k−1
) mod 2
k
tgtf stP (i, k) =
target(f stP (i, k), k) = f stP (i, k) + 2
k−1
Thealgorithmstartswithalo al LUfa torizationonea hpro essorofthe
m/P × b
blo k-rowsthatit owns. Thenat ea hlevelk
ofthebinarytreeandforea hnodeat thislevel,pairs ofpro essorsperformredundantlyanLUfa torization. Considerforexampleapro essori
and thenode atlevelk
whi his mappedonpro essori
, andidentiedaslevel(i, k)
. Thefa torsL
andU
omputedatthisnodearedenoted asL
¯
level(i,k),k
,U
¯
level(i,k),k
. Pro essori
anditstarget pro essorex hange data andperformredundantlytheLU fa torizationof twomatri esof sizeb × b
.Notethatthesequen eoflo alLUfa torizationsperformedin therstthreestepsofTSLU
are notperformed in pla e (the inputmatrixis notoverwritten). Hen e TSLUneedsan extra
storageofsize
m × b
to storetheresultingLand Ufa torsof thesefa torizationsandave tor ofsizeb
tostorethepermutationve tor.TSLU algorithm
1. Let
i
bemypro essornumber.2. ComputetheLUfa torizationofthelo al
m/P × b
groupofrowsA
i
= ¯
Π
i0
L
¯
i0
U
¯
i0
. LetB
i
beformedbytheb
pivotrows,B
i
= ¯
Π
T
i0
A
i
(1 : b, 1 : b)
. 3. fork = 1
tolog
2
P
doiif
i ≥ tgtf stP (i, k)
theniφ = target(i, k)
,τ = i
ielseiφ = i
,τ = target(i, k)
iendif
iLet
l = level(i, k)
.(a) Pro essor
φ
ex hangesitsB
i
withPro essorτ
.(b) ComputetheLUfa torizationofthetwomatri es
B
φ
andB
τ
ofsizeb × b
sta kedone ontopofanother:B
φ
B
τ
= ¯
Π
l,k
L
¯
l,k
U
¯
l,k
( ) Let
B
i
beformedbytherstb
pivotrowsofΠ
¯
T
l,k
B
φ
B
τ
iend if endfor4. Letthenalpermutation
Π = ¯
¯
Π
0
Π
¯
1
. . . ¯
Π
log
2
p
andpermutethelo alA = ¯
Π
T
A
.
5. Let
U = ¯
U
0 log
2
P
,whereU
istheuppertriangularfa torofA
. 6. Computethelo alL
fa tor,L
i
= A
i
U
−1
.
Weillustratetheexe utionofthisalgorithmonasmallexampleinFigure1,wherewesuppose
thatthematrix
A
ofsize16×2
isdistributedfollowinga1Dblo k y li distribution,withblo ks ofsize2 × 2
on4
pro essors. LetA =
2
0 2
0 0
1 2
0 2
1 4
1 0
0
1 4
4
1 0
0 1
4 1
2 0
2 1
0 0
2
0 2
T
Inthis example, the1st, 2nd, 9th, 10th rowsare distributed on pro essor
0
. First a lo al LUfa torizationisperformedbyea hpro essor. Forpro essor0
,the1st and9throwsareused as pivots. Se ond the pro essors0
and1
ex hange the2
rows used aspivots in the lo al LU fa torization. ThentheyperformredundantlytheLUfa torizationof the4 × 2
matrixformed bythese rowssta kedoneontopof another. Similarly,pro essors2
and3
ex hange theirrows andperformredundantlytheLUfa torizationofthematrixformedbytheserows. Inthethirdstep,pro essors
0
and2
ex hangethe2
pivotrowsidentiedinthese ondstep,andperforman LUfa torizationonthe4 × 2
matrixformedbytheserows. Thesame omputationisperformed by thepro essors1
and3
. The rowsidentied in the third steprepresentthe pivots that will beusedto fa torizetheentirematrixA
. Inthis simpleexample,thepivotrowsused by TSLU happentobethesameasthoseusedbyGaussianeliminationwithpartialpivoting.Tostudytheperforman eofTSLU,weusea lassi almodeltodes ribeama hinear hite ture
0
1
2
3
0
1
2
3
2
2
3
3
0
0
1
1
4
2
1
0
2
0
1
2
0
0
0
2
2
4
2
4
1
0
1
4
1
4
2
2
2
4
0
1
2
0
1
2
2
0
0
0
4
1
1
0
4
1
1
0
0
1
1
4
0
0
0
2
2
1
0
2
1
0
4
2
2
4
0
1
2
0
0
0
1
4
2
1
0
2
4
1
4
2
2
2
4
1
0
2
0
0
0
2
4
1
4
4
1
2
4
4
1
4
2
4
1
1
4
4
1
1
4
4
1
1
4
4
1
Figure1: Exampleofexe utionofTSLUon4pro essors.
we use one parameter to des ribe the time per op (add and multiply), denoted
γ
, and one parameterto ountthetimeperdivide,denotedγ
d
. Weestimatethetimeforsendingamessage ofm
wordsbetweentwopro essorsasα + mβ
,whereα
denotesthelaten yandβ
theinverseof thebandwidth. Weapproximatethetimeofbroad astsand ombinesthat involveP
pro essors by assuminglog
2
P
identi al steps of ommuni ation and/or omputation are needed. With thesenotations,theruntimeofTSLUisestimatedtobe(weomitloworderterms):T
T SLU
(m, b, P ) =
h
2mb
2
P
+
2b
3
3
(log
2
P − 1)
i
γ+
+b(log
2
P + 1)γ
d
+
+ log
2
P α + b
2
log
2
P β
(1)We omparethistoS aLAPACK laterinSe tion5.
4 Parallel CALU algorithm for matri es distributed in a 2D
layout
In this se tion we present a parallel algorithm that implements the CALU method presented
in Se tion 2. We onsider an
m × n
matrix blo k y li ally distributed overa bi-dimensional gridof pro essorsP = P
r
× P
c
,using squareblo ksofdimensionb × b
. The parallelalgorithm usesablo kright-lookingapproa h,asusedforexampleinPDGETRFroutineinS aLAPACKor in HPL ben hmark. That is, it iterates overblo k- olumns of
A
, and at ea h step rst a blo k- olumnofwidthb
isfa tored. Thenthetrailingmatrixispermutedandupdated,andthe de omposition ontinuesonthetrailingmatrix. Themain dieren ewith theotheralgorithmsisthatCALUfa torsablo k- olumnusingtheTSLUfa torizationpresentedinSe tion3,whi h
leadstoanimportantredu tioninthenumberofmessagesex hangedduringthefa torization.
Considerthat therst
j − 1
iterationsof theLUfa torizationwereperformed. That is,the rstj − 1
blo k olumnswerefa toredandthetrailingmatrixwaspermutedandupdated. The a tivematrixatstepj
isofdimension(m − (j − 1)b) × (n − (j − 1)b)
=m
j
× n
j
. Forthe larity ofpresentation,wesupposethatm
andn
divideb
. Wedes ribeherethemainstepsinvolvedin thej
-thiterationofCALU:1. The olumn of the grid that holds blo k- olumn
j
omputes its LU fa torization using TSLU(AlgorithminSe tion3).2. Every pro essor in the pro essor olumn holding the matrix blo k- olumn
j
broad asts along its pro essor row the lo ally stored subblo k ofL
. It also broad asts an array of sizeb
thatstoresthepermutationve torΠ
j
asso iatedwiththeLUfa torizationof blo k- olumnj
.3. Thematrix
A
ispermuteda ordingtoΠ
j
.4. Every pro essor in the pro essor row holding the matrixblo k-row
j
ofU
omputesits lo alblo k.5. Everypro essorin thepro essorrowholdingmatrixblo k-row
j
ofU
broad astsitslo al blo kdownits olumn.6. Allpro essorsupdate thetrailingmatrix.
Inour urrentimplementation,weuseroutinesfromS aLAPACKforseveralstepsofCALU.
Step 3 is performed by a allto PDLASWP, step 4 is done by PDTRSM, and steps 5and 6
orrespond to a all to PDGEMM. However, CALU an be implemented dierently, and an
in orporate te hniques whi h allow some overlapbetween omputation and ommuni ation as
theso- alledlook-aheadte hniqueusedin HPLben hmark.
Toestimatetheperforman eofCALU,weassumethatthenetworkbandwidthandlaten y
is notne essarilythe sameeverywhere,e.g. it anbedierent along olumns of thegridthan
along rowsof the grid. Weuse adierent bandwidthand laten y for ommuni ationbetween
pro essorsin dierentrowsandthesame olumn (
α
c
andβ
c
)versusdierent olumnsandthe samerows(α
r
andβ
r
). This isarst steptowardsunderstanding ertainhierar hi alparallel ma hines,wherethereishighbandwidthamongpro essorsonthesame hip(ornodeormodule)andlowerbetweenpro essorsondierent hips(ornodesormodules).
Thetotal omputationtime overare tangulargridofpro essorsisgiveninEquation2(we
omitsomelowerordertermsandthetimeofpivotingrowslo ally). Inthisestimationwe onsider
that atotalof
(2n/b) · log
2
P
r
messagesareex hangedforswappingrowsofmatrixA
instep3. This is be ause the swapping ofb
rows o urs after ea h blo k- olumn fa torization. Hen e, this operation anbeimplemented intwosteps, using2 log
2
P
r
messages. Firstea h pro essor sends at mostb
rows that need to be swapped to the root pro essor as a redu e operation. Se ond the root pro essorbroad asts the ne essaryrows to all the pro essorsin its pro essorolumn. Howeverin our urrentimplementationweusePDLASWP, andthis routineperforms
onemessageex hangeforea hrowswap,whi hleadstoatotalof
n log
2
P
r
messagesex hanged forstep3. Inour urrentwork,wearerepla ingthis routinebyaroutinethatis implementedasexplained above, andwein lude thenumberof messagesasso iated withthe future routine
insteadofPDLASWPinourtimeestimation.
T
CALU
(m, n, P
r
, P
c
) =
h
1
P
mn
2
−
n
3
3
+
1
P
r
mn −
n
2
2
2b +
n
2
b
2P
c
+
2nb
2
3
(log
2
P
r
− 1)
i
γ+
+n(log
2
P
r
+ 1)γ
d
+
+ log
2
P
r
h
3n
b
α
c
+
nb
2
+
3n
2
2P
c
β
c
i
+
+ log
2
P
c
h
3n
b
α
r
+
1
P
r
mn −
n
2
2
β
r
i
(2)5 Comparison with the S aLAPACK's LU fa torization
Consider that we de ompose an
m × n
matrix whi h is distributed blo k y li ally overaP
r
byP
c
grid of pro essors, whereP
r
· P
c
= P
andm ≥ n
. The two-dimensional blo k y li distributionusessquareblo ksofdimensionb × b
. Thealgorithmloopsovern/b
blo k- olumns. Atthej
-thstep,therstj − 1
blo k- olumnsofL
andblo k-rowsofU
arealready omputed. At thisstep,theblo k- olumnj
ofL
isfa tored( alltoP DGET F 2
)usingpivoting. Thepivoting informationis appliedto therestof thematrix( alltoP DLASW P
). Theblo k-rowj
ofU
is omputed using triangular solves( all to PDTRSM), and then the trailing matrix is updated( allto
P DGEM M
).Equation3representstheruntimeestimationofPDGETRFroutineinS aLAPACKLU.To
be onsistentwiththeruntimeestimationofCALU,we onsiderforPDGETRFaswellthatthe
swappingof
b
rowsperformedbya allto PDLASWPleadsto2 · log
2
P
r
messagesex hanged.T
P DGET RF
(m, n, P
r
, P
c
)
=
h
1
P
mn
2
−
n
3
3
+
P
1
r
mn −
n
2
2
b +
n
2P
2
b
c
i
γ+
+nγ
d
+
+
2n 1 +
2
b
log
2
P
r
+ n α
c
+
nb
2
+
3n
2
2P
c
log
2
P
r
β
c
+
+ log
2
P
c
h
3n
b
α
r
+
1
P
r
mn −
n
2
2
β
r
i
(3)Tobetterunderstandthedieren esbetweenCALU and theLUfa torizationimplemented
in S aLAPACK,wewill ompare theruntime estimationof the twofa torizationsas givenby
Equation2andEquation3.
Comparing the additions, multipli ations op ounts, CALU adds a lower order term of
about
b(mn − n
2
/2)/P
r
. This term omes from TSLU,whi h performs twi e thefa torization ofablo k- olumn, rstto getthepivotrows,andse ondto a tually omputethefa tors.Comparingthedivisionop ounts,CALUaddsalowerordertermof
n log
2
P
r
,allfromthe TSLUsofblo k- olumns(the fa torizationsoftwob × b
matri es).Comparing ommuni ation ostswithinpro essor olumns(
α
c
andβ
c
terms),forbandwidth, bothalgorithmshavethesame ommuni ationvolume. Forlaten y,CALUis lowerbyafa torof
b(1 + 1/ log
2
P
r
)
. Theredu tionin the numberof messageswithin pro essor olumns omes fromtheredu tioninthefa torizationofablo k- olumnperformedbyTSLUversusPDGETF2.Comparing ommuni ations osts within pro essorrows(
α
r
andβ
r
terms), in PDGETRF, thenumberof broad astswithin pro essorrowsis alreadyofthe orderofn/b
,and hen eboth algorithmshavethesame osts.6 Experimental results
Inthisse tion,weevaluatetheperforman eofCALUalgorithm,andthegoalofourexperiments
isthree-fold. First,westudythenumeri alstabilityofthenew a-pivotingstrategy. Se ond,we
evaluatetheperforman e improvementobtainedin thepanelfa torization byTSLU ompared
tothe orrespondingroutineinS aLAPACK.Andthird,weevaluatetheperforman eofCALU
and ompareittoPDGETRFroutineinS aLAPACK.
Theexperiments areperformedon two omputational systemsat the National Energy
Re-sear h S ienti Computing Center (NERSC). The rst system is an IBM p575 POWER 5
system, whi h has 888 ompute pro essors distributed among 111 ompute nodes. Ea h
pro- essor is lo ked at
1.9
GHz and has a theoreti alpeak performan e of7.6
GFLOPs/s. Ea h nodeof8
pro essorshas32
Gbytesofmemory. The omputenodesare onne tedtoea hotherwithahigh-bandwidth,low-laten yswit hingnetwork. Thepeak bandwidthis
3100
MB/sand theMPIPointtoPointinternodelaten yis4.5
use [7℄. OnIBMPOWER5weusetheBLAS routinesfromtheESSLlibrary(EngineeringandS ienti Subroutinelibrary). Foralltherunsweusedthemaximumnumberofpro essorsavailablepernode.
These ond system is a Cray XT4system with
9660
ompute nodes. Ea h ompute node has a 2.6 GHz dual- ore AMD Opteron pro essorwith atheoreti al peak performan e of 5.2GFLOPs/s. Ea h ompute node has 4 GBytes of memory. In our omparisons we use the
routinesPDGETRFandPDGETF2fromtheCrayS ienti Librariespa kage,LibS i. However
theseroutineshavenosigni antoptimizationwithrespe ttotheroutinesfromS aLAPACK[8℄.
InourtestsweuseS aLAPACKinmixedmode,thatisMPIisusedinbetween omputenodes,
andthreadedBLASlevelparallelismon oreswithinanode. ThethreadedBLASusedislibGoto
library.
6.1 Stability of a-pivoting strategy
In this se tionweshow that CALU is as stable asGaussian elimination with partial pivoting.
ForthiswesummarizeresultsthatexpressthestabilityofGaussianelimination,intermsofthe
pivot growthand the normwiseba kward stability attained. We performour tests in Matlab,
usingmatri esfromanormaldistributionwithvaryingsizefrom
1024
to8192
.Thegrowthfa torinvolvesthevaluesoftheelementsof
A
duringtheelimination. Weusethe growthfa tor asdened byTrefethenandS hreiber[10℄,g
T
=
max
i,j,k
|a
(k)
ij
|
σ
A
,wherea
(k)
ij
denotes the absolute valueof the element ofA
at rowi
and olumnj
at thek
-th step of elimination, andσ
A
isthestandarddeviationoftheinitialelementdistribution. Itisshownexperimentally in [10℄ thatin pra ti eg
T
≈ n
2/3
for partialpivoting, and
g
T
≈ n
1/2
for ompletepivoting (at
leastfor
n 6 1024
).InFigure 2 (left)wedisplay thevalue of the growth fa tor
g
T
obtainedfor dierent blo k sizes and dierent number of pro essors. Here twosamples are used for ea h test. From thepointofviewofstability,onlythenumberofrowsinthepro essgrid
P
r
playsarole. Hen ewe varyonlyP
r
,presentedasP
inFigure2. Weobservethatthegrowthfa torof a-pivotinggrows asc · n
2/3
(
c
beingasmall onstantaround1.5
),andhasthesamebehavioraspartialpivoting. Thenew a-pivoting strategy does notensure that the element of maximum magnitude isused as pivot at ea h step of fa torization. Hen e
|L|
is not bounded by1
as in Gaussian elimination withpartial pivoting. However,in pra ti ethe pivotsused by a-pivoting areveryloseto theelementsofmaximummagnitudein therespe tive olumns. InFigure2(right)we
displaythevalueoftheminimumthresholdinCALU,wherethethresholdis omputedat ea h
stepoffa torization
i
asthequotientofthepivotusedatstepi
dividedbythemaximumvalue in olumni
. Weobservethat this valueis alwayslargerthan0.33
, meaning that in ourtests|L|
isboundedby3
. Theaveragevalueofthethresholdislargerthan0.84
. Wehaveperformed experiments on dierent matri es, as matri es following dierent random distributions, denseToeplitzmatri es,andwehaveobtainedsimilarresults.
Toevaluatethestabilityof a-pivotingin termsofnormwiseba kwardstability,we ompute
three a ura y tests as performed in the HPL ben hmark, and denoted as HPL1, HPL2 and
HPL3. For stability, the expe ted values are of the order of
O(1)
, that is a slowly growing fun tion ofn
. In HPL, the a ura y tests are passedif the values of the three quantities are smallerthan16
.HPL1
= ||Ax − b||
∞
/(ǫ||A||
1
∗ N ),
HPL2= ||Ax − b||
∞
/(ǫ||A||
1
||x||
1
),
HPL3= ||Ax − b||
∞
/(ǫ||A||
∞
||x||
∞
∗ N ).
n P b
g
T
τ
ave
τ
min
w
b
HPL1 HPL2 HPL3256 32 497.43 0.84 0.40 4.22e-14 5.06e-02 2.26e-02 4.54e-03
16 551.76 0.86 0.35 4.10e-14 2.24e-02 2.15e-02 4.34e-03
64 463.84 0.84 0.42 4.11e-14 4.78e-02 2.21e-02 4.22e-03
128 32 525.81 0.84 0.38 3.95e-14 3.90e-02 2.12e-02 4.30e-03
2
13
16 573.11 0.86 0.38 3.70e-14 6.67e-02 1.97e-02 3.89e-03
128 402.09 0.85 0.47 3.86e-14 2.09e-02 2.07e-02 3.84e-03
64 64 457.49 0.84 0.43 3.82e-14 3.45e-02 2.05e-02 4.26e-03
32 468.02 0.84 0.37 4.36e-14 6.64e-02 2.31e-02 4.85e-03
16 482.58 0.86 0.39 3.87e-14 1.32e-02 2.08e-02 4.24e-03
256 16 334.03 0.87 0.37 1.88e-14 1.38e-02 1.94e-02 4.36e-03
128 32 341.13 0.86 0.42 2.15e-14 2.35e-02 2.22e-02 4.99e-03
16 348.01 0.87 0.38 1.98e-14 5.58e-01 2.11e-02 3.95e-03
2
12
64 294.77 0.86 0.47 2.03e-14 1.22e-02 2.13e-02 4.55e-03
64 32 339.85 0.86 0.41 2.03e-14 2.39e-02 2.13e-02 4.56e-03
16 306.10 0.87 0.37 1.99e-14 2.76e-02 2.10e-02 3.98e-03
128 16 198.48 0.89 0.41 9.71e-15 3.74e-02 2.01e-02 4.36e-03
2
11
64 32 201.92 0.88 0.43 1.13e-14 5.52e-02 2.32e-02 5.16e-03
16 187.18 0.89 0.42 9.91e-15 2.83e-02 2.06e-02 4.49e-03
2
10
64 16 131.98 0.90 0.44 5.13e-15 2.24e-02 2.11e-02 5.18e-03Table1: HPLa ura ytestsfor a-pivotingstrategy
n S
g
T
w
b
HPL1 HPL2 HPL32
13
5 325.36 2.64e-14 1.41e-01 1.40e-02 2.75e-03
2
12
5 219.93 1.33e-14 1.22e-02 1.40e-02 3.02e-03
2
11
5 151.42 6.78e-15 1.86e-02 1.38e-02 3.01e-032
10
10 101.65 3.87e-15 2.41e-02 1.57e-02 3.63e-03
1024
2048
4096
8192
100
200
300
400
500
600
700
Average growth factor: g
T
(randn, 2D layout, New pivoting)
P=256,b=32
P=256,b=16
P=128,b=64
P=128,b=32
P=128,b=16
P=64,b=128
P=64,b=64
P=64,b=32
P=64,b=16
GEPP
n
2/3
2*n
2/3
1024
2048
4096
8192
0.34
0.38
0.42
0.46
Minimun threshold (randn, 2D layout, New pivoting)
P=256,b=32
P=256,b=16
P=128,b=64
P=128,b=32
P=128,b=16
P=64,b=128
P=64,b=64
P=64,b=32
P=64,b=16
Figure 2: Thegrowthfa torandtheminimum thresholdvalueformatri esfollowinganormal
distribution
Wepresent in Table1the resultsobtainedforthe three testsfor CALU,when varyingthe
matrix size, the number of pro essors and the blo k size. For the matrix of size
n = 2
k
in
Table1, thesample size is
S = max{10 ∗ 2
10−k
, 3}
. We also re ordthe growthfa tor
g
T
, the averagethresholdτ
ave
,theminimumthresholdτ
min
,andthe omponentwiseba kwarderror be-foreiterativerenementsw
b
. Usuallyafter2iterativerenements,the omponentwiseba kward error anbe redu edto theorder of10
−16
. Wedisplayin Table2the resultsobtainedby LU
fa torizationwithpartialpivotingforthesamematrixsizes,where
S
isthesamplesize. Allthe three tests,asperformedin HPL,arepassedbyCALU.Moreover,forall thetest ases,CALUleads to resultsof thesame order of magnitude (
10
−2
,
10
−3
) asLU fa torization with partial
pivoting.
6.2 Performan e of TSLU
We evaluate theperforman eof TSLUusing matri esof asize
m × n
, ablo k sizeb = n
, and varying bothm
andn
(m ∈ {10
3
, 5 · 10
3
, 10
4
, 10
5
, 10
6
}
and
n ∈ {50, 100, 150}
). Ourgoalis to studytheperforman eimprovementofTSLU omparedtotheS aLAPACKPDGETF2routine.ThetimeratiobetweenPDGETF2 andTSLUobtainedontheIBMPOWER5systemandthe
CrayXT4systemisdisplayedinTable3andTable4.
Theimprovementisexpe tedinpartduetousingabetterLUfa torizationalgorithminthe
lo al sequentialLUfa torization (step 2of AlgorithmTSLU), and in partdue toredu ingthe
laten y ost. In ouralgorithm weuse the re ursiveLUfa torization, the RGETF2routine as
giveninAppendixBof[6℄. Re allthatTSLUperformstwi ethenumberofopsofPDGETF2.
Tobetterunderstandtheseissues,we omparetwodierent ongurationsofTSLU.Intherst
onethe lo al LUfa torization performed by ea h pro essoron itsgroup ofrowsisdone using
the lassi LU fa torization. We use the LAPACK DGETF2 routine, and the results for this
onguration are displayedin the olumns denoted
Cl
in Tables 3and 4. In the se ond one, displayed in the olumnsRec
, we use there ursiveLU fa torization, the RGETF2 routine as givenin AppendixBof[6℄.Inea htableweshowresultsforxed
m
anddierentvaluesofn
andnumberofpro essors. Severalresultsaremissingintheplots,andthisisbe auseeithertherewasnotenoughmemorytoperformthefa torizationortheinputmatrixistoosmallandsomepro essorsarenotinvolved
Noofpro essors
P = P
r
× P
c
m
n = b
4 8 16 32 642 × 2
2 × 4
4 × 4
4 × 8
8 × 8
Re Cl Re Cl Re Cl Re Cl Re Cl
10
3
50 1.66 1.59 1.96 2.06 2.24 2.09 - - --10
3
100 1.27 1.17 1.44 1.37 - - --10
3
150 1.06 0.97 - - --5 · 10
3
50 1.62 1.08 1.48 1.44 1.97 1.94 1.78 2.05 2.09 1.715 · 10
3
100 0.98 0.85 1.13 1.04 1.36 1.29 1.39 1.47 --5 · 10
3
150 1.08 0.81 1.00 1.01 1.06 0.96 0.99 0.97 --10
4
50 1.24 0.87 1.71 0.88 1.78 1.68 1.66 1.94 2.18 1.7610
4
100 1.34 0.81 1.01 0.80 1.26 1.15 1.30 1.28 1.51 1.2110
4
150 3.07 0.88 1.03 0.78 1.01 0.89 1.00 0.97 1.06 0.8010
5
50 1.07 0.70 1.09 0.72 1.18 0.85 1.15 1.32 1.50 1.2310
5
100 1.00 0.70 1.04 0.67 1.09 0.73 1.21 1.03 1.19 1.0110
5
150 1.13 0.68 1.13 0.69 1.36 0.77 1.08 0.84 1.03 0.7510
6
50 1.36 0.71 1.27 0.71 1.25 0.70 1.12 0.69 2.01 0.8210
6
100 1.84 0.75 1.95 0.87 1.62 0.73 2.90 0.84 1.08 0.7010
6
150 2.32 0.81 2.34 0.89 4.37 0.90 3.42 0.85 1.22 0.70Table3: TimeratioofPDGETF2toTSLUobtainedonIBMPOWER5system,usingDGETF2
forthelo alLUfa torization(Cl),andusingRGETF2forthelo alLUfa torization(Re ). The
matrixfa torizedis
m × n
,withablo kofsizeb = n
.OntheIBMPOWER 5system(Table3), thebest improvementisobtainedfor thelargest
matrix in our test set
m = 10
6
and
n = b = 150
, where TSLU outperforms PDGETF2 by a fa torof4.37
on16
pro essors. Theimprovementduetolaten yredu tionisalmostafa tor2
. Thisshowsthat redu ingthelaten y ostisanimportantpartoftheoverallimprovement.Thebestperforman e ofTSLUon theIBMPOWER 5systemis
215
GFLOPs/s, and itis obtained form = 10
6
and
n = 150
on64
pro essors(we ounthere thetotal numberof ops performedbyTSLU).Thisrepresents44%
ofthetheoreti alpeakperforman e. Thisperforman e orrespondstoanimprovementof1.22
overPDGETF2.For small matri es, we an noti e that the improvement omes mainly from redu ing the
laten y ost. For intermediate size matri es and a small number of pro essors (up to
4
,8
pro essors), the improvement omes mainly from using re ursion. For thesame matri es anda large number of pro essors, the improvement omes from de reasing the laten y ost. On
small number of pro essors the re ursion leads to important improvements for large matri es
(
m = 10
5
, 10
6
and varying
n
), for instan e a fa tor of2.3
form = 10
5
and
n = 150
on4
pro essors. However, with in reasing number of pro essors, even for large matri es, redu ingthe laten yplaysanimportant rolein theoverallimprovement. Thebest resultsare obtained
on
16
and32
pro essorsfor the biggest matri esm = 10
6
and
n = 150
, showing the overall improvementfa torsof4.37
and3.42
respe tively.OntheCrayXT4system(Table4),thebest improvementis seenfor
m = 10
6
and
n = 150
: afa tor of5.58
on4
pro essorsand afa torof5.52
on8
pro essors. Thebest performan eof the TSLUalgorithm is240
GFLOPs/s, obtained by TSLUon64
pro essorsform = 10
6
and
n = 150
. This represents36%
of the theoreti al peak performan e and it orresponds to an improvementof3.67
overPDGETF2.Noofpro essors
P = P
r
× P
c
m
n = b
4 8 16 32 642 × 2
2 × 4
4 × 4
4 × 8
8 × 8
Re Cl Re Cl Re Cl Re Cl Re Cl
10
3
50 1.42 2.23 1.85 2.71 2.09 3.09 - - --10
3
100 1.14 1.39 1.29 1.56 - - --10
3
150 1.12 0.91 - - --5 · 10
3
50 1.22 1.42 1.65 2.15 1.97 2.72 2.10 3.06 1.04 2.595 · 10
3
100 1.27 1.24 1.25 1.32 1.35 1.53 1.38 1.65 --5 · 10
3
150 1.67 1.22 0.97 0.90 0.88 0.91 0.85 0.90 --10
4
50 1.20 1.14 1.37 1.19 1.85 2.42 1.03 2.88 2.03 3.1010
4
100 2.19 1.34 1.56 1.44 1.30 1.41 0.94 1.56 1.36 1.9410
4
150 2.61 1.30 2.03 1.44 0.92 0.88 0.81 0.90 0.87 0.9310
5
50 2.12 1.13 2.23 1.37 2.29 1.50 1.20 1.47 1.76 1.8810
5
100 3.14 1.25 3.13 1.45 2.97 1.43 1.92 1.39 2.38 1.3910
5
150 3.78 1.30 3.57 1.47 3.14 1.30 2.12 1.26 2.34 1.1510
6
50 2.99 1.49 3.02 1.51 2.86 1.28 2.09 1.03 2.14 1.3110
6
100 4.51 1.65 4.55 1.71 4.04 1.36 3.04 1.13 3.07 1.2710
6
150 5.58 1.61 5.52 1.76 4.80 1.39 3.60 1.12 3.67 1.20Table4: Time ratioofPDGETF2 toTSLUobtainedonCrayXT4system,using DGETF2for
the lo al LUfa torization (Cl), andusing RGETF2 forthe lo al LU fa torization(Re ). The
matrixfa torizedis
m × n
,withablo kofsizeb = n
.Forthesmallmatri es(
m = 10
3
and
n
varyingfrom50
to150
orm = 5 · 10
3
and
n = 50
or100
)thebestresultsareobtainedusing lassi LU,hen esolely duetoredu tionof thelaten y ost. Forallthelargematri es,usingre ursiveLUshowsabetterperforman e.Insummary, forallthe asestested, at leastoneof thenewTSLU algorithmsoutperforms
theS aLAPACKroutinePDGETF2. Forsmallmatri estheusageof lassi LUleadsto better
performan ethanthe re ursiveLU.Thisis ina ordan ewith theresultsin [6,9℄whi hshow
thatthere ursivealgorithmdonotfarebetterthanthe lassi algorithmforsmallmatri es. For
largermatri es, re ursiveLUperformsbetterthan lassi LU.The improvementsobtainedby
TSLUareduetobothredu ingthelaten yandthelo al LUfa torization osts.
6.3 Performan e of CALU
Inthisse tionwestudy theperforman eimprovementofCALU omparedtotheS aLAPACK
PDGETRFroutine. Weusematri esofasize
m × m
,andvaryingbothm
andb
(m ∈ {10
3
, 5 ·
10
3
, 10
4
}
and
b ∈ {50, 100, 150}
). The timeratio betweenPDGETRF and CALU obtainedon theIBMPOWER5systemandtheCrayXT4systemarepresentedinTables5and6.We have observed in Tables 3 and 4 that for a small number of pro essors, the best
per-forman efor TSLUis obtainedwhen re ursiveLU isused for thelo al LU fa torization. The
number of pro essorsused for TSLU orresponds to the numberof rows
P
r
in the2D grid of pro essorsusedforCALU.Sin einourtestsforCALUP
r
isrelativelysmall(withvaluesgoing from2
to8
),inallourtestsweuseforthepanelfa torizationTSLUwithre ursiveLU.ForIBMPOWER5system,theresultsaredisplayedinTable5. Therstmatrix(
m = 10
3
and varying
b
) is relatively small, and thus we do not expe t any important speedup with in reasing number of pro essors. In fa t, form = 10
3
we see no speedup for a number of
redu tionofthelaten y ost. Thebestimprovementisobtainedfor
m = 10
3
and
b = 50
(fa tors of2.23
on16
pro essorsand2.29
respe tivelyon64
pro essors),mainly asaresultof redu ing the laten y ost. An important improvementis obtainedalso form = 5 · 10
3
, afa tor of
1.67
on32
pro essorsand afa torof1.69
on64
pro essors. Form = 10
4
,thebest improvementisa
fa torof
1.59
on32
pro essors.ForCrayXT4system,theresultsaredisplayedinTable6. Wenoti ethattheimprovements
aresmallerthanonthe IBMPOWER 5system. Thebestimprovementobtainedisafa torof
1.81
form = 10
3
and
b = 100
on64
pro essors. Form = 5 · 10
3
,thebestimprovementsobtained
areafa torof
1.38
on8
pro essorsand afa torof1.36
on64
pro essors,bothforb = 150
. Form = 10
4
,thebest improvementsareafa torof
1.38
on32
pro essorsandafa torof1.33
on64
pro essors.Noofpro essors
P = P
r
× P
c
m = n
b
4 8 16 32 642 × 2
2 × 4
4 × 4
4 × 8
8 × 8
Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops
CALU CALU CALU CALU CALU
10
3
50 1.57 9.79 1.59 11.9 2.23 11.8 2.07 11.5 2.25 9.810
3
100 1.48 8.84 1.47 10.5 1.91 10.6 1.91 11.2 2.29 10.910
3
150 1.36 8.26 1.41 9.9 1.70 9.5 - - --5 · 10
3
50 1.05 21.5 1.09 39.4 1.31 61.2 1.51 95.5 1.69 118.65 · 10
3
100 1.06 21.1 1.13 37.3 1.21 56.7 1.67 84.1 1.66 103.15 · 10
3
150 1.04 20.6 1.08 35.1 1.18 52.3 1.26 74.8 1.45 89.110
4
50 1.00 23.27 1.00 45.1 1.08 80.3 1.17 143.2 1.35 213.910
4
100 1.00 23.62 1.00 44.4 1.10 78.0 1.19 133.2 1.24 197.610
4
150 1.01 23.47 1.02 42.3 1.33 74.1 1.59 122.1 1.17 173.8Table5: Time ratio of PDGETRF to CALU (Impvt olumns) and performan e for CALU in
GFLOPs/s (GFlops olumns) obtained on IBM POWER 5 system. The matrix fa torized is
m × m
,withablo kofsizeb
.TheimprovementspresentedinTables6and5areobtainedforaxednumberofpro essors
and a xed blo ksize. Howeverthe best improvementsdo not orrespond alwaysto the best
performan eofCALUorPDGETRF.CALU anhaveabetterperforman eforadierentblo k
size or grid shape than PDGETRF. Hen e, an interesting question to answer is: for a given
problemsize
m
andagivenmaximumnumberofpro essors,what istheimprovementobtained bythebest CALUwithrespe tto thebestPDGETRF?Toanswerthisquestion,wepresentinTable7theimprovementobtainedbytakingthebestperforman eindependentlyforCALUand
PDGETRF,whenvaryingthenumberofpro essors(from
8
to64
)andtheblo ksize(valuesof50
,100
and150
). Fora givennumberofpro essors,weuse onegrid shape, asin ourprevious experiments. We also display the best performan e for CALU and PDGETRF in GFLOPs/s(GFlops olumns),theblo ksize(
b
)andthenumberofpro essorsforwhi hthebestperforman e wasobtained,and theper entageoftheoreti alpeakperforman eobtainedbyCALU( olumnsPr nt). Thespeedup is omputedas follows:
speedup(m, m, P
max
) =
min
P ≤P
max
,b
T
P DGET RF
(m, m, P, b)
min
P ≤P
max
,b
T
CALU
(m, m, P, b)
CALU leadsto improvementsupto
1.69
onIBM POWER 5and upto1.53
onCray XT4. Note that the improvements are obtained when the performan e of the algorithm is a smallNoofpro essors
P = P
r
× P
c
m = n
b
4 8 16 32 642 × 2
2 × 4
4 × 4
4 × 8
8 × 8
Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops
CALU CALU CALU CALU CALU
10
3
50 1.19 5.4 1.20 6.6 1.33 6.6 1.35 7.5 1.67 7.610
3
100 1.28 5.5 1.39 7.0 1.52 7.2 1.60 8.5 1.81 8.310
3
150 1.23 5.3 1.32 6.6 1.44 6.7 - - --5 · 10
3
50 1.03 19.2 1.09 33.3 1.12 44.3 1.16 69.2 1.11 67.25 · 10
3
100 1.12 19.4 1.20 32.3 1.13 42.8 1.24 67.4 1.32 76.15 · 10
3
150 1.23 19.1 1.38 31.0 1.22 40.8 1.35 61.9 1.36 70.510
4
50 1.01 24.4 1.05 45.7 1.04 69.5 1.08 121.3 1.31 154.910
4
100 1.09 25.3 1.18 46.5 1.13 69.8 1.22 118.2 1.33 153.310
4
150 1.16 25.2 1.31 45.4 1.22 67.3 1.38 111.4 1.30 140.3Table6: Time ratio of PDGETRF to CALU (Impvt olumns) and performan e for CALU in
GFLOPs/s(GFlops olumns)obtainedonCrayXT4system. Thematrixfa torizedis
m × m
, withablo kofsizeb
.per entageof the theoreti al peak performan e. The smallestper entageis obtained on Cray
XT4,for
m = 10
3
and
P = 32
. Thisissomehowexpe ted,sin ethisisasmallmatrixexe uted on 32 dual- ore pro essors. A better per entage is obtained on IBMPOWER 5, for example40.6
form = 10
4
on
64
pro essors. However,evenwhen theper entageofpeakperforman eis small,theTable7showsthatCALU willletausermoree ientlyusethesameresour esthanPDGETRF,andsoitisalwaysworthusingit.
IBMPower5
m speedup CALU PDGETRF
GFlops P b Pr nt GFlops P b
10
3
1.59 11.9 8 50 19.6 7.5 8 505 · 10
3
1.69 118.6 64 50 24.4 70.0 64 5010
4
1.34 213.9 64 50 40.6 159.8 64 100 CrayXT4m speedup CALU PDGETRF
GFlops P b Pr nt GFlops P b
10
3
1.53 8.5 32 100 2.5 5.54 32 505 · 10
3
1.26 76.1 64 100 11.4 60.2 64 5010
4
1.31 154.9 64 50 23.2 118.1 64 50Table7:SpeedupestimatedastheratioofbestPDGETRFoverbestCALUforagivenproblem
size,thebestperforman eforCALUandPDGETRFinGFLOPs/s(GFlops olumns),theblo k
size(
b
olumns)andthenumberofpro essors(P
olumns)forwhi hthebestperforman ewas obtained. Pr nt denotestheper entageoftheoreti alpeakperforman eobtainedbyCALU.7 Con lusions and future work
Inthispaperwehaveintrodu edCALU,anewalgorithmfor omputingtheLUfa torizationof
densematri es. This algorithm usesanewpivoting strategy, the a-pivoting, whi h isused to
e iently omputethe LUfa torizationof ablo k- olumn, andleadstoanimportantde rease
in thenumberofmessagesofCALU withrespe tto lassi algorithms.
We have ompared CALU with the orresponding PDGETRF routine from S aLAPACK.
Ourexperimentshaveshownthatdependingonthesizeofthematrixandthe hara teristi sof
theunderlying omputerar hite ture, itis either laten yredu tionorre ursion orbothwhi h
aremajor fa torsin redu ingtheparalleltime ofthe LUfa torization. Interestingly, thegains
due to laten y redu tionare not limited onlyto the small matri es, but ae ts also the large
matri es. Inthese asesthere ursionisverye ientinredu ingthelo alLUfa torizationtime
andthusleavesthelaten yasthetime onsumingbottlene k,whi hneedstobealleviated.
Thefa torizationofablo k- olumn,TSLU,outperformsthe orrespondingroutinePDGETF2
from S aLAPACK upto a fa tor of
4.37
on the IBM POWER5 system and upto afa tor of5.52
on the CrayXT4system. CALU outperformsPDGETRF up to afa tor of2.29
on IBM POWER5anduptoafa torof1.81
onCrayXT4.Thefa torizationofablo k- olumnliesonthe riti alpathoftheparallelLUfa torization,
andhen eweexpe tthat theusageof a-pivoting strategybyotherparallel LUalgorithms, as
HPL,willleadtoimprovementsoftheoveralltime.
Asfuturework,itwill beinterestingto studythesuitabilityofthenew a-pivotingstrategy
for parallel LUon multi ore ar hite tures. Anotherdire tion onsists of using the a-pivoting
strategy for the LU fa torization orthe in omplete LUfa torization of sparse matri es. This
maypayomu hmorethanthedense ase,sin ethereisahigherproportionof ommuni ation
versus omputation.
A knowledgments
TheauthorsthankM.Hoemmen,J.Langou,andJ.Riedyforhelpfuldis ussionsonthissubje t.
Referen es
[1℄ Top500super omputersites. availableatwww.top500.org.
[2℄ J.Choi, J.J.Dongarra, L. S.Ostrou hov,A. P.Petitet, D.W. Walker, and R.C.Whaley. The
Design and Implementation of the S aLAPACK LU, QR and CholeskyFa torization Routines.
S ienti Programming,5(3):173184,1996. ISSN1058-9244.
[3℄ J. Demmel,L. Grigori, M. Hoemmen, and J. Langou. Communi ation-avoiding parallel and
se-quentialQRfa torizations. Te hni alreport, UCBerkeley,2008.
[4℄ J.J.Dongarra, R.A.V.D.Geijn, andD.W.Walker. S alabilityIssuesAe tingtheDesignofa
DenseLinearAlgebraLibrary. Journalof Parallel andDistributedComputing, 22:523537,1994.
[5℄ J.J.Dongarra,P.Lusz zek,andA.Petitet.TheLINPACKBen hmark: Past,PresentandFuture.
Con urren y: Pra ti e andExperien e, 15:803820,2003.
[6℄ F.Gustavson. Re ursion Leadsto Automati VariableBlo king for DenseLinear-Algebra
Algo-rithms. IBMJournalof Resear h andDevelopment,41(6):737755,1997.
[7℄ D.Skinner.IBMSPParallelS alingOverview.http://www.ners .gov/news/reports/te hni al/seabo rg_s a ling.
[8℄ A.Tate. Personal ommuni ation.
[9℄ S.Toledo. Lo alityofreferen einLUDe ompositionwithpartialpivoting. SIAMJ.Matrix Anal.
Appl.,18(4),1997.
[10℄ L.N.TrefethenandR.S.S hreiber.Average- asestabilityofGaussianelimination.SIAMJ.Matrix