• Aucun résultat trouvé

Communication Avoiding Gaussian Elimination

N/A
N/A
Protected

Academic year: 2021

Partager "Communication Avoiding Gaussian Elimination"

Copied!
22
0
0

Texte intégral

(1)

HAL Id: inria-00277901

https://hal.inria.fr/inria-00277901v2

Submitted on 13 May 2008

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Communication Avoiding Gaussian Elimination

Laura Grigori, James Demmel, H. Xiang

To cite this version:

Laura Grigori, James Demmel, H. Xiang. Communication Avoiding Gaussian Elimination. [Research

Report] RR-6523, INRIA. 2008. �inria-00277901v2�

(2)

a p p o r t

d e r e c h e r c h e

0

2

4

9

-6

3

9

9

IS

R

N

IN

R

IA

/R

R

--6

5

2

3

--F

R

+

E

N

G

Thème NUM

Communication Avoiding Gaussian Elimination

Laura GRIGORI — James W. DEMMEL — Hua XIANG

N° 6523

(3)
(4)

Centre de recherche INRIA Saclay – Île-de-France

Parc Orsay Université

Laura GRIGORI

, JamesW. DEMMEL

, HuaXIANG

ThèmeNUMSystèmesnumériques

Équipe-ProjetGrand-large

Rapportdere her he n°6523Mai200818pages

Abstra t: ThispaperpresentsCALU,aCommuni ationAvoidingalgorithmfortheLU

fa tor-izationof densematri esdistributed in atwo-dimensional(2D) y li layout. Thealgorithmis

basedonanewpivotingstrategy,referredtoas a-pivoting,thatisshowntobestableinpra ti e.

The a-pivoting strategy leads to asigni ant de rease in the number of messages ex hanged

duringthefa torizationofablo k- olumnrelativelyto onventionalalgorithms,andthusCALU

over omes the laten y bottlene k of the LU fa torization as in urrent implementations like

S aLAPACKandHPL.

Theexperimentalpartofthis paperfo useson theevaluation oftheperforman eof CALU

on two omputational systems, an IBM POWER 5 system with 888 ompute pro essors

dis-tributedamong111 omputenodes,andaCrayXT4systemwith

9660

dual- oreAMDOpteron pro essors. We ompare CALU with S aLAPACKPDGETRF routine that omputes the LU

fa torization. OurexperimentsshowthatCALUleadstoaredu tionintheparallel timeofthe

LUfa torization. Thegaindependsonthesizeofthematri esandonthe hara teristi softhe

omputerar hite ture. In parti ularthe ee t isfound to be signi antin the aseswhenthe

laten y time isan importantfa torof the overall time, asfor examplewhen asmall matrixis

exe utedonlargenumberofpro essors.

The fa torization of a blo k- olumn, referred to as TSLU, rea hes a performan e of

215

GFLOPs/son

64

pro essorsoftheIBMPOWER5system,andaperforman eof

240

GFLOPs/s on

64

pro essorsof theCray XT4system. It represents

44%

and

36%

of the theoreti alpeak performan esonthese systems. TSLUoutperforms the orrespondingroutinePDGETF2 from

S aLAPACKuptoafa torof

4.37

ontheIBMPOWER5systemanduptoafa torof

5.58

on theCrayXT4system.

Onsquarematri esoforder

10

4

,CALU outperformsPDGETRFbyafa torof1.24onIBM

POWER 5 and by a fa tor of

1.31

on Cray XT4. It represents

40%

and

23%

of the peak performan eon thesesystems. Thebest improvementobtainedbyCALU is aspeedup of

2.29

onIBMPOWER5andaspeedupof

1.81

onCrayXT4.

Key-words: denseLUfa torization,parallel pivotingstrategies,avoiding ommuni ation

INRIA Sa lay-Ile de Fran e, Bat 490, Universite Paris-Sud 11, 91405 Orsay Fran e

(laura.grigoriinria.f r).

Computer S ien e Division and Mathemati s Department, UC Berkeley, CA 94720-1776, USA

(demmel s.berkeley.ed u).

(5)

Résumé: Dans e papiernousprésentonsCALU,unalgorithmequiminimizeles

ommuni a-tionsdelafa torizationLUdesmatri esdensesave unedistributionbi-dimensionnelle y lique.

L'algorithmeest basésurune nouvellestrategiedepivotage,appelée a-pivotage,quiest stable

enpratique. Lastrategie a-pivotagemèneàunediminutionsigni ativedunombredemessages

é hangéspendantlafa torisationd'unbloque- olonnerelativementauxalgorithmes

onvention-nels, et CALU surmonte ainsi le goulot d'étranglementde laten ede la fa torisation LUdans

desréalisations ourantes ommeS aLAPACK etHPL.

Lapartie expérimentalede et arti lese on entre surl'évaluation de CALU surdeux

ma- hines,unema hineIBMPOWER5ave 888pro esseursdistribuésparmi111noeudsde al ul,

et une ma hine Cray XT4 ave

9660

pro esseurs dual- ore AMD Opteron. Nous omparons CALUàlaroutinePDGETRFdeS aLAPACKqui al ulelafa torisationLU.Nosexpérien es

montrentqueCALUmèneàunerédu tiondutemps delafa torisationLU.Legaindépendde

latailledesmatri esetdes ara téristiquesdel'ar hite turedelama hine. Enparti ulierl'eet

s'avèresigni atifdansles asoùletempsdelaten eestunfa teurimportantdutempsglobal,

omme parexemple quand une matri e de taille réduite est exé utée sur ungrand nombrede

pro esseurs.

Lafa torisationd'unbloque- olonne,désignéesouslenom deTSLU,atteint

215

GFLOPs/s sur

64

pro esseurs de IBM POWER 5, et

240

GFLOPs/s sur

64

pro esseurs de Cray XT4. Ce ireprésente

44%

et

36%

desexé utionsmaximalesthéoriquessur esma hines. Lemeilleur speedup de TSLUpar rapport à laroutine orrespondante PDGETF2 de S aLAPACKest de

4.37

surIBMPOWER5etde

5.58

surCrayXT4. Surdesmatri es arréesdetaille

10

4

,CALUsurpassePDGETRFparunfa teur

1.24

surIBM POWER5et parunfa teur de

1.31

surCrayXT4. Ce i représente

40%

et

23%

del'exé ution maximalesur essystèmes. LameilleureaméliorationobtenueparCALUestunspeedupde

2.29

surIBMPOWER 5et unspeedupde

1.81

surCrayXT4.

(6)

1 Introdu tion

Solvinglinearsystemsofequationsisoneofthemostusedoperationinvariousappli ationsand

numeri alsimulationsins ienti omputing. Theseappli ationsfrequentlyleadtosolvingvery

largedensesets oflinearequations, oftenwithmillionsof rowsand olumns, andsolvingthese

problemsisverytime onsuming.

InthispaperwepresentaCommuni ation-AvoidingLUfa torization(CALU)algorithmfor

omputingtheLUfa torizationof adensematrix

A

distributed inatwo-dimensional(2D) lay-out. CALUisbasedonanewpivotingstrategy,thatweshowitisnumeri allystableinpra ti e.

CALU has twomain hara teristi s. First,it islaten y avoiding, asthenew pivotingstrategy

allowsfor asigni antde rease in the numberof messagesex hangedduring thefa torization

relativelyto onventionalalgorithms,thoughthat omes atthe ostofsomeredundant

ompu-tations. Werefertothenewpivotingstrategyas a-pivoting. Thisapproa histhusparti ularly

bene ial on parallel ar hite tures and for sizes of matri es for whi h the overhead asso iated

with sendingamessagebetweentwopro essorsis anexpensivefa torin thealgorithm.

More-over, today'ste hnology trends predi t that arithmeti will ontinue to improveexponentially

fasterthanbandwidth,andbandwidthexponentiallyfasterthanlaten y. SoCALUiswellsuited

forfuture parallel ar hite tures, inwhi h onventionalalgorithms willspendmoreandmoreof

theirtime ommuni atingandlessandlessdoingarithmeti . Se ond,itallowstheusageofthe

bestavailablesequentialalgorithmfor omputingtheLUfa torizationofablo k- olumn,asfor

examplethere ursivealgorithms[6,9℄.

CALU uses ablo kright-looking approa h in whi h adense matrix

A

with a2D layoutis fa torizedbytraversingiterativelyblo ksof olumns. Atea hiteration,rstablo k- olumnof

width

b

is fa tored. Then thetrailing matrixis updated, and thede omposition ontinueson thetrailingmatrix. Themaindieren ewithrespe ttootherblo kright-lookingalgorithmslies

inthefa torizationofablo k- olumn,whi hisperformedverye ientlyinCALUbyusingthe

new a-pivotingstrategyas follows. TheLUde ompositionoftheblo k- olumnisperformedin

twosteps. Therststep,aprepro essingstep,identiese ientlyin parallel

b

pivotrows,that providegood pivots forthe LUfa torizationof the entire blo k- olumn. Wedes ribe in detail

laterin thepaperhowtheserowsareidentied. Thepivotrowsarepermutedto bein therst

b

positionsoftheblo k- olumn. Inthese ondsteptheLUfa torizationwithnopivotingofthe blo k- olumnis performed. Werefer tothis approa h forperformingtheLUfa torizationof a

blo k- olumn asTSLU (Tall Skinny LU), sin ea blo k- olumn an onsidered to beamatrix

with a 1D layout for whi h the verti al dimension (number of rows)is mu h larger than the

horizontaldimension(numberof olumns).

CALU over omesthelaten y bottlene kof the lassi LUfa torization, as implemented in

S aLAPACKPDGETRFroutine[2℄. InPDGETRF,theLUde ompositionofan

m × n

matrix is performedin parallel using ablo k y li distribution of the matrixovera

P

r

by

P

c

gridof pro essors, where

P

r

× P

c

= P

and

P

isthe numberof pro essors. Thelaten y bottlene k in S aLAPACKlies in theLU fa torizationof a blo k- olumn that is spread over

P

r

pro essors, that leads to

2n log

2

P

r

number of messages ommuni ated during the fa torization. All the other terms are of the form

O(n/b) log

2

P

r

+ O(n/b) log

2

P

c

, where

b

is the size of the blo k usedinthe2Ddistribution. CALUhasanumberofmessages ommuni atedof

3(n/b) log

2

P

r

+

3(n/b) log

2

P

c

, i.e. smallerby a fa torof

b

. Thepri e for fewer messagesis

b(mn − n

2

/2)/P

r

moreoatingpointwork,whi hisasmall fra tionoftheoverall

(mn

2

− n

3

/3)/P

work.

Thispaperfo useson omparingCALUwiththeapproa husedinS aLAPACKPDGETRF,

and the parallel implementation we present for CALU follows the main steps used in

S aLA-PACK.However,the a-pivotings heme anbeused inotherparallel algorithmsimplementing

(7)

exam-ple in thehighly optimizedHigh Performan e Linpa k(HPL) ben hmark, usedin determining

theTop500list[1℄. Thisistheobje tofour urrentresear h.

Thenew a-pivotings hemeusedinCALUmayleadtoadierentrowpermutationthanthe

lassi LUfa torization. Inthispaperwepresentnumeri alresultsthat showthat a-pivoting

s heme is stable in pra ti e. We observe that it behaves as a threshold pivoting, where the

minimumthresholdvalueinpra ti alexperimentsis

0.33

. Inother words,

|L|

isbounded by

3

, whileinLUfa torizationwithpartialpivoting,

|L|

isboundedby

1

,where

|L|

denotesthematrix ofabsolutevaluesoftheentriesof

L

. Wealsondthatthea ura ytestsperformedinHPL[5℄, whi h onsist of omputing several s aled residuals, are fullled by the a-pivoting strategy.

Hen e thisapproa h ouldbeusedforevaluatingtheperforman eofparallel omputers.

ThealgorithmpresentedinthispaperbearssomesimilaritiestotheCommuni ation-Avoiding

QR (CAQR) fa torization dis ussed in [3℄. Both algorithms usea redu e-like omputation for

thepanelfa torization,thusde reasingthe ommuni ation ost. Howeverthenumeri alstability

issues relatedto theLUfa torization leadto anumberofsigni antdieren es. Forinstan e,

CALU performsthepanelfa torizationtwi e,but theupdateofthetrailingmatrixisthesame

as in the lassi LU fa torization. In ommuni ation-avoiding QR, the panel fa torization is

performedon e,butthereissomeredundant omputationintheupdateofthetrailingmatrix.

The rest of the paper is organized as follows. Se tion 2 introdu es CALU and the new

a-pivoting s heme. Se tion 3 des ribes the parallel LU fa torization of a tall-skinny matrix

using a-pivoting, and dis usses its performan e in terms of omputation and ommuni ation

ost. Se tion 4 presents the parallel CALU algorithm of amatrix distributed in a 2D layout

and dis usses its omputation and ommuni ation ost. Se tion 5 ompares the lassi LU

fa torizationalgorithms implemented inS aLAPACKand thenewproposed CALU algorithm.

Se tion 6 des ribes experimental results that rst dis uss the stability of a-pivoting s heme,

andse ondevaluatetheperforman eofCALUontwo omputationalsystems,anIBMPOWER

5 systemand aCray XT4system, lo ated at National Energy Resear h S ienti Computing

Center (NERSC).AndSe tion7presentsthe on lusionsandourfuturework.

2 Des ription of CALU

Inthisse tionwedes ribethemainstepsofCALUalgorithmfor omputingtheLUfa torization

of amatrix

A

ofsize

m × n

. Weuseseveralnotations. Werefertothe submatrixof

A

formed by elements of rowindi es from

i

to

j

and olumn indi es from

d

to

e

as

A(i : j, d : e)

. If

A

is theresult ofthe multipli ationof twomatri es

B

and

C

, we referto thesubmatrix of

A

as

(BC)(i : j, d : e)

.

CALUis ablo kalgorithmthat fa torizes theinputmatrixbytraversingiterativelyblo ks

of olumns. Attherstiteration,thematrix

A

ispartitionedasfollows:

A =



A

11

A

12

A

21

A

22



where

A

11

is of size

b × b

,

A

21

is of size

(m − b) × b

,

A

12

is of size

b × (n − b)

and

A

22

is of size

(m − b) × (n − b)

. Asother lassi rightlooking algorithms,CALU rst omputesthe LU fa torizationoftherst blo k- olumn, thendeterminestheblo k

U

12

,andupdates thetrailing matrix

A

22

.

Themain dieren ewithrespe tto otherexisting algorithmsliesinthefa torizationofthe

rst blo k- olumn. CALU uses the new a-pivoting strategy, onsisting in performing rst a

prepro essing step in whi h agood set of pivot rowsis identied. Se ond, thepivot rowsare

(8)

rstblo k- olumnisperformed. CALU onsidersthattherstblo k- olumnispartitionedin

P

blo k-rows. Wepresentherethesimple ase

P = 4

. Forthesakeofsimpli ity,wesupposethat

m

isamultipleof

4

.

The prepro essing step starts by performing the LU fa torization with partial pivoting of

ea hblo k-row,asfollows:

A

=

A

0

A

1

A

2

A

3

=

¯

Π

00

L

¯

00

U

¯

00

¯

Π

10

L

¯

10

U

¯

10

¯

Π

20

L

¯

20

U

¯

20

¯

Π

30

L

¯

30

U

¯

30

=

¯

Π

00

¯

Π

10

¯

Π

20

¯

Π

30

·

¯

L

00

¯

L

10

¯

L

20

¯

L

30

·

¯

U

00

¯

U

10

¯

U

20

¯

U

30

Π

¯

0

L

¯

0

U

¯

0

Thisleadstoade ompositioninwhi htherstfa tor,

Π

¯

0

,isan

m×m

blo kdiagonalmatrix, whereea hdiagonalblo k

Π

¯

i0

isapermutationmatrix. These ondfa tor,

L

¯

0

,isan

m×P b

blo k diagonal matrix, where ea h diagonal blo k

L

¯

i0

is an

m/P × b

lowerunit trapezoidalmatrix. Thethirdfa tor,

U

¯

0

,isa

P b × b

matrix,whereea hblo k

U

¯

i0

isa

b × b

uppertriangularfa tor. Note that this step aimsat identifying in ea h blo k-rowaset of

b

linearlyindependentrows, whi h orrespondtotherst

b

rowsof

Π

¯

T

i0

A

i

,with

i = 0 . . . 3

.

From the

P

sets of lo al pivot rows, we perform abinary tree (of depth

log

2

P = 2

in our example)ofLUfa torizationsofmatri esofsize

2b × b

toidentify

b

globalpivotrows. The2LU fa torizationsat theleavesofourdepth-2binarytreeareshownhere, ombinedin onematrix.

Thisde ompositionleadstoa

P b × P b

permutationmatrix

Π

¯

1

,a

P b × 2b

fa tor

L

¯

1

anda

2b × b

fa tor

U

¯

1

.

¯

Π

T

0

A (1 : b, 1 : b)

¯

Π

T

0

A (m/P + 1 : m/P + b, 1 : b)

¯

Π

T

0

A (2m/P + 1 : 2m/P + b, 1 : b)

¯

Π

T

0

A (3m/P + 1 : 3m/P + b, 1 : b)

=



¯

Π

01

L

¯

01

U

¯

01

¯

Π

11

L

¯

11

U

¯

11



=



¯

Π

01

¯

Π

11



·



¯

L

01

¯

L

11



·



¯

U

01

¯

U

11



Π

¯

1

L

¯

1

U

¯

1

Theglobalpivot rowsareobtainedafter applying onemoreLUde omposition(at theroot

ofourdepth-2binarytree)onthepivotrowsidentiedpreviously:



¯

Π

T

1

Π

¯

T

0

A (1 : b, 1 : b)

¯

Π

T

1

Π

¯

T

0

A (2m/P + 1 : 2m/P + b, 1 : b)



= ¯

Π

02

L

¯

02

U

¯

02

≡ ¯

Π

2

L

¯

2

U

¯

2

Thepermutations identiedin theprepro essing stepareapplied ontheoriginal matrix

A

. ThentheLUfa torizationwithnopivotingoftherstblo k- olumnisperformed,theblo k-row

of

U

is omputed and thetrailing matrixis updated. Note that

U

11

= ¯

U

2

. Thefa torization ontinuesonthetrailingmatrix

A

¯

. Thepermutation matri es

Π

¯

0

, ¯

Π

1

, ¯

Π

2

donothavethesame dimensions. By abuse of notation, we onsider that

Π

¯

1

, ¯

Π

2

are extended by the appropriate identitymatri estothedimensionof

Π

¯

0

.

¯

Π

T

2

Π

¯

T

1

Π

¯

T

0

A =



L

11

L

21

I

n−b



·



I

b

¯

A



·



U

11

U

12

U

22



The a-pivotingstrategyhasseveral important hara teristi s. First,when

b = 1

or

P = 1

, a-pivoting isequivalentto partialpivoting. Se ond,theeliminationof ea h olumn of

A

leads toarank-1update ofthetrailingmatrix. Therank-1updatepropertyisshownexperimentally

tobeveryimportantforthestabilityofLUfa torization[10℄. Alargerankupdatemightleadto

anunstableLUfa torization,asforexampleinanotherstrategysuitableforparallel omputing

alled parallelpivoting [10℄. Third,thenumeri altestspresentedin Se tion 6showthatit an

(9)

3 TSLU algorithm

In this se tion we present TSLU, a parallel algorithm for omputing the LU fa torization of

an

m × b

matrix

A

, with

m ≫ b

, whi h is distributed over

P

pro essorsusing a 1D layout. We also dis uss its performan e in terms of ops and number of messages ex hanged during

the fa torization. This algorithm will be used in CALU for performing the fa torization of a

blo k- olumn.

TSLUessentiallydoesanall-redu tion (withabuttery ommuni ationpattern)wherethe

redu tionoperationisGaussian eliminationonapairofmatri esofsize

b × b

sta kedontopof oneanother. For ompleteness,wedes ribethisall-redu tionoperationinmoredetailasfollows,

andthenshowanexample.

The omputation of TSLUis performed asanall-redu tion operation, and usesabuttery

forthe ommuni ationpattern. Thebutterymethodusesatree-like omputationasdes ribed

inthepreviousse tion,andtakespla ein(

log

2

P + 1

)steps,startingfromthebottomlevel

k = 0

ofabinarytree. Ea hnodeofthebinarytreeisasso iatedwithasetofpro essors. Forthesake

ofsimpli ity,wesupposethatthepro essorsareapoweroftwo,numberedfrom

0

to

P − 1

,and that

m

divides

P

. Weusenotationssimilarto[3℄:

f stP (i, k)

denotestherstpro essorae ted to thenode ofthebinary treeat level

k

to whi hpro essor

i

belongs;

target(i, k)

referstothe pro essor with whi h pro essor

i

ex hanges data at level

k

of the tree in a buttery pattern;

tgtf stP (i, k)

denotes thepro essorwith whi h

f stP (i, k)

ex hangesdata at level

k

;

level(i, k)

denotesthenodeatlevel

k

ofthebinarytreewhi hisassignedtoasetofpro essorsthatin ludes pro essor

i

,andis omputedas:

level(i, k) =



i

2

k



f stP (i, k) =

2

k

level(i, k)

target(i, k) =

f stP (i, k) + (i + 2

k−1

) mod 2

k

tgtf stP (i, k) =

target(f stP (i, k), k) = f stP (i, k) + 2

k−1

Thealgorithmstartswithalo al LUfa torizationonea hpro essorofthe

m/P × b

blo k-rowsthatit owns. Thenat ea hlevel

k

ofthebinarytreeandforea hnodeat thislevel,pairs ofpro essorsperformredundantlyanLUfa torization. Considerforexampleapro essor

i

and thenode atlevel

k

whi his mappedonpro essor

i

, andidentiedas

level(i, k)

. Thefa tors

L

and

U

omputedatthisnodearedenoted as

L

¯

level(i,k),k

,

U

¯

level(i,k),k

. Pro essor

i

anditstarget pro essorex hange data andperformredundantlytheLU fa torizationof twomatri esof size

b × b

.

Notethatthesequen eoflo alLUfa torizationsperformedin therstthreestepsofTSLU

are notperformed in pla e (the inputmatrixis notoverwritten). Hen e TSLUneedsan extra

storageofsize

m × b

to storetheresultingLand Ufa torsof thesefa torizationsandave tor ofsize

b

tostorethepermutationve tor.

(10)

TSLU algorithm

1. Let

i

bemypro essornumber.

2. ComputetheLUfa torizationofthelo al

m/P × b

groupofrows

A

i

= ¯

Π

i0

L

¯

i0

U

¯

i0

. Let

B

i

beformedbythe

b

pivotrows,

B

i

= ¯

Π

T

i0

A

i

 (1 : b, 1 : b)

. 3. for

k = 1

to

log

2

P

do

iif

i ≥ tgtf stP (i, k)

theni

φ = target(i, k)

,

τ = i

ielsei

φ = i

,

τ = target(i, k)

iendif

iLet

l = level(i, k)

.

(a) Pro essor

φ

ex hangesits

B

i

withPro essor

τ

.

(b) ComputetheLUfa torizationofthetwomatri es

B

φ

and

B

τ

ofsize

b × b

sta kedone ontopofanother:



B

φ

B

τ



= ¯

Π

l,k

L

¯

l,k

U

¯

l,k

( ) Let

B

i

beformedbytherst

b

pivotrowsof

Π

¯

T

l,k



B

φ

B

τ



iend if endfor

4. Letthenalpermutation

Π = ¯

¯

Π

0

Π

¯

1

. . . ¯

Π

log

2

p

andpermutethelo al

A = ¯

Π

T

A

.

5. Let

U = ¯

U

0 log

2

P

,where

U

istheuppertriangularfa torof

A

. 6. Computethelo al

L

fa tor,

L

i

= A

i

U

−1

.

Weillustratetheexe utionofthisalgorithmonasmallexampleinFigure1,wherewesuppose

thatthematrix

A

ofsize

16×2

isdistributedfollowinga1Dblo k y li distribution,withblo ks ofsize

2 × 2

on

4

pro essors. Let

A =



2

0 2

0 0

1 2

0 2

1 4

1 0

0

1 4

4

1 0

0 1

4 1

2 0

2 1

0 0

2

0 2



T

Inthis example, the1st, 2nd, 9th, 10th rowsare distributed on pro essor

0

. First a lo al LUfa torizationisperformedbyea hpro essor. Forpro essor

0

,the1st and9throwsareused as pivots. Se ond the pro essors

0

and

1

ex hange the

2

rows used aspivots in the lo al LU fa torization. ThentheyperformredundantlytheLUfa torizationof the

4 × 2

matrixformed bythese rowssta kedoneontopof another. Similarly,pro essors

2

and

3

ex hange theirrows andperformredundantlytheLUfa torizationofthematrixformedbytheserows. Inthethird

step,pro essors

0

and

2

ex hangethe

2

pivotrowsidentiedinthese ondstep,andperforman LUfa torizationonthe

4 × 2

matrixformedbytheserows. Thesame omputationisperformed by thepro essors

1

and

3

. The rowsidentied in the third steprepresentthe pivots that will beusedto fa torizetheentirematrix

A

. Inthis simpleexample,thepivotrowsused by TSLU happentobethesameasthoseusedbyGaussianeliminationwithpartialpivoting.

Tostudytheperforman eofTSLU,weusea lassi almodeltodes ribeama hinear hite ture

(11)

0

1

2

3

0

1

2

3

2

2

3

3

0

0

1

1

4

2

1

0

2

0

1

2

0

0

0

2

2

4

2

4

1

0

1

4

1

4

2

2

2

4

0

1

2

0

1

2

2

0

0

0

4

1

1

0

4

1

1

0

0

1

1

4

0

0

0

2

2

1

0

2

1

0

4

2

2

4

0

1

2

0

0

0

1

4

2

1

0

2

4

1

4

2

2

2

4

1

0

2

0

0

0

2

4

1

4

4

1

2

4

4

1

4

2

4

1

1

4

4

1

1

4

4

1

1

4

4

1

Figure1: Exampleofexe utionofTSLUon4pro essors.

we use one parameter to des ribe the time per op (add and multiply), denoted

γ

, and one parameterto ountthetimeperdivide,denoted

γ

d

. Weestimatethetimeforsendingamessage of

m

wordsbetweentwopro essorsas

α + mβ

,where

α

denotesthelaten yand

β

theinverseof thebandwidth. Weapproximatethetimeofbroad astsand ombinesthat involve

P

pro essors by assuming

log

2

P

identi al steps of ommuni ation and/or omputation are needed. With thesenotations,theruntimeofTSLUisestimatedtobe(weomitloworderterms):

T

T SLU

(m, b, P ) =

h

2mb

2

P

+

2b

3

3

(log

2

P − 1)

i

γ+

+b(log

2

P + 1)γ

d

+

+ log

2

P α + b

2

log

2

P β

(1)

We omparethistoS aLAPACK laterinSe tion5.

4 Parallel CALU algorithm for matri es distributed in a 2D

layout

In this se tion we present a parallel algorithm that implements the CALU method presented

in Se tion 2. We onsider an

m × n

matrix blo k y li ally distributed overa bi-dimensional gridof pro essors

P = P

r

× P

c

,using squareblo ksofdimension

b × b

. The parallelalgorithm usesablo kright-lookingapproa h,asusedforexampleinPDGETRFroutineinS aLAPACK

or in HPL ben hmark. That is, it iterates overblo k- olumns of

A

, and at ea h step rst a blo k- olumnofwidth

b

isfa tored. Thenthetrailingmatrixispermutedandupdated,andthe de omposition ontinuesonthetrailingmatrix. Themain dieren ewith theotheralgorithms

isthatCALUfa torsablo k- olumnusingtheTSLUfa torizationpresentedinSe tion3,whi h

leadstoanimportantredu tioninthenumberofmessagesex hangedduringthefa torization.

Considerthat therst

j − 1

iterationsof theLUfa torizationwereperformed. That is,the rst

j − 1

blo k olumnswerefa toredandthetrailingmatrixwaspermutedandupdated. The a tivematrixatstep

j

isofdimension

(m − (j − 1)b) × (n − (j − 1)b)

=

m

j

× n

j

. Forthe larity ofpresentation,wesupposethat

m

and

n

divide

b

. Wedes ribeherethemainstepsinvolvedin the

j

-thiterationofCALU:

(12)

1. The olumn of the grid that holds blo k- olumn

j

omputes its LU fa torization using TSLU(AlgorithminSe tion3).

2. Every pro essor in the pro essor olumn holding the matrix blo k- olumn

j

broad asts along its pro essor row the lo ally stored subblo k of

L

. It also broad asts an array of size

b

thatstoresthepermutationve tor

Π

j

asso iatedwiththeLUfa torizationof blo k- olumn

j

.

3. Thematrix

A

ispermuteda ordingto

Π

j

.

4. Every pro essor in the pro essor row holding the matrixblo k-row

j

of

U

omputesits lo alblo k.

5. Everypro essorin thepro essorrowholdingmatrixblo k-row

j

of

U

broad astsitslo al blo kdownits olumn.

6. Allpro essorsupdate thetrailingmatrix.

Inour urrentimplementation,weuseroutinesfromS aLAPACKforseveralstepsofCALU.

Step 3 is performed by a allto PDLASWP, step 4 is done by PDTRSM, and steps 5and 6

orrespond to a all to PDGEMM. However, CALU an be implemented dierently, and an

in orporate te hniques whi h allow some overlapbetween omputation and ommuni ation as

theso- alledlook-aheadte hniqueusedin HPLben hmark.

Toestimatetheperforman eofCALU,weassumethatthenetworkbandwidthandlaten y

is notne essarilythe sameeverywhere,e.g. it anbedierent along olumns of thegridthan

along rowsof the grid. Weuse adierent bandwidthand laten y for ommuni ationbetween

pro essorsin dierentrowsandthesame olumn (

α

c

and

β

c

)versusdierent olumnsandthe samerows(

α

r

and

β

r

). This isarst steptowardsunderstanding ertainhierar hi alparallel ma hines,wherethereishighbandwidthamongpro essorsonthesame hip(ornodeormodule)

andlowerbetweenpro essorsondierent hips(ornodesormodules).

Thetotal omputationtime overare tangulargridofpro essorsisgiveninEquation2(we

omitsomelowerordertermsandthetimeofpivotingrowslo ally). Inthisestimationwe onsider

that atotalof

(2n/b) · log

2

P

r

messagesareex hangedforswappingrowsofmatrix

A

instep3. This is be ause the swapping of

b

rows o urs after ea h blo k- olumn fa torization. Hen e, this operation anbeimplemented intwosteps, using

2 log

2

P

r

messages. Firstea h pro essor sends at most

b

rows that need to be swapped to the root pro essor as a redu e operation. Se ond the root pro essorbroad asts the ne essaryrows to all the pro essorsin its pro essor

olumn. Howeverin our urrentimplementationweusePDLASWP, andthis routineperforms

onemessageex hangeforea hrowswap,whi hleadstoatotalof

n log

2

P

r

messagesex hanged forstep3. Inour urrentwork,wearerepla ingthis routinebyaroutinethatis implemented

asexplained above, andwein lude thenumberof messagesasso iated withthe future routine

insteadofPDLASWPinourtimeestimation.

T

CALU

(m, n, P

r

, P

c

) =

h

1

P



mn

2

n

3

3



+

1

P

r



mn −

n

2

2



2b +

n

2

b

2P

c

+

2nb

2

3

(log

2

P

r

− 1)

i

γ+

+n(log

2

P

r

+ 1)γ

d

+

+ log

2

P

r

h

3n

b

α

c

+



nb

2

+

3n

2

2P

c



β

c

i

+

+ log

2

P

c

h

3n

b

α

r

+



1

P

r



mn −

n

2

2



β

r

i

(2)

(13)

5 Comparison with the S aLAPACK's LU fa torization

Consider that we de ompose an

m × n

matrix whi h is distributed blo k y li ally overa

P

r

by

P

c

grid of pro essors, where

P

r

· P

c

= P

and

m ≥ n

. The two-dimensional blo k y li distributionusessquareblo ksofdimension

b × b

. Thealgorithmloopsover

n/b

blo k- olumns. Atthe

j

-thstep,therst

j − 1

blo k- olumnsof

L

andblo k-rowsof

U

arealready omputed. At thisstep,theblo k- olumn

j

of

L

isfa tored( allto

P DGET F 2

)usingpivoting. Thepivoting informationis appliedto therestof thematrix( allto

P DLASW P

). Theblo k-row

j

of

U

is omputed using triangular solves( all to PDTRSM), and then the trailing matrix is updated

( allto

P DGEM M

).

Equation3representstheruntimeestimationofPDGETRFroutineinS aLAPACKLU.To

be onsistentwiththeruntimeestimationofCALU,we onsiderforPDGETRFaswellthatthe

swappingof

b

rowsperformedbya allto PDLASWPleadsto

2 · log

2

P

r

messagesex hanged.

T

P DGET RF

(m, n, P

r

, P

c

)

=

h

1

P



mn

2

n

3

3



+

P

1

r



mn −

n

2

2



b +

n

2P

2

b

c

i

γ+

+nγ

d

+

+

2n 1 +

2

b

 log

2

P

r

+ n α

c

+



nb

2

+

3n

2

2P

c



log

2

P

r

β

c

+

+ log

2

P

c

h

3n

b

α

r

+

1

P

r



mn −

n

2

2



β

r

i

(3)

Tobetterunderstandthedieren esbetweenCALU and theLUfa torizationimplemented

in S aLAPACK,wewill ompare theruntime estimationof the twofa torizationsas givenby

Equation2andEquation3.

Comparing the additions, multipli ations op ounts, CALU adds a lower order term of

about

b(mn − n

2

/2)/P

r

. This term omes from TSLU,whi h performs twi e thefa torization ofablo k- olumn, rstto getthepivotrows,andse ondto a tually omputethefa tors.

Comparingthedivisionop ounts,CALUaddsalowerordertermof

n log

2

P

r

,allfromthe TSLUsofblo k- olumns(the fa torizationsoftwo

b × b

matri es).

Comparing ommuni ation ostswithinpro essor olumns(

α

c

and

β

c

terms),forbandwidth, bothalgorithmshavethesame ommuni ationvolume. Forlaten y,CALUis lowerbyafa tor

of

b(1 + 1/ log

2

P

r

)

. Theredu tionin the numberof messageswithin pro essor olumns omes fromtheredu tioninthefa torizationofablo k- olumnperformedbyTSLUversusPDGETF2.

Comparing ommuni ations osts within pro essorrows(

α

r

and

β

r

terms), in PDGETRF, thenumberof broad astswithin pro essorrowsis alreadyofthe orderof

n/b

,and hen eboth algorithmshavethesame osts.

6 Experimental results

Inthisse tion,weevaluatetheperforman eofCALUalgorithm,andthegoalofourexperiments

isthree-fold. First,westudythenumeri alstabilityofthenew a-pivotingstrategy. Se ond,we

evaluatetheperforman e improvementobtainedin thepanelfa torization byTSLU ompared

tothe orrespondingroutineinS aLAPACK.Andthird,weevaluatetheperforman eofCALU

and ompareittoPDGETRFroutineinS aLAPACK.

Theexperiments areperformedon two omputational systemsat the National Energy

Re-sear h S ienti Computing Center (NERSC). The rst system is an IBM p575 POWER 5

system, whi h has 888 ompute pro essors distributed among 111 ompute nodes. Ea h

pro- essor is lo ked at

1.9

GHz and has a theoreti alpeak performan e of

7.6

GFLOPs/s. Ea h nodeof

8

pro essorshas

32

Gbytesofmemory. The omputenodesare onne tedtoea hother

(14)

withahigh-bandwidth,low-laten yswit hingnetwork. Thepeak bandwidthis

3100

MB/sand theMPIPointtoPointinternodelaten yis

4.5

use [7℄. OnIBMPOWER5weusetheBLAS routinesfromtheESSLlibrary(EngineeringandS ienti Subroutinelibrary). Foralltheruns

weusedthemaximumnumberofpro essorsavailablepernode.

These ond system is a Cray XT4system with

9660

ompute nodes. Ea h ompute node has a 2.6 GHz dual- ore AMD Opteron pro essorwith atheoreti al peak performan e of 5.2

GFLOPs/s. Ea h ompute node has 4 GBytes of memory. In our omparisons we use the

routinesPDGETRFandPDGETF2fromtheCrayS ienti Librariespa kage,LibS i. However

theseroutineshavenosigni antoptimizationwithrespe ttotheroutinesfromS aLAPACK[8℄.

InourtestsweuseS aLAPACKinmixedmode,thatisMPIisusedinbetween omputenodes,

andthreadedBLASlevelparallelismon oreswithinanode. ThethreadedBLASusedislibGoto

library.

6.1 Stability of a-pivoting strategy

In this se tionweshow that CALU is as stable asGaussian elimination with partial pivoting.

ForthiswesummarizeresultsthatexpressthestabilityofGaussianelimination,intermsofthe

pivot growthand the normwiseba kward stability attained. We performour tests in Matlab,

usingmatri esfromanormaldistributionwithvaryingsizefrom

1024

to

8192

.

Thegrowthfa torinvolvesthevaluesoftheelementsof

A

duringtheelimination. Weusethe growthfa tor asdened byTrefethenandS hreiber[10℄,

g

T

=

max

i,j,k

|a

(k)

ij

|

σ

A

,where

a

(k)

ij

denotes the absolute valueof the element of

A

at row

i

and olumn

j

at the

k

-th step of elimination, and

σ

A

isthestandarddeviationoftheinitialelementdistribution. Itisshownexperimentally in [10℄ thatin pra ti e

g

T

≈ n

2/3

for partialpivoting, and

g

T

≈ n

1/2

for ompletepivoting (at

leastfor

n 6 1024

).

InFigure 2 (left)wedisplay thevalue of the growth fa tor

g

T

obtainedfor dierent blo k sizes and dierent number of pro essors. Here twosamples are used for ea h test. From the

pointofviewofstability,onlythenumberofrowsinthepro essgrid

P

r

playsarole. Hen ewe varyonly

P

r

,presentedas

P

inFigure2. Weobservethatthegrowthfa torof a-pivotinggrows as

c · n

2/3

(

c

beingasmall onstantaround

1.5

),andhasthesamebehavioraspartialpivoting. Thenew a-pivoting strategy does notensure that the element of maximum magnitude is

used as pivot at ea h step of fa torization. Hen e

|L|

is not bounded by

1

as in Gaussian elimination withpartial pivoting. However,in pra ti ethe pivotsused by a-pivoting arevery

loseto theelementsofmaximummagnitudein therespe tive olumns. InFigure2(right)we

displaythevalueoftheminimumthresholdinCALU,wherethethresholdis omputedat ea h

stepoffa torization

i

asthequotientofthepivotusedatstep

i

dividedbythemaximumvalue in olumn

i

. Weobservethat this valueis alwayslargerthan

0.33

, meaning that in ourtests

|L|

isboundedby

3

. Theaveragevalueofthethresholdislargerthan

0.84

. Wehaveperformed experiments on dierent matri es, as matri es following dierent random distributions, dense

Toeplitzmatri es,andwehaveobtainedsimilarresults.

Toevaluatethestabilityof a-pivotingin termsofnormwiseba kwardstability,we ompute

three a ura y tests as performed in the HPL ben hmark, and denoted as HPL1, HPL2 and

HPL3. For stability, the expe ted values are of the order of

O(1)

, that is a slowly growing fun tion of

n

. In HPL, the a ura y tests are passedif the values of the three quantities are smallerthan

16

.

HPL1

= ||Ax − b||

/(ǫ||A||

1

∗ N ),

HPL2

= ||Ax − b||

/(ǫ||A||

1

||x||

1

),

HPL3

= ||Ax − b||

/(ǫ||A||

||x||

∗ N ).

(15)

n P b

g

T

τ

ave

τ

min

w

b

HPL1 HPL2 HPL3

256 32 497.43 0.84 0.40 4.22e-14 5.06e-02 2.26e-02 4.54e-03

16 551.76 0.86 0.35 4.10e-14 2.24e-02 2.15e-02 4.34e-03

64 463.84 0.84 0.42 4.11e-14 4.78e-02 2.21e-02 4.22e-03

128 32 525.81 0.84 0.38 3.95e-14 3.90e-02 2.12e-02 4.30e-03

2

13

16 573.11 0.86 0.38 3.70e-14 6.67e-02 1.97e-02 3.89e-03

128 402.09 0.85 0.47 3.86e-14 2.09e-02 2.07e-02 3.84e-03

64 64 457.49 0.84 0.43 3.82e-14 3.45e-02 2.05e-02 4.26e-03

32 468.02 0.84 0.37 4.36e-14 6.64e-02 2.31e-02 4.85e-03

16 482.58 0.86 0.39 3.87e-14 1.32e-02 2.08e-02 4.24e-03

256 16 334.03 0.87 0.37 1.88e-14 1.38e-02 1.94e-02 4.36e-03

128 32 341.13 0.86 0.42 2.15e-14 2.35e-02 2.22e-02 4.99e-03

16 348.01 0.87 0.38 1.98e-14 5.58e-01 2.11e-02 3.95e-03

2

12

64 294.77 0.86 0.47 2.03e-14 1.22e-02 2.13e-02 4.55e-03

64 32 339.85 0.86 0.41 2.03e-14 2.39e-02 2.13e-02 4.56e-03

16 306.10 0.87 0.37 1.99e-14 2.76e-02 2.10e-02 3.98e-03

128 16 198.48 0.89 0.41 9.71e-15 3.74e-02 2.01e-02 4.36e-03

2

11

64 32 201.92 0.88 0.43 1.13e-14 5.52e-02 2.32e-02 5.16e-03

16 187.18 0.89 0.42 9.91e-15 2.83e-02 2.06e-02 4.49e-03

2

10

64 16 131.98 0.90 0.44 5.13e-15 2.24e-02 2.11e-02 5.18e-03

Table1: HPLa ura ytestsfor a-pivotingstrategy

n S

g

T

w

b

HPL1 HPL2 HPL3

2

13

5 325.36 2.64e-14 1.41e-01 1.40e-02 2.75e-03

2

12

5 219.93 1.33e-14 1.22e-02 1.40e-02 3.02e-03

2

11

5 151.42 6.78e-15 1.86e-02 1.38e-02 3.01e-03

2

10

10 101.65 3.87e-15 2.41e-02 1.57e-02 3.63e-03

(16)

1024

2048

4096

8192

100

200

300

400

500

600

700

Average growth factor: g

T

(randn, 2D layout, New pivoting)

P=256,b=32

P=256,b=16

P=128,b=64

P=128,b=32

P=128,b=16

P=64,b=128

P=64,b=64

P=64,b=32

P=64,b=16

GEPP

n

2/3

2*n

2/3

1024

2048

4096

8192

0.34

0.38

0.42

0.46

Minimun threshold (randn, 2D layout, New pivoting)

P=256,b=32

P=256,b=16

P=128,b=64

P=128,b=32

P=128,b=16

P=64,b=128

P=64,b=64

P=64,b=32

P=64,b=16

Figure 2: Thegrowthfa torandtheminimum thresholdvalueformatri esfollowinganormal

distribution

Wepresent in Table1the resultsobtainedforthe three testsfor CALU,when varyingthe

matrix size, the number of pro essors and the blo k size. For the matrix of size

n = 2

k

in

Table1, thesample size is

S = max{10 ∗ 2

10−k

, 3}

. We also re ordthe growthfa tor

g

T

, the averagethreshold

τ

ave

,theminimumthreshold

τ

min

,andthe omponentwiseba kwarderror be-foreiterativerenements

w

b

. Usuallyafter2iterativerenements,the omponentwiseba kward error anbe redu edto theorder of

10

−16

. Wedisplayin Table2the resultsobtainedby LU

fa torizationwithpartialpivotingforthesamematrixsizes,where

S

isthesamplesize. Allthe three tests,asperformedin HPL,arepassedbyCALU.Moreover,forall thetest ases,CALU

leads to resultsof thesame order of magnitude (

10

−2

,

10

−3

) asLU fa torization with partial

pivoting.

6.2 Performan e of TSLU

We evaluate theperforman eof TSLUusing matri esof asize

m × n

, ablo k size

b = n

, and varying both

m

and

n

(

m ∈ {10

3

, 5 · 10

3

, 10

4

, 10

5

, 10

6

}

and

n ∈ {50, 100, 150}

). Ourgoalis to studytheperforman eimprovementofTSLU omparedtotheS aLAPACKPDGETF2routine.

ThetimeratiobetweenPDGETF2 andTSLUobtainedontheIBMPOWER5systemandthe

CrayXT4systemisdisplayedinTable3andTable4.

Theimprovementisexpe tedinpartduetousingabetterLUfa torizationalgorithminthe

lo al sequentialLUfa torization (step 2of AlgorithmTSLU), and in partdue toredu ingthe

laten y ost. In ouralgorithm weuse the re ursiveLUfa torization, the RGETF2routine as

giveninAppendixBof[6℄. Re allthatTSLUperformstwi ethenumberofopsofPDGETF2.

Tobetterunderstandtheseissues,we omparetwodierent ongurationsofTSLU.Intherst

onethe lo al LUfa torization performed by ea h pro essoron itsgroup ofrowsisdone using

the lassi LU fa torization. We use the LAPACK DGETF2 routine, and the results for this

onguration are displayedin the olumns denoted

Cl

in Tables 3and 4. In the se ond one, displayed in the olumns

Rec

, we use there ursiveLU fa torization, the RGETF2 routine as givenin AppendixBof[6℄.

Inea htableweshowresultsforxed

m

anddierentvaluesof

n

andnumberofpro essors. Severalresultsaremissingintheplots,andthisisbe auseeithertherewasnotenoughmemory

toperformthefa torizationortheinputmatrixistoosmallandsomepro essorsarenotinvolved

(17)

Noofpro essors

P = P

r

× P

c

m

n = b

4 8 16 32 64

2 × 2

2 × 4

4 × 4

4 × 8

8 × 8

Re Cl Re Cl Re Cl Re Cl Re Cl

10

3

50 1.66 1.59 1.96 2.06 2.24 2.09 - - -

-10

3

100 1.27 1.17 1.44 1.37 - - -

-10

3

150 1.06 0.97 - - -

-5 · 10

3

50 1.62 1.08 1.48 1.44 1.97 1.94 1.78 2.05 2.09 1.71

5 · 10

3

100 0.98 0.85 1.13 1.04 1.36 1.29 1.39 1.47 -

-5 · 10

3

150 1.08 0.81 1.00 1.01 1.06 0.96 0.99 0.97 -

-10

4

50 1.24 0.87 1.71 0.88 1.78 1.68 1.66 1.94 2.18 1.76

10

4

100 1.34 0.81 1.01 0.80 1.26 1.15 1.30 1.28 1.51 1.21

10

4

150 3.07 0.88 1.03 0.78 1.01 0.89 1.00 0.97 1.06 0.80

10

5

50 1.07 0.70 1.09 0.72 1.18 0.85 1.15 1.32 1.50 1.23

10

5

100 1.00 0.70 1.04 0.67 1.09 0.73 1.21 1.03 1.19 1.01

10

5

150 1.13 0.68 1.13 0.69 1.36 0.77 1.08 0.84 1.03 0.75

10

6

50 1.36 0.71 1.27 0.71 1.25 0.70 1.12 0.69 2.01 0.82

10

6

100 1.84 0.75 1.95 0.87 1.62 0.73 2.90 0.84 1.08 0.70

10

6

150 2.32 0.81 2.34 0.89 4.37 0.90 3.42 0.85 1.22 0.70

Table3: TimeratioofPDGETF2toTSLUobtainedonIBMPOWER5system,usingDGETF2

forthelo alLUfa torization(Cl),andusingRGETF2forthelo alLUfa torization(Re ). The

matrixfa torizedis

m × n

,withablo kofsize

b = n

.

OntheIBMPOWER 5system(Table3), thebest improvementisobtainedfor thelargest

matrix in our test set

m = 10

6

and

n = b = 150

, where TSLU outperforms PDGETF2 by a fa torof

4.37

on

16

pro essors. Theimprovementduetolaten yredu tionisalmostafa tor

2

. Thisshowsthat redu ingthelaten y ostisanimportantpartoftheoverallimprovement.

Thebestperforman e ofTSLUon theIBMPOWER 5systemis

215

GFLOPs/s, and itis obtained for

m = 10

6

and

n = 150

on

64

pro essors(we ounthere thetotal numberof ops performedbyTSLU).Thisrepresents

44%

ofthetheoreti alpeakperforman e. Thisperforman e orrespondstoanimprovementof

1.22

overPDGETF2.

For small matri es, we an noti e that the improvement omes mainly from redu ing the

laten y ost. For intermediate size matri es and a small number of pro essors (up to

4

,

8

pro essors), the improvement omes mainly from using re ursion. For thesame matri es and

a large number of pro essors, the improvement omes from de reasing the laten y ost. On

small number of pro essors the re ursion leads to important improvements for large matri es

(

m = 10

5

, 10

6

and varying

n

), for instan e a fa tor of

2.3

for

m = 10

5

and

n = 150

on

4

pro essors. However, with in reasing number of pro essors, even for large matri es, redu ing

the laten yplaysanimportant rolein theoverallimprovement. Thebest resultsare obtained

on

16

and

32

pro essorsfor the biggest matri es

m = 10

6

and

n = 150

, showing the overall improvementfa torsof

4.37

and

3.42

respe tively.

OntheCrayXT4system(Table4),thebest improvementis seenfor

m = 10

6

and

n = 150

: afa tor of

5.58

on

4

pro essorsand afa torof

5.52

on

8

pro essors. Thebest performan eof the TSLUalgorithm is

240

GFLOPs/s, obtained by TSLUon

64

pro essorsfor

m = 10

6

and

n = 150

. This represents

36%

of the theoreti al peak performan e and it orresponds to an improvementof

3.67

overPDGETF2.

(18)

Noofpro essors

P = P

r

× P

c

m

n = b

4 8 16 32 64

2 × 2

2 × 4

4 × 4

4 × 8

8 × 8

Re Cl Re Cl Re Cl Re Cl Re Cl

10

3

50 1.42 2.23 1.85 2.71 2.09 3.09 - - -

-10

3

100 1.14 1.39 1.29 1.56 - - -

-10

3

150 1.12 0.91 - - -

-5 · 10

3

50 1.22 1.42 1.65 2.15 1.97 2.72 2.10 3.06 1.04 2.59

5 · 10

3

100 1.27 1.24 1.25 1.32 1.35 1.53 1.38 1.65 -

-5 · 10

3

150 1.67 1.22 0.97 0.90 0.88 0.91 0.85 0.90 -

-10

4

50 1.20 1.14 1.37 1.19 1.85 2.42 1.03 2.88 2.03 3.10

10

4

100 2.19 1.34 1.56 1.44 1.30 1.41 0.94 1.56 1.36 1.94

10

4

150 2.61 1.30 2.03 1.44 0.92 0.88 0.81 0.90 0.87 0.93

10

5

50 2.12 1.13 2.23 1.37 2.29 1.50 1.20 1.47 1.76 1.88

10

5

100 3.14 1.25 3.13 1.45 2.97 1.43 1.92 1.39 2.38 1.39

10

5

150 3.78 1.30 3.57 1.47 3.14 1.30 2.12 1.26 2.34 1.15

10

6

50 2.99 1.49 3.02 1.51 2.86 1.28 2.09 1.03 2.14 1.31

10

6

100 4.51 1.65 4.55 1.71 4.04 1.36 3.04 1.13 3.07 1.27

10

6

150 5.58 1.61 5.52 1.76 4.80 1.39 3.60 1.12 3.67 1.20

Table4: Time ratioofPDGETF2 toTSLUobtainedonCrayXT4system,using DGETF2for

the lo al LUfa torization (Cl), andusing RGETF2 forthe lo al LU fa torization(Re ). The

matrixfa torizedis

m × n

,withablo kofsize

b = n

.

Forthesmallmatri es(

m = 10

3

and

n

varyingfrom

50

to

150

or

m = 5 · 10

3

and

n = 50

or

100

)thebestresultsareobtainedusing lassi LU,hen esolely duetoredu tionof thelaten y ost. Forallthelargematri es,usingre ursiveLUshowsabetterperforman e.

Insummary, forallthe asestested, at leastoneof thenewTSLU algorithmsoutperforms

theS aLAPACKroutinePDGETF2. Forsmallmatri estheusageof lassi LUleadsto better

performan ethanthe re ursiveLU.Thisis ina ordan ewith theresultsin [6,9℄whi hshow

thatthere ursivealgorithmdonotfarebetterthanthe lassi algorithmforsmallmatri es. For

largermatri es, re ursiveLUperformsbetterthan lassi LU.The improvementsobtainedby

TSLUareduetobothredu ingthelaten yandthelo al LUfa torization osts.

6.3 Performan e of CALU

Inthisse tionwestudy theperforman eimprovementofCALU omparedtotheS aLAPACK

PDGETRFroutine. Weusematri esofasize

m × m

,andvaryingboth

m

and

b

(

m ∈ {10

3

, 5 ·

10

3

, 10

4

}

and

b ∈ {50, 100, 150}

). The timeratio betweenPDGETRF and CALU obtainedon theIBMPOWER5systemandtheCrayXT4systemarepresentedinTables5and6.

We have observed in Tables 3 and 4 that for a small number of pro essors, the best

per-forman efor TSLUis obtainedwhen re ursiveLU isused for thelo al LU fa torization. The

number of pro essorsused for TSLU orresponds to the numberof rows

P

r

in the2D grid of pro essorsusedforCALU.Sin einourtestsforCALU

P

r

isrelativelysmall(withvaluesgoing from

2

to

8

),inallourtestsweuseforthepanelfa torizationTSLUwithre ursiveLU.

ForIBMPOWER5system,theresultsaredisplayedinTable5. Therstmatrix(

m = 10

3

and varying

b

) is relatively small, and thus we do not expe t any important speedup with in reasing number of pro essors. In fa t, for

m = 10

3

we see no speedup for a number of

(19)

redu tionofthelaten y ost. Thebestimprovementisobtainedfor

m = 10

3

and

b = 50

(fa tors of

2.23

on

16

pro essorsand

2.29

respe tivelyon

64

pro essors),mainly asaresultof redu ing the laten y ost. An important improvementis obtainedalso for

m = 5 · 10

3

, afa tor of

1.67

on

32

pro essorsand afa torof

1.69

on

64

pro essors. For

m = 10

4

,thebest improvementisa

fa torof

1.59

on

32

pro essors.

ForCrayXT4system,theresultsaredisplayedinTable6. Wenoti ethattheimprovements

aresmallerthanonthe IBMPOWER 5system. Thebestimprovementobtainedisafa torof

1.81

for

m = 10

3

and

b = 100

on

64

pro essors. For

m = 5 · 10

3

,thebestimprovementsobtained

areafa torof

1.38

on

8

pro essorsand afa torof

1.36

on

64

pro essors,bothfor

b = 150

. For

m = 10

4

,thebest improvementsareafa torof

1.38

on

32

pro essorsandafa torof

1.33

on

64

pro essors.

Noofpro essors

P = P

r

× P

c

m = n

b

4 8 16 32 64

2 × 2

2 × 4

4 × 4

4 × 8

8 × 8

Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops

CALU CALU CALU CALU CALU

10

3

50 1.57 9.79 1.59 11.9 2.23 11.8 2.07 11.5 2.25 9.8

10

3

100 1.48 8.84 1.47 10.5 1.91 10.6 1.91 11.2 2.29 10.9

10

3

150 1.36 8.26 1.41 9.9 1.70 9.5 - - -

-5 · 10

3

50 1.05 21.5 1.09 39.4 1.31 61.2 1.51 95.5 1.69 118.6

5 · 10

3

100 1.06 21.1 1.13 37.3 1.21 56.7 1.67 84.1 1.66 103.1

5 · 10

3

150 1.04 20.6 1.08 35.1 1.18 52.3 1.26 74.8 1.45 89.1

10

4

50 1.00 23.27 1.00 45.1 1.08 80.3 1.17 143.2 1.35 213.9

10

4

100 1.00 23.62 1.00 44.4 1.10 78.0 1.19 133.2 1.24 197.6

10

4

150 1.01 23.47 1.02 42.3 1.33 74.1 1.59 122.1 1.17 173.8

Table5: Time ratio of PDGETRF to CALU (Impvt olumns) and performan e for CALU in

GFLOPs/s (GFlops olumns) obtained on IBM POWER 5 system. The matrix fa torized is

m × m

,withablo kofsize

b

.

TheimprovementspresentedinTables6and5areobtainedforaxednumberofpro essors

and a xed blo ksize. Howeverthe best improvementsdo not orrespond alwaysto the best

performan eofCALUorPDGETRF.CALU anhaveabetterperforman eforadierentblo k

size or grid shape than PDGETRF. Hen e, an interesting question to answer is: for a given

problemsize

m

andagivenmaximumnumberofpro essors,what istheimprovementobtained bythebest CALUwithrespe tto thebestPDGETRF?Toanswerthisquestion,wepresentin

Table7theimprovementobtainedbytakingthebestperforman eindependentlyforCALUand

PDGETRF,whenvaryingthenumberofpro essors(from

8

to

64

)andtheblo ksize(valuesof

50

,

100

and

150

). Fora givennumberofpro essors,weuse onegrid shape, asin ourprevious experiments. We also display the best performan e for CALU and PDGETRF in GFLOPs/s

(GFlops olumns),theblo ksize(

b

)andthenumberofpro essorsforwhi hthebestperforman e wasobtained,and theper entageoftheoreti alpeakperforman eobtainedbyCALU( olumns

Pr nt). Thespeedup is omputedas follows:

speedup(m, m, P

max

) =

min

P ≤P

max

,b

T

P DGET RF

(m, m, P, b)

min

P ≤P

max

,b

T

CALU

(m, m, P, b)

CALU leadsto improvementsupto

1.69

onIBM POWER 5and upto

1.53

onCray XT4. Note that the improvements are obtained when the performan e of the algorithm is a small

(20)

Noofpro essors

P = P

r

× P

c

m = n

b

4 8 16 32 64

2 × 2

2 × 4

4 × 4

4 × 8

8 × 8

Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops Impvt GFlops

CALU CALU CALU CALU CALU

10

3

50 1.19 5.4 1.20 6.6 1.33 6.6 1.35 7.5 1.67 7.6

10

3

100 1.28 5.5 1.39 7.0 1.52 7.2 1.60 8.5 1.81 8.3

10

3

150 1.23 5.3 1.32 6.6 1.44 6.7 - - -

-5 · 10

3

50 1.03 19.2 1.09 33.3 1.12 44.3 1.16 69.2 1.11 67.2

5 · 10

3

100 1.12 19.4 1.20 32.3 1.13 42.8 1.24 67.4 1.32 76.1

5 · 10

3

150 1.23 19.1 1.38 31.0 1.22 40.8 1.35 61.9 1.36 70.5

10

4

50 1.01 24.4 1.05 45.7 1.04 69.5 1.08 121.3 1.31 154.9

10

4

100 1.09 25.3 1.18 46.5 1.13 69.8 1.22 118.2 1.33 153.3

10

4

150 1.16 25.2 1.31 45.4 1.22 67.3 1.38 111.4 1.30 140.3

Table6: Time ratio of PDGETRF to CALU (Impvt olumns) and performan e for CALU in

GFLOPs/s(GFlops olumns)obtainedonCrayXT4system. Thematrixfa torizedis

m × m

, withablo kofsize

b

.

per entageof the theoreti al peak performan e. The smallestper entageis obtained on Cray

XT4,for

m = 10

3

and

P = 32

. Thisissomehowexpe ted,sin ethisisasmallmatrixexe uted on 32 dual- ore pro essors. A better per entage is obtained on IBMPOWER 5, for example

40.6

for

m = 10

4

on

64

pro essors. However,evenwhen theper entageofpeakperforman eis small,theTable7showsthatCALU willletausermoree ientlyusethesameresour esthan

PDGETRF,andsoitisalwaysworthusingit.

IBMPower5

m speedup CALU PDGETRF

GFlops P b Pr nt GFlops P b

10

3

1.59 11.9 8 50 19.6 7.5 8 50

5 · 10

3

1.69 118.6 64 50 24.4 70.0 64 50

10

4

1.34 213.9 64 50 40.6 159.8 64 100 CrayXT4

m speedup CALU PDGETRF

GFlops P b Pr nt GFlops P b

10

3

1.53 8.5 32 100 2.5 5.54 32 50

5 · 10

3

1.26 76.1 64 100 11.4 60.2 64 50

10

4

1.31 154.9 64 50 23.2 118.1 64 50

Table7:SpeedupestimatedastheratioofbestPDGETRFoverbestCALUforagivenproblem

size,thebestperforman eforCALUandPDGETRFinGFLOPs/s(GFlops olumns),theblo k

size(

b

olumns)andthenumberofpro essors(

P

olumns)forwhi hthebestperforman ewas obtained. Pr nt denotestheper entageoftheoreti alpeakperforman eobtainedbyCALU.

(21)

7 Con lusions and future work

Inthispaperwehaveintrodu edCALU,anewalgorithmfor omputingtheLUfa torizationof

densematri es. This algorithm usesanewpivoting strategy, the a-pivoting, whi h isused to

e iently omputethe LUfa torizationof ablo k- olumn, andleadstoanimportantde rease

in thenumberofmessagesofCALU withrespe tto lassi algorithms.

We have ompared CALU with the orresponding PDGETRF routine from S aLAPACK.

Ourexperimentshaveshownthatdependingonthesizeofthematrixandthe hara teristi sof

theunderlying omputerar hite ture, itis either laten yredu tionorre ursion orbothwhi h

aremajor fa torsin redu ingtheparalleltime ofthe LUfa torization. Interestingly, thegains

due to laten y redu tionare not limited onlyto the small matri es, but ae ts also the large

matri es. Inthese asesthere ursionisverye ientinredu ingthelo alLUfa torizationtime

andthusleavesthelaten yasthetime onsumingbottlene k,whi hneedstobealleviated.

Thefa torizationofablo k- olumn,TSLU,outperformsthe orrespondingroutinePDGETF2

from S aLAPACK upto a fa tor of

4.37

on the IBM POWER5 system and upto afa tor of

5.52

on the CrayXT4system. CALU outperformsPDGETRF up to afa tor of

2.29

on IBM POWER5anduptoafa torof

1.81

onCrayXT4.

Thefa torizationofablo k- olumnliesonthe riti alpathoftheparallelLUfa torization,

andhen eweexpe tthat theusageof a-pivoting strategybyotherparallel LUalgorithms, as

HPL,willleadtoimprovementsoftheoveralltime.

Asfuturework,itwill beinterestingto studythesuitabilityofthenew a-pivotingstrategy

for parallel LUon multi ore ar hite tures. Anotherdire tion onsists of using the a-pivoting

strategy for the LU fa torization orthe in omplete LUfa torization of sparse matri es. This

maypayomu hmorethanthedense ase,sin ethereisahigherproportionof ommuni ation

versus omputation.

A knowledgments

TheauthorsthankM.Hoemmen,J.Langou,andJ.Riedyforhelpfuldis ussionsonthissubje t.

Referen es

[1℄ Top500super omputersites. availableatwww.top500.org.

[2℄ J.Choi, J.J.Dongarra, L. S.Ostrou hov,A. P.Petitet, D.W. Walker, and R.C.Whaley. The

Design and Implementation of the S aLAPACK LU, QR and CholeskyFa torization Routines.

S ienti Programming,5(3):173184,1996. ISSN1058-9244.

[3℄ J. Demmel,L. Grigori, M. Hoemmen, and J. Langou. Communi ation-avoiding parallel and

se-quentialQRfa torizations. Te hni alreport, UCBerkeley,2008.

[4℄ J.J.Dongarra, R.A.V.D.Geijn, andD.W.Walker. S alabilityIssuesAe tingtheDesignofa

DenseLinearAlgebraLibrary. Journalof Parallel andDistributedComputing, 22:523537,1994.

[5℄ J.J.Dongarra,P.Lusz zek,andA.Petitet.TheLINPACKBen hmark: Past,PresentandFuture.

Con urren y: Pra ti e andExperien e, 15:803820,2003.

[6℄ F.Gustavson. Re ursion Leadsto Automati VariableBlo king for DenseLinear-Algebra

Algo-rithms. IBMJournalof Resear h andDevelopment,41(6):737755,1997.

[7℄ D.Skinner.IBMSPParallelS alingOverview.http://www.ners .gov/news/reports/te hni al/seabo rg_s a ling.

[8℄ A.Tate. Personal ommuni ation.

[9℄ S.Toledo. Lo alityofreferen einLUDe ompositionwithpartialpivoting. SIAMJ.Matrix Anal.

Appl.,18(4),1997.

[10℄ L.N.TrefethenandR.S.S hreiber.Average- asestabilityofGaussianelimination.SIAMJ.Matrix

(22)

Parc Orsay Université - ZAC des Vignes

4, rue Jacques Monod - 91893 Orsay Cedex (France)

Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence Cedex

Centre de recherche INRIA Grenoble – Rhône-Alpes : 655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier

Centre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’Ascq

Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique

615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex

Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex

Centre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex

Centre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex

Éditeur

Figure

Figure 1: Example of exeution of TSLU on 4 proessors.
Table 1: HPL auray tests for a-pivoting strategy
Figure 2: The growth fator and the minimum threshold value for matries following a normal
Table 3: Time ratio of PDGETF2 to TSLU obtained on IBM POWER 5 system, using DGETF2
+4

Références

Documents relatifs

The aims of this study are: (1) to build a feasible methodology using content and structural analysis of social media (in particular, Twitter) with respect to

T is the address of the current subterm (which is not closed, in general) : it is, therefore, an instruction pointer which runs along the term to be performed ; E is the

Toute reproduction ou représentation intégrale ou partielle, par quelque procédé que ce so'it, des pages publiées dans la présente édition, faite sans l'autorisation de l'éditeur

[r]

The comparison between the ducted and the open DP thruster near wake characteristics indicates that the rate of reduction in the open OP thruster flow axial velocity, vorticity,

To prevent abusing exceptions that would provide Buddhist novice monks and nuns flexibility to occasionally use music for legitimate purposes for the sake

In order to simplify the process of converting numbers from the modular representation to the positional representation of numbers, we will consider an approximate method that allows

In a second time, we investigate the inverse problem consisting in recovering the shape of the plasma inside the tokamak, from the knowledge of Cauchy data on the external boundary Γ