Parallel GMRES with a multiplicative Schwarz preconditioner

(1)

HAL Id: inria-00508277

https://hal.inria.fr/inria-00508277v2

Submitted on 23 Sep 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

preconditioner

Désiré Nuentsa Wakam, Guy-Antoine Atenekeng Kahou

To cite this version:

Désiré Nuentsa Wakam, Guy-Antoine Atenekeng Kahou. Parallel GMRES with a multiplicative

Schwarz preconditioner. [Research Report] RR-7342, INRIA. 2010. �inria-00508277v2�

(2)

a p p o r t

d e r e c h e r c h e

N

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--7

3

4

2 --F

R

+

E

N

G

Observation and Modeling for Environmental Sciences

Parallel GMRES

with a multiplicative Schwarz preconditioner

Desire NUENTSA WAKAM – Guy Antoine ATENEKENG KAHOU

N° 7342 — version 2

(3)

(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

Desire NUENTSAWAKAM

∗

Guy Antoine ATENEKENG KAHOU

†

Theme: ObservationandModelingforEnvironmentalS ien es Équipes-ProjetsSage

Rapportdere her he n°7342version2initialversionJuillet2010revisedversion Septembre201024pages

Abstra t: Inthis paper, we present an hybridsolverfor linear systems that ombines a Krylov subspa e method as a elerator with some overlapping domain de omposition method as pre onditioner. The pre onditioner uses an expli it formulation asso iated to oneiterationofthe lassi almultipli ativeS hwarzmethod. Toavoid ommuni ationsand syn hronizationsbetweensubdomains,theNewton-basisGMRESimplementationisusedas a elerator. Thus,itisne essarytodividethe omputationoftheorthonormalbasisintotwo steps: thepre onditionedNewtonbasisis omputedthenitisorthogonalized. Therststep ismerelyasequen eofmatrix-ve torprodu tsandsolutionsoflinearsystemsasso iatedto subdomains;wedes ribethene-grainedparallelismthatisusedinthesekerneloperations. The se ond stepuses aparallel implementationof dense

QR

fa torizationon theresulted basis.

Atea happli ation ofthepre onditioneroperator,lo alsystemsasso iatedtothe sub-domains are solved with some a ura y depending on the global physi al problem. We show that this step an be further parallelized with alls to external third-party solvers. Tothis end, wedene twolevelsof parallelismin thesolver: therstlevelisintended for the omputationandthe ommuni ationa rossallthesubdomains;these ondlevelof par-allelism is used inside ea h subdomain to solve thesmaller linearsystems indu edby the pre onditioner.

Numeri al experiments areperformed onseveral problems to demonstratethe benets of su happroa hes,mainly intermsofglobale ien yandnumeri alrobustness.

Key-words: domain de omposition, pre onditioning, multipli ative S hwarz, Parallel GMRES, Newtonbasis,multilevelparallelism

∗

INRIA, Campus universitaire de Beaulieu, 35042 Rennes Cedex. E-mail : de-sire.nuentsa_wakamirisa.fr

†

(5)

Résumé : Dans et arti le, nous présentons unsolveur hybride pour la résolution des systèmes linéairessur desar hite turesparallèles. Lesolveurasso ie una élérateur basée surlessous-espa esdeKrylovàunpré onditionneurbasésurunedé ompositiondedomaine. Le pré onditionneur est déni à partir d'une formulation expli ite orrespondant à une itérationdeS hwarzmultipli atif. L'a élérateurestdetypeGMRES.Danslebutderéduire la ommuni ationentreles sous-domaines,nous utilisonsune versionparallèlede GMRES qui divise endeuxétapesla onstru tionde labasedusous-espa ede Krylovasso iée: La première étape onsiste à genererles ve teurs de la base dans un pipeline d'opérations à traverstouslespro esseurs. Lesprin ipalesopérationsi isontlesproduitsmatri e-ve teur et les résolutions des sous-systèmes linéaires dans les sous-domaines. La deuxième étape permet d'ee tuerunefa torisation QRparallèlesur unematri ere tangulaireforméepar lesve teurs onstruitspré édemment; e ipermet d'obtenirune baseorthogonaledu sous-espa edeKrylov.

Les sous-systèmes issus de l'appli ation du pré onditionneur sont résolus à diérents niveaux depré isionsparunefa torisationLUouILU,enfon tiondeladi ultédu prob-lèmesous-ja ent. Deplus, etteétapepeutêtrefaiteenparallèleviadesappelsauxsolveurs externes. Pour ela,nous utilisonsdeuxniveaux deparallélismedansnotre solveurglobal. Le premier niveau permet de dénirles al uls et les ommuni ations àtraversles sous-domaines. Le deuxième niveau est dénie à l'intérieur de haque sous-domaine pour la résolution dessous-systèmesissusdupré onditionneur.

Plusieurs tests numériques sontee tués pour validerl'e a ité de ette appro he, en parti ulierpourlesaspe tsd'e a itéetderobustesse.

Mots- lés : De omposition de domaine, pre onditionnement, S hwarz multipli atif, GMRESparallèle,BasedeNewton,parallélismemultiniveaux.

(6)

In this paper, we are interested in the parallel omputation of the solution of the linear system(1)

Ax = b

(1) with

A ∈ R

n×n

,

x, b ∈ R

n

. Over the two past de ades, the GMRES iterative method proposedbySaadandS hultz[33℄hasbeenprovedverysu essfulforthistypeofsystems, parti ularly when

A

is a large sparse nonsymmetri matrix. Usually, to be robust, the methodsolvesapre onditionedsystem(2)

M

−

1 Ax = M

−

1 b

or

AM

−

₁

(M x) = b

(2)

where

M

−

1

is a pre onditioner operator that a elerates the onvergen e of the iterative method.

On omputing environments with a distributed ar hite ture, pre onditioners based on domain de omposition are of natural use. Their formulation redu es the global problem to several subproblems, where ea h subproblem is asso iated to a subdomain; therefore, one ormoresubdomains areasso iatedto anodeof theparallel omputer andthe global system is solvedby ex hanginginformationsbetweenneighboring subdomains. Generally, indomainde ompositionmethods,therearetwowaysofderivingthesubdomains: (i)from theunderlying physi al domainand(ii)from theadja en ygraphofthe oe ientmatrix

A

. Inany ofthese partitioningte hniques,subdomains may overlap. Overlappingdomain de ompositionapproa hesareknownasS hwarzmethodswhilenon-overlappingapproa hes referto S hur omplementte hniques. Here,weareinterestedin pre onditionersbasedon the rst lass. Depending on how the global solution is obtained, the S hwarz method is additive or multipli ative [38, Ch. 1℄. The former approa h omputes the solution of subproblems simultaneously in all subdomains. It is akin to the blo k Ja obi method; therefore, theadditiveS hwarzmethod hasastraightforwardimplementationin aparallel environment [13℄. Furthermore, it is often used in onjun tion with S hur omplement te hniquestoprodu ehybridpre onditioners[14,23,34℄.

Themultipli ativeS hwarzmethodbuildsasolutionoftheglobalsystembyalternating su essively through all the subdomains; it is therefore similar to the blo k Gauss-Seidel methodonanextendedsystem;Thus, ompared totheadditiveapproa h, itwill theoreti- allyrequirefeweriterationsto onverge.However,goode ien yisdi ulttoobtainina parallel environmentduetothehighdatadependen iesbetweenthesubdomains. The tra-ditionalapproa htoover omethisisthroughgraph oloringbyasso iatingdierent olors to neighboring subdomains. Hen e,thesolutionin subdomains ofthesame olor ouldbe updatedin parallel[38, Ch. 1℄. Re ently, adierentapproa h hasbeenproposed[4,6℄. In that work,the authors proposed anexpli it formulationasso iated to oneiterationof the multipli ativeS hwarzmethod. Thisformulationrequiresthatthematrixispartitionedin blo k diagonal form [5℄ su h that ea h blo k has amaximum of twoneighbors; from this expli itformula,theresidualve torisdeterminedoutofthe omputationofthenew approx-imateglobalsolution,andtherefore ouldbeparallelizedthroughsequen esofmatrix-ve tor produ tsand lo alsolutionsin subdomains.

Therst purposeof this paperis to presenttheparallelismthat is obtainedwhen this expli itformulationisusedtobuildapre onditionedKrylovbasisfortheGMRESmethod. This is a hievedthroughthe Newton basisimplementationproposed in [7℄ andapplied in [18,37℄. Hen e,theusualinnerprodu tsandglobalsyn hronizationsareavoideda rossall thesubdomainsandtheresultedalgorithmleadsto apipelineparallelism. Wewillreferto thisasarst-levelofparallelism inGMRES.These ondandmainpurposeofourworkhere

(7)

is to further useparallel operations whensolving lo al systemsasso iated to subdomains. Tothisend, weintrodu ease ond-level of parallelism todealwiththose subsystems. The rst observation is that for systems that arise from non-ellipti operators, the number of subdomainsshould besmalltoguaranteethe onvergen e. Consequently,thelo alsystems ouldbeverylargeandshouldbesolvede ientlytoa eleratethe onvergen e.These ond observation omesfromthear hite tureofthe urrentparallel omputers. Thesetwolevels ofparallelismenableane ientusageoftheallo atedresour esbydividingtasksa rossand insidealltheavailablenodes. Thisapproa hhasbeenre entlyusedtoenhan es alabilityof hybridpre onditionersbasedonadditiveS hwarzandS hurComplementte hniques[22℄.

The remaining partof this paper is organizedas follows. Se tion 2 re all the expli it formulationof theMultipli ativeS hwarzpre onditioner. Afterthat, aparallel implemen-tation ofthepre onditionedNewton-basisGMRES isgiven. Se tion 3providesthese ond level of parallelism introdu ed to solve linear systems in subdomains. As those systems should be solvedseveral times with dierentrighthand sides, the naturalwayis to usea parallel third-party solver basedon an (in omplete)

LU

fa torization. The parallelismin these solversisintended fordistributed-memory omputers; wedis ussthe benetsof this approa h. Inse tion4,weprovideintensivenumeri alresultsthatrevealgoodperforman e of this parallel hybridsolver. Thematri es oftests aretaken either from publi a ademi repositoriesor fromindustrial test ases. Con ludingremarksandfuture dire tionsof this workaregivenattheendofthepaper.

2 A parallel version of GMRES pre onditioned by

mul-tipli ative S hwarz

Inthisse tion,theexpli itformulationofthemultipli ativeS hwarzmethodisintrodu ed. Thenweshowhowtouseite ientlyasaparallelpre onditionerfortheGMRESmethod. A goodparallelismisobtainedthankstotheuseoftheNewton-Krylovbasis.

2.1 Expli it formulation of the multipli ative S hwarz

pre ondi-tioner

Fromtheequation(1),we onsiderapermutationofthematrix

A

into

p

overlapping parti-tions

A

i

. Wedenoteby

C

i

theoverlappingmatrixbetween

A

i

and

A

i+1

(i = 1, . . . , p − 1)

. Here,ea hdiagonalblo khasamaximumoftwoneighbors(seeFigure1.a). Assumingthat there is nofull line(or olumn)in thesparse matrix

A

, su h partitioning anbeobtained fromtheadja en ygraphof

A

bymeansofproleredu tionandlevelsets[5℄. Wedene

A

¯

i

(resp.

C

¯

i

)thematrix

A

i

(resp.

C

i

) ompletedbyidentitytothesizeof

A

(seeFigure1.b).

If

A

¯

i

and

C

¯

i

arenonsingularmatri es,thenthematrixasso iatedtooneiterationofthe lassi almultipli ativeS hwarzmethodisdened [6℄as:

M

−

1 _{= ¯}

_A

p

−

₁

¯

C

p−1

A

¯

−

₁

p−1

C

¯

p−2

. . . ¯

A

2 −

₁

¯

C

1 A

¯

1 −

₁

(3)

This expli it formulation is very useful to provide stand-alone algorithm for the essential operation

y ← M

−

1 _x

usediniterativemethods. However,sin edependen iesbetween sub-domainsarestillpresentinthisexpression,noe ientparallelalgorithm ouldbeobtained for this single operation. Now, if asequen e of ve tors

v

i

should be generated su h that

v

i

← M

−

1 Av

i−1

, thenthe algorithm would produ eaverygood pipeline parallelism. For the pre onditioned Krylov method, these ve tors are simply those that span the Krylov subspa e. However, to get aparallel pro ess, the basis of this Krylovsubspa eshould be formed withtheNewton polynomialsasdes ribedin thenextse tion.

(8)

(b): Completed submatrix

I

A

k

A

₁

C

1 A

₃

A

₄

(a): blo k-diagonalformof

A

₂

C

3 C

2

Figure1: Partitioningof

A

intofour subdomains

2.2 Ba kground on GMRES with the Newton basis

A left pre onditioned restarted GMRES(m) method minimizes the residual ve tor

r

m

=

M

−

1 _{(b − Ax}

m

)

intheKrylovsubspa e

x

0 + K

m

where

x

0

istheinitial approximation,

x

m

the urrentiterateand

K

m

= span{r

0 , M

−

1 _Ar

0 , . . . , (M

−

1 A)

m

r

0 }

. Thenewapproximation is oftheform

x

m

= x

0 + V

m

y

m

where

y

m

minimizes

||r

m

||

. Themosttime onsumingpart in thismethod isthe onstru tionoftheorthonormalbasis

V

m

of

K

m

;theArnoldi pro ess is generallyused forthis purpose [32℄. It onstru tsthebasisand orthogonalizesitin the sametime. Theee t ofthis isthepresen eofglobal ommuni ationbetweenallthe pro- esses. Hen e,noparallelism ouldbeobtaineda rossthesubdomainswiththisapproa h. However,syn hronisationpoints anbeavoidedbyde ouplingthe onstru tionof

V

m

into twoindependentphases: rstthebasisisgeneratedapriori thenitisorthogonalized.

Many authors proposed dierent waysto generate this a priori basis [7, 40, 16℄. For stabilitypurposes,we hoosetheNewtonbasisintrodu edbyBaietal[7℄andusedbyErhel [18℄:

b

V

m+1

= [µ

0 r

0 , µ

1 (M

−

₁

A − λ

j

I)r

0 , . . . , µ

m

Y

j=1

(M

−

₁

A − λ

j

I)r

0 ]

(4)

where

µ

j

and

λ

j

(j = 1, . . . , n)

arerespe tivelys alingfa torsandapproximateeigenvalues of

A

. Witha hoseninitialve tor

v

0 = r

0 /||r

0 ||

,asequen eofve tors

v

1 , v

2 , . . . , v

m

isrst generated su hthat:

v

j

= σ

j

(M

−

1 _{A − λ}

1 I)v

j−1

j = 1 . . . m

(5) where

σ

j

= 1/||(M

−

1 _{A − λ}

j

I)v

j−1

||

(6) To avoid global ommuni ation, the ve tors

v

j

are normalized at the end of the pro ess. Hen e, thes alars

µ

j

from the equation (4)are easily omputedas theprodu t ofs alars

σ

j

. Thesestepsareexplainedindetailinthese tion2.2.1and2.2.2. Atthispoint,weget anormalizedbasis

V

m

su hthat

M

−

₁

AV

m

= V

m+1

T

m

(7)

where

T

m

is a bidiagonal matrix with the s alars

σ

j

and

λ

j

. Therefore,

V

m

should be orthogonalizedusinga

QR

fa torization:

(9)

As we show in se tion 2.2.1 and following the distribution of ve torsin Figure 3.(b), the

v

i

saredistributed inblo ksof onse utiverowsbetweenallthepro essors;However,their overlapped regions are not dupli ated between neighboring pro essors. Thus, to perform the

QR

fa torization, we use a an algorithm introdu ed by Sameh [35℄ with a parallel implementation provided by Sidje [37℄. Re ently, a new approa h alled TSQR has been proposedbyDemmelet al. [17℄whi haimsprimarilytominimizethe ommuni ationsand betterusetheBLASkerneloperations. Their urrentalgorithmusedin[27℄isimplemented with POSIXthreads and ould only beused when awhole matrix

V

m

is available in the memoryofoneSMPnode. Nevertheless,thealgorithmisSPMD-styleandshouldbeeasily portedto distributedmemorynodes.

So far, at the end of the fa torization, we get an orthogonal basis

Q

m+1

impli itly representedasasetoforthogonalree tors. Thematrix

R

m+1

isavailableinthememoryof thelastpro essor. Toperformtheminimization stepin GMRES,wederiveanArnoldi-like relation[32, Se tion6.3℄usingequation(7)and(8)

M

−

₁

AV

m

= Q

m+1

R

m+1

T

m

= Q

m+1

G

¯

m

(9) Hen e thematrix

G

¯

m

isin Hessenbergform andthenewapproximatesolutionis givenby

x

m

= x

0 + V

m

y

m

wheretheve tor

y

minimizesthefun tion

J

denedby

J(y) = ||βe

1 − ¯

G

m

y

m

||,

β = ||r

0 || e

1 = [1, 0, . . . , 0]

T

(10)

The matrix

G

¯

m

is in the memory of the last pro essor. Sin e

m << n

, this least-square problem is sequentiallyand easily solvedusing a

QR

fa torization of

G

m

. The outline of this GMRESalgorithmwiththeNewton basis anbefoundin[18℄.

2.2.1 Parallel pro essingof the pre onditionedNewtonbasis

Inthisse tion,wegeneratetheKrylovve tors

v

j

, (j = 0, . . . , m)

of

V

m+1

fromtheequation (5). ConsiderthepartitioningoftheFigure 1with atotalof

p

subdomains. Atthis point, weassumethatthenumberofpro essesisequalto thenumberofsubdomains. Thus,ea h

pro ess omputesthesequen eofve tors

v

(k)

j

= (M

−

₁

A − λ

j

I)v

(k)

j−1

where

v

(k)

j

istheset of blo krowsfromtheve tor

v

j

ownedbypro ess

P

k

. Thekernel omputationredu estosome extent totwomajoroperations

z = Ax

and

y = M

−

1 _z

. Forthese operations,we onsider thematri esandve tordistributiononFigures2and3. Onea hpro ess

P

k

,theoverlapping submatrixiszeroedtoyieldthematrix

B

i

in Figure2. InFigure3.(b),thedistributionfor theve tor

x

isplotted. Theoverlappingpartsarerepeatedonallpro esses. Hen e,forea h subve tor

x

(k)

onpro ess

P

k

,

x

(k)

u

and

x

(k)

d

denoterespe tivelytheoverlappingpartswith

x

(k−1)

and

x

(k+1)

. The pseudo odefor thematrix-ve torprodu t

z = Ax

follows thenin Algorithm1.

Now we onsider the matrix distribution in Figure 3.(a) for these ond operation

y =

M

−

₁

z

. A ordingtorelation(3),ea hpro ess

k

solveslo allythelinearsystem

A

k

t

(k)

_{= z}

(k)

for

t

(k)

followed by a produ t

y

(k)

d

= C

k

t

(k)

d

with the overlapped matrix

C

k

. However, the pro ess

P

k

should re eiverst the overlapping partof

y

(k−1)

d

from thepro ess

P

k−1

. Algorithm2des ribestheappli ationofthepre onditioner

M

−

1

toave tor

z

. Theformof the

M

−

1

operatorprodu es adatadependen ybetweenneighboring pro esses. Hen e the omputation ofasingleve tor

v

j

= (M

−

1 _{A − λ}

j

I)v

j−1

issequentialoverallthepro essses. Howeversin ethe

v

j

are omputed oneafter another, apro ess anstart to ompute its ownpartoftheve tor

v

j

evenifthewhole previousve tor

v

j−1

is notavailable. Werefer to thisasapipeline parallelism sin etheve tors

v

j

are omputeda rossall thepro esses as in a pipeline. This would not be possible with the Arnoldi pro ess as all the global

(10)

0

B

i

=

Figure2: Distributionforthematrix-ve torprodu t

A

₁

A

₂

C

₂

A

₃

C

₃

A

₄

C

₁

x

(1)

x

(2)

x

(3)

x

(4)

x

(1)

d

= x

(2)

u

x

(2)

d

= x

(3)

u

(a): Distributionfortheoperation

y ← M

−

1 _x

(b):Distributionoftheve tor

Figure3: Matrixandve torsplittings

ommuni ations introdu e syn hronizations points between the pro essesand preventthe use ofthepipeline;see forinstan ethedata dependen iesimpliedbythis pro essin Erhel [18℄.

Tobetterunderstandthea tualpipeline parallelism, were allhereallthedependen ies presentedin[39℄. Fortheparallelmatrix-ve torprodu tinAlgorithm1,ifweset

z = h(x) =

Ax

, then

z

(k)

_{= h}

k

(x

(k−1)

, x

(k)

, x

(k+1)

)

andtheFigure4.(a) illustratesthesedependen ies. Forthepre onditionerappli ation

y ← M

−

₁

z

aswritten inAlgorithm2,weset

y = g(z) =

M

−

₁

z

, then

y

(k)

_{= g}

k

(y

(k−1)

, z

(k)

, z

(k+1)

)

and dependen iesare depi ted on Figure 4.(b). Finally, for the operation

y ← M

−

₁

Ax

, if

y = f (x) = AM

−

₁

x = h

g(x)

, then

y

(k)

₌

f

k

(y

(k−1)

, x

(k)

, x

(k+1)

, x

(k+2)

)

andwe ombinethetwographstohavethedependen ieson Figure 5.a. Hen e thedependen y

x

(k−1)

is ombinedwiththat of

y

(k−1)

. In thepipeline ow omputation

v

j

= f (v

j−1

)

,thedependen y

x

(k+2)

inFigure5.bdelaysthe omputation

of the

v

(k)

j

until thesubve tor

v

(k+2)

(11)

Algorithm 1

z = Ax

1: /* Pro ess

P

k

holds

B

k

,

x

(k)

*/

2:

k ← myrank()

3:

z

(k)

_{← B}

k

x

(k)

;/* lo al matrix-ve torprodu t*/

4: if

k < p

then 5: Send

z

(k)

d

topro ess

P

k+1

6: endif 7: if

k > 1

then 8: Re eive

z

(k−1)

d

frompro ess

P

k−1

9:

z

(k)

u

← z

(k)

u

+ z

(k−1)

d

10: endif

11: /* Communi ationfor the onsisten y ofoverlappedregions */

12: if

k > 1

then 13: Send

z

(k)

u

topro ess

P

k−1

14: Re eive

z

(k+1)

u

from pro ess

P

k+1

15:

z

(k)

d

← z

(k+1)

u

16: endif 17: return

z

(k)

Algorithm 2

y = M

−

1 _z

1: /* Pro ess

P

k

holds

A

k

,

z

(k)

*/ 2:

k ← myrank()

3: if

k > 1

then 4: Re eive

y

(k−1)

d

frompro ess

P

k−1

5:

z

(k)

u

← y

(k−1)

d

6: endif

7: Solvelo alsystem

A

k

y

(k)

_{= z}

(k)

for

y

(k)

8: if

k < p

then 9:

y

(k)

d

= C

k

y

(k)

d

10: Send

y

(k)

d

to pro ess

P

k+1

11: endif

12: /* Communi ationfor the onsisten y ofoverlappedregions */

13: if

k > 1

then 14: Send

y

(k)

u

topro ess

P

k−1

15: endif 16: if

k < p

then 17: Re eive

y

(k+1)

u

frompro ess

P

k+1

18:

y

(k)

d

= y

(k+1)

u

19: endif 20: return

y

(k)

all the subdomains and if

τ

denotes a time to ompute a subve tor

x

(k)

in luding the ommuni ationoverhead,then thetimeto omputeoneve toris

t(1) = pτ

(11)

and thetimeto ompute

m

ve torsofthebasisfollows

(12)

pτ

3τ

(see Figure6). Thee ien y

e

p

oftheoverallalgorithmistherefore omputedas:

e

p

=

mt(1)

pt(m)

=

pτ m

p(pτ + 3(m − 1)τ )

=

m

p + 3(m − 1)

(13)

Hen e, thee ien y is lose to

1/3

for alarge value of

m

. However, this result strongly suggesttokeepverysmallthenumber

p

ofsubdomains,relativelyto

m

.

z

(2)

x(2)

z

(1)

x

(1)

z

(4)

x

(4)

z

(3)

x

(3)

y

(2)

z

(2)

y

(1)

z

(1)

y

(4)

z

(4)

y

(3)

z

(3)

(a):Graphforthematrix-ve tor (b): Graphforapplying

M

−1

Figure4: Dependen ygraphs

(a)

(b)

z

(2)

z

(1)

z

(4)

y

(2)

y

(1)

y

(4)

y

(3)

y

(2)

y

(1)

y(4)

y

(3)

x

(1)

x

(4)

x

(3)

x

(2)

x

(1)

x

(4)

x

(3)

x

(2)

z

(3)

Figure5: Dependen ygraphfor

y = M

−

1 _Ax

2.2.2 Computation ofshifts and s alingfa tors

Sofar,wehavenotyetexplainedhowto omputetheshifts

λ

i

andthes alingfa tors

σ

i

of theequation(5)and(6).

As suggested by Rei hel [31℄, the values

λ

j

should be the

m

eigenvalues of greatest magnitudeofthematrix

M

−

1 _A

(13)

σ

3 (1)

= ||v

(1)

3 ||

(2)

2 P

1 P

2 P

3 P

4 σ

(3)

₃

= ||v

(3)

3 ||

(2)

2 + σ

(2)

3 σ

(2)

3 = ||v

(2)

3 ||

(2)

2 + σ

(1)

3 v

1 = (M

−1

A

− λ

1 )v

0 v

₂

v

3

Communi ationsfor

Ax

Communi ationsfor

M

−1

_y

y

(1)

_{= B}

1 v

0 (1)

v

1 (1)

= ¯

C

1 A

¯

−1

1 y

(1)

− v

(1)

0

Communi ationsforthe onsisten yofthe omputedve torintheoverlappedregion

σ

3 =

q

σ

(4)

3 σ

(4)

3 = ||v

(4)

3 ||

(2)

2 + σ

(3)

3

Figure6: Doublere ursion duringthe omputationoftheKrylovbasis

eigenvalues annotbe heaply omputed,weperformoneiterationof lassi alGMRES(m) with theArnoldipro ess. Hen e,the

m

eigenvalues omputedfromtheoutputHessenberg matrixaretheapproximateeigenvaluesofthepre onditionedglobalmatrix[33,Se tion3℄. TheseeigenvaluesarearrangedusingtheLejaorderingwhilegroupingtogethertwo omplex onjugateeigenvalues. Re ently,PhilippeandRei hel[29℄provedthat,insome ases,roots of theCheby hevpolynomials anbeusede iently.

Thes alars

σ

i

are determinedwithoutusing global redu tionoperations. Indeed,su h operationsintrodu esyn hronisationpointsbetweenallthepro essesand onsequently de-stroythepipelineparallelism. TheAlgorithms1and2 omputeinsomeextentthesequen e

v

j

= (M

−

₁

A − λ

i

I)v

j−1

. We are interested in the s alars

σ

j

= ||v

j

||

2

. Forthis purpose, on pro ess

P

k

, wedene by

ˆ

v

(k)

j

the subve tor

v

(k)

j

withoutthe overlappingpart. In the Algorithm 2, after the line 8, aninstru tion is therefore added to ompute the lo al sum

σ

(k)

j

= ||ˆ

v

(k)

j

||

2

. Theresultissenttothepro ess

P

k+1

atthesametimeastheoverlapping subve toratline

10

. Thenextpro ess

P

k+1

re eivestheresultandaddsittoitsown ontri-bution. Thisisrepeateduntilthelastpro ess

P

p

. Thenatthisstep,thes alar

σ

(p)

j

=

q

σ

j

(p)

is the norm

||v

j

||

2

. An illustrationis given in theFigure 6during the omputationof the ve tor

v

3

.

3 Parallel omputation of lo al systems in subdomains

During ea h appli ation ofthe operator

M

−

1

,severallinearsystems

A

k

y

(k)

_{= z}

(k)

should besolvedfor

y

(k)

withdierentvaluesof

z

(k)

(seeAlgorithm2line7). Thematri es

A

k

, k =

(1 . . . p)

are sparse and nonsymmetri as the global matrix

A

. Hen e, during the whole iterativepro essandinsideea hsubdomain,asparselinearsystemwithdierentrighthand sidesisbeingsolved. Onnowadayssuper omputerswithSMP(Symmetri MultiPro essing)

(14)

naturaltointrodu eanotherlevelofparallelismwithinea hsubdomain. So,inthisse tion, wegiverstthemotivationsofusingthisse ondlevelofparallelism. Thenwedis ussabout theappropriatesolvertouseinsubdomains. Finallywegivethemainstepsofourparallel solverwiththesetwolevelsofparallelism.

3.1 Motivations for two levels of parallelism

Indomainde ompositionmethods,the lassi alapproa histoassignoneorseveral subdo-mains to auniquepro essor. Intermsof numeri al onvergen eand parallelperforman e, this approa hhassomelimitations,aspointedoutin[22, 24℄:

For some di ult problems, su h as those arising from non-ellipti operators, the numberofiterationstendstoin reasewiththenumberofsubdomains. Itistherefore essentialtokeepthisnumberofsubdomainssmallinordertoprovidearobustiterative method. Moreover,withaxedsize

m

ofKrylovbasis,abettere ien yisobtained withasmall numberof subdomains. Bydoingthis, thesubsystemsaregettinglarge and ould notbe solved e iently with onlyone pro essorin asubdomain. In this situation, the se ond level of parallelism is improving the numeri al onvergen e of themethod whileproviding fastandee tivesolutionsin subdomains. On theother hand,theequation13suggests,withaxedsize

m

oftheKrylovbasisto useasmall numberofsubdomains.

In a wide range of parallel super omputers, ea h ompute node is made of several pro essors a essing more or less the same global memory spa e (see Figure 7.a). Usually, this spa e is logi ally divided by the resour e manager if only a subset of pro essors is s heduled. As a result, the allo ated spa e ould not be su ient to handleallthedataasso iatedtoasubdomain;itisthereforene essarytos hedulefor morepro essorsinanodetoa esstheamountofrequiredmemory. Inthissituation, withtheonelevelofparallelism,onlyonepro essorwillbeworking(Thepro essor

P

0

forinstan ein Figure7.b). However,ifanewlevelofparallelismis introdu edinside ea h s hedulednode,allthepro essors ould ontributetothetreatmentofthedata storedinto thememoryof thenode(Figure 7. ). Moreover,withthis approa h, data asso iatedtoaparti ularsubdomain ouldbedistributedonmorethanoneSMPnode; Inthis ase,thenewlevelofparallelismisdeneda rossallthenodesresponsiblefor asubdomain(Figure7.d).

3.2 Lo al solver in subdomains : A ura y and parallelism issues

The sparselinearsystemsasso iated tosubdomains arein thes opeof thepre onditioner operator

M

−

1

. Therefore, the solutions of these systems an be obtained more or less a urately:

An a uratesolution, withadire t solverforinstan e, leadsto apowerful pre ondi-tioner,thusa eleratingthe onvergen eoftheglobaliterativemethod. Thisapproa h is usuallyknownashybriddire t/iterativete hnique. Thedrawba khereis the sig-ni antmemoryneededin ea hsubdomain. Indeed,notonlythelo alsubmatrix

A

k

shouldbestoredbutalsothe

L

k

and

U

k

matri esfromits

LU

fa torization. Withthe presen eofll-induringnumeri alfa torization,thesizeofthosefa tors anin rease signi antly omparedtothesizeoftheoriginalsubmatrix.

(15)

(c) : One subdomain/SMP Node

WORKING

(d) : One subdomain/two SMP Nodes

IDLE

WORKING

(a) : SMP Node

L2

L1

L2

L1

Socket #0

Socket #1

L1

Memory system

Node 1

Node 0

(b) : One subdomain/SMP Node

P

₀

P

₂

P

₁

P

₃

P

₀

P

₂

P

₁

P

₃

P

₀

P

₂

P

₁

P

₃

P

₀

P

₂

P

₁

P

₃

A

_k

P

₂

P

₁

P

₃

P

₀

A

_k

A

_k

Figure7: DistributionofsubdomainsonSMPnodes

Theproblemofmemory anbeover omebyusinganin omplete(

ILU

)fa torizationof lo almatri es,thusprodu ingfastsolutionsinsubdomainswithfaira ura y. Inthis ase,theglobaliterativemethodismorelikelytostagnateforsome omplexproblems. Analternative ouldbetousedaniterativemethodinsideea hsubdomain. However, alargedieren einthea ura yoflo alsolutions ould hangetheexpressionofthe pre onditionerbetweenexternal y lesofGMRES.

So far, only the numeri al a ura y and the memory issues have being dis ussed but not the underlying parallel programming model. Beforehand, we assume that a parallel dire t solveris usedwithin ea h omputingnodewhenthere isenoughmemory,otherwise an in ompletesolverisapplied. OnSMPnodeswith pro essorssharingthesamememory, solversthat distributetasks among omputingunits by usingthread modelare ofnatural use (Spooles [3℄, PaStiX[25℄, PARDISO[36℄). On theother side, high performant sparse solversthat usethe message-passingmodel areintendedprimarily fordistributed memory pro essors.Nevertheless,onSMPnodes,thisapproa h anbeusede ientlywiththe avail-ability of intranode ommuni ation hannels for theunderlying messagepassing interfa e (MPI) [12, 21℄. These hannels are appropriate for the large amount of ommuni ations introdu edbetweenpro essorsduringthesolutioninsubdomains;forinstan ethe ompar-ative studyin [2, Se t. 5℄giveavolume andsize of these ommuni ationsforsomedire t solvers. Astheex hangedmessagesareofsmallsize,agooddatatransfer anbeobtained inside SMP nodes[11,30℄. Moreover,itispossibletosplit asubdomain onmorethanone SMP node. Forallthesereasons,weuseprimarily distributeddire tsolversin subdomains forthese ondlevelofparallelism.

3.3 Main steps of the parallel solver

Thesolveris omposed ofthesamestepswhether a ompleteorin ompletefa torizationis used forthesubsystems:

(16)

A

onalformbythehostpro ess(seeFigure1). Afterthat,thelo almatri es

A

k

and

C

k

shouldbedistributedtootherpro esses. Ifadistributedsolverisusedinsubdomains, thenthepro essorsaredispat hed inmultiple ommuni ators. Forinstan e, the Fig-ure 8shows adistributionof four SMP nodes with four pro essesaround twolevels of MPI ommuni ators. The rstlevel isintended forthe ommuni ationa rossthe subdomains. Hen einthis ommuni ator,thesubmatri esaredistributedtothe

level

0

pro esses(i.e

P

k,0

where

k = 0 . . . p − 1

and

p

thenumberofsubdomains). Thenfor ea hsubdomain

k

,ase ond ommuni atoris reatedbetweenthe

level 1

pro essesto manage ommuni ationsinsidethesubdomains(i.e

P

k,j

where

j = 0 . . . p

k

− 1

and

p

k

thenumberofpro essesinthesubdomaink).

Level 0 MPI COMM.

Level 1

MPI COMMs.

P

_0,0

P

_0,1

P

_1,2

P

_1,3

P

_1,1

P

_0,2

P

_0,3

P

_1,0

P

_3,1

P

_3,2

P

_3,3

P

_3,0

P

_2,1

P

_2,2

P

_2,3

D

₁

D

₀

D

₄

D

₃

P

_2,0

Figure 8: MPI ommuni atorsforthetwolevelsofparallelism

2. Setup : In this phase, the symboli and numeri al fa torization are performed on submatri es

A

k

by the underlying lo al solver. This step is purely parallel a ross all the subdomains. At the end of this phase, the fa tors

L

k

and

U

k

reside in the pro essorsresponsibleforthesubdomain

k

. Thisistotallymanagedbytheunderlying lo al solver. Prior to thisphase, aprepro essing step anbeperformed to s alethe elementsofthematrix.

3. Solve : This is theiterative phaseof the solver. With an initial guess

x

0

and the shifts

λ

j

, thesolver omputesalltheequations(5-10)asoutlinedhere:

(a) Pipeline omputationof

V

m+1

(b) Parallel

QR

fa torization

V

m+1

= Q

m+1

R

m+1

( ) Sequential omputation of

G

¯

m

su h that

M

−

1 _AV

m

= Q

m+1

G

¯

m

on the pro ess

P

p−1,0

.

(d) Sequentialsolutionofleast-square problem(10)for

y

m

onthepro ess

P

p−1,0

. (e) Broad asttheve tor

y

m

topro esses

P

k,0

(17)

(f) Paralle omputationof

x

m

= x

0 + V

m

y

m

The onvergen eisrea hedwhen

||b − Ax|| < ǫ||b||

otherwisethepro essrestartswith

x

0 = x

m

. When two levels of parallelism are used during the omputation of

V

m

,

level 1

pro essorsperformmultiple ba kwardand forwardsweepsinparallel tosolve lo al systems. Afterthe parallel

QR

fa torizationin step3b,theexpli it

Q

fa toris neverformedexpli itly. Indeed,onlytheunfa tored basis

V

isusedtoapplythenew orre tionasshowninstep3f. Nevertheless,aroutineto formthisfa torisprovided inthesolverforgeneralpurposes.

4 Numeri al experiments

Inthisse tion,weperformseveralexperimentstodemonstratetheparallelperforman eand thenumeri alrobustnessofourparallelsolveronawiderangeofrealproblems. Inse tion 4.1, we giverst thesoftware and thehardwarear hite ture onwhi h thetests are done. These tions 4.3and4.4 provideintensiveresultsin termsofs alabilityandrobustnesson several tests ases. Those ases listed in Table1 are taken from various sour es, ranging from publi repositoriesto industrialsoftwares

4.1 Software and hardware framework

ThesolverisnamedGPREMS 1

(GmresPRE onditionedbyMultipi ativeS hwarz). It isintendedtobedeployedondistributed memory omputersthat ommuni atethrough messagepassing(MPI). Theparallelismin subdomainsisbased eitheronmessagepassing orthreads model depending ontheunderlying solver. Thewhole libraryis builtontopof PETS (Portable, ExtensibleToolkitforS ienti Computation) [8,9℄. Themotivation of this hoi eistheuniforma esstoawiderangeofexternalpa kagesforthelinearsystems in subdomain. Moreover,thePETS pa kageprovideseveraloptimizedfun tions anddata stru turestomanipulatethedistributedmatri esandve tors. AFigure9givestheposition

BLAS

LAPACK

SCALAPACK

BLACS

MPI

KSP

(Krylov Subspace Methods)

PC

(Preconditioners)

INDEX SETS

VECTORS

MATRICES

SuperLU_DIST

MUMPS

HYPRE

PaStiX

GPREMS

APPLICATIONS CODES

LEVEL OF ABSTRACTION

EXTERNAL PACKAGES

Figure 9: GPREMSlibraryinPETS

ofGPREMSintheabstra tionlayerofPETS . Wegiveonlythe omponentsthatareused

1

(18)

GPREMS usesthePETS environment,it isnotdistributed asapartof PETS libraries. Nevertheless,thelibraryis onguredeasilyon eausableversionofPETS isavailableon thetargetedar hite ture.

AllthetestsinthispaperareperformedontheIBMp575SMPnodes onne tedthrough the InnibandDDRnetwork. Ea h nodeis omposedof 32Power6pro essorssharingthe sameL3 a he. APower6pro essorisadual- ore2-waySMTwithapeakfrequen yat4.7 GHz. Figure10(obtainedwithhwlo [10℄)showsthetopologyofthersteightpro essorsin one omputenode. Atotalof128nodesisavailableinthissuper omputer(namedVargas

2 ) whi hispartoftheFren hCNRS-IDRIS omputingfa ilities.

System(128GB)

L3(32MB)

L2 (4MB)

Core#0

Core#1

Core#2

Core#3

Core#4

Core#5

Core#6

Core#7

P#0

P#1

P#2

P#3

P#4

P#5

P#6

P#7

P#8

P#9

P#10

P#11

P#12

P#13

P#14

P#15

Figure10: VargasAr hite ture

4.2 Test matri es

This se tiongivesthemain hara teristi softhesetofproblemsunder study.

TherstsetistakenfromtheUniversityofFlorida(UFL)SparseMatrixColle tion[15℄. Inthisset,we hoosetheproblemsMEMCHIPandPARA-4respe tivelyfromFrees aleand S henk_ISEIdire tories;main hara teristi sofwhi haregiveninTable1. TheMEMCHIP

Table1: Generalpropertiesofthefourtestmatri es

MatrixName MEMCHIP PARA_4 CASE_004 CASE_017

Order 2,707,524 153,226 7,980 381,689

No. entries(nnz) 14,810,202 2,930,882 430,909 37,464,962

Entriesperrow 5 19 53 98

Patternsymmetry 91% 100% 88% 93%

2D/3Dproblem No yes 2D 3D

Sour e UFL(K.Gullapalli) UFL(O.S henk) FLUOREM FLUOREM

problem arise from ir uitsimulationeld. Thedensityof theasso iatedmatrixis rather small. Thenumberofentriesisabout

1.4 · 10

6

butthenonzeroselementsperrowrevealsa rathersmalldensity. Hen e,thisproblemiseasilysolvedwithGPREMSusinganin omplete fa torizationin subdomains. Wepresenttheresultswiththis aseinse tion4.3. Thenext matrix is PARA-4 oming from 2D semi ondu tor devi e simulations. Compared to the previousproblem,thegeometryofthisproblemmakesitdi ulttosolve. Hen e,we hoose thismatrixtoshowtherobustnessofoursolverwhenitis ombinedwithadire tsolverin subdomains. Again,theresultsonthisareshownin se tion4.3.

The se ond set ontains two test matri es (CASE_004 and CASE_017) provided by FLUOREM [20℄, a uid dynami s software publisher. In those ases, the sparse matrix orrespondstotheglobalja obianmatrixresultingfromthepartialrst-orderderivativesof

2

(19)

thedis retesteadyEulerorReynolds-averagedNavier-Stokesequations. Thederivativesare donewithrespe ttotheseven onservativeuidvariables(mass:

ρ

,momentum:

[ρu, ρv, ρw]

, energy:

ρE

, turbulen e:

[ρk, ρω]

with

k − ω

); the steadyequation anberoughlywritten as

F (Q) = 0

where

Q

denotes the uid variables and

F (Q)

the divergen e of the ux fun tion. If

Q

h

and

F

h

arethenitevolumedis retizationsoverthe omputationaldomain of

Q

and

F

respe tively,then thematrix

A

ofthe linearsystem (1)is theglobalja obian matrix

J

F

h

(Q

h

)

.

A ∈ R

n×n

isnon-symmetri andindenite andwhen thestationary ow isdominatedbyadve tion,it orrespondstoastronglynon-ellipti operator. Thetwo-level parallelism is thenuseful in those asesin that, we anin reasethe numberof pro essors to speedup the overall exe ution time while providing the guarantee that the number of iterations will remain almost onstant. Se tion 4.4 gives the results with the multilevel approa h on theCASE_017 ase. The matrix CASE_004is used to show thenumeri al robustnessofGPREMS omparedtotheadditiveS hwarzapproa himplementedinPETS .

4.3 Convergen eands alabilitystudiesonMEMCHIPandP

ARA-4 ases

Wedis ussheretheresultsofthenumeri al onvergen eofGPREMSonthetwoproblems MEMCHIPandPARA-4. Weuserespe tively

m = 12

and

m = 24

asthesizeoftheNewton basis (sear h dire tions)for thetwo ases. Note that theiterativepro essstops when the relativeresidualnorm

||b − Ax||/||b||

islessthan

10 −

8

orwhen1500iterationsarerea hed. The totalnumber of iterations is usually a multiple of

m

as theNewton basis should be generated a priori; hen e the onvergen e is tested only at ea h external y le.

ILU (p)

denotesauseofanin ompletefa torizationinsubdomainswith

p

levelsofll-ininfa tored matrix. Thepa kageused to perform this fa torization is EUCLID[26℄from the HYPRE suite [19℄. Some otherproblems likePARA-4aremoredi ult tosolve. Hen e theuse of hybriddire t-iterativesolverisne essarytoa hieveagood onvergen einareasonableCPU time. TheMUMPS[1℄pa kageis thenused atthis point. Onall asessofar,the relative residual norm of the omputed solution is givenwith respe t to the numberof iterations. We omparethe onvergen ehistoryon4,8and16subdomains. Followingthe onvergen e history,wegivetheCPUtimespentinthevariousphases:theinitializationphase,thesetup phaseandtheiterativephase.

For the MEMCHIP ase with about

1.5 · 10

7

entries, it takes less than 25 y les to onvergeasshown onFigure 11.(a). Moreover,thenumberofiterations tendsto de rease with16subdomains. The

ILU (0)

fa torizationisusedinsubdomainsmeaningthatnonew elementsareallowedduringthenumeri alfa torizationandthus nomorespa eis needed. This hasthe ee t to speed-up the setup phaseand the appli ation of the

M

−

1

operator in the iterative step. The CPU times of these steps are shown in Table 2along with the initialization time (Init) and thetotaltime. These resultsprovethe s alabilityfor almost all thestepsin GPREMS.Apartfrom theinitializationstepwhi his sequential,allothers steps benet from the rst level of parallelism in the method. The setup phase is fully parallel a ross the subdomains. Here ea h pro ess responsible for a subdomain performs a sequentialfa torization on its lo al matrix. As the size of those submatri es de reases withthenumberof subdomains,thesetuptimede reasesaswell. Theiterativepartofthe solver(given byIter. loop time in Table2) s alesverywellin this ase. Forinstan e,300 iterationsofGMRESareperformedinlessthan103s. whenusing4subdomainsinthe blo k-diagonal partitioning. With 8subdomains,themethod requires 312iterations(26 restarts or y les) omputedinlessthan65s. Whentheproblemissubdividedinto16subdomains, the methodrevealsasuperlinearspeedup,thanks totherather smallnumberofiterations needed to onverge. Thisisaparti ular aseprobablydue totheunpredi tableproperties

(20)

ILU

parti ular aseon16subdomainsdoesnotappearwhen

ILU (1)

isused;seeFigure11.(b).

0

100

200

300

400

500

10 −10

10 −8

10 −6

10 −4

10 −2

10

0 MEMCHIP, GPREMS(12)+ILU0

Rel. Res. Norm

iterations

D=4

D=8

D=12

D=16

D=20

0

20

40

60

80

10 −10

10 −8

10 −6

10 −4

10 −2

10

0 MEMCHIP, GPREMS(12)+ILU1

Prec. Res. Norm

iterations

D=4

D=8

D=12

D=16

D=20

(a) (b)

Figure11: Convergen ehistorywithMEMCHIP aseon4,8and16subdomains

Table 2: MEMCHIP : CPU Time for various numbersof subdomains; ILU(0) is used in subdomains

#Pro essors #Subdomains Iterations CPUTime(s.)

Init Setup Iter. loop Time/Iter Total

4 4 48 26.38 3.39 33.2 0.69 62.97

8 8 72 27.4 2.03 25.98 0.36 55.41

16 16 60 30.41 1 17.59 0.29 49

Obviously, when the subsystems are solved approximately, the global method looses somerobustness. Thus formany2D/3D ases,thenumberofiterationsgetsverylargeand onsequently theoverall timeneeded to onvergeis toohigh. Weillustrate thisbehaviour with thePARA-4 2D ase. An in omplete

LU

fa torization(

ILU (2)

)isrstperformedon ea h submatrix to solvethe systemsindu ed bythe pre onditioner operator. The ILU(2) fa torization keepsthese ond-orderll-ins in thefa tors, leadingto amore a urate pre- onditioner than that with

ILU (0)

. However, this is not good enough to a hieve agood onvergen erateasintheprevious ase. Forinstan einFigure12.(a),after rea hing

10 −

7

, theresidualnormtendstostagnate. This anbeobservedeitherwith4,8or16subdomains. Ontheotherhand,ittakesa ertaintimetosetuptheblo kmatri esfortheiterativephase asshownin Table3. It anbeseenthat themaximum timeoverall pro essors(Setup) to getthein ompletefa torizationoflo almatri esistwi ethanthetimeoftheiterativeloop.

Now if a dire t solver is used within ea h subdomain, the method will get a better onvergen ethankstoa ura ya hievedduringtheappli ationofthepre onditioner. This behaviour is observed in Figure 12.(b) in whi h we show the onvergen e history for the number of subdomains under study. The number of iterations grows slightly when we hangefrom4to8subdomains. Thisdoesnotimpa ttheoveralltimespentintheiterative step as shown in the se ond part of the Table 3. Indeed, this step reveals a very good speedup asweonly need 35 s. with 8 subdomains ompared to 45 s. for 4 subdomains. With 16pro essors(subdomains),thetotaltime ontinuestode rease,evenwiththeslight in rease of the global number of iterations. It is also worth pointing out the time spent in thesetup phase. It takesroughly 5s. on four pro essorsto performthe symboli and thenumeri alfa torizationofthesubmatri es. Notethatea hpro essholdsalo almatrix

(21)

0

200

400

600

800 1000

1200

10 −10

10 −8

10 −6

10 −4

10 −2

10

0 PARA−4 : GPREMS(24)+ILU(2)

Prec. Res. Norm

iterations

D=4

D=8

D=16

0

20

40

60

80

100

120

140

10 −10

10 −8

10 −6

10 −4

10 −2

10

0 PARA−4 : GPREMS(24)+LU

Prec. Res. Norm

iterations

D=4

D=8

D=16

(a) (b)

Figure12: Convergen ehistorywithPARA-4 aseon4,8and16subdomains

asso iated to one subdomain. Hen e, the setup time is the maximum time overall a tive pro essors. Whenthe matrixissplitted into 16subdomains, thesize of thesesubmatri es is rather small and the setup time drops to 1 s. This is almost 400 times smaller than that with the ILU(2) fa torization. The main reason is the algorithm used by the two methods. In MUMPS,almost all theoperationsare doneonblo kmatri es (the so- alled frontal matri es)via alls to level 2 and level 3BLAS fun tions. Therefore, the dierent levelsof the a he memoryareused e iently during thenumeri alfa torizationwhi h is the most time- onsumingstep. Moreover,amultithreaded BLAS is linked to MUMPS to further speedup the method on SMP nodes. For this reason and the rather small size of submatri es,these ondlevelofparallelismisnotapplied upto thispoint.

Table 3: PARA-4 : CPU Time of GPREMS(24) for various number of subdomains and varioustypesofsolversinsubdomains;

#D #Pro . Iter. CPUTime(s.)

s

p

e

p

In ompletesolverin subdomains(ILU(2)-EUCLID)

4 4 1008 1.33 418.17 654.45 0.65 1073.95 -

-8 8 1008 1.31 409.68 599.64 0.59 1010.63 1.06 0.53 16 16 1008 1.51 398.61 542.75 0.54 942.87 1.14 0.28 Dire tsolverin subdomains(LU -MUMPS)

4 4 144 1.32 5.12 45.43 0.32 51.87 -

-8 8 216 1.26 1.67 35.06 0.16 37.98 1.37 0.68

16 16 240 1.37 0.73 29.21 0.12 31.31 1.66 0.41

In Table 3, we also give the speedup and e ien y of the method relatively to four pro essors.Clearly,

s

p

= T

4 /T

p

and

e

p

= 4s

p

/p

. Weobservethatthee ien yisde reasing very fast with the number of subdomains. With ILU(2) in subdomains for instan e, the e ien y on 8and 16 subdomains are respe tively 0.53and 0.28. This behaviour anbe explainedbythetheformulain Equation(13)as

p

shouldbekeepedsmallrelativelyto

m

. This isoneofthereasonsto usethese ondlevelofparallelisminsidethesubdomains.

4.4 Benets of the two-levels parallelism

Now,letus onsidertheFLUOREMproblem CASE_017listedin Table1. The geometry of this 3D ase is a jet engine ompressor. Although the turbulent variables are frozen

(22)

study involvingsomedistributed linearsolvers[28℄. From thisstudy, solversthat ombine hybriddire t-iterativeapproa heslikeGPREMSprovideane ientwaytodealwithsu h problems. However, the number of subdomains should be kept small to provide a fast onvergen e. Thereforethe two-levelsof parallelism introdu ed in se tion 3.2 provide the waytodothise ientlyevenwithalargenumberofpro essors. InTable4,wereportthe benetof usingthis approa h ontheabove-named3D ase. Wekeep4and 8subdomains andwein reasethetotalnumberofpro essors. Asaresult,almostallthestepsinGPREMS getanoti eablespeedup. Typi ally,with4subdomains, whenonlyonepro essorisa tive, thetimetosetuptheblo kmatri esisalmost203s. andthetimespentintheiterativeloop is622s. Movingto8a tivepro essorsde reasesthesetimesto59s. and181s. respe tively. Thesameobservation anbedonewhenusing8subdomains,inwhi h asetheoveralltime dropsfrom540s. to238s. Notethattheoveralltimein ludestherstsequentialstepthat permutesthematrixin blo k-diagonalformanddistributestheblo kmatri estoalla tive pro essors.

Thelasttwo olumnsofTable4givetheintranodespeedupande ien yofGPREMS. This is dierent from the ones omputed with onelevel of parallelism. Clearly, we want to showthe benet ofadding morepro essorsin asubdomain. Hen e,if

d

is thenumber of subdomains,

s

p

= T

d

/T

p

and

e

p

= ds

p

/p

where

T

p

is the CPU time on

p

pro essors. When we add morepro essorsin ea h subdomain, we observeanoti eable speedup; with 8 subdomains for instan e, the speedup moves from 1.18 to 2.28when 1and 8 pro esses are used within ea h subdomain. However, a look on the e ien y reveals a quite poor s alability. Thisisageneralobservationwhenthetestsaredoneonmulti orenodes,when the ompute units share the same memory bandwith; indeed, the speed of sparse matrix omputations are determined rather by the size of this memory bandwith than the lo k rate ofCPUs. Nevertheless,thetotalCPU timeto getthesolutionalwayssuggeststo use all the available pro essors/ oresin the ompute node; Note that these pro essors in the nodewouldbeidleotherwiseastheyares heduledtoa essmorememory.

Table4: Benetsofthetwo-levelsofparallelismforvariousphasesofGPREMSonCASE_17 witharestartat 64andMUMPS dire tsolverin subdomains

#D #Pro . Iter. CPUTime(s.)

s

p

e

p

4 4 128 70.78 203.67 622.94 4.87 897.39 - -8 128 70.77 125.77 411.42 3.21 607.96 1.48 0.74 16 128 69.79 73.15 280.41 2.19 423.35 2.12 0.53 32 128 70.77 59.44 181.45 1.42 311.66 2.88 0.36 8 8 256 69.91 71.73 399.34 1.56 540.98 - -16 256 70.36 42.99 343.25 1.34 456.59 1.18 0.59 32 256 71.61 28.72 206.7 0.81 307.02 1.76 0.44 64 256 71.85 20.85 144.79 0.57 237.49 2.28 0.28

4.5 Numeri al robustness of GPREMS

In this se tion, our aim is to show the robustnessof the GPREMS solveron CASE_004 listed in Table 1. The matrix of this 2D test ase orresponds to the global ja obian of the dis rete steady Euler equation [20℄. The two turbulent variables are kept during the dierentiation predi ting adi ult onvergen e for any iterative method. The re ipro al

(23)

ondition number is estimated to be

3.27 · 10

−

6

. The matrix is permuted into 4 blo k diagonal matri es produ ing lo al matri es with approximately 2,000 rows/ olumns and 120,000nonzerosentries. Fortheserelativelysmallblo kmatri es,we hooseanin omplete lo al solver in subdomains when applying the

M

−

1

operator. We also allow 5degrees of ll-ins(

ILU (5)

)duringthein ompletenumeri alfa torization. Figure13givesthehistory ofthepre onditionedresidualnormatea hrestartfor16and32Krylovve tors. Thisnorm drops to

10 −

11

in less than 200 iterations for GPREMS(16) with ILU(5) in subdomains. Whenthesize oftheKrylovsubspa evariesto32, theresidualve torrea hes

10 −

9

inless than 100iterations. This isalmostthesamebehaviourwhenadire t solverisusedwithin thesubdomainsinwhi h asetheoverallnumberofiterationsis112and96respe tivelyfor 16and32Krylovve tors.

0

50

100

150

200

250

300

10 −10

10 −8

10 −6

10 −4

10 −2

10

0 CASE_04

Prec. Res. Norm

iterations

GPREMS(16)+ILU(5)

GPREMS (32)+ILU(5)

GPREMS(16)+LU

GPREMS (32)+LU

ASM+ILU(5) +GMRES(16)

ASM+ILU(5)+GMRES(32)

ASM+LU +GMRES(16)

ASM+LU+GMRES(32)

Figure 13: The multipli ative S hwarz approa h (GPREMS) ompared to the restri ted additiveS hwarz(ASM)onCASE_004;16 and32ve torsin theKrylovsubspa eatea h restart; 4 subdomains (blo k diagonal partitioning in GPREMS and Parmetis for ASM); lo alsystemssolvedwithdierenta ura ies(ILU(5)andLU)

Thenwe omparewiththerestri tedadditiveS hwarzmethodusedasapre onditioner for GMRES.Thetestsaredonewiththeimplementationprovidedin PETS release 3.0.0-p2. Here,theKrylovbasisisbuiltwiththemodiedGram-S hmidt(MGS)methodandthe inputmatrixispartitionedin4subdomainsusingParMETIS.InFigure13,wedepi tagain the pre onditionedresidualnorm withrespe tto thenumberof iterations. The urvesare giveninmagentawiththesamelinespe i ationsasforGPREMS.Themainobservationis that theiterativemethodstagnatesfromtherstrestartwhenusingILU(5)insubdomains (themagentaplainanddash urves). Withadire tsolver,itisne essarytoform32Krylov ve torsatea hrestartinordertoa hieve onvergen e(asshownbythemagentadash-dotted urve).

(24)

Inthispaper,wegiveanimplementationofaparallelsolverforthesolutionoflinearsystems on distributed-memory omputers. This implementation is based on an hybridte hnique that ombinesamultipli ativeS hwarzpre onditionerwithaGMRESa elerator. A very goode ien yisobtainedusingaNewtonbasisGMRESversion;hen ewedenetwolevels of parallelism during the omputation of the orthonormalbasis neededby GMRES : The rst levelis expressedthrough pipeline operationsa rossall thesubdomains. The se ond levelusesaparallelisminsidethird-partysolverstobuildthesolutionofsubsystemsindu ed bythedomainde omposition. Thenumeri alexperimentsdemonstratethee ien yofthis approa honawiderangeoftest ases. Indeed,with therst-orderparallelism,weshowa goode ien ywhenusingeitheradire tsolveroranin omplete

LU

fa torization. Forthe se ond levelused withinthe subdomains, thegainobtainedis two-fold: the onvergen eis guaranteedwhenthenumberofpro essorsgrowsasthenumberofsubdomainsremainsthe same. Theglobale ien y in reasesasweaddmorepro essorsin subdomains. Moreover, on some di ultproblems, the useof a dire t solver in subdomains implies to a ess the whole memory of aSMP node. Therefore, this approa h keeps busy all thepro essorsof thenodeandenablesagoodusageofallo ated omputingresour es. Finally,onsometest asesweshowthatthissolver anbe ompetitivewithrespe ttootheroverlappingdomain de ompositionpre onditioners.

However, more work needs to be done to a hieve very good s alability on massively parallel omputers. Presently, anattempt toin rease thenumberof subdomains upto 32 in reasesthenumberofiterationsaswell. Themainreasonistheformofthemultipli ative S hwarz methodasea h orre tionoftheresidualve torshould gothroughallthe subdo-mains. Arst attemptisto usemorethantwolevelsofsplittingbutthee ien y ofsu h approa his notalwaysguaranteed,spe i ally ifthelinearsystemarisesfrom non-ellipti partial dierentialequations. An ongoingworkisto keepthetwo-levelsofsplittingforthe pre onditioner operator andthen to further a elerate the GMRESmethod with spe tral informationsgatheredduringtheiterativepro ess.

Referen es

[1℄ Amestoy, P. R., Duff, I. S., L'Ex ellent, J.-Y., and Koster, J. A fully asyn hronousmultifrontalsolverusingdistributed dynami s heduling. SIAMJournal on MatrixAnalysisandAppli ations 23, 1(2001),1541.

[2℄ Amestoy, P. R., Duff, I. S., L'Ex ellent, J.-Y., and Li, X. S. Analysis and omparison of two general sparse solvers for distributed memory omputers. ACM Transa tions onMathemati al Software27 (2000),2001.

[3℄ Ash raft,C.,andGrimes,R.Spooles: Anobje t-orientedsparsematrixlibrary.In Pro eedingsofthe9thSIAMConferen eonParallelPro essingforS ienti Computing (1999).

[4℄ Atenekeng-Kahou,G.-A.Parallélisationde'GMRES'pré onditionnéparune itéra-tion de S hwarz multipli atif. PhD thesis, University of Rennes 1 and University of Yaounde1,2008.

[5℄ Atenekeng-Kahou, G.-A., Grigori, L., and Sosonkina,M. A partitioning al-gorithmfor blo k-diagonalmatri eswith overlap. Parallel Computing 34, 6-8(2008), 332344.

(25)

[6℄ Atenekeng-Kahou, G.-A., Kamgnia, E., and Philippe,B. An expli it formula-tionof themultipli atives hwarzpre onditioner. Applied Numeri al Mathemati s57, 11-12(2007),11971213.

[7℄ Bai, Z.,Hu,D.,and Rei hel,L. ANewtonbasisGMRESimplementation. IMAJ NumerAnal14,4(1994),563581.

[8℄ Balay,S.,Bus helman,K., Eijkhout,V.,Gropp,W.D.,Kaushik,D., Knep-ley, M.G.,M Innes,L. C., Smith,B. F.,and Zhang,H. PETS usersmanual. Te h.Rep.ANL-95/11-Revision3.0.0,Argonne NationalLaboratory,2008.

[9℄ Balay, S., Bus helman, K., Gropp, W. D., Kaushik, D., Knepley, M. G., M Innes, L. C., Smith, B. F., and Zhang, H. PETS Web page, 2009. http://www.m s.anl.gov/pets .

[10℄ Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mer ier, G., Thibault, S., and Namyst, R. hwlo : a generi framework for managinghardwareanitiesin HPCappli ations. InPro eedings ofthe 18th Euromi- ro International Conferen e on Parallel, Distributed and Network-Based Pro essing (PDP2010), Pisa, Italia (2010).

[11℄ Buntinas, D., and Mer ier, G. Datatransfersbetweenpro essesin anSMP sys-tem: performan estudy andappli ation to MPI. InPro eedings of the 2006 Interna-tionalConferen eonParallelPro essing(ICPP2006),IEEEComputerSo iety (2006), pp.487496.

[12℄ Buntinas, D., Mer ier, G., and Gropp, W. Implementation and evaluation of shared-memory ommuni ation and syn hronization operations in mpi h2 using the nemesis ommuni ationsubsystem.ParallelComputing33,9(2007),634644.Sele ted PapersfromEuroPVM/MPI2006.

[13℄ Cai, X.-C.,andSaad,Y.Overlappingdomainde ompositionalgorithmsforgeneral sparsematri es. Numeri al Linear Algebrawith Appli ations3,3(1996),221237.

[14℄ Carvalho,L., Giraud, L., and Meurant, G. Lo al pre onditionersfor two-level non-overlappingdomain de omposition methods. Numeri al Linear Algebrawith Ap-pli ations 8,4(2001),207227.

[15℄ Davis,T.TheuniversityofFloridasparsematrix olle tion,2010. SubmittedtoACM Transa tionsonMathemati alSoftware.

[16℄ deSturler, E. AparallelvariantofGMRES(m). InPro eedings ofthe 13thIMACS World Congress on Computation and Applied Mathemati s, Dublin, Ireland (1991), J.J.H.MillerandR.Vi hnevetsky,Eds.,CriterionPress.

[17℄ Demmel,J.,Grigori,L.,Hoemmen,M.,andLangou,J.Communi ation-optimal parallel and sequential QR and LU fa torizations. Te h. Rep. UCB/EECS-2008-89, UniversityofCaliforniaEECS, May2008.

[18℄ Erhel, J. AparallelGMRESversionforgeneralsparsematri es. Ele troni T ransa -tiononNumeri alAnalysis 3 (1995),160176.

[19℄ Falgout, R., and Yang,U. Hypre: alibraryofhigh performan e pre onditioners. InLe turesNotesinComputerS ien e(2002),vol.2331,Springer-Verlag,pp.632641.

(26)

http://www.uorem. om.

[21℄ Geoffray,P.,Prylli,L.,andTouran heau,B.Bip-smp: Highperforman e mes-sagepassingovera lusterof ommoditysmps. InPro eedingsofthe 1999ACM/IEEE Conferen eon Super omputing (1999).

[22℄ Giraud, L., Haidar, A., and Pralet, S. Using multiple levelsof parallelism to enhan etheperforman eofdomainde ompositionsolvers.ParallelComputingInPress (2010).

[23℄ Giraud, L.,Haidar, A., and Watson, L. Parallels alabilitystudy ofhybrid pre- onditionersin threedimensions. Parallel Computing 34,6-8(2008),363379.

[24℄ Haidar, A. On the parallel s alability of hybrid linear solversfor large3D problems. PhDthesis,InstitutNationalPolyte hniquedeToulouse,2008.

[25℄ Hénon, P., Pellegrini, F., Ramet, P., Roman, J., andSaad, Y. High Perfor-man eCompleteandIn ompleteFa torizationsforVeryLargeSparseSystemsbyusing S ot handPaStiXsoftwares. InEleventh SIAMConferen e onParallel Pro essingfor S ienti Computing (2004).

[26℄ Hysom, D., and Pothen, A. A s alable parallel algorithm for in omplete fa tor pre onditioning. SIAMJournalon S ienti Computing22 (2000),21942215.

[27℄ Mohiyuddin, M., Hoemmen, M., Demmel, J., andYeli k, K. Minimizing om-muni ation in sparse matrix solvers. In SC '09: Pro eedings of the Conferen e on HighPerforman eComputingNetworking,StorageandAnalysis(NewYork,NY,USA, 2009),ACM,pp.112.

[28℄ Nuentsa Wakam, D., Erhel, J., Édouard Canot, and Atenekeng Kahou, G.-A. A omparativestudyofsomedistributedlinearsolversonsystemsarisingfrom uid dynami s simulations. In Parallel Computing: From Multi ores and GPU's to Petas ale (2010),vol.19ofAdvan es inParallel Computing,IOS Press,pp.5158.

[29℄ Philippe, B., and Rei hel, L. On the generationof krylovsubspa e bases. Te h. Rep.RR-7099,INRIA, 2009.

[30℄ Rabenseifner,R.,Hager,G.,andJost,G.Hybridmpi/openmpparallel program-ming on lusters of multi- ore smp nodes. Parallel, Distributed, and Network-Based Pro essing, Euromi roConferen eon 0 (2009),427436.

[31℄ Rei hel,L. Newtoninterpolationatlejapoints. BIT30 (1990),305331.

[32℄ Saad,Y.IterativeMethodsforSparseLinearSystems,se onded.So ietyforIndustrial andAppliedMathemati s,Philadelphia,PA, USA,2000.

[33℄ Saad, Y.,and S hultz, M. H. GMRES:A generalizedminimal residualalgorithm for solving nonsymmetri linearsystems. SIAM Journal on S ienti and Statisti al Computing 7,3(1986),856869.

[34℄ Saad,Y., andSosonkina,M. Distributeds hur omplementte hniquesforgeneral sparselinearsystems. SIAMJournalonS ienti Computing21,4(1999),13371356.

(27)

[35℄ Sameh, A. Solvingthelinearleastsquaresproblem onalineararrayofpro essors. In HighSpeedComputerandAlgorithmOrganization(1977),D.L.D.Ku kandA.Sameh, Eds.,A ademi Press,pp.207228.

[36℄ S henk, O., and Gärtner, K. Solvingunsymmetri sparsesystemsof linear equa-tions withpardiso. Journal of Future Generation Computer Systems 20 (2004),475 487.

[37℄ Sidje, R. B. Alternativesforparallel krylovsubspa ebasis omputation. Numeri al LinearAlgebrawithAppli ations 4,4(1997),305331.

[38℄ Smith, B., Bj

φ

rstad, P., andGropp, W. Domain De omposition, Parallel Multi-level Methods for Ellipti PartialDierential Equations. CambridgeUniversityPress, 1996.

[39℄ Souopgui, I. Parallélisationd'une double re urren edans 'gmres'. Master'sthesis, UniversityofYaounde1,2005.

[40℄ Walker,H.F. Implementationofthe'gmres'methodusinghouseholder transforma-tions. SIAMJournalonS ienti andStatisti alComputing 9,1(1988),152163.

(28)