HAL Id: inria-00508277
https://hal.inria.fr/inria-00508277v2
Submitted on 23 Sep 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
preconditioner
Désiré Nuentsa Wakam, Guy-Antoine Atenekeng Kahou
To cite this version:
Désiré Nuentsa Wakam, Guy-Antoine Atenekeng Kahou. Parallel GMRES with a multiplicative
Schwarz preconditioner. [Research Report] RR-7342, INRIA. 2010. �inria-00508277v2�
a p p o r t
d e r e c h e r c h e
N
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--7
3
4
2
--F
R
+
E
N
G
Observation and Modeling for Environmental Sciences
Parallel GMRES
with a multiplicative Schwarz preconditioner
Desire NUENTSA WAKAM – Guy Antoine ATENEKENG KAHOU
N° 7342 — version 2
Centre de recherche INRIA Rennes – Bretagne Atlantique
Desire NUENTSAWAKAM
∗
Guy Antoine ATENEKENG KAHOU
†
Theme: ObservationandModelingforEnvironmentalS ien es Équipes-ProjetsSage
Rapportdere her he n°7342version2initialversionJuillet2010revisedversion Septembre201024pages
Abstra t: Inthis paper, we present an hybridsolverfor linear systems that ombines a Krylov subspa e method as a elerator with some overlapping domain de omposition method as pre onditioner. The pre onditioner uses an expli it formulation asso iated to oneiterationofthe lassi almultipli ativeS hwarzmethod. Toavoid ommuni ationsand syn hronizationsbetweensubdomains,theNewton-basisGMRESimplementationisusedas a elerator. Thus,itisne essarytodividethe omputationoftheorthonormalbasisintotwo steps: thepre onditionedNewtonbasisis omputedthenitisorthogonalized. Therststep ismerelyasequen eofmatrix-ve torprodu tsandsolutionsoflinearsystemsasso iatedto subdomains;wedes ribethene-grainedparallelismthatisusedinthesekerneloperations. The se ond stepuses aparallel implementationof dense
QR
fa torizationon theresulted basis.Atea happli ation ofthepre onditioneroperator,lo alsystemsasso iatedtothe sub-domains are solved with some a ura y depending on the global physi al problem. We show that this step an be further parallelized with alls to external third-party solvers. Tothis end, wedene twolevelsof parallelismin thesolver: therstlevelisintended for the omputationandthe ommuni ationa rossallthesubdomains;these ondlevelof par-allelism is used inside ea h subdomain to solve thesmaller linearsystems indu edby the pre onditioner.
Numeri al experiments areperformed onseveral problems to demonstratethe benets of su happroa hes,mainly intermsofglobale ien yandnumeri alrobustness.
Key-words: domain de omposition, pre onditioning, multipli ative S hwarz, Parallel GMRES, Newtonbasis,multilevelparallelism
∗
INRIA, Campus universitaire de Beaulieu, 35042 Rennes Cedex. E-mail : de-sire.nuentsa_wakamirisa.fr
†
Résumé : Dans et arti le, nous présentons unsolveur hybride pour la résolution des systèmes linéairessur desar hite turesparallèles. Lesolveurasso ie una élérateur basée surlessous-espa esdeKrylovàunpré onditionneurbasésurunedé ompositiondedomaine. Le pré onditionneur est déni à partir d'une formulation expli ite orrespondant à une itérationdeS hwarzmultipli atif. L'a élérateurestdetypeGMRES.Danslebutderéduire la ommuni ationentreles sous-domaines,nous utilisonsune versionparallèlede GMRES qui divise endeuxétapesla onstru tionde labasedusous-espa ede Krylovasso iée: La première étape onsiste à genererles ve teurs de la base dans un pipeline d'opérations à traverstouslespro esseurs. Lesprin ipalesopérationsi isontlesproduitsmatri e-ve teur et les résolutions des sous-systèmes linéaires dans les sous-domaines. La deuxième étape permet d'ee tuerunefa torisation QRparallèlesur unematri ere tangulaireforméepar lesve teurs onstruitspré édemment; e ipermet d'obtenirune baseorthogonaledu sous-espa edeKrylov.
Les sous-systèmes issus de l'appli ation du pré onditionneur sont résolus à diérents niveaux depré isionsparunefa torisationLUouILU,enfon tiondeladi ultédu prob-lèmesous-ja ent. Deplus, etteétapepeutêtrefaiteenparallèleviadesappelsauxsolveurs externes. Pour ela,nous utilisonsdeuxniveaux deparallélismedansnotre solveurglobal. Le premier niveau permet de dénirles al uls et les ommuni ations àtraversles sous-domaines. Le deuxième niveau est dénie à l'intérieur de haque sous-domaine pour la résolution dessous-systèmesissusdupré onditionneur.
Plusieurs tests numériques sontee tués pour validerl'e a ité de ette appro he, en parti ulierpourlesaspe tsd'e a itéetderobustesse.
Mots- lés : De omposition de domaine, pre onditionnement, S hwarz multipli atif, GMRESparallèle,BasedeNewton,parallélismemultiniveaux.
In this paper, we are interested in the parallel omputation of the solution of the linear system(1)
Ax = b
(1) withA ∈ R
n×n
,x, b ∈ R
n
. Over the two past de ades, the GMRES iterative method proposedbySaadandS hultz[33℄hasbeenprovedverysu essfulforthistypeofsystems, parti ularly when
A
is a large sparse nonsymmetri matrix. Usually, to be robust, the methodsolvesapre onditionedsystem(2)M
−
1
Ax = M
−
1
b
orAM
−
1
(M x) = b
(2)where
M
−
1
is a pre onditioner operator that a elerates the onvergen e of the iterative method.
On omputing environments with a distributed ar hite ture, pre onditioners based on domain de omposition are of natural use. Their formulation redu es the global problem to several subproblems, where ea h subproblem is asso iated to a subdomain; therefore, one ormoresubdomains areasso iatedto anodeof theparallel omputer andthe global system is solvedby ex hanginginformationsbetweenneighboring subdomains. Generally, indomainde ompositionmethods,therearetwowaysofderivingthesubdomains: (i)from theunderlying physi al domainand(ii)from theadja en ygraphofthe oe ientmatrix
A
. Inany ofthese partitioningte hniques,subdomains may overlap. Overlappingdomain de ompositionapproa hesareknownasS hwarzmethodswhilenon-overlappingapproa hes referto S hur omplementte hniques. Here,weareinterestedin pre onditionersbasedon the rst lass. Depending on how the global solution is obtained, the S hwarz method is additive or multipli ative [38, Ch. 1℄. The former approa h omputes the solution of subproblems simultaneously in all subdomains. It is akin to the blo k Ja obi method; therefore, theadditiveS hwarzmethod hasastraightforwardimplementationin aparallel environment [13℄. Furthermore, it is often used in onjun tion with S hur omplement te hniquestoprodu ehybridpre onditioners[14,23,34℄.Themultipli ativeS hwarzmethodbuildsasolutionoftheglobalsystembyalternating su essively through all the subdomains; it is therefore similar to the blo k Gauss-Seidel methodonanextendedsystem;Thus, ompared totheadditiveapproa h, itwill theoreti- allyrequirefeweriterationsto onverge.However,goode ien yisdi ulttoobtainina parallel environmentduetothehighdatadependen iesbetweenthesubdomains. The tra-ditionalapproa htoover omethisisthroughgraph oloringbyasso iatingdierent olors to neighboring subdomains. Hen e,thesolutionin subdomains ofthesame olor ouldbe updatedin parallel[38, Ch. 1℄. Re ently, adierentapproa h hasbeenproposed[4,6℄. In that work,the authors proposed anexpli it formulationasso iated to oneiterationof the multipli ativeS hwarzmethod. Thisformulationrequiresthatthematrixispartitionedin blo k diagonal form [5℄ su h that ea h blo k has amaximum of twoneighbors; from this expli itformula,theresidualve torisdeterminedoutofthe omputationofthenew approx-imateglobalsolution,andtherefore ouldbeparallelizedthroughsequen esofmatrix-ve tor produ tsand lo alsolutionsin subdomains.
Therst purposeof this paperis to presenttheparallelismthat is obtainedwhen this expli itformulationisusedtobuildapre onditionedKrylovbasisfortheGMRESmethod. This is a hievedthroughthe Newton basisimplementationproposed in [7℄ andapplied in [18,37℄. Hen e,theusualinnerprodu tsandglobalsyn hronizationsareavoideda rossall thesubdomainsandtheresultedalgorithmleadsto apipelineparallelism. Wewillreferto thisasarst-levelofparallelism inGMRES.These ondandmainpurposeofourworkhere
is to further useparallel operations whensolving lo al systemsasso iated to subdomains. Tothisend, weintrodu ease ond-level of parallelism todealwiththose subsystems. The rst observation is that for systems that arise from non-ellipti operators, the number of subdomainsshould besmalltoguaranteethe onvergen e. Consequently,thelo alsystems ouldbeverylargeandshouldbesolvede ientlytoa eleratethe onvergen e.These ond observation omesfromthear hite tureofthe urrentparallel omputers. Thesetwolevels ofparallelismenableane ientusageoftheallo atedresour esbydividingtasksa rossand insidealltheavailablenodes. Thisapproa hhasbeenre entlyusedtoenhan es alabilityof hybridpre onditionersbasedonadditiveS hwarzandS hurComplementte hniques[22℄.
The remaining partof this paper is organizedas follows. Se tion 2 re all the expli it formulationof theMultipli ativeS hwarzpre onditioner. Afterthat, aparallel implemen-tation ofthepre onditionedNewton-basisGMRES isgiven. Se tion 3providesthese ond level of parallelism introdu ed to solve linear systems in subdomains. As those systems should be solvedseveral times with dierentrighthand sides, the naturalwayis to usea parallel third-party solver basedon an (in omplete)
LU
fa torization. The parallelismin these solversisintended fordistributed-memory omputers; wedis ussthe benetsof this approa h. Inse tion4,weprovideintensivenumeri alresultsthatrevealgoodperforman e of this parallel hybridsolver. Thematri es oftests aretaken either from publi a ademi repositoriesor fromindustrial test ases. Con ludingremarksandfuture dire tionsof this workaregivenattheendofthepaper.2 A parallel version of GMRES pre onditioned by
mul-tipli ative S hwarz
Inthisse tion,theexpli itformulationofthemultipli ativeS hwarzmethodisintrodu ed. Thenweshowhowtouseite ientlyasaparallelpre onditionerfortheGMRESmethod. A goodparallelismisobtainedthankstotheuseoftheNewton-Krylovbasis.
2.1 Expli it formulation of the multipli ative S hwarz
pre ondi-tioner
Fromtheequation(1),we onsiderapermutationofthematrix
A
intop
overlapping parti-tionsA
i
. WedenotebyC
i
theoverlappingmatrixbetweenA
i
andA
i+1
(i = 1, . . . , p − 1)
. Here,ea hdiagonalblo khasamaximumoftwoneighbors(seeFigure1.a). Assumingthat there is nofull line(or olumn)in thesparse matrixA
, su h partitioning anbeobtained fromtheadja en ygraphofA
bymeansofproleredu tionandlevelsets[5℄. WedeneA
¯
i
(resp.C
¯
i
)thematrixA
i
(resp.C
i
) ompletedbyidentitytothesizeofA
(seeFigure1.b).If
A
¯
i
andC
¯
i
arenonsingularmatri es,thenthematrixasso iatedtooneiterationofthe lassi almultipli ativeS hwarzmethodisdened [6℄as:M
−
1
= ¯
A
p
−
1
¯
C
p−1
A
¯
−
1
p−1
C
¯
p−2
. . . ¯
A
2
−
1
¯
C
1
A
¯
1
−
1
(3)This expli it formulation is very useful to provide stand-alone algorithm for the essential operation
y ← M
−
1
x
usediniterativemethods. However,sin edependen iesbetween sub-domainsarestillpresentinthisexpression,noe ientparallelalgorithm ouldbeobtained for this single operation. Now, if asequen e of ve tors
v
i
should be generated su h thatv
i
← M
−
1
Av
i−1
, thenthe algorithm would produ eaverygood pipeline parallelism. For the pre onditioned Krylov method, these ve tors are simply those that span the Krylov subspa e. However, to get aparallel pro ess, the basis of this Krylovsubspa eshould be formed withtheNewton polynomialsasdes ribedin thenextse tion.(b): Completed submatrix
I
I
A
k
A
1
C
1
A
3
A
4
(a): blo k-diagonalformof
A
A
2
C
3
C
2
Figure1: Partitioningof
A
intofour subdomains2.2 Ba kground on GMRES with the Newton basis
A left pre onditioned restarted GMRES(m) method minimizes the residual ve tor
r
m
=
M
−
1
(b − Ax
m
)
intheKrylovsubspa ex
0
+ K
m
wherex
0
istheinitial approximation,x
m
the urrentiterateandK
m
= span{r
0
, M
−
1
Ar
0
, . . . , (M
−
1
A)
m
r
0
}
. Thenewapproximation is oftheformx
m
= x
0
+ V
m
y
m
wherey
m
minimizes||r
m
||
. Themosttime onsumingpart in thismethod isthe onstru tionoftheorthonormalbasisV
m
ofK
m
;theArnoldi pro ess is generallyused forthis purpose [32℄. It onstru tsthebasisand orthogonalizesitin the sametime. Theee t ofthis isthepresen eofglobal ommuni ationbetweenallthe pro- esses. Hen e,noparallelism ouldbeobtaineda rossthesubdomainswiththisapproa h. However,syn hronisationpoints anbeavoidedbyde ouplingthe onstru tionofV
m
into twoindependentphases: rstthebasisisgeneratedapriori thenitisorthogonalized.Many authors proposed dierent waysto generate this a priori basis [7, 40, 16℄. For stabilitypurposes,we hoosetheNewtonbasisintrodu edbyBaietal[7℄andusedbyErhel [18℄:
b
V
m+1
= [µ
0
r
0
, µ
1
(M
−
1
A − λ
j
I)r
0
, . . . , µ
m
m
Y
j=1
(M
−
1
A − λ
j
I)r
0
]
(4)where
µ
j
andλ
j
(j = 1, . . . , n)
arerespe tivelys alingfa torsandapproximateeigenvalues ofA
. Witha hoseninitialve torv
0
= r
0
/||r
0
||
,asequen eofve torsv
1
, v
2
, . . . , v
m
isrst generated su hthat:v
j
= σ
j
(M
−
1
A − λ
1
I)v
j−1
j = 1 . . . m
(5) whereσ
j
= 1/||(M
−
1
A − λ
j
I)v
j−1
||
(6) To avoid global ommuni ation, the ve torsv
j
are normalized at the end of the pro ess. Hen e, thes alarsµ
j
from the equation (4)are easily omputedas theprodu t ofs alarsσ
j
. Thesestepsareexplainedindetailinthese tion2.2.1and2.2.2. Atthispoint,weget anormalizedbasisV
m
su hthatM
−
1
AV
m
= V
m+1
T
m
(7)where
T
m
is a bidiagonal matrix with the s alarsσ
j
andλ
j
. Therefore,V
m
should be orthogonalizedusingaQR
fa torization:As we show in se tion 2.2.1 and following the distribution of ve torsin Figure 3.(b), the
v
i
saredistributed inblo ksof onse utiverowsbetweenallthepro essors;However,their overlapped regions are not dupli ated between neighboring pro essors. Thus, to perform theQR
fa torization, we use a an algorithm introdu ed by Sameh [35℄ with a parallel implementation provided by Sidje [37℄. Re ently, a new approa h alled TSQR has been proposedbyDemmelet al. [17℄whi haimsprimarilytominimizethe ommuni ationsand betterusetheBLASkerneloperations. Their urrentalgorithmusedin[27℄isimplemented with POSIXthreads and ould only beused when awhole matrixV
m
is available in the memoryofoneSMPnode. Nevertheless,thealgorithmisSPMD-styleandshouldbeeasily portedto distributedmemorynodes.So far, at the end of the fa torization, we get an orthogonal basis
Q
m+1
impli itly representedasasetoforthogonalree tors. ThematrixR
m+1
isavailableinthememoryof thelastpro essor. Toperformtheminimization stepin GMRES,wederiveanArnoldi-like relation[32, Se tion6.3℄usingequation(7)and(8)M
−
1
AV
m
= Q
m+1
R
m+1
T
m
= Q
m+1
G
¯
m
(9) Hen e thematrixG
¯
m
isin Hessenbergform andthenewapproximatesolutionis givenbyx
m
= x
0
+ V
m
y
m
wheretheve tory
minimizesthefun tionJ
denedbyJ(y) = ||βe
1
− ¯
G
m
y
m
||,
β = ||r
0
|| e
1
= [1, 0, . . . , 0]
T
(10)
The matrix
G
¯
m
is in the memory of the last pro essor. Sin em << n
, this least-square problem is sequentiallyand easily solvedusing aQR
fa torization ofG
m
. The outline of this GMRESalgorithmwiththeNewton basis anbefoundin[18℄.2.2.1 Parallel pro essingof the pre onditionedNewtonbasis
Inthisse tion,wegeneratetheKrylovve tors
v
j
, (j = 0, . . . , m)
ofV
m+1
fromtheequation (5). ConsiderthepartitioningoftheFigure 1with atotalofp
subdomains. Atthis point, weassumethatthenumberofpro essesisequalto thenumberofsubdomains. Thus,ea hpro ess omputesthesequen eofve tors
v
(k)
j
= (M
−
1
A − λ
j
I)v
(k)
j−1
wherev
(k)
j
istheset of blo krowsfromtheve torv
j
ownedbypro essP
k
. Thekernel omputationredu estosome extent totwomajoroperationsz = Ax
andy = M
−
1
z
. Forthese operations,we onsider thematri esandve tordistributiononFigures2and3. Onea hpro ess
P
k
,theoverlapping submatrixiszeroedtoyieldthematrixB
i
in Figure2. InFigure3.(b),thedistributionfor theve torx
isplotted. Theoverlappingpartsarerepeatedonallpro esses. Hen e,forea h subve torx
(k)
onpro essP
k
,x
(k)
u
andx
(k)
d
denoterespe tivelytheoverlappingpartswith
x
(k−1)
and
x
(k+1)
. The pseudo odefor thematrix-ve torprodu t
z = Ax
follows thenin Algorithm1.Now we onsider the matrix distribution in Figure 3.(a) for these ond operation
y =
M
−
1
z
. A ordingtorelation(3),ea hpro essk
solveslo allythelinearsystemA
k
t
(k)
= z
(k)
fort
(k)
followed by a produ ty
(k)
d
= C
k
t
(k)
d
with the overlapped matrix
C
k
. However, the pro essP
k
should re eiverst the overlapping partofy
(k−1)
d
from thepro ess
P
k−1
. Algorithm2des ribestheappli ationofthepre onditionerM
−
1
toave tor
z
. Theformof theM
−
1
operatorprodu es adatadependen ybetweenneighboring pro esses. Hen e the omputation ofasingleve tor
v
j
= (M
−
1
A − λ
j
I)v
j−1
issequentialoverallthepro essses. Howeversin ethev
j
are omputed oneafter another, apro ess anstart to ompute its ownpartoftheve torv
j
evenifthewhole previousve torv
j−1
is notavailable. Werefer to thisasapipeline parallelism sin etheve torsv
j
are omputeda rossall thepro esses as in a pipeline. This would not be possible with the Arnoldi pro ess as all the global0
B
i
=
Figure2: Distributionforthematrix-ve torprodu t
A
1
A
2
C
2
A
3
C
3
A
4
C
1
x
(1)
x
(2)
x
(3)
x
(4)
x
(1)
d
= x
(2)
u
x
(2)
d
= x
(3)
u
(a): Distributionfortheoperation
y ← M
−
1
x
(b):Distributionoftheve tor
Figure3: Matrixandve torsplittings
ommuni ations introdu e syn hronizations points between the pro essesand preventthe use ofthepipeline;see forinstan ethedata dependen iesimpliedbythis pro essin Erhel [18℄.
Tobetterunderstandthea tualpipeline parallelism, were allhereallthedependen ies presentedin[39℄. Fortheparallelmatrix-ve torprodu tinAlgorithm1,ifweset
z = h(x) =
Ax
, thenz
(k)
= h
k
(x
(k−1)
, x
(k)
, x
(k+1)
)
andtheFigure4.(a) illustratesthesedependen ies. Forthepre onditionerappli ationy ← M
−
1
z
aswritten inAlgorithm2,wesety = g(z) =
M
−
1
z
, theny
(k)
= g
k
(y
(k−1)
, z
(k)
, z
(k+1)
)
and dependen iesare depi ted on Figure 4.(b). Finally, for the operationy ← M
−
1
Ax
, ify = f (x) = AM
−
1
x = h
g(x)
, theny
(k)
=
f
k
(y
(k−1)
, x
(k)
, x
(k+1)
, x
(k+2)
)
andwe ombinethetwographstohavethedependen ieson Figure 5.a. Hen e thedependen yx
(k−1)
is ombinedwiththat of
y
(k−1)
. In thepipeline ow omputation
v
j
= f (v
j−1
)
,thedependen yx
(k+2)
inFigure5.bdelaysthe omputation
of the
v
(k)
j
until thesubve torv
(k+2)
Algorithm 1
z = Ax
1: /* Pro ess
P
k
holdsB
k
,x
(k)
*/
2:
k ← myrank()
3:z
(k)
← B
k
x
(k)
;/* lo al matrix-ve torprodu t*/4: if
k < p
then 5: Sendz
(k)
d
topro essP
k+1
6: endif 7: ifk > 1
then 8: Re eivez
(k−1)
d
frompro essP
k−1
9:z
(k)
u
← z
(k)
u
+ z
(k−1)
d
10: endif11: /* Communi ationfor the onsisten y ofoverlappedregions */
12: if
k > 1
then 13: Sendz
(k)
u
topro essP
k−1
14: Re eivez
(k+1)
u
from pro essP
k+1
15:z
(k)
d
← z
(k+1)
u
16: endif 17: returnz
(k)
Algorithm 2y = M
−
1
z
1: /* Pro ess
P
k
holdsA
k
,z
(k)
*/ 2:k ← myrank()
3: ifk > 1
then 4: Re eivey
(k−1)
d
frompro essP
k−1
5:z
(k)
u
← y
(k−1)
d
6: endif7: Solvelo alsystem
A
k
y
(k)
= z
(k)
fory
(k)
8: ifk < p
then 9:y
(k)
d
= C
k
y
(k)
d
10: Sendy
(k)
d
to pro essP
k+1
11: endif12: /* Communi ationfor the onsisten y ofoverlappedregions */
13: if
k > 1
then 14: Sendy
(k)
u
topro essP
k−1
15: endif 16: ifk < p
then 17: Re eivey
(k+1)
u
frompro essP
k+1
18:y
(k)
d
= y
(k+1)
u
19: endif 20: returny
(k)
all the subdomains and if
τ
denotes a time to ompute a subve torx
(k)
in luding the ommuni ationoverhead,then thetimeto omputeoneve toris
t(1) = pτ
(11)and thetimeto ompute
m
ve torsofthebasisfollowspτ
3τ
(see Figure6). Thee ien y
e
p
oftheoverallalgorithmistherefore omputedas:e
p
=
mt(1)
pt(m)
=
pτ m
p(pτ + 3(m − 1)τ )
=
m
p + 3(m − 1)
(13)Hen e, thee ien y is lose to
1/3
for alarge value ofm
. However, this result strongly suggesttokeepverysmallthenumberp
ofsubdomains,relativelytom
.z
(2)
x(2)
z
(1)
x
(1)
z
(4)
x
(4)
z
(3)
x
(3)
y
(2)
z
(2)
y
(1)
z
(1)
y
(4)
z
(4)
y
(3)
z
(3)
(a):Graphforthematrix-ve tor (b): Graphforapplying
M
−1
Figure4: Dependen ygraphs
(a)
(b)
z
(2)
z
(1)
z
(4)
y
(2)
y
(1)
y
(4)
y
(3)
y
(2)
y
(1)
y(4)
y
(3)
x
(1)
x
(4)
x
(3)
x
(2)
x
(1)
x
(4)
x
(3)
x
(2)
z
(3)
Figure5: Dependen ygraphfor
y = M
−
1
Ax
2.2.2 Computation ofshifts and s alingfa tors
Sofar,wehavenotyetexplainedhowto omputetheshifts
λ
i
andthes alingfa torsσ
i
of theequation(5)and(6).As suggested by Rei hel [31℄, the values
λ
j
should be them
eigenvalues of greatest magnitudeofthematrixM
−
1
A
σ
3
(1)
= ||v
(1)
3
||
(2)
2
P
1
P
2
P
3
P
4
σ
(3)
3
= ||v
(3)
3
||
(2)
2
+ σ
(2)
3
σ
(2)
3
= ||v
(2)
3
||
(2)
2
+ σ
(1)
3
v
1
= (M
−1
A
− λ
1
)v
0
v
2
v
3
Communi ationsfor
Ax
Communi ationsforM
−1
y
y
(1)
= B
1
v
0
(1)
v
1
(1)
= ¯
C
1
A
¯
−1
1
y
(1)
− v
(1)
0
Communi ationsforthe onsisten yofthe omputedve torintheoverlappedregion
σ
3
=
q
σ
(4)
3
σ
(4)
3
= ||v
(4)
3
||
(2)
2
+ σ
(3)
3
Figure6: Doublere ursion duringthe omputationoftheKrylovbasis
eigenvalues annotbe heaply omputed,weperformoneiterationof lassi alGMRES(m) with theArnoldipro ess. Hen e,the
m
eigenvalues omputedfromtheoutputHessenberg matrixaretheapproximateeigenvaluesofthepre onditionedglobalmatrix[33,Se tion3℄. TheseeigenvaluesarearrangedusingtheLejaorderingwhilegroupingtogethertwo omplex onjugateeigenvalues. Re ently,PhilippeandRei hel[29℄provedthat,insome ases,roots of theCheby hevpolynomials anbeusede iently.Thes alars
σ
i
are determinedwithoutusing global redu tionoperations. Indeed,su h operationsintrodu esyn hronisationpointsbetweenallthepro essesand onsequently de-stroythepipelineparallelism. TheAlgorithms1and2 omputeinsomeextentthesequen ev
j
= (M
−
1
A − λ
i
I)v
j−1
. We are interested in the s alarsσ
j
= ||v
j
||
2
. Forthis purpose, on pro essP
k
, wedene byˆ
v
(k)
j
the subve torv
(k)
j
withoutthe overlappingpart. In the Algorithm 2, after the line 8, aninstru tion is therefore added to ompute the lo al sumσ
(k)
j
= ||ˆ
v
(k)
j
||
2
2
. Theresultissenttothepro essP
k+1
atthesametimeastheoverlapping subve toratline10
. Thenextpro essP
k+1
re eivestheresultandaddsittoitsown ontri-bution. Thisisrepeateduntilthelastpro essP
p
. Thenatthisstep,thes alarσ
(p)
j
=
q
σ
j
(p)
is the norm
||v
j
||
2
. An illustrationis given in theFigure 6during the omputationof the ve torv
3
.3 Parallel omputation of lo al systems in subdomains
During ea h appli ation ofthe operator
M
−
1
,severallinearsystems
A
k
y
(k)
= z
(k)
should besolvedfor
y
(k)
withdierentvaluesof
z
(k)
(seeAlgorithm2line7). Thematri es
A
k
, k =
(1 . . . p)
are sparse and nonsymmetri as the global matrixA
. Hen e, during the whole iterativepro essandinsideea hsubdomain,asparselinearsystemwithdierentrighthand sidesisbeingsolved. Onnowadayssuper omputerswithSMP(Symmetri MultiPro essing)naturaltointrodu eanotherlevelofparallelismwithinea hsubdomain. So,inthisse tion, wegiverstthemotivationsofusingthisse ondlevelofparallelism. Thenwedis ussabout theappropriatesolvertouseinsubdomains. Finallywegivethemainstepsofourparallel solverwiththesetwolevelsofparallelism.
3.1 Motivations for two levels of parallelism
Indomainde ompositionmethods,the lassi alapproa histoassignoneorseveral subdo-mains to auniquepro essor. Intermsof numeri al onvergen eand parallelperforman e, this approa hhassomelimitations,aspointedoutin[22, 24℄:
For some di ult problems, su h as those arising from non-ellipti operators, the numberofiterationstendstoin reasewiththenumberofsubdomains. Itistherefore essentialtokeepthisnumberofsubdomainssmallinordertoprovidearobustiterative method. Moreover,withaxedsize
m
ofKrylovbasis,abettere ien yisobtained withasmall numberof subdomains. Bydoingthis, thesubsystemsaregettinglarge and ould notbe solved e iently with onlyone pro essorin asubdomain. In this situation, the se ond level of parallelism is improving the numeri al onvergen e of themethod whileproviding fastandee tivesolutionsin subdomains. On theother hand,theequation13suggests,withaxedsizem
oftheKrylovbasisto useasmall numberofsubdomains. In a wide range of parallel super omputers, ea h ompute node is made of several pro essors a essing more or less the same global memory spa e (see Figure 7.a). Usually, this spa e is logi ally divided by the resour e manager if only a subset of pro essors is s heduled. As a result, the allo ated spa e ould not be su ient to handleallthedataasso iatedtoasubdomain;itisthereforene essarytos hedulefor morepro essorsinanodetoa esstheamountofrequiredmemory. Inthissituation, withtheonelevelofparallelism,onlyonepro essorwillbeworking(Thepro essor
P
0
forinstan ein Figure7.b). However,ifanewlevelofparallelismis introdu edinside ea h s hedulednode,allthepro essors ould ontributetothetreatmentofthedata storedinto thememoryof thenode(Figure 7. ). Moreover,withthis approa h, data asso iatedtoaparti ularsubdomain ouldbedistributedonmorethanoneSMPnode; Inthis ase,thenewlevelofparallelismisdeneda rossallthenodesresponsiblefor asubdomain(Figure7.d).3.2 Lo al solver in subdomains : A ura y and parallelism issues
The sparselinearsystemsasso iated tosubdomains arein thes opeof thepre onditioner operator
M
−
1
. Therefore, the solutions of these systems an be obtained more or less a urately:
An a uratesolution, withadire t solverforinstan e, leadsto apowerful pre ondi-tioner,thusa eleratingthe onvergen eoftheglobaliterativemethod. Thisapproa h is usuallyknownashybriddire t/iterativete hnique. Thedrawba khereis the sig-ni antmemoryneededin ea hsubdomain. Indeed,notonlythelo alsubmatrix
A
k
shouldbestoredbutalsotheL
k
andU
k
matri esfromitsLU
fa torization. Withthe presen eofll-induringnumeri alfa torization,thesizeofthosefa tors anin rease signi antly omparedtothesizeoftheoriginalsubmatrix.(c) : One subdomain/SMP Node
WORKING
(d) : One subdomain/two SMP Nodes
IDLE
WORKING
(a) : SMP Node
L2
L1
L2
L1
L1
Socket #0
Socket #1
L1
Memory system
Node 1
Node 0
(b) : One subdomain/SMP Node
P
0
P
2
P
1
P
3
P
0
P
2
P
1
P
3
P
0
P
2
P
1
P
3
P
0
P
2
P
1
P
3
A
k
P
2
P
1
P
3
P
0
A
k
A
k
Figure7: DistributionofsubdomainsonSMPnodes
Theproblemofmemory anbeover omebyusinganin omplete(
ILU
)fa torizationof lo almatri es,thusprodu ingfastsolutionsinsubdomainswithfaira ura y. Inthis ase,theglobaliterativemethodismorelikelytostagnateforsome omplexproblems. Analternative ouldbetousedaniterativemethodinsideea hsubdomain. However, alargedieren einthea ura yoflo alsolutions ould hangetheexpressionofthe pre onditionerbetweenexternal y lesofGMRES.So far, only the numeri al a ura y and the memory issues have being dis ussed but not the underlying parallel programming model. Beforehand, we assume that a parallel dire t solveris usedwithin ea h omputingnodewhenthere isenoughmemory,otherwise an in ompletesolverisapplied. OnSMPnodeswith pro essorssharingthesamememory, solversthat distributetasks among omputingunits by usingthread modelare ofnatural use (Spooles [3℄, PaStiX[25℄, PARDISO[36℄). On theother side, high performant sparse solversthat usethe message-passingmodel areintendedprimarily fordistributed memory pro essors.Nevertheless,onSMPnodes,thisapproa h anbeusede ientlywiththe avail-ability of intranode ommuni ation hannels for theunderlying messagepassing interfa e (MPI) [12, 21℄. These hannels are appropriate for the large amount of ommuni ations introdu edbetweenpro essorsduringthesolutioninsubdomains;forinstan ethe ompar-ative studyin [2, Se t. 5℄giveavolume andsize of these ommuni ationsforsomedire t solvers. Astheex hangedmessagesareofsmallsize,agooddatatransfer anbeobtained inside SMP nodes[11,30℄. Moreover,itispossibletosplit asubdomain onmorethanone SMP node. Forallthesereasons,weuseprimarily distributeddire tsolversin subdomains forthese ondlevelofparallelism.
3.3 Main steps of the parallel solver
Thesolveris omposed ofthesamestepswhether a ompleteorin ompletefa torizationis used forthesubsystems:
A
onalformbythehostpro ess(seeFigure1). Afterthat,thelo almatri es
A
k
andC
k
shouldbedistributedtootherpro esses. Ifadistributedsolverisusedinsubdomains, thenthepro essorsaredispat hed inmultiple ommuni ators. Forinstan e, the Fig-ure 8shows adistributionof four SMP nodes with four pro essesaround twolevels of MPI ommuni ators. The rstlevel isintended forthe ommuni ationa rossthe subdomains. Hen einthis ommuni ator,thesubmatri esaredistributedtothelevel
0
pro esses(i.eP
k,0
wherek = 0 . . . p − 1
andp
thenumberofsubdomains). Thenfor ea hsubdomaink
,ase ond ommuni atoris reatedbetweenthelevel 1
pro essesto manage ommuni ationsinsidethesubdomains(i.eP
k,j
wherej = 0 . . . p
k
− 1
andp
k
thenumberofpro essesinthesubdomaink).Level 0 MPI COMM.
Level 1
MPI COMMs.
P
0,0
P
0,1
P
1,2
P
1,3
P
1,1
P
0,2
P
0,3
P
1,0
P
3,1
P
3,2
P
3,3
P
3,0
P
2,1
P
2,2
P
2,3
D
1
D
0
D
4
D
3
P
2,0
Figure 8: MPI ommuni atorsforthetwolevelsofparallelism
2. Setup : In this phase, the symboli and numeri al fa torization are performed on submatri es
A
k
by the underlying lo al solver. This step is purely parallel a ross all the subdomains. At the end of this phase, the fa torsL
k
andU
k
reside in the pro essorsresponsibleforthesubdomaink
. Thisistotallymanagedbytheunderlying lo al solver. Prior to thisphase, aprepro essing step anbeperformed to s alethe elementsofthematrix.3. Solve : This is theiterative phaseof the solver. With an initial guess
x
0
and the shiftsλ
j
, thesolver omputesalltheequations(5-10)asoutlinedhere:(a) Pipeline omputationof
V
m+1
(b) Parallel
QR
fa torizationV
m+1
= Q
m+1
R
m+1
( ) Sequential omputation ofG
¯
m
su h thatM
−
1
AV
m
= Q
m+1
G
¯
m
on the pro essP
p−1,0
.(d) Sequentialsolutionofleast-square problem(10)for
y
m
onthepro essP
p−1,0
. (e) Broad asttheve tory
m
topro essesP
k,0
(f) Paralle omputationof
x
m
= x
0
+ V
m
y
m
The onvergen eisrea hedwhen
||b − Ax|| < ǫ||b||
otherwisethepro essrestartswithx
0
= x
m
. When two levels of parallelism are used during the omputation ofV
m
,level 1
pro essorsperformmultiple ba kwardand forwardsweepsinparallel tosolve lo al systems. Afterthe parallelQR
fa torizationin step3b,theexpli itQ
fa toris neverformedexpli itly. Indeed,onlytheunfa tored basisV
isusedtoapplythenew orre tionasshowninstep3f. Nevertheless,aroutineto formthisfa torisprovided inthesolverforgeneralpurposes.4 Numeri al experiments
Inthisse tion,weperformseveralexperimentstodemonstratetheparallelperforman eand thenumeri alrobustnessofourparallelsolveronawiderangeofrealproblems. Inse tion 4.1, we giverst thesoftware and thehardwarear hite ture onwhi h thetests are done. These tions 4.3and4.4 provideintensiveresultsin termsofs alabilityandrobustnesson several tests ases. Those ases listed in Table1 are taken from various sour es, ranging from publi repositoriesto industrialsoftwares
4.1 Software and hardware framework
ThesolverisnamedGPREMS 1
(GmresPRE onditionedbyMultipi ativeS hwarz). It isintendedtobedeployedondistributed memory omputersthat ommuni atethrough messagepassing(MPI). Theparallelismin subdomainsisbased eitheronmessagepassing orthreads model depending ontheunderlying solver. Thewhole libraryis builtontopof PETS (Portable, ExtensibleToolkitforS ienti Computation) [8,9℄. Themotivation of this hoi eistheuniforma esstoawiderangeofexternalpa kagesforthelinearsystems in subdomain. Moreover,thePETS pa kageprovideseveraloptimizedfun tions anddata stru turestomanipulatethedistributedmatri esandve tors. AFigure9givestheposition
BLAS
LAPACK
SCALAPACK
BLACS
MPI
KSP
(Krylov Subspace Methods)
PC
(Preconditioners)
INDEX SETS
VECTORS
MATRICES
SuperLU_DIST
MUMPS
HYPRE
PaStiX
GPREMS
APPLICATIONS CODES
LEVEL OF ABSTRACTION
EXTERNAL PACKAGES
Figure 9: GPREMSlibraryinPETS
ofGPREMSintheabstra tionlayerofPETS . Wegiveonlythe omponentsthatareused
1
GPREMS usesthePETS environment,it isnotdistributed asapartof PETS libraries. Nevertheless,thelibraryis onguredeasilyon eausableversionofPETS isavailableon thetargetedar hite ture.
AllthetestsinthispaperareperformedontheIBMp575SMPnodes onne tedthrough the InnibandDDRnetwork. Ea h nodeis omposedof 32Power6pro essorssharingthe sameL3 a he. APower6pro essorisadual- ore2-waySMTwithapeakfrequen yat4.7 GHz. Figure10(obtainedwithhwlo [10℄)showsthetopologyofthersteightpro essorsin one omputenode. Atotalof128nodesisavailableinthissuper omputer(namedVargas
2 ) whi hispartoftheFren hCNRS-IDRIS omputingfa ilities.
System(128GB)
L3(32MB)
L2 (4MB)
L2 (4MB)
L2 (4MB)
L2 (4MB)
Core#0
Core#1
Core#2
Core#3
Core#4
Core#5
Core#6
Core#7
P#0
P#1
P#2
P#3
P#4
P#5
P#6
P#7
P#8
P#9
P#10
P#11
P#12
P#13
P#14
P#15
Figure10: VargasAr hite ture
4.2 Test matri es
This se tiongivesthemain hara teristi softhesetofproblemsunder study.
TherstsetistakenfromtheUniversityofFlorida(UFL)SparseMatrixColle tion[15℄. Inthisset,we hoosetheproblemsMEMCHIPandPARA-4respe tivelyfromFrees aleand S henk_ISEIdire tories;main hara teristi sofwhi haregiveninTable1. TheMEMCHIP
Table1: Generalpropertiesofthefourtestmatri es
MatrixName MEMCHIP PARA_4 CASE_004 CASE_017
Order 2,707,524 153,226 7,980 381,689
No. entries(nnz) 14,810,202 2,930,882 430,909 37,464,962
Entriesperrow 5 19 53 98
Patternsymmetry 91% 100% 88% 93%
2D/3Dproblem No yes 2D 3D
Sour e UFL(K.Gullapalli) UFL(O.S henk) FLUOREM FLUOREM
problem arise from ir uitsimulationeld. Thedensityof theasso iatedmatrixis rather small. Thenumberofentriesisabout
1.4 · 10
6
butthenonzeroselementsperrowrevealsa rathersmalldensity. Hen e,thisproblemiseasilysolvedwithGPREMSusinganin omplete fa torizationin subdomains. Wepresenttheresultswiththis aseinse tion4.3. Thenext matrix is PARA-4 oming from 2D semi ondu tor devi e simulations. Compared to the previousproblem,thegeometryofthisproblemmakesitdi ulttosolve. Hen e,we hoose thismatrixtoshowtherobustnessofoursolverwhenitis ombinedwithadire tsolverin subdomains. Again,theresultsonthisareshownin se tion4.3.
The se ond set ontains two test matri es (CASE_004 and CASE_017) provided by FLUOREM [20℄, a uid dynami s software publisher. In those ases, the sparse matrix orrespondstotheglobalja obianmatrixresultingfromthepartialrst-orderderivativesof
2
thedis retesteadyEulerorReynolds-averagedNavier-Stokesequations. Thederivativesare donewithrespe ttotheseven onservativeuidvariables(mass:
ρ
,momentum:[ρu, ρv, ρw]
, energy:ρE
, turbulen e:[ρk, ρω]
withk − ω
); the steadyequation anberoughlywritten asF (Q) = 0
whereQ
denotes the uid variables andF (Q)
the divergen e of the ux fun tion. IfQ
h
andF
h
arethenitevolumedis retizationsoverthe omputationaldomain ofQ
andF
respe tively,then thematrixA
ofthe linearsystem (1)is theglobalja obian matrixJ
F
h
(Q
h
)
.A ∈ R
n×n
isnon-symmetri andindenite andwhen thestationary ow isdominatedbyadve tion,it orrespondstoastronglynon-ellipti operator. Thetwo-level parallelism is thenuseful in those asesin that, we anin reasethe numberof pro essors to speedup the overall exe ution time while providing the guarantee that the number of iterations will remain almost onstant. Se tion 4.4 gives the results with the multilevel approa h on theCASE_017 ase. The matrix CASE_004is used to show thenumeri al robustnessofGPREMS omparedtotheadditiveS hwarzapproa himplementedinPETS .
4.3 Convergen eands alabilitystudiesonMEMCHIPandP
ARA-4 ases
Wedis ussheretheresultsofthenumeri al onvergen eofGPREMSonthetwoproblems MEMCHIPandPARA-4. Weuserespe tively
m = 12
andm = 24
asthesizeoftheNewton basis (sear h dire tions)for thetwo ases. Note that theiterativepro essstops when the relativeresidualnorm||b − Ax||/||b||
islessthan10
−
8
orwhen1500iterationsarerea hed. The totalnumber of iterations is usually a multiple of
m
as theNewton basis should be generated a priori; hen e the onvergen e is tested only at ea h external y le.ILU (p)
denotesauseofanin ompletefa torizationinsubdomainswithp
levelsofll-ininfa tored matrix. Thepa kageused to perform this fa torization is EUCLID[26℄from the HYPRE suite [19℄. Some otherproblems likePARA-4aremoredi ult tosolve. Hen e theuse of hybriddire t-iterativesolverisne essarytoa hieveagood onvergen einareasonableCPU time. TheMUMPS[1℄pa kageis thenused atthis point. Onall asessofar,the relative residual norm of the omputed solution is givenwith respe t to the numberof iterations. We omparethe onvergen ehistoryon4,8and16subdomains. Followingthe onvergen e history,wegivetheCPUtimespentinthevariousphases:theinitializationphase,thesetup phaseandtheiterativephase.For the MEMCHIP ase with about
1.5 · 10
7
entries, it takes less than 25 y les to onvergeasshown onFigure 11.(a). Moreover,thenumberofiterations tendsto de rease with16subdomains. The
ILU (0)
fa torizationisusedinsubdomainsmeaningthatnonew elementsareallowedduringthenumeri alfa torizationandthus nomorespa eis needed. This hasthe ee t to speed-up the setup phaseand the appli ation of theM
−
1
operator in the iterative step. The CPU times of these steps are shown in Table 2along with the initialization time (Init) and thetotaltime. These resultsprovethe s alabilityfor almost all thestepsin GPREMS.Apartfrom theinitializationstepwhi his sequential,allothers steps benet from the rst level of parallelism in the method. The setup phase is fully parallel a ross the subdomains. Here ea h pro ess responsible for a subdomain performs a sequentialfa torization on its lo al matrix. As the size of those submatri es de reases withthenumberof subdomains,thesetuptimede reasesaswell. Theiterativepartofthe solver(given byIter. loop time in Table2) s alesverywellin this ase. Forinstan e,300 iterationsofGMRESareperformedinlessthan103s. whenusing4subdomainsinthe blo k-diagonal partitioning. With 8subdomains,themethod requires 312iterations(26 restarts or y les) omputedinlessthan65s. Whentheproblemissubdividedinto16subdomains, the methodrevealsasuperlinearspeedup,thanks totherather smallnumberofiterations needed to onverge. Thisisaparti ular aseprobablydue totheunpredi tableproperties
ILU
parti ular aseon16subdomainsdoesnotappearwhen
ILU (1)
isused;seeFigure11.(b).0
100
200
300
400
500
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
MEMCHIP, GPREMS(12)+ILU0
Rel. Res. Norm
iterations
D=4
D=8
D=12
D=16
D=20
0
20
40
60
80
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
MEMCHIP, GPREMS(12)+ILU1
Prec. Res. Norm
iterations
D=4
D=8
D=12
D=16
D=20
(a) (b)Figure11: Convergen ehistorywithMEMCHIP aseon4,8and16subdomains
Table 2: MEMCHIP : CPU Time for various numbersof subdomains; ILU(0) is used in subdomains
#Pro essors #Subdomains Iterations CPUTime(s.)
Init Setup Iter. loop Time/Iter Total
4 4 48 26.38 3.39 33.2 0.69 62.97
8 8 72 27.4 2.03 25.98 0.36 55.41
16 16 60 30.41 1 17.59 0.29 49
Obviously, when the subsystems are solved approximately, the global method looses somerobustness. Thus formany2D/3D ases,thenumberofiterationsgetsverylargeand onsequently theoverall timeneeded to onvergeis toohigh. Weillustrate thisbehaviour with thePARA-4 2D ase. An in omplete
LU
fa torization(ILU (2)
)isrstperformedon ea h submatrix to solvethe systemsindu ed bythe pre onditioner operator. The ILU(2) fa torization keepsthese ond-orderll-ins in thefa tors, leadingto amore a urate pre- onditioner than that withILU (0)
. However, this is not good enough to a hieve agood onvergen erateasintheprevious ase. Forinstan einFigure12.(a),after rea hing10
−
7
, theresidualnormtendstostagnate. This anbeobservedeitherwith4,8or16subdomains. Ontheotherhand,ittakesa ertaintimetosetuptheblo kmatri esfortheiterativephase asshownin Table3. It anbeseenthat themaximum timeoverall pro essors(Setup) to getthein ompletefa torizationoflo almatri esistwi ethanthetimeoftheiterativeloop.
Now if a dire t solver is used within ea h subdomain, the method will get a better onvergen ethankstoa ura ya hievedduringtheappli ationofthepre onditioner. This behaviour is observed in Figure 12.(b) in whi h we show the onvergen e history for the number of subdomains under study. The number of iterations grows slightly when we hangefrom4to8subdomains. Thisdoesnotimpa ttheoveralltimespentintheiterative step as shown in the se ond part of the Table 3. Indeed, this step reveals a very good speedup asweonly need 35 s. with 8 subdomains ompared to 45 s. for 4 subdomains. With 16pro essors(subdomains),thetotaltime ontinuestode rease,evenwiththeslight in rease of the global number of iterations. It is also worth pointing out the time spent in thesetup phase. It takesroughly 5s. on four pro essorsto performthe symboli and thenumeri alfa torizationofthesubmatri es. Notethatea hpro essholdsalo almatrix
0
200
400
600
800
1000
1200
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
PARA−4 : GPREMS(24)+ILU(2)
Prec. Res. Norm
iterations
D=4
D=8
D=16
0
20
40
60
80
100
120
140
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
PARA−4 : GPREMS(24)+LU
Prec. Res. Norm
iterations
D=4
D=8
D=16
(a) (b)
Figure12: Convergen ehistorywithPARA-4 aseon4,8and16subdomains
asso iated to one subdomain. Hen e, the setup time is the maximum time overall a tive pro essors. Whenthe matrixissplitted into 16subdomains, thesize of thesesubmatri es is rather small and the setup time drops to 1 s. This is almost 400 times smaller than that with the ILU(2) fa torization. The main reason is the algorithm used by the two methods. In MUMPS,almost all theoperationsare doneonblo kmatri es (the so- alled frontal matri es)via alls to level 2 and level 3BLAS fun tions. Therefore, the dierent levelsof the a he memoryareused e iently during thenumeri alfa torizationwhi h is the most time- onsumingstep. Moreover,amultithreaded BLAS is linked to MUMPS to further speedup the method on SMP nodes. For this reason and the rather small size of submatri es,these ondlevelofparallelismisnotapplied upto thispoint.
Table 3: PARA-4 : CPU Time of GPREMS(24) for various number of subdomains and varioustypesofsolversinsubdomains;
#D #Pro . Iter. CPUTime(s.)
Init Setup Iter. loop Time/Iter Total
s
p
e
p
In ompletesolverin subdomains(ILU(2)-EUCLID)4 4 1008 1.33 418.17 654.45 0.65 1073.95 -
-8 8 1008 1.31 409.68 599.64 0.59 1010.63 1.06 0.53 16 16 1008 1.51 398.61 542.75 0.54 942.87 1.14 0.28 Dire tsolverin subdomains(LU -MUMPS)
4 4 144 1.32 5.12 45.43 0.32 51.87 -
-8 8 216 1.26 1.67 35.06 0.16 37.98 1.37 0.68
16 16 240 1.37 0.73 29.21 0.12 31.31 1.66 0.41
In Table 3, we also give the speedup and e ien y of the method relatively to four pro essors.Clearly,
s
p
= T
4
/T
p
ande
p
= 4s
p
/p
. Weobservethatthee ien yisde reasing very fast with the number of subdomains. With ILU(2) in subdomains for instan e, the e ien y on 8and 16 subdomains are respe tively 0.53and 0.28. This behaviour anbe explainedbythetheformulain Equation(13)asp
shouldbekeepedsmallrelativelytom
. This isoneofthereasonsto usethese ondlevelofparallelisminsidethesubdomains.4.4 Benets of the two-levels parallelism
Now,letus onsidertheFLUOREMproblem CASE_017listedin Table1. The geometry of this 3D ase is a jet engine ompressor. Although the turbulent variables are frozen
study involvingsomedistributed linearsolvers[28℄. From thisstudy, solversthat ombine hybriddire t-iterativeapproa heslikeGPREMSprovideane ientwaytodealwithsu h problems. However, the number of subdomains should be kept small to provide a fast onvergen e. Thereforethe two-levelsof parallelism introdu ed in se tion 3.2 provide the waytodothise ientlyevenwithalargenumberofpro essors. InTable4,wereportthe benetof usingthis approa h ontheabove-named3D ase. Wekeep4and 8subdomains andwein reasethetotalnumberofpro essors. Asaresult,almostallthestepsinGPREMS getanoti eablespeedup. Typi ally,with4subdomains, whenonlyonepro essorisa tive, thetimetosetuptheblo kmatri esisalmost203s. andthetimespentintheiterativeloop is622s. Movingto8a tivepro essorsde reasesthesetimesto59s. and181s. respe tively. Thesameobservation anbedonewhenusing8subdomains,inwhi h asetheoveralltime dropsfrom540s. to238s. Notethattheoveralltimein ludestherstsequentialstepthat permutesthematrixin blo k-diagonalformanddistributestheblo kmatri estoalla tive pro essors.
Thelasttwo olumnsofTable4givetheintranodespeedupande ien yofGPREMS. This is dierent from the ones omputed with onelevel of parallelism. Clearly, we want to showthe benet ofadding morepro essorsin asubdomain. Hen e,if
d
is thenumber of subdomains,s
p
= T
d
/T
p
ande
p
= ds
p
/p
whereT
p
is the CPU time onp
pro essors. When we add morepro essorsin ea h subdomain, we observeanoti eable speedup; with 8 subdomains for instan e, the speedup moves from 1.18 to 2.28when 1and 8 pro esses are used within ea h subdomain. However, a look on the e ien y reveals a quite poor s alability. Thisisageneralobservationwhenthetestsaredoneonmulti orenodes,when the ompute units share the same memory bandwith; indeed, the speed of sparse matrix omputations are determined rather by the size of this memory bandwith than the lo k rate ofCPUs. Nevertheless,thetotalCPU timeto getthesolutionalwayssuggeststo use all the available pro essors/ oresin the ompute node; Note that these pro essors in the nodewouldbeidleotherwiseastheyares heduledtoa essmorememory.Table4: Benetsofthetwo-levelsofparallelismforvariousphasesofGPREMSonCASE_17 witharestartat 64andMUMPS dire tsolverin subdomains
#D #Pro . Iter. CPUTime(s.)
Init Setup Iter. loop Time/Iter Total
s
p
e
p
4 4 128 70.78 203.67 622.94 4.87 897.39 - -8 128 70.77 125.77 411.42 3.21 607.96 1.48 0.74 16 128 69.79 73.15 280.41 2.19 423.35 2.12 0.53 32 128 70.77 59.44 181.45 1.42 311.66 2.88 0.36 8 8 256 69.91 71.73 399.34 1.56 540.98 - -16 256 70.36 42.99 343.25 1.34 456.59 1.18 0.59 32 256 71.61 28.72 206.7 0.81 307.02 1.76 0.44 64 256 71.85 20.85 144.79 0.57 237.49 2.28 0.28
4.5 Numeri al robustness of GPREMS
In this se tion, our aim is to show the robustnessof the GPREMS solveron CASE_004 listed in Table 1. The matrix of this 2D test ase orresponds to the global ja obian of the dis rete steady Euler equation [20℄. The two turbulent variables are kept during the dierentiation predi ting adi ult onvergen e for any iterative method. The re ipro al
ondition number is estimated to be
3.27 · 10
−
6
. The matrix is permuted into 4 blo k diagonal matri es produ ing lo al matri es with approximately 2,000 rows/ olumns and 120,000nonzerosentries. Fortheserelativelysmallblo kmatri es,we hooseanin omplete lo al solver in subdomains when applying the
M
−
1
operator. We also allow 5degrees of ll-ins(
ILU (5)
)duringthein ompletenumeri alfa torization. Figure13givesthehistory ofthepre onditionedresidualnormatea hrestartfor16and32Krylovve tors. Thisnorm drops to10
−
11
in less than 200 iterations for GPREMS(16) with ILU(5) in subdomains. Whenthesize oftheKrylovsubspa evariesto32, theresidualve torrea hes
10
−
9
inless than 100iterations. This isalmostthesamebehaviourwhenadire t solverisusedwithin thesubdomainsinwhi h asetheoverallnumberofiterationsis112and96respe tivelyfor 16and32Krylovve tors.
0
50
100
150
200
250
300
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
CASE_04
Prec. Res. Norm
iterations
GPREMS(16)+ILU(5)
GPREMS (32)+ILU(5)
GPREMS(16)+LU
GPREMS (32)+LU
ASM+ILU(5) +GMRES(16)
ASM+ILU(5)+GMRES(32)
ASM+LU +GMRES(16)
ASM+LU+GMRES(32)
Figure 13: The multipli ative S hwarz approa h (GPREMS) ompared to the restri ted additiveS hwarz(ASM)onCASE_004;16 and32ve torsin theKrylovsubspa eatea h restart; 4 subdomains (blo k diagonal partitioning in GPREMS and Parmetis for ASM); lo alsystemssolvedwithdierenta ura ies(ILU(5)andLU)
Thenwe omparewiththerestri tedadditiveS hwarzmethodusedasapre onditioner for GMRES.Thetestsaredonewiththeimplementationprovidedin PETS release 3.0.0-p2. Here,theKrylovbasisisbuiltwiththemodiedGram-S hmidt(MGS)methodandthe inputmatrixispartitionedin4subdomainsusingParMETIS.InFigure13,wedepi tagain the pre onditionedresidualnorm withrespe tto thenumberof iterations. The urvesare giveninmagentawiththesamelinespe i ationsasforGPREMS.Themainobservationis that theiterativemethodstagnatesfromtherstrestartwhenusingILU(5)insubdomains (themagentaplainanddash urves). Withadire tsolver,itisne essarytoform32Krylov ve torsatea hrestartinordertoa hieve onvergen e(asshownbythemagentadash-dotted urve).
Inthispaper,wegiveanimplementationofaparallelsolverforthesolutionoflinearsystems on distributed-memory omputers. This implementation is based on an hybridte hnique that ombinesamultipli ativeS hwarzpre onditionerwithaGMRESa elerator. A very goode ien yisobtainedusingaNewtonbasisGMRESversion;hen ewedenetwolevels of parallelism during the omputation of the orthonormalbasis neededby GMRES : The rst levelis expressedthrough pipeline operationsa rossall thesubdomains. The se ond levelusesaparallelisminsidethird-partysolverstobuildthesolutionofsubsystemsindu ed bythedomainde omposition. Thenumeri alexperimentsdemonstratethee ien yofthis approa honawiderangeoftest ases. Indeed,with therst-orderparallelism,weshowa goode ien ywhenusingeitheradire tsolveroranin omplete
LU
fa torization. Forthe se ond levelused withinthe subdomains, thegainobtainedis two-fold: the onvergen eis guaranteedwhenthenumberofpro essorsgrowsasthenumberofsubdomainsremainsthe same. Theglobale ien y in reasesasweaddmorepro essorsin subdomains. Moreover, on some di ultproblems, the useof a dire t solver in subdomains implies to a ess the whole memory of aSMP node. Therefore, this approa h keeps busy all thepro essorsof thenodeandenablesagoodusageofallo ated omputingresour es. Finally,onsometest asesweshowthatthissolver anbe ompetitivewithrespe ttootheroverlappingdomain de ompositionpre onditioners.However, more work needs to be done to a hieve very good s alability on massively parallel omputers. Presently, anattempt toin rease thenumberof subdomains upto 32 in reasesthenumberofiterationsaswell. Themainreasonistheformofthemultipli ative S hwarz methodasea h orre tionoftheresidualve torshould gothroughallthe subdo-mains. Arst attemptisto usemorethantwolevelsofsplittingbutthee ien y ofsu h approa his notalwaysguaranteed,spe i ally ifthelinearsystemarisesfrom non-ellipti partial dierentialequations. An ongoingworkisto keepthetwo-levelsofsplittingforthe pre onditioner operator andthen to further a elerate the GMRESmethod with spe tral informationsgatheredduringtheiterativepro ess.
Referen es
[1℄ Amestoy, P. R., Duff, I. S., L'Ex ellent, J.-Y., and Koster, J. A fully asyn hronousmultifrontalsolverusingdistributed dynami s heduling. SIAMJournal on MatrixAnalysisandAppli ations 23, 1(2001),1541.
[2℄ Amestoy, P. R., Duff, I. S., L'Ex ellent, J.-Y., and Li, X. S. Analysis and omparison of two general sparse solvers for distributed memory omputers. ACM Transa tions onMathemati al Software27 (2000),2001.
[3℄ Ash raft,C.,andGrimes,R.Spooles: Anobje t-orientedsparsematrixlibrary.In Pro eedingsofthe9thSIAMConferen eonParallelPro essingforS ienti Computing (1999).
[4℄ Atenekeng-Kahou,G.-A.Parallélisationde'GMRES'pré onditionnéparune itéra-tion de S hwarz multipli atif. PhD thesis, University of Rennes 1 and University of Yaounde1,2008.
[5℄ Atenekeng-Kahou, G.-A., Grigori, L., and Sosonkina,M. A partitioning al-gorithmfor blo k-diagonalmatri eswith overlap. Parallel Computing 34, 6-8(2008), 332344.
[6℄ Atenekeng-Kahou, G.-A., Kamgnia, E., and Philippe,B. An expli it formula-tionof themultipli atives hwarzpre onditioner. Applied Numeri al Mathemati s57, 11-12(2007),11971213.
[7℄ Bai, Z.,Hu,D.,and Rei hel,L. ANewtonbasisGMRESimplementation. IMAJ NumerAnal14,4(1994),563581.
[8℄ Balay,S.,Bus helman,K., Eijkhout,V.,Gropp,W.D.,Kaushik,D., Knep-ley, M.G.,M Innes,L. C., Smith,B. F.,and Zhang,H. PETS usersmanual. Te h.Rep.ANL-95/11-Revision3.0.0,Argonne NationalLaboratory,2008.
[9℄ Balay, S., Bus helman, K., Gropp, W. D., Kaushik, D., Knepley, M. G., M Innes, L. C., Smith, B. F., and Zhang, H. PETS Web page, 2009. http://www.m s.anl.gov/pets .
[10℄ Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mer ier, G., Thibault, S., and Namyst, R. hwlo : a generi framework for managinghardwareanitiesin HPCappli ations. InPro eedings ofthe 18th Euromi- ro International Conferen e on Parallel, Distributed and Network-Based Pro essing (PDP2010), Pisa, Italia (2010).
[11℄ Buntinas, D., and Mer ier, G. Datatransfersbetweenpro essesin anSMP sys-tem: performan estudy andappli ation to MPI. InPro eedings of the 2006 Interna-tionalConferen eonParallelPro essing(ICPP2006),IEEEComputerSo iety (2006), pp.487496.
[12℄ Buntinas, D., Mer ier, G., and Gropp, W. Implementation and evaluation of shared-memory ommuni ation and syn hronization operations in mpi h2 using the nemesis ommuni ationsubsystem.ParallelComputing33,9(2007),634644.Sele ted PapersfromEuroPVM/MPI2006.
[13℄ Cai, X.-C.,andSaad,Y.Overlappingdomainde ompositionalgorithmsforgeneral sparsematri es. Numeri al Linear Algebrawith Appli ations3,3(1996),221237.
[14℄ Carvalho,L., Giraud, L., and Meurant, G. Lo al pre onditionersfor two-level non-overlappingdomain de omposition methods. Numeri al Linear Algebrawith Ap-pli ations 8,4(2001),207227.
[15℄ Davis,T.TheuniversityofFloridasparsematrix olle tion,2010. SubmittedtoACM Transa tionsonMathemati alSoftware.
[16℄ deSturler, E. AparallelvariantofGMRES(m). InPro eedings ofthe 13thIMACS World Congress on Computation and Applied Mathemati s, Dublin, Ireland (1991), J.J.H.MillerandR.Vi hnevetsky,Eds.,CriterionPress.
[17℄ Demmel,J.,Grigori,L.,Hoemmen,M.,andLangou,J.Communi ation-optimal parallel and sequential QR and LU fa torizations. Te h. Rep. UCB/EECS-2008-89, UniversityofCaliforniaEECS, May2008.
[18℄ Erhel, J. AparallelGMRESversionforgeneralsparsematri es. Ele troni T ransa -tiononNumeri alAnalysis 3 (1995),160176.
[19℄ Falgout, R., and Yang,U. Hypre: alibraryofhigh performan e pre onditioners. InLe turesNotesinComputerS ien e(2002),vol.2331,Springer-Verlag,pp.632641.
http://www.uorem. om.
[21℄ Geoffray,P.,Prylli,L.,andTouran heau,B.Bip-smp: Highperforman e mes-sagepassingovera lusterof ommoditysmps. InPro eedingsofthe 1999ACM/IEEE Conferen eon Super omputing (1999).
[22℄ Giraud, L., Haidar, A., and Pralet, S. Using multiple levelsof parallelism to enhan etheperforman eofdomainde ompositionsolvers.ParallelComputingInPress (2010).
[23℄ Giraud, L.,Haidar, A., and Watson, L. Parallels alabilitystudy ofhybrid pre- onditionersin threedimensions. Parallel Computing 34,6-8(2008),363379.
[24℄ Haidar, A. On the parallel s alability of hybrid linear solversfor large3D problems. PhDthesis,InstitutNationalPolyte hniquedeToulouse,2008.
[25℄ Hénon, P., Pellegrini, F., Ramet, P., Roman, J., andSaad, Y. High Perfor-man eCompleteandIn ompleteFa torizationsforVeryLargeSparseSystemsbyusing S ot handPaStiXsoftwares. InEleventh SIAMConferen e onParallel Pro essingfor S ienti Computing (2004).
[26℄ Hysom, D., and Pothen, A. A s alable parallel algorithm for in omplete fa tor pre onditioning. SIAMJournalon S ienti Computing22 (2000),21942215.
[27℄ Mohiyuddin, M., Hoemmen, M., Demmel, J., andYeli k, K. Minimizing om-muni ation in sparse matrix solvers. In SC '09: Pro eedings of the Conferen e on HighPerforman eComputingNetworking,StorageandAnalysis(NewYork,NY,USA, 2009),ACM,pp.112.
[28℄ Nuentsa Wakam, D., Erhel, J., Édouard Canot, and Atenekeng Kahou, G.-A. A omparativestudyofsomedistributedlinearsolversonsystemsarisingfrom uid dynami s simulations. In Parallel Computing: From Multi ores and GPU's to Petas ale (2010),vol.19ofAdvan es inParallel Computing,IOS Press,pp.5158.
[29℄ Philippe, B., and Rei hel, L. On the generationof krylovsubspa e bases. Te h. Rep.RR-7099,INRIA, 2009.
[30℄ Rabenseifner,R.,Hager,G.,andJost,G.Hybridmpi/openmpparallel program-ming on lusters of multi- ore smp nodes. Parallel, Distributed, and Network-Based Pro essing, Euromi roConferen eon 0 (2009),427436.
[31℄ Rei hel,L. Newtoninterpolationatlejapoints. BIT30 (1990),305331.
[32℄ Saad,Y.IterativeMethodsforSparseLinearSystems,se onded.So ietyforIndustrial andAppliedMathemati s,Philadelphia,PA, USA,2000.
[33℄ Saad, Y.,and S hultz, M. H. GMRES:A generalizedminimal residualalgorithm for solving nonsymmetri linearsystems. SIAM Journal on S ienti and Statisti al Computing 7,3(1986),856869.
[34℄ Saad,Y., andSosonkina,M. Distributeds hur omplementte hniquesforgeneral sparselinearsystems. SIAMJournalonS ienti Computing21,4(1999),13371356.
[35℄ Sameh, A. Solvingthelinearleastsquaresproblem onalineararrayofpro essors. In HighSpeedComputerandAlgorithmOrganization(1977),D.L.D.Ku kandA.Sameh, Eds.,A ademi Press,pp.207228.
[36℄ S henk, O., and Gärtner, K. Solvingunsymmetri sparsesystemsof linear equa-tions withpardiso. Journal of Future Generation Computer Systems 20 (2004),475 487.
[37℄ Sidje, R. B. Alternativesforparallel krylovsubspa ebasis omputation. Numeri al LinearAlgebrawithAppli ations 4,4(1997),305331.
[38℄ Smith, B., Bj
φ
rstad, P., andGropp, W. Domain De omposition, Parallel Multi-level Methods for Ellipti PartialDierential Equations. CambridgeUniversityPress, 1996.[39℄ Souopgui, I. Parallélisationd'une double re urren edans 'gmres'. Master'sthesis, UniversityofYaounde1,2005.
[40℄ Walker,H.F. Implementationofthe'gmres'methodusinghouseholder transforma-tions. SIAMJournalonS ienti andStatisti alComputing 9,1(1988),152163.