Solving large sparse linear systems with a variable s-step GMRES preconditioned by DD

(1)

HAL Id: hal-01528636

https://hal.inria.fr/hal-01528636

Submitted on 23 Nov 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Solving large sparse linear systems with a variable s-step GMRES preconditioned by DD

David Imberti, Jocelyne Erhel

To cite this version:

David Imberti, Jocelyne Erhel. Solving large sparse linear systems with a variable s-step GMRES

preconditioned by DD. DD24 - International Conference on Domain Decomposition Methods, Feb

2017, Longyearbyen, Norway. �hal-01528636�

(2)

FibGMRES

DI & JE

&DNW

GMRES DGMRES and AGMRES VGMRES

Solving large sparse linear systems with a variable s-step GMRES preconditioned by DD

David Imberti and Jocelyne Erhel

Joint work with D´ esir´ e Nuentsa Wakam (first part)

FLUMINANCE team, Inria Rennes, France

DD24

Longyearbyen - Svalbard, Norway, February 2017

1 / 27

(3)

FibGMRES

DI & JE

&DNW

Numerical simulations

Solving large sparse linear systems is at the heart of many numerical simulations

Simulation of the velocity field on the Greenland inlandsis, with Elmer/Ice model Image : Fabien Gillet-Chaulet, CNRS and LGGE, Grenoble.

More information on the websiteIntersticesfor the general public [”modeliser-et-simuler-la-fonte-des-calottes-polaires”, Nodet and JE, 2015]

2 / 27

(4)

FibGMRES

DI & JE

&DNW

Outline

1

Krylov subspace algorithm GMRES

2

m-step preconditioned and deflated GMRES

3

Variable s-step algorithm VGMRES(m,s)

3 / 27

(5)

FibGMRES

DI & JE

&DNW

Preconditioned GMRES

Ax=b, A∈R^n×n x,b∈Rⁿ B=AM⁻¹ x₀∈Rⁿ r₀=b−Ax₀

GMRES(m): a Krylov subspace method

[Saad and Schultz 1986, Meurant’s book 1999, Saad’s book 2003, Simoncini and Szyld 2007, JE 2011, ...]

K_m(B,r₀) =span{r₀,Br₀, . . .B^m−1r₀}

Findxm∈x₀+M⁻¹K_m(B,r₀) such thatkr_mk₂=kb−Bxmk₂= min_x∈x

0 +M−1Km(B,r0 )kb−Bxk₂

Building blocks of GMRES(m)

Build an orthonormal basisV_k+1of the Krylov subspaceK_k+1 get the Arnoldi-like relationBV_k=V_k+1H_kfork= 1, . . . ,m Minimize the residual in the Krylov subspace

x=x₀+M⁻¹V_kyimpliesr=r₀−BV_ky=V_k+1(βe₁−H_ky) Solve the least-squares problem: min_y∈R_kkβe₁−H_kyk Restart if not converged

x₀=x₀+M⁻¹V_my_m

4 / 27

(6)

FibGMRES

DI & JE

&DNW

Preconditioned GMRES

Ax=b, A∈R^n×n x,b∈Rⁿ B=AM⁻¹ x₀∈Rⁿ r₀=b−Ax₀

GMRES(m): a Krylov subspace method

[Saad and Schultz 1986, Meurant’s book 1999, Saad’s book 2003, Simoncini and Szyld 2007, JE 2011, ...]

K_m(B,r₀) =span{r₀,Br₀, . . .B^m−1r₀}

Findxm∈x₀+M⁻¹K_m(B,r₀) such thatkr_mk₂=kb−Bxmk₂= min_x∈x

0 +M−1Km(B,r0 )kb−Bxk₂

Building blocks of GMRES(m)

Build an orthonormal basisV_k+1of the Krylov subspaceK_k+1 get the Arnoldi-like relationBV_k=V_k+1H_kfork= 1, . . . ,m Minimize the residual in the Krylov subspace

x=x₀+M⁻¹V_kyimpliesr=r₀−BV_ky=V_k+1(βe₁−H_ky) Solve the least-squares problem: min_y∈R_kkβe₁−H_kyk Restart if not converged

x₀=x₀+M⁻¹V_my_m

4 / 27

(7)

FibGMRES

DI & JE

&DNW

GMRES ... practical issues

Arnoldi process

1:

v₁=r₀/kr₀k₂

2:

fork= 1,mdo

3:

p=Bv_k

4:

fori= 1 :kdo

5:

h_ik=v_i^Tp

6:

p=p−h_ikv_i

7:

end for

8:

h_k+1,k=||p||₂

9:

v_k+1=p/h_k+1,k

10:

end for

⇓

BV_m=V_m+1H_m

Granularity issues in parallel algorithms

⇒Communication-avoiding strategies

Generate the basis vectors[Reichel 1990, Bai et al 1994]

Orthogonalize the basis[JE 1995, Sidje 1997, Demmel et al 2011]

Compute the basis block by block[De Sturler 1994, Hoemmen 2010]

Preconditioning issues

⇒multilevel methods to deal with large systems Schwarz preconditioning[Atenekeng Kahou et al 2007, Dufaud+Tromeur-Dervout 2010, Giraud+Haidar 2009,..]

Filtering and Schur complement[Li et al 2003, Grigori et al 2011]

Multilevel parallelism[Nuentsa Wakam et al 2011, Giraud et al 2010, ...]

Complexity and stagnation issues with restarted GMRES(m)

⇒deflation to recover possible loss of information

Deflation by preconditioning[JE et al 1996, Burrage et al 1998, Baglama et al 1998, ...]

Deflation by augmented basis[Morgan 1995, Morgan 2002,...]

Strategy

Combine ’communication-avoiding’ GMRES ... and Deflation ... and Domain Decomposition preconditioners [Nuentsa Wakam et al 2013, Nuentsa Wakam and Pacull 2013, Nuentsa Wakam and JE 2013]

5 / 27

(8)

FibGMRES

DI & JE

&DNW

GMRES ... practical issues

Arnoldi process

1:

v₁=r₀/kr₀k₂

2:

fork= 1,mdo

3:

p=Bv_k

4:

fori= 1 :kdo

5:

h_ik=v_i^Tp

6:

p=p−h_ikv_i

7:

end for

8:

h_k+1,k=||p||₂

9:

v_k+1=p/h_k+1,k

10:

end for

⇓

BV_m=V_m+1H_m

Strategy

5 / 27

(9)

FibGMRES

DI & JE

&DNW

GMRES ... practical issues

Arnoldi process

1:

v₁=r₀/kr₀k₂

2:

fork= 1,mdo

3:

p=Bv_k

4:

fori= 1 :kdo

5:

h_ik=v_i^Tp

6:

p=p−h_ikv_i

7:

end for

8:

h_k+1,k=||p||₂

9:

v_k+1=p/h_k+1,k

10:

end for

⇓

BV_m=V_m+1H_m

Strategy

5 / 27

(10)

FibGMRES

DI & JE

&DNW

GMRES ... practical issues

Arnoldi process

1:

v₁=r₀/kr₀k₂

2:

fork= 1,mdo

3:

p=Bv_k

4:

fori= 1 :kdo

5:

h_ik=v_i^Tp

6:

p=p−h_ikv_i

7:

end for

8:

h_k+1,k=||p||₂

9:

v_k+1=p/h_k+1,k

10:

end for

⇓

BV_m=V_m+1H_m

Strategy

5 / 27

(11)

FibGMRES

DI & JE

&DNW

GMRES ... practical issues

Arnoldi process

1:

v₁=r₀/kr₀k₂

2:

fork= 1,mdo

3:

p=Bv_k

4:

fori= 1 :kdo

5:

h_ik=v_i^Tp

6:

p=p−h_ikv_i

7:

end for

8:

h_k+1,k=||p||₂

9:

v_k+1=p/h_k+1,k

10:

end for

⇓

BV_m=V_m+1H_m

Strategy

5 / 27

(12)

FibGMRES

DI & JE

&DNW

Outline

1

Krylov subspace algorithm GMRES

2

m-step preconditioned and deflated GMRES

3

Variable s-step algorithm VGMRES(m,s)

6 / 27

(13)

FibGMRES

DI & JE

&DNW

m-step GMRES(m)

Building blocks of m-step GMRES(m)

Build a basisW_m+1of the Krylov subspaceK_m+1such thatBWm=W_m+1T_m+1 Build an orthonormal basisV_m+1ofK_m+1such thatW_m+1=V_m+1R_m+1 get the Arnoldi-like relationsBWm=Vm+1HmandBVm=Vm+1HmR_m⁻¹ [Nuentsa Wakam and JE 2013, Hoemmen 2010]

Minimize the residual in the Krylov subspace

x=x₀+M⁻¹Vmyimpliesr=r₀−BVmy=V_m+1(βe₁−HmR⁻¹_m y) Solve the least-squares problem:

min

y∈Rmkβe₁−H_mR_m⁻¹yk Alternative to minimize the residual in the Krylov subspace [Imberti and JE 2016]

x=x₀+M⁻¹W_myimpliesr=r₀−BW_my=V_m+1(βe₁−H_my) Solve the least-squares problem:

min

y∈Rmkβe₁−H_myk

Computation of vectorsw_k Monomial basis:w_k+1=Bw_k

Newton basis:σ_k+1w_k+1= (B−α_k+1I)w_k

where the shiftsα_kare computed during a preliminary cycle of GMRES(m) and scaling is done withσ_k

[Reichel 1990, Bai et al 1994, JE 1995, Nuentsa Wakam and JE 2013]

Chebyshev basis[Joubert+Carey 1992, Philippe+Reichel 2012]

Stability and parallelism issues

the condition number ofW_mincreases withm Parallel performances increases withm

8 / 27

(20)

FibGMRES

DI & JE

&DNW

GMRES combined with Domain Decomposition

Main steps with Domain Decomposition preconditioning

Partition the weighted graph of the matrix in parallel with PARMETIS.

Redistribute the matrix and right-hand-side according to the PARMETIS partitioning.

Perform a parallel iterative row and column scaling on the matrix and the right-hand side vector [Amestoy et al 2008].

Define the overlap between the submatrices for the additive Schwarz preconditioner [Cai and Sarkis 1999, Efstathiou and Gander 2003]

M_RAS⁻¹ = D X

k=1

(R_k⁰)^T(A^δ_k)⁻¹R^δ_k

Setup the submatrices (ILU or LU factorization).

Solve iteratively the preconditioned system using GMRES.

9 / 27

(21)

FibGMRES

DI & JE

&DNW

Deflation strategies

Restarted GMRES(m)

The convergence rate depends on the spectral distribution inB Smallest eigenvalues slow down the convergence

Deflation occurs when the Krylov subspace is large enough With restarting : loss of spectral information, risk of stalling

Accelerating the restarted GMRES[Simoncini and Szyld, 2007]

Approximate the smallest eigenvalues and the associated invariant subspaceUr Explicit deflation technique

[JE et al 1996; Burrage et al 1998; Moriya et al 2000, Nuentsa Wakam et al 2013 ]:

BM¯⁻¹˜x=b

withM¯⁻¹= (In+Ur(|λ_n|T⁻¹−Ir)U_r^TandT=U_r^TBUr Augmented techniques

[Morgan 2000, 2002, Giraud et al 2010, Nuentsa Wakam and JE 2013]

x_m∈x₀+K_m(B,r₀)+span{U_r}

10 / 27

(22)

FibGMRES

DI & JE

&DNW

Deflation strategies

Restarted GMRES(m)

The convergence rate depends on the spectral distribution inB Smallest eigenvalues slow down the convergence

Deflation occurs when the Krylov subspace is large enough With restarting : loss of spectral information, risk of stalling

Accelerating the restarted GMRES[Simoncini and Szyld, 2007]

Approximate the smallest eigenvalues and the associated invariant subspaceUr Explicit deflation technique

[JE et al 1996; Burrage et al 1998; Moriya et al 2000, Nuentsa Wakam et al 2013 ]:

BM¯⁻¹˜x=b withM¯⁻¹= (In+Ur(|λ_n|T⁻¹−Ir)U_r^TandT=U_r^TBUr Augmented techniques

[Morgan 2000, 2002, Giraud et al 2010, Nuentsa Wakam and JE 2013]

x_m∈x₀+K_m(B,r₀)+span{U_r}

10 / 27

(23)

FibGMRES

DI & JE

&DNW

Experiments with CFD matrices

FLUOREM matrices

in MatrixMarket collection large, sparse, nonsymmetric matrices

linearization of Navier-Stokes: symmetric profile with structured blocks [Nuentsa Wakam+Pacull 2013]

RM07R n= 381,689; nnz=37,464,962

Software

Krylov solvers integrated in Petsc [Nuentsa Wakam 2011]

Schwarz preconditioning combined with GMRES, Newton basis and deflation DGMRES: preconditioning deflation AGMRES: augmented subspace

11 / 27

(24)

FibGMRES

DI & JE

&DNW

Experiments with CFD matrices

FLUOREM matrices

RM07R n= 381,689; nnz=37,464,962

Software

11 / 27

(25)

FibGMRES

DI & JE

&DNW

Experiments with CFD matrices

FLUOREM matrices

RM07R n= 381,689; nnz=37,464,962

Software

11 / 27

(26)

FibGMRES

DI & JE

&DNW

Convergence with Augmented GMRES (AGMRES)

RM07R: effect of the restarting

1e-10 1e-08 1e-06 0.0001 0.01 1

0 100 200 300 400 500 600 700 800 900

Rel. Res. Norm

Number of Iterations RM07R-AGMRES(m,2)-GMRES(m)

GMRES(32) AGMRES(32,2) GMRES(48) AGMRES(48,2)

RM07R: effect of the number of subdomains

1e-10 1e-08 1e-06 0.0001 0.01 1

0 200 400 600 800 1000

Rel. Res. Norm

Number of Iterations RM07R-GMRES(m)-AGMRES(m,2)-D32-64

GMRES(32)-D32 AGMRES(32,2)-D32 GMRES(32)-D64 AGMRES(32,2)-D64

CPU time on parallel computers

RM07R, n = 381,689, nz = 37,464,962

D GMRES(32) AGMRES(32,r)

ITS Time (s) ITS Time (s)

16 254 379.3 169 224.1

32 886 573.4 212 91.41

64 - - 287 62.39

12 / 27

(27)

FibGMRES

DI & JE

&DNW

Parallel CPU Time with AGMRES

Convection-Diffusion test cases

3DCONSKY 121 : size = 1,771,561; nonzeros = 50,178,241 3DCOSKY 161 : size= 4,173,281; nonzeros = 118,645,121 [Nuentsa Wakam and JE 2013]

3DCONSKY 121

16 32 64 128 256

0 5 10 15 20 25 30 35 40

Subdomains

CPU Time

3DCONSKY_121 GMRES(16) AGMRES(16,1)

3DCONSKY 161

16 32 64 96 128 256

0 20 40 60 80 100 120 140 160 180 200

Subdomains

CPU Time

3DCONSKY_161 GMRES(16) AGMRES(16,1)

13 / 27

(28)

FibGMRES

DI & JE

&DNW

Outline

1

Krylov subspace algorithm GMRES

2

m-step preconditioned and deflated GMRES

3

Variable s-step algorithm VGMRES(m,s)

14 / 27

(29)

FibGMRES

DI & JE

&DNW

Variable s-step VGMRES(m,s)

Variablesstep GMRES: VGMRES(m,s)

Variable block sizes_jand Krylov sizel₁=s₁,l_j=lj−1+s_j,j≥2 [Imberti and JE 2016]

Build a basisW_lj of the Krylov subspaceK_ljfor 1≤j≤J Build an orthonormal basisV_lj₊₁of the Krylov subspaceK_lj₊₁ get the Arnoldi-like relationBW_lj=V_lj₊₁H_lj

x=x₀+M⁻¹W_ljyimpliesr=r₀−BW_ljy=V_lj₊₁(βe₁−H_ljy) Solve the least-squares problem:

min

y∈Rljkβe₁−H_ljyk

Test convergence at each stepj

Restart if not converged at stepJwithl_J=m

By induction, get the Arnoldi-like relation

h v₁,BW_lji

=V_lj₊₁R_lj₊₁,BW_lj=V_lj₊₁H_lj

16 / 27

(35)

FibGMRES

DI & JE

&DNW

Block computation and orthogonalization with VGMRES(m,s)

Stepjof VGMRES(m,s)

h v₁,BW_lji

16 / 27

(36)

FibGMRES

DI & JE

&DNW

Block computation and orthogonalization with VGMRES(m,s)

Stepjof VGMRES(m,s)

h v₁,BW_lji

16 / 27

(37)

FibGMRES

DI & JE

&DNW

Block computation and orthogonalization with VGMRES(m,s)

Stepjof VGMRES(m,s)

h v₁,BW_lji

16 / 27

(38)

FibGMRES

DI & JE

&DNW

Stability issues

Condition number ofBC_jandBW_lj block basis

W_lj=

C₁,C₂, . . . ,C_j κ(BW_lj)≥ max

1≤k≤jκ(BC_k) symmetric case: exponential growth of the condition number of a Krylov basis [Beckermann 2000]

nonsymmetric case with|λ₁|>|λ₂| monomial basisC_j={u,Bu, . . . ,B^sj⁻¹u}

[Imberti and JE 2016]

κ(BC_j)≥cste|λ₁/λ₂|^sj⁻¹

Choice of the block sizes_j

Objective: small condition numbers of the first blocks fixed sequences_j=s: SGMRES(m,s) [Hoemmen 2010, ...]

adaptive increasing sequences_jwiths_j≤s: VGMRES(m,s) [Imberti and JE 2016]

17 / 27

(39)

FibGMRES

DI & JE

&DNW

Stability issues

Condition number ofBC_jandBW_lj block basis

W_lj=

C₁,C₂, . . . ,C_j κ(BW_lj)≥ max

1≤k≤jκ(BC_k) symmetric case: exponential growth of the condition number of a Krylov basis [Beckermann 2000]

nonsymmetric case with|λ₁|>|λ₂| monomial basisC_j={u,Bu, . . . ,B^sj⁻¹u}

[Imberti and JE 2016]

κ(BC_j)≥cste|λ₁/λ₂|^sj⁻¹

Choice of the block sizes_j

Objective: small condition numbers of the first blocks fixed sequences_j=s: SGMRES(m,s) [Hoemmen 2010, ...]

adaptive increasing sequences_jwiths_j≤s: VGMRES(m,s) [Imberti and JE 2016]

17 / 27

(40)

FibGMRES

DI & JE

&DNW

Communication issues

The number of messages is related to the number of stepsJ Objective: reduce the number of steps

Number of steps

SGMRES(m,s):J=m/ssteps

VGMRES(m,s):Jsteps withl_J=mandJ=J₁+J₂ FibGMRES(m,s):Fibonacci sequence capped ats

J₂=O(m/s),J₁=O(log_φ(s))

18 / 27

(41)

FibGMRES

DI & JE

&DNW

Communication issues

The number of messages is related to the number of stepsJ Objective: reduce the number of steps

Number of steps

SGMRES(m,s):J=m/ssteps

VGMRES(m,s):Jsteps withl_J=mandJ=J₁+J₂ FibGMRES(m,s):Fibonacci sequence capped ats

J₂=O(m/s),J₁=O(log_φ(s))

18 / 27

(42)

FibGMRES

DI & JE

&DNW

FibGMRES(m,s) and Reverse FibGMRES(m,s)

Sequences with m = 48 and s = 16

Sequences of FibGMRES(m,s) with m = 96

j 1 2 3 4 5 6 7 8 9 10

s

j

1 2 3 5 8 13 16 16 16 16 l

j

1 3 6 11 19 32 48 64 80 96

Variable increasing block size for m = 96 and s = 16

j 1 2 3 4 5 6 7 8 9

s

FibGMRES

DI & JE

&DNW

Conclusion

DGMRES(m,r) and AGMRES(m,r)

Combined DD preconditioning with m-step GMRES and deflation

Parallel efficiency and fast convergence with a large number of subdomains Difficult choice of the restarting parameter m

VGMRES(m,s) and FibGMRES(m,s)

s -step algorithm combined with a variable block size s

j

Relationship between convergence rate and condition numbers of the blocks

Numerical experiments with a Fibonacci sequence

Future work

Adaptive variation of s

Blocks computed via a Newton basis Schwarz preconditioning with deflation Parallel computations

27 / 27

(51)

FibGMRES

DI & JE

&DNW

Conclusion

DGMRES(m,r) and AGMRES(m,r)

Combined DD preconditioning with m-step GMRES and deflation

Parallel efficiency and fast convergence with a large number of subdomains Difficult choice of the restarting parameter m

VGMRES(m,s) and FibGMRES(m,s)

s -step algorithm combined with a variable block size s

j

Relationship between convergence rate and condition numbers of the blocks

Numerical experiments with a Fibonacci sequence

Future work

Adaptive variation of s

Blocks computed via a Newton basis Schwarz preconditioning with deflation Parallel computations

27 / 27

(52)

FibGMRES

DI & JE

&DNW

Conclusion

DGMRES(m,r) and AGMRES(m,r)

Combined DD preconditioning with m-step GMRES and deflation

Parallel efficiency and fast convergence with a large number of subdomains Difficult choice of the restarting parameter m

VGMRES(m,s) and FibGMRES(m,s)

s -step algorithm combined with a variable block size s

j

Relationship between convergence rate and condition numbers of the blocks

Numerical experiments with a Fibonacci sequence

Future work

Adaptive variation of s

Blocks computed via a Newton basis Schwarz preconditioning with deflation Parallel computations

27 / 27