Toward High Performance Matrix Multiplication for Exact Computation

(1)

Toward High Performance Matrix Multiplication for Exact Computation

Pascal Giorgi

Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC

S´eminaire CASYS - LJK, April 2014

(2)

Motivations

Matrix multiplication plays a central role in computer algebra.

algebraic complexity ofO(n^ω)withω <2.3727[Williams 2011]

Modern processors provide many levels of parallelism.

superscalar, SIMD units, multiple cores

High performance matrix multiplication

3 numerical computing = classic algorithm + hardware arithmetic 7 exact computing6=numerical computing

algebraic algorithm is not the most efficient (6=complexity model) arithmetic is not directly in the hardware (e.g.Z,Fq,Z[x],Q[x,y,z]).

(3)

Motivation : Superscalar processor with SIMD

Hierarchical memory :

L1 cache :32kB -4 cycles L2 cache :256kB -12cycles L3 cache :8MB -36cycles RAM :32GB -36cycles+ 57ns

(4)

Motivations : practical algorithms

High performance algorithms (rule of thumb)

best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization

fine/coarse grain task parallelism matter :multicore parallelism

Our goal :try to incorporate these rules into exact matrix multiplications

(5)

Motivations : practical algorithms

High performance algorithms (rule of thumb)

best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization

fine/coarse grain task parallelism matter :multicore parallelism

Our goal :try to incorporate these rules into exact matrix multiplications

(6)

Outline

1 Matrix multiplication with small integers

2 Matrix multiplication with multi-precision integers

3 Matrix multiplication with polynomials

(7)

Outline

(8)

Matrix multiplication with small integers

This corresponds to the case where each integer result holds in one processor register :

A,B∈Z^n×n such that ||AB||∞<2^s wheres is the register size.

Main interests

ring isomorphism :

→computation overZ/pZis congruent toZ/2^sZwhenp(n−1)²<2^s. its a building block for matrix mutiplication with larger integers

(9)

Matrix multiplication with small integers

Two possibilities for hardware support : use floating point mantissa, i.e. s= 2⁵³, use native integer, i.e.s= 2⁶⁴.

Using floating point

historically, the first approach in computer algebra[Dumas, Gautier, Pernet 2002]

3 out of the box performance from optimized BLAS 7 handle matrix with entries<2²⁶

Using native integers

3 apply same optimizations as BLAS libraries[Goto, Van De Geijn 2008]

3 handle matrix with entries<2³²

(10)

Matrix multiplication with small integers

floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add

Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add

# vector operations per cycle (pipelined)

0 5 10 15 20 25 30 35 40 45 50

32 64 128 256 512 1024 2048 4096 8192

GFLOPS

Matrix dimensions OpenBlas: double -‐ SSE

OpenBlas: double -‐ AVX OpenBlas: double -‐ AVX2 Our code: int -‐ SSE Our code: int -‐ AVX2

benchmark on Intel i7-4960HQ @ 2.60GHz

(11)

Matrix multiplication with small integers

Matrix multiplication modulo a small integer Letpsuch that(p−1)²×n<2⁵³

1 perform the multiplication in Zusing BLAS

2 reduce the result modulop

0 10 20 30 40 50 60

32 64 128 256 512 1024 2048 4096 8192

GFLOPS

Matrix dimensions OpenBlas

mod p -‐ Classic mod p -‐ Winograd

benchmark on Intel i7-4960HQ @ 2.60GHz

(12)

Outline

(13)

Matrix multiplication with multi-precision integers

Direct approach

LetM(k)be the bit complexity ofk-bit integers multiplication and A,B∈Z^n×nsuch that ||A||∞,||B||∞∈O(2^k).

ComputingAB using direct algorithm costsn^ωM(k)bit operations.

7 not best possible complexity, i.e.M(k)is super-linear 7 not efficient in practice

Remark:

Use evaluation/interpolation technique for better performances ! ! !

(14)

Multi-modular matrix multiplication

Multi-modular approach

||AB||_∞<M=

k

Y

i=1

mi, with primesmi ∈O(1)

thenAB can be reconstructed with the CRT from(AB) modmi.

1 for eachm_i compute A_i=Amodm_i andB_i=Bmodm_i

2 for eachmi compute Ci =AiBimodmi

3 reconstructC=AB from(C₁, . . . ,C_k) Bit complexity :

O(n^ωk+n²R(k))whereR(k)is the cost of reduction/reconstruction

R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k²)using naive approach

(15)

Multi-modular matrix multiplication

Multi-modular approach

||AB||_∞<M=

k

Y

i=1

mi, with primesmi ∈O(1)

thenAB can be reconstructed with the CRT from(AB) modmi.

1 for eachm_i compute A_i=Amodm_i andB_i=Bmodm_i

2 for eachmi compute Ci =AiBimodmi

3 reconstructC=AB from(C₁, . . . ,C_k) Bit complexity :

O(n^ωk+n²R(k))whereR(k)is the cost of reduction/reconstruction R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k²)using naive approach

(16)

Multi-modular matrix multiplication

Improving naive approach with linear algebra

reduction/reconstruction ofn² data corresponds to matrix multiplication 3 improve the bit complexity fromO(n²k²)toO(n²k^ω−1)

3 benefit from optimized matrix multiplication, i.e. SIMD

Remark :

A similar approach has been used by[Doliskani, Schost 2010]in a non-distributed code.

(17)

Multi-modular reductions of an integer matrix

Let us assumeM=Qk

i=1mi < β^k withmi< β.

Multi-reduction of a single entry

Leta=a0+a1β+. . .a_k−1β^k−1be a value to reduce modmi then







|a|m₁

...

|a|_m_k





=







1 |β|m₁ . . . |β^k⁻¹|m₁

... ... . .. ... 1 |β|mk . . . |β^k−1|mk





×





 a0

... a_k₋₁





−Q×





 m1

... m_k







with||Q||∞<kβ²

Lemma : ifkβ²∈O(1)than the reduction ofn²integers modulo the m_i’s costsO(n²k^ω−1) +O(n²k)bit operations.

(18)

Multi-modular reconstruction of an integer matrix

Let us assumeM=Qk

i=1mi < β^k withmi< β andMi=M/mi

CRT formulae :a= (

k

X

i=1

|a|m1·M_i|M_i⁻¹|mi) modM

Reconstruction of a single entry

LetMi|M_i⁻¹|mi =α⁽ⁱ⁾₀ +α⁽ⁱ⁾₁ β+. . . α⁽ⁱ⁾_k−1β^k−1be the CRT constants, then





 a₀

... ak−1





=







α⁽¹⁾₀ . . . α⁽¹⁾_k₋₁ ... . .. ... α^(k)₀ . . . α^(k)_k₋₁







×







|a|m1

...

|a|m_k







withai<kβ²anda=a0+. . .+ak−1β^k−1modM the CRT solution.

Lemma : ifkβ²∈O(1)than the reconstruction ofn²integers from their images modulo themi’s costsO(n²k^ω−1) +O˜(n²k)bit operations.

(19)

Matrix multiplication with multi-precision integers

Implementation of multi-modular approach

chooseβ = 2¹⁶ to optimizeβ-adic conversions choosemi s.t.nβmi <2⁵³ and use BLASdgemm use a linear storage for multi-modular matrices

Compare sequential performances with : FLINT library¹: uses divide and conquer Mathemagix library²: uses divide and conquer Doliskani’s code³: usesdgemm for reductions only

1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani

(20)

Matrix multiplication with multi-precision integers

0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00

2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Time in seconds

Entry bitsize Integer matrix multiplication

(matrix dim = 32) Flint

Mathemagix Doliskani’s code Our code

benchmark on Intel Xeon-2620 @ 2.0GHz

(21)

Matrix multiplication with multi-precision integers

0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00 256.00

2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Time in seconds

(22)

Matrix multiplication with multi-precision integers

0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 256.00

2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Time in seconds

(23)

Matrix multiplication with multi-precision integers

2.00 4.00 8.00 16.00 32.00 64.00 128.00

2⁴ 2⁶ 2⁸ 2¹⁰ 2¹² 2¹⁴ 2¹⁶

Time in seconds

(24)

Parallel multi-modular matrix multiplication

1 fori= 1. . .k compute Ai=Amodmi andBi=B modmi

2 fori= 1. . .k computeCi=AiBi modmi

3 reconstructC=AB from(C1, . . . ,Ck)

Parallelization of multi-modular reduction/reconstruction

each thread reduces (resp. reconstructs) a chunk of the given matrix

A₀=Amodm₀ A1=Amodm1

A2=Amodm2

A₃=Amodm₃ A₄=Amodm₄ A5=Amodm5

A6=Amodm6

thread0 thread1 thread2 thread3

(25)

Parallel multi-modular matrix multiplication

1 fori= 1. . .k computeAi =Amodmi andBi=B modmi

2 fori= 1. . .k compute C_i =A_iB_i modm_i

3 reconstructC=AB from(C1, . . . ,Ck)

Parallelization of modular multiplication

each thread computes a bunch of matrix multiplications modmi thread0

thread1

thread2

thread3

C0=A0B0modm0

C1=A1B1modm1

C₂=A₂B₂modm₂ C₃=A₃B₃modm₃ C4=A4B4modm4

C5=A5B5modm5

C₆=A₆B₆modm₆

(26)

Parallel Matrix multiplication with multi-precision integers

based on OpenMP task

CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy ! ! !

0 2 4 6 8 10 12

2 6 8 12

speedup

number of cores

Speedup of parallel integer matrix mul5plica5on

dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = 8192

benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)

(27)

Outline

(28)

Matrix multiplication over F

p

[x]

We consider the ”easiest” case :

A,B∈Fp[x]^n×nsuch that deg(AB)<k = 2^t

p is a Fourier prime, i.e.p= 2^tq+ 1 p is such thatn(p−1)²<2⁵³

Complexity

O(n^ωk+n²klog(k))op. inFp using evaluation/interpolation with FFT

Remark:

using Vandermonde matrix on can get a similar approach as for integers, i.e.O(n^ωk+n²k^ω−1)

(29)

Matrix multiplication over F

p

[x]

We consider the ”easiest” case :

A,B∈Fp[x]^n×nsuch that deg(AB)<k = 2^t

p is a Fourier prime, i.e.p= 2^tq+ 1 p is such thatn(p−1)²<2⁵³

Complexity

O(n^ωk+n²klog(k))op. inFp using evaluation/interpolation with FFT

Remark:

using Vandermonde matrix on can get a similar approach as for integers, i.e.O(n^ωk+n²k^ω−1)

(30)

Matrix multiplication over F

p

[x]

Evaluation/Interpolation scheme

Letθa primitivekth root of unity inF_p.

1 fori= 1. . .k compute Ai=A(θⁱ⁻¹)andBi =B(θⁱ⁻¹)

2 fori= 1. . .k compute Ci =AiBi ∈Fp

3 interpolateC =AB from(C₁, . . . ,C_k)

steps 1 and 3 :O(n²)call toFFTk overFp[x]

step 2 :k matrix multiplications modulo a small primep

(31)

FFT with SIMD over F

p Butterly operation modulo p

computeX+Y modpand(X−Y)θ²ⁱ modp.

Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]

LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW⁰ =dWβ/pe.

Algorithm:Butterfly(X,Y,W,W’,p)

1: X⁰ :=X+Y mod 2p

2: T :=X−Y + 2p

3: Q:=dW⁰T/βe 1 high short product

4: Y⁰:= (WT−Qp) modβ 2 low short products

5: return(X⁰,Y⁰)

3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)

(32)

FFT with SIMD over F

p Butterly operation modulo p

computeX+Y modpand(X−Y)θ²ⁱ modp.

Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]

LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW⁰ =dWβ/pe.

Algorithm:Butterfly(X,Y,W,W’,p)

1: X⁰ :=X+Y mod 2p

2: T :=X−Y + 2p

3: Q:=dW⁰T/βe 1 high short product

4: Y⁰:= (WT−Qp) modβ 2 low short products

5: return(X⁰,Y⁰)

3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)

(33)

Matrix multiplication over F

p

[x]

Implementation

radix-4 FFT with 128-bits SSE (29 bits primes)

BLAS-based matrix multiplication over Fp [FFLAS-FFPACK library]

0,02 0,03 0,06 0,13 0,25 0,50 1,00 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512,00 1 024,00

deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32

n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048

Time in seconds

Matrix with dimension (n) and degree (deg) Polynomial Matrix Mul9plica9on Performances

FLINT MMX our code

(34)

Matrix multiplication over Z [x]

A,B∈Z[x]^n×nsuch that deg(AB)<d and||(AB)i||_∞<k

Complexity

O˜(n^ωdlog(d) log(k))bit op. using Kronecker substitution O(n^ωdlog(k) +n²dlog(d) log(k))bit op. using CRT+FFT

Remark:

if the result’s degree and bitsize are not too large, CRT with Fourier primes might suffice.

(35)

Matrix multiplication over Z [x]

Implementation

use CRT with Fourier primes

re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication inFp[x]

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

log(k)=64 log(k)=128 log(k)=256 log(k)=512 log(k)=64 log(k)=128 log(k)=256

n=16 ; deg=8192 n=128 ; deg=512

Time in seconds

FLINT our code

(36)

Parallel Matrix multiplication over Z [x]

Very first attempt (work still in progress)

parallel CRT with linear algebra (same code as in Zcase) perform each multiplication overFp[x]in parallel

some part of the code still sequential

n d log(k) 6 cores 12 cores time seq 64 1024 600 ×3.52 ×4.88 61.1s 32 4096 600 ×3.68 ×5.02 64.4s 32 2048 1024 ×3.95 ×5.73 54.5s 128 128 1024 ×3.76 ×5.55 53.9s

(37)

Polynomial Matrix in LinBox (proposition)

Generic handler class for Polynomial Matrix

t e m p l a t e<s i z e t t y p e , s i z e t s t o r a g e , c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x ;

Specialization for different memory strategy

// M a t r i x o f p o l y n o m i a l s t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : p o l f i r s t , PMStorage : : p l a i n , F i e l d>;

// P o l y n o m i a l o f m a t r i c e s t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : p l a i n , F i e l d>;

// P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s ) t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : v i e w , F i e l d>;

(38)

Conclusion

High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT

FFT for polynomial over wordsize prime fields adaptative matrix representation

We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication overF_p[x]andZ[x]