Toward High Performance Matrix Multiplication for Exact Computation
Pascal Giorgi
Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC
S´eminaire CASYS - LJK, April 2014
Motivations
Matrix multiplication plays a central role in computer algebra.
algebraic complexity ofO(nω)withω <2.3727[Williams 2011]
Modern processors provide many levels of parallelism.
superscalar, SIMD units, multiple cores
High performance matrix multiplication
3 numerical computing = classic algorithm + hardware arithmetic 7 exact computing6=numerical computing
algebraic algorithm is not the most efficient (6=complexity model) arithmetic is not directly in the hardware (e.g.Z,Fq,Z[x],Q[x,y,z]).
Motivation : Superscalar processor with SIMD
Hierarchical memory :
L1 cache :32kB -4 cycles L2 cache :256kB -12cycles L3 cache :8MB -36cycles RAM :32GB -36cycles+ 57ns
Motivations : practical algorithms
High performance algorithms (rule of thumb)
best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization
fine/coarse grain task parallelism matter :multicore parallelism
Our goal :try to incorporate these rules into exact matrix multiplications
Motivations : practical algorithms
High performance algorithms (rule of thumb)
best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization
fine/coarse grain task parallelism matter :multicore parallelism
Our goal :try to incorporate these rules into exact matrix multiplications
Outline
1 Matrix multiplication with small integers
2 Matrix multiplication with multi-precision integers
3 Matrix multiplication with polynomials
Outline
1 Matrix multiplication with small integers
2 Matrix multiplication with multi-precision integers
3 Matrix multiplication with polynomials
Matrix multiplication with small integers
This corresponds to the case where each integer result holds in one processor register :
A,B∈Zn×n such that ||AB||∞<2s wheres is the register size.
Main interests
ring isomorphism :
→computation overZ/pZis congruent toZ/2sZwhenp(n−1)2<2s. its a building block for matrix mutiplication with larger integers
Matrix multiplication with small integers
Two possibilities for hardware support : use floating point mantissa, i.e. s= 253, use native integer, i.e.s= 264.
Using floating point
historically, the first approach in computer algebra[Dumas, Gautier, Pernet 2002]
3 out of the box performance from optimized BLAS 7 handle matrix with entries<226
Using native integers
3 apply same optimizations as BLAS libraries[Goto, Van De Geijn 2008]
3 handle matrix with entries<232
Matrix multiplication with small integers
floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add
Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add
# vector operations per cycle (pipelined)
0 5 10 15 20 25 30 35 40 45 50
32 64 128 256 512 1024 2048 4096 8192
GFLOPS
Matrix dimensions OpenBlas: double -‐ SSE
OpenBlas: double -‐ AVX OpenBlas: double -‐ AVX2 Our code: int -‐ SSE Our code: int -‐ AVX2
benchmark on Intel i7-4960HQ @ 2.60GHz
Matrix multiplication with small integers
Matrix multiplication modulo a small integer Letpsuch that(p−1)2×n<253
1 perform the multiplication in Zusing BLAS
2 reduce the result modulop
0 10 20 30 40 50 60
32 64 128 256 512 1024 2048 4096 8192
GFLOPS
Matrix dimensions OpenBlas
mod p -‐ Classic mod p -‐ Winograd
benchmark on Intel i7-4960HQ @ 2.60GHz
Outline
1 Matrix multiplication with small integers
2 Matrix multiplication with multi-precision integers
3 Matrix multiplication with polynomials
Matrix multiplication with multi-precision integers
Direct approach
LetM(k)be the bit complexity ofk-bit integers multiplication and A,B∈Zn×nsuch that ||A||∞,||B||∞∈O(2k).
ComputingAB using direct algorithm costsnωM(k)bit operations.
7 not best possible complexity, i.e.M(k)is super-linear 7 not efficient in practice
Remark:
Use evaluation/interpolation technique for better performances ! ! !
Multi-modular matrix multiplication
Multi-modular approach
||AB||∞<M=
k
Y
i=1
mi, with primesmi ∈O(1)
thenAB can be reconstructed with the CRT from(AB) modmi.
1 for eachmi compute Ai=Amodmi andBi=Bmodmi
2 for eachmi compute Ci =AiBimodmi
3 reconstructC=AB from(C1, . . . ,Ck) Bit complexity :
O(nωk+n2R(k))whereR(k)is the cost of reduction/reconstruction
R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k2)using naive approach
Multi-modular matrix multiplication
Multi-modular approach
||AB||∞<M=
k
Y
i=1
mi, with primesmi ∈O(1)
thenAB can be reconstructed with the CRT from(AB) modmi.
1 for eachmi compute Ai=Amodmi andBi=Bmodmi
2 for eachmi compute Ci =AiBimodmi
3 reconstructC=AB from(C1, . . . ,Ck) Bit complexity :
O(nωk+n2R(k))whereR(k)is the cost of reduction/reconstruction R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k2)using naive approach
Multi-modular matrix multiplication
Improving naive approach with linear algebra
reduction/reconstruction ofn2 data corresponds to matrix multiplication 3 improve the bit complexity fromO(n2k2)toO(n2kω−1)
3 benefit from optimized matrix multiplication, i.e. SIMD
Remark :
A similar approach has been used by[Doliskani, Schost 2010]in a non-distributed code.
Multi-modular reductions of an integer matrix
Let us assumeM=Qk
i=1mi < βk withmi< β.
Multi-reduction of a single entry
Leta=a0+a1β+. . .ak−1βk−1be a value to reduce modmi then
|a|m1
...
|a|mk
=
1 |β|m1 . . . |βk−1|m1
... ... . .. ... 1 |β|mk . . . |βk−1|mk
×
a0
... ak−1
−Q×
m1
... mk
with||Q||∞<kβ2
Lemma : ifkβ2∈O(1)than the reduction ofn2integers modulo the mi’s costsO(n2kω−1) +O(n2k)bit operations.
Multi-modular reconstruction of an integer matrix
Let us assumeM=Qk
i=1mi < βk withmi< β andMi=M/mi
CRT formulae :a= (
k
X
i=1
|a|m1·Mi|Mi−1|mi) modM
Reconstruction of a single entry
LetMi|Mi−1|mi =α(i)0 +α(i)1 β+. . . α(i)k−1βk−1be the CRT constants, then
a0
... ak−1
=
α(1)0 . . . α(1)k−1 ... . .. ... α(k)0 . . . α(k)k−1
×
|a|m1
...
|a|mk
withai<kβ2anda=a0+. . .+ak−1βk−1modM the CRT solution.
Lemma : ifkβ2∈O(1)than the reconstruction ofn2integers from their images modulo themi’s costsO(n2kω−1) +O˜(n2k)bit operations.
Matrix multiplication with multi-precision integers
Implementation of multi-modular approach
chooseβ = 216 to optimizeβ-adic conversions choosemi s.t.nβmi <253 and use BLASdgemm use a linear storage for multi-modular matrices
Compare sequential performances with : FLINT library1: uses divide and conquer Mathemagix library2: uses divide and conquer Doliskani’s code3: usesdgemm for reductions only
1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani
Matrix multiplication with multi-precision integers
0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00
24 26 28 210 212 214 216
Time in seconds
Entry bitsize Integer matrix multiplication
(matrix dim = 32) Flint
Mathemagix Doliskani’s code Our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Matrix multiplication with multi-precision integers
0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00 256.00
24 26 28 210 212 214 216
Time in seconds
Entry bitsize Integer matrix multiplication
(matrix dim = 128) Flint
Mathemagix Doliskani’s code Our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Matrix multiplication with multi-precision integers
0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 256.00
24 26 28 210 212 214 216
Time in seconds
Entry bitsize Integer matrix multiplication
(matrix dim = 512) Flint
Mathemagix Doliskani’s code Our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Matrix multiplication with multi-precision integers
2.00 4.00 8.00 16.00 32.00 64.00 128.00
24 26 28 210 212 214 216
Time in seconds
Entry bitsize Integer matrix multiplication
(matrix dim = 1024) Flint
Mathemagix Doliskani’s code Our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Parallel multi-modular matrix multiplication
1 fori= 1. . .k compute Ai=Amodmi andBi=B modmi
2 fori= 1. . .k computeCi=AiBi modmi
3 reconstructC=AB from(C1, . . . ,Ck)
Parallelization of multi-modular reduction/reconstruction
each thread reduces (resp. reconstructs) a chunk of the given matrix
A0=Amodm0 A1=Amodm1
A2=Amodm2
A3=Amodm3 A4=Amodm4 A5=Amodm5
A6=Amodm6
thread0 thread1 thread2 thread3
Parallel multi-modular matrix multiplication
1 fori= 1. . .k computeAi =Amodmi andBi=B modmi
2 fori= 1. . .k compute Ci =AiBi modmi
3 reconstructC=AB from(C1, . . . ,Ck)
Parallelization of modular multiplication
each thread computes a bunch of matrix multiplications modmi thread0
thread1
thread2
thread3
C0=A0B0modm0
C1=A1B1modm1
C2=A2B2modm2 C3=A3B3modm3 C4=A4B4modm4
C5=A5B5modm5
C6=A6B6modm6
Parallel Matrix multiplication with multi-precision integers
based on OpenMP task
CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy ! ! !
0 2 4 6 8 10 12
2 6 8 12
speedup
number of cores
Speedup of parallel integer matrix mul5plica5on
dim = 1024 bitsize = 1024 dim=1024 bitsize = 4096 dim = 256 bitsize = 8192
benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)
Outline
1 Matrix multiplication with small integers
2 Matrix multiplication with multi-precision integers
3 Matrix multiplication with polynomials
Matrix multiplication over F
p[x]
We consider the ”easiest” case :
A,B∈Fp[x]n×nsuch that deg(AB)<k = 2t
p is a Fourier prime, i.e.p= 2tq+ 1 p is such thatn(p−1)2<253
Complexity
O(nωk+n2klog(k))op. inFp using evaluation/interpolation with FFT
Remark:
using Vandermonde matrix on can get a similar approach as for integers, i.e.O(nωk+n2kω−1)
Matrix multiplication over F
p[x]
We consider the ”easiest” case :
A,B∈Fp[x]n×nsuch that deg(AB)<k = 2t
p is a Fourier prime, i.e.p= 2tq+ 1 p is such thatn(p−1)2<253
Complexity
O(nωk+n2klog(k))op. inFp using evaluation/interpolation with FFT
Remark:
using Vandermonde matrix on can get a similar approach as for integers, i.e.O(nωk+n2kω−1)
Matrix multiplication over F
p[x]
Evaluation/Interpolation scheme
Letθa primitivekth root of unity inFp.
1 fori= 1. . .k compute Ai=A(θi−1)andBi =B(θi−1)
2 fori= 1. . .k compute Ci =AiBi ∈Fp
3 interpolateC =AB from(C1, . . . ,Ck)
steps 1 and 3 :O(n2)call toFFTk overFp[x]
step 2 :k matrix multiplications modulo a small primep
FFT with SIMD over F
p Butterly operation modulo pcomputeX+Y modpand(X−Y)θ2i modp.
Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]
LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW0 =dWβ/pe.
Algorithm:Butterfly(X,Y,W,W’,p)
1: X0 :=X+Y mod 2p
2: T :=X−Y + 2p
3: Q:=dW0T/βe 1 high short product
4: Y0:= (WT−Qp) modβ 2 low short products
5: return(X0,Y0)
3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)
FFT with SIMD over F
p Butterly operation modulo pcomputeX+Y modpand(X−Y)θ2i modp.
Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]
LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW0 =dWβ/pe.
Algorithm:Butterfly(X,Y,W,W’,p)
1: X0 :=X+Y mod 2p
2: T :=X−Y + 2p
3: Q:=dW0T/βe 1 high short product
4: Y0:= (WT−Qp) modβ 2 low short products
5: return(X0,Y0)
3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)
Matrix multiplication over F
p[x]
Implementation
radix-4 FFT with 128-bits SSE (29 bits primes)
BLAS-based matrix multiplication over Fp [FFLAS-FFPACK library]
0,02 0,03 0,06 0,13 0,25 0,50 1,00 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512,00 1 024,00
deg=1024 deg=2048 deg=4096 deg=8192 deg=512 deg=256 deg=512 deg=64 deg=32
n=16 n=16 n=16 n=16 n=128 n=256 n=512 n=1024 n=2048
Time in seconds
Matrix with dimension (n) and degree (deg) Polynomial Matrix Mul9plica9on Performances
FLINT MMX our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Matrix multiplication over Z [x]
A,B∈Z[x]n×nsuch that deg(AB)<d and||(AB)i||∞<k
Complexity
O˜(nωdlog(d) log(k))bit op. using Kronecker substitution O(nωdlog(k) +n2dlog(d) log(k))bit op. using CRT+FFT
Remark:
if the result’s degree and bitsize are not too large, CRT with Fourier primes might suffice.
Matrix multiplication over Z [x]
Implementation
use CRT with Fourier primes
re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication inFp[x]
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
log(k)=64 log(k)=128 log(k)=256 log(k)=512 log(k)=64 log(k)=128 log(k)=256
n=16 ; deg=8192 n=128 ; deg=512
Time in seconds
FLINT our code
benchmark on Intel Xeon-2620 @ 2.0GHz
Parallel Matrix multiplication over Z [x]
Very first attempt (work still in progress)
parallel CRT with linear algebra (same code as in Zcase) perform each multiplication overFp[x]in parallel
some part of the code still sequential
n d log(k) 6 cores 12 cores time seq 64 1024 600 ×3.52 ×4.88 61.1s 32 4096 600 ×3.68 ×5.02 64.4s 32 2048 1024 ×3.95 ×5.73 54.5s 128 128 1024 ×3.76 ×5.55 53.9s
Polynomial Matrix in LinBox (proposition)
Generic handler class for Polynomial Matrix
t e m p l a t e<s i z e t t y p e , s i z e t s t o r a g e , c l a s s F i e l d>
c l a s s P o l y n o m i a l M a t r i x ;
Specialization for different memory strategy
// M a t r i x o f p o l y n o m i a l s t e m p l a t e<c l a s s F i e l d>
c l a s s P o l y n o m i a l M a t r i x<PMType : : p o l f i r s t , PMStorage : : p l a i n , F i e l d>;
// P o l y n o m i a l o f m a t r i c e s t e m p l a t e<c l a s s F i e l d>
c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : p l a i n , F i e l d>;
// P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s ) t e m p l a t e<c l a s s F i e l d>
c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : v i e w , F i e l d>;
Conclusion
High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT
FFT for polynomial over wordsize prime fields adaptative matrix representation
We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication overFp[x]andZ[x]