• Aucun résultat trouvé

Toward High Performance Matrix Multiplication for Exact Computation

N/A
N/A
Protected

Academic year: 2022

Partager "Toward High Performance Matrix Multiplication for Exact Computation"

Copied!
38
0
0

Texte intégral

(1)

Toward High Performance Matrix Multiplication for Exact Computation

Pascal Giorgi

Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC

S´eminaire CASYS - LJK, April 2014

(2)

Motivations

Matrix multiplication plays a central role in computer algebra.

algebraic complexity ofO(nω)withω <2.3727[Williams 2011]

Modern processors provide many levels of parallelism.

superscalar, SIMD units, multiple cores

High performance matrix multiplication

3 numerical computing = classic algorithm + hardware arithmetic 7 exact computing6=numerical computing

algebraic algorithm is not the most efficient (6=complexity model) arithmetic is not directly in the hardware (e.g.Z,Fq,Z[x],Q[x,y,z]).

(3)

Motivation : Superscalar processor with SIMD

Hierarchical memory :

L1 cache :32kB -4 cycles L2 cache :256kB -12cycles L3 cache :8MB -36cycles RAM :32GB -36cycles+ 57ns

(4)

Motivations : practical algorithms

High performance algorithms (rule of thumb)

best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization

fine/coarse grain task parallelism matter :multicore parallelism

Our goal :try to incorporate these rules into exact matrix multiplications

(5)

Motivations : practical algorithms

High performance algorithms (rule of thumb)

best asymptotic complexity is not alway faster :constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time :vectorization

fine/coarse grain task parallelism matter :multicore parallelism

Our goal :try to incorporate these rules into exact matrix multiplications

(6)

Outline

1 Matrix multiplication with small integers

2 Matrix multiplication with multi-precision integers

3 Matrix multiplication with polynomials

(7)

Outline

1 Matrix multiplication with small integers

2 Matrix multiplication with multi-precision integers

3 Matrix multiplication with polynomials

(8)

Matrix multiplication with small integers

This corresponds to the case where each integer result holds in one processor register :

A,B∈Zn×n such that ||AB||<2s wheres is the register size.

Main interests

ring isomorphism :

→computation overZ/pZis congruent toZ/2sZwhenp(n−1)2<2s. its a building block for matrix mutiplication with larger integers

(9)

Matrix multiplication with small integers

Two possibilities for hardware support : use floating point mantissa, i.e. s= 253, use native integer, i.e.s= 264.

Using floating point

historically, the first approach in computer algebra[Dumas, Gautier, Pernet 2002]

3 out of the box performance from optimized BLAS 7 handle matrix with entries<226

Using native integers

3 apply same optimizations as BLAS libraries[Goto, Van De Geijn 2008]

3 handle matrix with entries<232

(10)

Matrix multiplication with small integers

floating point integers Nehalem (2008) SSE4 128-bits 1 mul+1 add 1 mul+2 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add

Haswell (2013) AVX2 256-bits 2 FMA 1 mul+2 add

# vector operations per cycle (pipelined)

0   5   10   15   20   25   30   35   40   45   50  

32   64   128   256   512   1024   2048   4096   8192  

GFLOPS  

Matrix  dimensions   OpenBlas:    double  -­‐  SSE    

OpenBlas:    double  -­‐  AVX     OpenBlas:    double  -­‐  AVX2   Our  code:  int  -­‐  SSE   Our  code:  int  -­‐  AVX2  

benchmark on Intel i7-4960HQ @ 2.60GHz

(11)

Matrix multiplication with small integers

Matrix multiplication modulo a small integer Letpsuch that(p−1)2×n<253

1 perform the multiplication in Zusing BLAS

2 reduce the result modulop

0   10   20   30   40   50   60  

32   64   128   256   512   1024   2048   4096   8192  

GFLOPS  

Matrix  dimensions   OpenBlas  

mod  p  -­‐  Classic   mod  p  -­‐  Winograd  

benchmark on Intel i7-4960HQ @ 2.60GHz

(12)

Outline

1 Matrix multiplication with small integers

2 Matrix multiplication with multi-precision integers

3 Matrix multiplication with polynomials

(13)

Matrix multiplication with multi-precision integers

Direct approach

LetM(k)be the bit complexity ofk-bit integers multiplication and A,B∈Zn×nsuch that ||A||,||B||∈O(2k).

ComputingAB using direct algorithm costsnωM(k)bit operations.

7 not best possible complexity, i.e.M(k)is super-linear 7 not efficient in practice

Remark:

Use evaluation/interpolation technique for better performances ! ! !

(14)

Multi-modular matrix multiplication

Multi-modular approach

||AB||<M=

k

Y

i=1

mi, with primesmi ∈O(1)

thenAB can be reconstructed with the CRT from(AB) modmi.

1 for eachmi compute Ai=Amodmi andBi=Bmodmi

2 for eachmi compute Ci =AiBimodmi

3 reconstructC=AB from(C1, . . . ,Ck) Bit complexity :

O(nωk+n2R(k))whereR(k)is the cost of reduction/reconstruction

R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k2)using naive approach

(15)

Multi-modular matrix multiplication

Multi-modular approach

||AB||<M=

k

Y

i=1

mi, with primesmi ∈O(1)

thenAB can be reconstructed with the CRT from(AB) modmi.

1 for eachmi compute Ai=Amodmi andBi=Bmodmi

2 for eachmi compute Ci =AiBimodmi

3 reconstructC=AB from(C1, . . . ,Ck) Bit complexity :

O(nωk+n2R(k))whereR(k)is the cost of reduction/reconstruction R(k) =O(M(k) log(k))using divide and conquer strategy R(k) =O(k2)using naive approach

(16)

Multi-modular matrix multiplication

Improving naive approach with linear algebra

reduction/reconstruction ofn2 data corresponds to matrix multiplication 3 improve the bit complexity fromO(n2k2)toO(n2kω−1)

3 benefit from optimized matrix multiplication, i.e. SIMD

Remark :

A similar approach has been used by[Doliskani, Schost 2010]in a non-distributed code.

(17)

Multi-modular reductions of an integer matrix

Let us assumeM=Qk

i=1mi < βk withmi< β.

Multi-reduction of a single entry

Leta=a0+a1β+. . .ak−1βk−1be a value to reduce modmi then

|a|m1

...

|a|mk

=

1 |β|m1 . . . |βk−1|m1

... ... . .. ... 1 |β|mk . . . |βk−1|mk

×

 a0

... ak−1

−Q×

 m1

... mk

with||Q||<kβ2

Lemma : ifkβ2∈O(1)than the reduction ofn2integers modulo the mi’s costsO(n2kω−1) +O(n2k)bit operations.

(18)

Multi-modular reconstruction of an integer matrix

Let us assumeM=Qk

i=1mi < βk withmi< β andMi=M/mi

CRT formulae :a= (

k

X

i=1

|a|m1·Mi|Mi−1|mi) modM

Reconstruction of a single entry

LetMi|Mi−1|mi(i)0(i)1 β+. . . α(i)k−1βk−1be the CRT constants, then

 a0

... ak−1

=

α(1)0 . . . α(1)k−1 ... . .. ... α(k)0 . . . α(k)k−1

×

|a|m1

...

|a|mk

withai<kβ2anda=a0+. . .+ak−1βk−1modM the CRT solution.

Lemma : ifkβ2∈O(1)than the reconstruction ofn2integers from their images modulo themi’s costsO(n2kω−1) +O˜(n2k)bit operations.

(19)

Matrix multiplication with multi-precision integers

Implementation of multi-modular approach

chooseβ = 216 to optimizeβ-adic conversions choosemi s.t.nβmi <253 and use BLASdgemm use a linear storage for multi-modular matrices

Compare sequential performances with : FLINT library1: uses divide and conquer Mathemagix library2: uses divide and conquer Doliskani’s code3: usesdgemm for reductions only

1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani

(20)

Matrix multiplication with multi-precision integers

0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00

24 26 28 210 212 214 216

Time in seconds

Entry bitsize Integer matrix multiplication

(matrix dim = 32) Flint

Mathemagix Doliskani’s code Our code

benchmark on Intel Xeon-2620 @ 2.0GHz

(21)

Matrix multiplication with multi-precision integers

0.00 0.02 0.06 0.25 1.00 4.00 16.00 64.00 256.00

24 26 28 210 212 214 216

Time in seconds

Entry bitsize Integer matrix multiplication

(matrix dim = 128) Flint

Mathemagix Doliskani’s code Our code

benchmark on Intel Xeon-2620 @ 2.0GHz

(22)

Matrix multiplication with multi-precision integers

0.25 0.50 1.00 2.00 4.00 8.00 16.00 32.00 64.00 128.00 256.00

24 26 28 210 212 214 216

Time in seconds

Entry bitsize Integer matrix multiplication

(matrix dim = 512) Flint

Mathemagix Doliskani’s code Our code

benchmark on Intel Xeon-2620 @ 2.0GHz

(23)

Matrix multiplication with multi-precision integers

2.00 4.00 8.00 16.00 32.00 64.00 128.00

24 26 28 210 212 214 216

Time in seconds

Entry bitsize Integer matrix multiplication

(matrix dim = 1024) Flint

Mathemagix Doliskani’s code Our code

benchmark on Intel Xeon-2620 @ 2.0GHz

(24)

Parallel multi-modular matrix multiplication

1 fori= 1. . .k compute Ai=Amodmi andBi=B modmi

2 fori= 1. . .k computeCi=AiBi modmi

3 reconstructC=AB from(C1, . . . ,Ck)

Parallelization of multi-modular reduction/reconstruction

each thread reduces (resp. reconstructs) a chunk of the given matrix

A0=Amodm0 A1=Amodm1

A2=Amodm2

A3=Amodm3 A4=Amodm4 A5=Amodm5

A6=Amodm6

thread0 thread1 thread2 thread3

(25)

Parallel multi-modular matrix multiplication

1 fori= 1. . .k computeAi =Amodmi andBi=B modmi

2 fori= 1. . .k compute Ci =AiBi modmi

3 reconstructC=AB from(C1, . . . ,Ck)

Parallelization of modular multiplication

each thread computes a bunch of matrix multiplications modmi thread0

thread1

thread2

thread3

C0=A0B0modm0

C1=A1B1modm1

C2=A2B2modm2 C3=A3B3modm3 C4=A4B4modm4

C5=A5B5modm5

C6=A6B6modm6

(26)

Parallel Matrix multiplication with multi-precision integers

based on OpenMP task

CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy ! ! !

0   2   4   6   8   10   12  

2   6   8   12  

speedup  

number  of  cores  

Speedup  of  parallel  integer  matrix  mul5plica5on  

dim  =  1024   bitsize  =  1024   dim=1024   bitsize  =  4096   dim  =  256   bitsize  =  8192  

benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)

(27)

Outline

1 Matrix multiplication with small integers

2 Matrix multiplication with multi-precision integers

3 Matrix multiplication with polynomials

(28)

Matrix multiplication over F

p

[x]

We consider the ”easiest” case :

A,B∈Fp[x]n×nsuch that deg(AB)<k = 2t

p is a Fourier prime, i.e.p= 2tq+ 1 p is such thatn(p−1)2<253

Complexity

O(nωk+n2klog(k))op. inFp using evaluation/interpolation with FFT

Remark:

using Vandermonde matrix on can get a similar approach as for integers, i.e.O(nωk+n2kω−1)

(29)

Matrix multiplication over F

p

[x]

We consider the ”easiest” case :

A,B∈Fp[x]n×nsuch that deg(AB)<k = 2t

p is a Fourier prime, i.e.p= 2tq+ 1 p is such thatn(p−1)2<253

Complexity

O(nωk+n2klog(k))op. inFp using evaluation/interpolation with FFT

Remark:

using Vandermonde matrix on can get a similar approach as for integers, i.e.O(nωk+n2kω−1)

(30)

Matrix multiplication over F

p

[x]

Evaluation/Interpolation scheme

Letθa primitivekth root of unity inFp.

1 fori= 1. . .k compute Ai=A(θi−1)andBi =B(θi−1)

2 fori= 1. . .k compute Ci =AiBi ∈Fp

3 interpolateC =AB from(C1, . . . ,Ck)

steps 1 and 3 :O(n2)call toFFTk overFp[x]

step 2 :k matrix multiplications modulo a small primep

(31)

FFT with SIMD over F

p Butterly operation modulo p

computeX+Y modpand(X−Y)θ2i modp.

Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]

LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW0 =dWβ/pe.

Algorithm:Butterfly(X,Y,W,W’,p)

1: X0 :=X+Y mod 2p

2: T :=X−Y + 2p

3: Q:=dW0T/βe 1 high short product

4: Y0:= (WT−Qp) modβ 2 low short products

5: return(X0,Y0)

3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)

(32)

FFT with SIMD over F

p Butterly operation modulo p

computeX+Y modpand(X−Y)θ2i modp.

Barret’s modular multiplication with a constant (NTL) calculate into[0,2p)to remove two conditionals [Harvey 2014]

LetX,Y ∈[0,2p),W ∈[0,p),p< β/4andW0 =dWβ/pe.

Algorithm:Butterfly(X,Y,W,W’,p)

1: X0 :=X+Y mod 2p

2: T :=X−Y + 2p

3: Q:=dW0T/βe 1 high short product

4: Y0:= (WT−Qp) modβ 2 low short products

5: return(X0,Y0)

3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)

(33)

Matrix multiplication over F

p

[x]

Implementation

radix-4 FFT with 128-bits SSE (29 bits primes)

BLAS-based matrix multiplication over Fp [FFLAS-FFPACK library]

0,02   0,03   0,06   0,13   0,25   0,50   1,00   2,00   4,00   8,00   16,00   32,00   64,00   128,00   256,00   512,00   1  024,00  

deg=1024   deg=2048   deg=4096   deg=8192   deg=512   deg=256   deg=512   deg=64   deg=32  

n=16   n=16   n=16   n=16   n=128   n=256   n=512   n=1024   n=2048  

Time  in  seconds  

Matrix  with  dimension  (n)  and  degree  (deg)   Polynomial  Matrix  Mul9plica9on  Performances    

FLINT   MMX   our  code  

benchmark on Intel Xeon-2620 @ 2.0GHz

(34)

Matrix multiplication over Z [x]

A,B∈Z[x]n×nsuch that deg(AB)<d and||(AB)i||<k

Complexity

O˜(nωdlog(d) log(k))bit op. using Kronecker substitution O(nωdlog(k) +n2dlog(d) log(k))bit op. using CRT+FFT

Remark:

if the result’s degree and bitsize are not too large, CRT with Fourier primes might suffice.

(35)

Matrix multiplication over Z [x]

Implementation

use CRT with Fourier primes

re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication inFp[x]

1   2   4   8   16   32   64   128   256   512   1024   2048   4096  

log(k)=64   log(k)=128   log(k)=256   log(k)=512   log(k)=64   log(k)=128   log(k)=256  

n=16  ;  deg=8192   n=128  ;  deg=512  

Time  in  seconds  

FLINT   our  code  

benchmark on Intel Xeon-2620 @ 2.0GHz

(36)

Parallel Matrix multiplication over Z [x]

Very first attempt (work still in progress)

parallel CRT with linear algebra (same code as in Zcase) perform each multiplication overFp[x]in parallel

some part of the code still sequential

n d log(k) 6 cores 12 cores time seq 64 1024 600 ×3.52 ×4.88 61.1s 32 4096 600 ×3.68 ×5.02 64.4s 32 2048 1024 ×3.95 ×5.73 54.5s 128 128 1024 ×3.76 ×5.55 53.9s

(37)

Polynomial Matrix in LinBox (proposition)

Generic handler class for Polynomial Matrix

t e m p l a t e<s i z e t t y p e , s i z e t s t o r a g e , c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x ;

Specialization for different memory strategy

// M a t r i x o f p o l y n o m i a l s t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : p o l f i r s t , PMStorage : : p l a i n , F i e l d>;

// P o l y n o m i a l o f m a t r i c e s t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : p l a i n , F i e l d>;

// P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s ) t e m p l a t e<c l a s s F i e l d>

c l a s s P o l y n o m i a l M a t r i x<PMType : : m a t f i r s t , PMStorage : : v i e w , F i e l d>;

(38)

Conclusion

High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT

FFT for polynomial over wordsize prime fields adaptative matrix representation

We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication overFp[x]andZ[x]

Références

Documents relatifs

Furthermore, the decreased cytolytic activity against tumor cell lines of intratumoral NK cells from cancer patients has also been shown to be increased/restored when IL-2 or IFNα

Given a parallelized loop nest, we determine which parallel loops should be distributed to processors, and the best alignment and distribution of arrays to minimize

the investment decision, the proposed division, if the proposed division was accepted or not, and their profit from the experiment (that is given by the proposed division if

Simply put, declaring Mars as a “free planet” and refusing any Earth-based authority over Martian ac- tivities conflicts with the international obligations of the United States

Here we present LDpred2, a new version of LDpred (Vilhjálmsson et al. LDpred2 has 3 options: 1) LDpred2-inf, which provides an analytical solution under the infinitesimal model

compared the effect of a diet enriched in BCAAs on mouse cortical motoneurons (the population of cortical neurons that command the spinal motoneurons, and which are speci fi

It is also possible to give variants of the new multiplication algorithms in which Bluestein’s transform is replaced by a different method for converting DFTs to convolutions, such

sorting and array transposition algorithms, discrete Fourier transforms (DFTs), the Cooley–Tukey algorithm, FFT multiplication and convolution, Bluestein’s chirp transform,