Synthesis of fixed-point programs: the case of matrix multiplication

(1)

HAL Id: lirmm-01277362

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01277362

Submitted on 22 Feb 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Synthesis of fixed-point programs: the case of matrix

multiplication

Mohamed Amine Najahi

To cite this version:

Mohamed Amine Najahi.

Synthesis of fixed-point programs: the case of matrix multiplication.

EJCIM: École Jeunes Chercheurs en Informatique Mathématique, Apr 2013, Perpignan, France. 13th

École Jeunes Chercheurs en Informatique Mathématique (EJCIM 2013) Perpignan, April 12th, 2013,

2013. �lirmm-01277362�

(2)

13 th

_{École Jeunes Chercheurs en Informatique Mathématique (EJCIM 2013)}

Perpignan, April 12

th

, 2013

Synthesis of fixed-point programs:

the case of matrix multiplication

Amine Najahi

Advisers

: M. Martel and G. Revy

Équipe-projet DALI, Univ. Perpignan Via Domitia

LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2

D

(3)

How easy it is to program a product of matrices?

Well, in floating-point, it is very easy !!

# de fine N 80 int main() { int i,j,k; float A[N][N] = { . . . } ; float B[N][N] = { . . . } ; float C[N][N]={0 ,... ,0}; for (i= 0; i<N;i++) for (j= 0; j<N;j++)

for (k= 0; k<N;k++) /* dot p r o d u c t of row i and colu mn j */

C[i][j]+=A[i][k]*B[k][j];

retu rn0; }

(4)

How easy it is to program a product of matrices?

Well, in floating-point, it is very easy !!

for (k= 0;k <N;k++) /* dot p r o d u c t of row i and colu mn j */

C[i][j]+=A[i][k]*B[k][j];

retu rn0; }

(5)

How easy it is to program a product of matrices?

Well, in floating-point, it is very easy !!

for (k= 0;k <N;k++) /* dot p r o d u c t of row i and colu mn j */

C[i][j]+=A[i][k]*B[k][j];

retu rn0; }

(6)

Motivation

Embedded systems

are ubiquitous

Ï

microprocessors and/or DSPs dedicated to one or a few specific tasks

Ï

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

Embedded systems

No FPU

Highly used in audio and video applications

(7)

Motivation

Embedded systems

are ubiquitous

Ï

microprocessors and/or DSPs dedicated to one or a few specific tasks

Ï

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

FP computations

Applications

Embedded systems

No FPU

(8)

Motivation

Embedded systems

are ubiquitous

Ï

microprocessors and/or DSPs dedicated to one or a few specific tasks

Ï

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

floating−point arithmetic

Software implementing

FP computations

Applications

Embedded systems

No FPU

Highly used in audio and video applications

(9)

Motivation

Embedded systems

are ubiquitous

Ï

microprocessors and/or DSPs dedicated to one or a few specific tasks

Ï

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

Conversion

FP computations

Applications

Fixed−point

Embedded systems

No FPU

(10)

Outline of the talk

1. Background of fixed-point arithmetic

1.1 Basics of fixed-point arithmetic

1.2 Numerical and combinatorial issues in fixed-point programs

1.3 CGPE

2. Matrix multiplication in fixed-point

2.1 An accurate algorithm

2.2 A compact algorithm

2.3 Closest pair algorithm

3. Conclusion

(11)

Background of fixed-point arithmetic

Outline of the talk

1. Background of fixed-point arithmetic

1.1 Basics of fixed-point arithmetic

1.2 Numerical and combinatorial issues in fixed-point programs

1.3 CGPE

2. Matrix multiplication in fixed-point

2.1 An accurate algorithm

2.2 A compact algorithm

2.3 Closest pair algorithm

3. Conclusion

(12)

Background of fixed-point arithmetic Basics of fixed-point arithmetic

Principles of fixed-point arithmetic

Main idea of fixed-point arithmetic:

Ï

interpret bit words as integers coupled with a scale factor:

z

2 n

1

0

1

0 Integer part

Fractional part

z 27+ 21= 130

Value in fixed-point

130 24= 27+21 24 = 2 3_{+ 2−}3_{= 8.125}

The scale factor (or fixed-point format) is implicit, only the programmer is aware of it

Let us denote by Q

a

,

b

a fixed-point format with

a

integer bits and

b

fractional bits

1

0

1

0 Q

1,7

Q

_2,6

Q

0,8

1

0

1

0 (

1.015625

)

₁₀

1

0

1

0 (

2.03125

)

₁₀

1

0

1

0 (

0.5078125

)

₁₀

(13)

Principles of fixed-point arithmetic

Main idea of fixed-point arithmetic:

Ï

interpret bit words as integers coupled with a scale factor:

z

2 n

1

0

1

0 Integer part

Fractional part

z 27+ 21= 130

Value in fixed-point

130 24= 27+21 24 = 2 3_{+ 2−}3_{= 8.125}

The scale factor (or fixed-point format) is implicit, only the programmer is aware of it

Let us denote by Q

a

,

b

a fixed-point format with

a

integer bits and

b

fractional bits

1

0

1

0 Q

1,7

Q

_2,6

Q

0,8

1

0

1

0 (

1.015625

)

₁₀

1

0

1

0 (

2.03125

)

₁₀

1

0

1

0 (

0.5078125

)

₁₀

(14)

Principles of fixed-point arithmetic

Main idea of fixed-point arithmetic:

Ï

interpret bit words as integers coupled with a scale factor:

z

2 n

1

0

1

0 Integer part

Fractional part

z 27+ 21= 130

Value in fixed-point

130 24= 27+21 24 = 2 3_{+ 2−}3_{= 8.125}

The scale factor (or fixed-point format) is implicit, only the programmer is aware of it

Let us denote by Q

a

,

b

a fixed-point format with

a

integer bits and

b

fractional bits

1

0

1

0 Q

1,7

Q

_2,6

Q

0,8

1

0

1

0 (

1.015625

)

₁₀

1

0

1

0 (

2.03125

)

₁₀

1

0

1

0 (

0.5078125

)

₁₀

(15)

Basic fixed-point operators

Addition

Ï

The two variables have to be in the same fixed-point format

Ï

The sum of two Q

_a,b

variables yields a Q

_a

_+1,b

variable

truncated

1 0 1 0 0 0 1 0

5.0625

1 0 1 1 0 1 0 1

2.828125

+

1 1 1 1 1 1 0 0

0

1 7.890625

7.875 Multiplication

Ï

No need for the two variables to have the same fixed-point format

Ï

The product of a Q

_a,b

variable by a Q

_c,d

variable yields a Q

_a

_+c,b+d

variable

1 0 1 0 0 0 1 0

5.0625

0 1 0 1 1 0 1 1

1.421875

×

0 0 1 1 1 0 0 1

1 0 0 1 0 1 1 0

7.198242187

7.125

(16)

Basic fixed-point operators

Addition

Ï

The two variables have to be in the same fixed-point format

Ï

The sum of two Q

_a,b

variables yields a Q

_a

_+1,b

variable

truncated

1 0 1 0 0 0 1 0

5.0625

1 0 1 1 0 1 0 1

2.828125

+

1 1 1 1 1 1 0 0

0

1 7.890625

7.875 Multiplication

Ï

No need for the two variables to have the same fixed-point format

Ï

The product of a Q

_a,b

variable by a Q

_c,d

variable yields a Q

_a

_+c,b+d

variable

1 0 1 0 0 0 1 0

5.0625

0 1 0 1 1 0 1 1

1.421875

×

0 0 1 1 1 0 0 1

1 0 0 1 0 1 1 0

7.198242187

7.125

(17)

Background of fixed-point arithmetic Numerical and combinatorial issues in fixed-point programs

First example: a size 3 dot product

Let us consider the arithmetic expression:

(

a

₀

_{× b}

₀

)

+

(

a

₁

_{× b}

₁

)

+

(

a

₂

_{× b}

₂

)

and the following input fixed-point formats:

a0 b0 a1 b1 a2 b2

Value [0.1,1.57] [0,1.98] [0.01,0.87] [1.1,1.86] [0,15.4] [2,3.3]

Fixed-point format Q1,7 Q1,7 Q0,8 Q1,7 Q4,4 Q2,6

Let us focus on 2 different schemes to compute the sum of products:

+ × a0 b0 + × a1 b1 × a2 b2

(

c

₀

+

(

c

₁

+ c

2 ))

+ + × a0 b0 × a1 b1 × a2 b2

((

c

₀

+ c

1 )

+ c

2 )

(18)

First example: a size 3 dot product

Let us consider the arithmetic expression:

(

a

₀

_{× b}

₀

)

₊

(

a

₁

_{× b}

₁

)

₊

(

a

₂

_{× b}

₂

)

and the following input fixed-point formats:

a0 b0 a1 b1 a2 b2

Value [0.1,1.57] [0,1.98] [0.01,0.87] [1.1,1.86] [0,15.4] [2,3.3]

Let us focus on 2 different schemes to compute the sum of products:

+ × a0 b0 + × a1 b1 × a2 b2

Q

8,15

Q

7,15

Q

2,14

Q

1,15

Q

6,10

(

c

₀

+

(

c

₁

+ c

2 ))

in full

precision

+ + × a0 b0 × a1 b1 × a2 b2

Q

7,15

Q

3,15

Q

6,10

Q

2,14

Q

1,15

((

c

₀

+ c

1 )

+ c

2 )

(19)

First example: a size 3 dot product

Let us consider the arithmetic expression:

(

a

₀

_{× b}

₀

)

₊

(

a

₁

_{× b}

₁

)

₊

(

a

₂

_{× b}

₂

)

and the following input fixed-point formats:

a0 b0 a1 b1 a2 b2

Value [0.1,1.57] [0,1.98] [0.01,0.87] [1.1,1.86] [0,15.4] [2,3.3]

Let us focus on 2 different schemes to compute the sum of products:

+ × a0 b0 + × a1 b1 × a2 b2

Q

8,8

Q

7,9

Q

2,14

Q

1,15

Q

6,10

(

c

₀

+

(

c

₁

+ c

2 ))

with 16 bits

precision

+ + × a0 b0 × a1 b1 × a2 b2

Q

7,9

Q

3,13

Q

6,10

Q

2,14

Q

1,15

((

c

₀

+ c

1 )

+ c

2 )

(20)

Background of fixed-point arithmetic CGPE

The CGPE

1 _{software tool}

Written by Revy and Mouilleron to aid in emulating floating-point in software

A tool that generates fast and certified code

fast

that reduce the evaluation latency

on a given target, by using the target

architecture features (as much as possible)

certified

for which we can bound the error

entailed by the evaluation within the given

target’s arithmetic

Set of DAGs

Decorated DAGs

back-end fr ont-end middle-end <architecture> <instruction name="add" type="unsigned" <coefficient ... > <polynomial> ... <variable ... >_... </polynomial> <error ... > inputs="32 32" </architecture>... /> gappa="..." macro="static inline ..." nodes="add dag 1 ..." output="32" latency="1" polynomial.xml architecture.xml

Code generator

Filter n

Accuracy certificates

Filter 1

DAG computation

C files

1 _{Code Generation for Polynomial Evaluation}

(21)

Matrix multiplication in fixed-point

Outline of the talk

1. Background of fixed-point arithmetic

1.1 Basics of fixed-point arithmetic

1.2 Numerical and combinatorial issues in fixed-point programs

1.3 CGPE

2. Matrix multiplication in fixed-point

2.1 An accurate algorithm

2.2 A compact algorithm

2.3 Closest pair algorithm

(22)

Defining the problem

We are provided with

Ï

a black box (CGPE) that synthesises code for dot-products in fixed-point arithmetic

Accuracy

certiﬁcate

C code

2 real

interval

vectors

Ï

2 matrices A and B in I

(

R

n

×n

₎

A

=







[−4.54,7.78]

· · ·

[−0.789,0.967]

. . . ._. . . . .

[12.51,24.14]

· · ·

[−0.921,0.791]







and,

B

=







[−64,45.78]

· · ·

[−0.287,0.7]

. . . ._. . . . .

[125.1,245.14]

· · ·

[−5.74,7.32]







We are asked to

Ï

Generate code that evaluates all the products C = MN in fixed-point arithmetic

• where M ∈ A and N ∈ B

(23)

Defining the problem

We are provided with

Ï

a black box (CGPE) that synthesises code for dot-products in fixed-point arithmetic

Accuracy

certiﬁcate

C code

2 real

interval

vectors

Ï

2 matrices A and B in I

(

R

n

×n

₎

A

=







[−4.54,7.78]

· · ·

[−0.789,0.967]

. . . ._. . . . .

[12.51,24.14]

· · ·

[−0.921,0.791]







and,

B

=







[−64,45.78]

· · ·

[−0.287,0.7]

. . . ._. . . . .

[125.1,245.14]

· · ·

[−5.74,7.32]







We are asked to

(24)

Tradeoffs to consider

Remark:

The suggested strategy should be efficient in terms of the tradeoffs

below for all matrices of size smaller than 80 × 80

1. Size of the generated code

Ï

We are targeting embedded systems

code size should be as tight as possible

2. Accuracy of the generated code

Ï

Accuracy certificates should be produced that bound the absolute error

Ï

The guaranteed absolute error should be as tight as possible

(25)

Tradeoffs to consider

Remark:

The suggested strategy should be efficient in terms of the tradeoffs

below for all matrices of size smaller than 80 × 80

1. Size of the generated code

Ï

We are targeting embedded systems

code size should be as tight as possible

2. Accuracy of the generated code

Ï

Accuracy certificates should be produced that bound the absolute error

Ï

The guaranteed absolute error should be as tight as possible

(26)

Tradeoffs to consider

Remark:

The suggested strategy should be efficient in terms of the tradeoffs

below for all matrices of size smaller than 80 × 80

1. Size of the generated code

Ï

We are targeting embedded systems

code size should be as tight as possible

2. Accuracy of the generated code

Ï

Accuracy certificates should be produced that bound the absolute error

Ï

The guaranteed absolute error should be as tight as possible

(27)

Tradeoffs to consider

Remark:

The suggested strategy should be efficient in terms of the tradeoffs

below for all matrices of size smaller than 80 × 80

1. Size of the generated code

Ï

We are targeting embedded systems

code size should be as tight as possible

2. Accuracy of the generated code

Ï

Accuracy certificates should be produced that bound the absolute error

Ï

The guaranteed absolute error should be as tight as possible

(28)

Matrix multiplication in fixed-point An accurate algorithm

An accurate algorithm

Main idea:

Generate a dot product code for each coefficient of the resulting matrix

AccurateProduct

Inputs:

Two square matrices A ∈ I

(

R

n

×n

₎

_{and B ∈ I}

₍

_R

n

×n

₎

Outputs:

C code to compute the product MN for all M ∈ A and N ∈ B

Steps:

1: for 1 < i ≤ n do

2:

for

1 < j ≤ n do

3:

cgpeGenDotProduct

(

A

i

, B

j

);

4:

end for

5: end for

Illustration on the product of two 2 × 2 matrices

C

=

Ã

C

_1,1

= cgpeGenDotProduct

(

A

₁

, B

1 )

C

_1,2

= cgpeGenDotProduct

(

A

₁

, B

2 )

C

2,1

= cgpeGenDotProduct

(

A

2 , B

1 )

C

2,2

= cgpeGenDotProduct

(

A

2 , B

2 )

(29)

An accurate algorithm

Main idea:

Generate a dot product code for each coefficient of the resulting matrix

AccurateProduct

Inputs:

Two square matrices A ∈ I

(

R

n

×n

₎

_{and B ∈ I}

₍

_R

n

×n

₎

Outputs:

C code to compute the product MN for all M ∈ A and N ∈ B

Steps:

1: for 1 < i ≤ n do

2:

for

1 < j ≤ n do

3:

cgpeGenDotProduct

(

A

i

, B

j

);

4:

end for

5: end for

Illustration on the product of two 2 × 2 matrices

C

=

Ã

(30)

Analysis of AccurateProduct

For square matrices of size n, n

2 _{calls to the cgpeGenDotProduct are issued}

Ï

Each dot product uses more than 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2n

3 ,→

More than 1024000 instructions for 80 × 80 matrices

Advantages

3 Easy to generate code

3 Two nested loops and n

2 calls to the

routine cgpeGenDotProduct

3 The reference in terms of numerical quality

Drawbacks

7 Code size is proportional to

2n

3

7 Similar code sizes are

prohibitive in embedded

systems

(31)

Analysis of AccurateProduct

For square matrices of size n, n

2 _{calls to the cgpeGenDotProduct are issued}

Ï

Each dot product uses more than 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2n

3 ,→

More than 1024000 instructions for 80 × 80 matrices

Advantages

3 Easy to generate code

3 Two nested loops and n

2 calls to the

routine cgpeGenDotProduct

3 The reference in terms of numerical quality

Drawbacks

7 Code size is proportional to

2n

3

7 Similar code sizes are

prohibitive in embedded

systems

(32)

Matrix multiplication in fixed-point A compact algorithm

A compact algorithm

Main idea:

Generate a unique dot product code for all the computations

CompactProduct

Inputs:

Two square matrices A ∈ I

(

R

n

×n

)

and B ∈ I

(

R

n

×n

)

Outputs:

C code to compute the product MN for all M ∈ A and N ∈ B

Steps:

1: compute v such that v = A

1 ∪ A

2 ∪ · · · ∪ A

n

2: compute w such that w = B

1 ∪ B

2 ∪ · · · ∪ B

n

3: cgpeGenDotProduct(v,w);

Illustration on the product of two 2 × 2 matrices

C

=

Ã

C

_1,1

= cgpeGenDotProduct

(

A

₁

∪ A

2 , B

1 ∪ B

2 )

C

_1,2

= cgpeGenDotProduct

(

A

₁

∪ A

2 , B

1 ∪ B

2 )

C

2,1

= cgpeGenDotProduct

(

A

1 ∪ A

2 , B

1 ∪ B

2 )

C

2,2

= cgpeGenDotProduct

(

A

1 ∪ A

2 , B

1 ∪ B

2 )

(33)

A compact algorithm

Main idea:

Generate a unique dot product code for all the computations

CompactProduct

Inputs:

Two square matrices A ∈ I

(

R

n

×n

)

and B ∈ I

(

R

n

×n

)

Outputs:

C code to compute the product MN for all M ∈ A and N ∈ B

Steps:

1: compute v such that v = A

1 ∪ A

2 ∪ · · · ∪ A

n

2: compute w such that w = B

1 ∪ B

2 ∪ · · · ∪ B

n

3: cgpeGenDotProduct(v,w);

Illustration on the product of two 2 × 2 matrices

C

=

Ã

C

_1,1

= cgpeGenDotProduct

(

A

₁

∪ A

2 , B

1 ∪ B

2 )

C

_1,2

= cgpeGenDotProduct

(

A

₁

∪ A

2 , B

1 ∪ B

2 )

C

2,1

= cgpeGenDotProduct

(

A

1 ∪ A

2 , B

1 ∪ B

2 )

C

2,2

= cgpeGenDotProduct

(

A

1 ∪ A

2 , B

1 ∪ B

2 )

(34)

Analysis of CompactProduct

For square matrices of size n, only one call to the cgpeGenDotProduct is issued

Ï

The dot product uses around 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2n

,→

Around 160 instructions for 80 × 80 matrices

Advantages

3 Easy to generate code

3 Compute the union of all vectors of A and

B

and call the routine

cgpeGenDotProduct

3 The reference in terms of code size

Drawbacks

7 Numerical quality

deteriorates dramatically

(35)

Analysis of CompactProduct

For square matrices of size n, only one call to the cgpeGenDotProduct is issued

Ï

The dot product uses around 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2n

,→

Around 160 instructions for 80 × 80 matrices

Advantages

3 Easy to generate code

3 Compute the union of all vectors of A and

B

and call the routine

cgpeGenDotProduct

3 The reference in terms of code size

Drawbacks

7 Numerical quality

deteriorates dramatically

(36)

Matrix multiplication in fixed-point Closest pair algorithm

A closest pair algorithm

Main idea:

Fuse together only rows or columns that are close to each other

The Hausdorff distance d

H

d

H

:

I

(

R

n

)

× I

(

R

n

)

→ R

d

H

(

A, B

)

= max

1≤i≤n

max

n¯

¯

¯a

i

− b

i

¯

¯ ,

¯

¯a

i

− b

i

¯

o

Example

Let A =

³

[

−4, 7

]

[

−11, 102

]

´

and B =

³

[

−2, 88

]

[

−23, 1

]

´

be two vectors in I ¡R

2 ¢

, we have:

d

H

(

A, B

)

= 101

∪

(

A, B

)

=

³

[

−4, 88

]

[

−23, 102

]

´

R

a

1

a

1

b

1

b

1

∪

(

A

1

, B

1

)

d

H

(

A

1

, B

1

)

(37)

A closest pair algorithm

Main idea:

Fuse together only rows or columns that are close to each other

The Hausdorff distance d

H

d

H

:

I

(

R

n

)

× I

(

R

n

)

→ R

d

H

(

A, B

)

= max

1≤i≤n

max

n¯

¯

¯a

i

− b

i

¯

¯ ,

¯

¯a

i

− b

i

¯

o

Example

Let A =

³

[

−4, 7

]

[

−11, 102

]

´

and B =

³

[

−2, 88

]

[

−23, 1

]

´

be two vectors in I ¡R

2 ¢

, we have:

d

H

(

A, B

)

= 101

∪

(

A, B

)

=

³

[

−4, 88

]

[

−23, 102

]

´

R

a

1

a

1

b

1

b

1

∪

(

A

1

, B

1

)

d

H

(

A

1

, B

1

)

(38)

ClosestPairFusion

Inputs:

n

vectors, v

₁

, . . . , v

n

in I

(

R

m

)

a routine findClosestPair based on d

H

a routine Union that applies the union operator

the number k of output vectors

Outputs:

k

vectors in I

(

_R

m

₎

Steps:

1:

B = {v

₁

, . . . , v

n

}

2:

while size

(

B

)

> k do

3:

(

u

₁

, u

2 )

= findClosestPair

(

B

)

4:

remove

(

u

₁

,

B

)

5:

remove

(

u

₂

,

B

)

6:

add

(

Union

(

u

₁

, u

2 )

,

B

)

7:

end while

(39)

Illustration of the ClosestPairFusion







v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

[

−7, 7

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]

v

4

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−9, 9

]







dH(v1, v2) dH(v1, v3) dH(v1, v4) dH(v2, v3) dH(v2, v4) dH(v3, v4) 4 7 5 9 7 3







w

1

w

2

w

3

w

4

[

−3, 3

]

[

−14, 14

]

[

−5, 5

]

[

−6, 6

]

[

−1, 1

]

[

−11, 11

]

[

−3, 3

]

[

−9, 9

]

[

−4, 4

]

[

−8, 8

]

[

−11, 11

]

[

−1, 1

]

[

−9, 9

]

[

−7, 7

]

[

−10, 10

]

[

−2, 2

]







dH(w1, w2) dH(w1, w3) dH(w1, w4) dH(w2, w3) dH(w2, w4) dH(w3, w4) 11 7 8 9 8 10





v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

∪ v

4

[

−8, 8

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]





dH(v1, v2) dH(v1, v3∪ v4) dH(v2, v3∪ v4) 4 7 9







w

1

∪ w

3

w

2

w

4

[

−5, 5

]

[

−14, 14

]

[

−6, 6

]

[

−3, 3

]

[

−11, 11

]

[

−9, 9

]

[

−11, 11

]

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−7, 7

]

[

−2, 2

]







dH(w1∪ w3, w2) dH(w1∪ w3, w4) dH(w2, w4) 9 10 8

µ

v

1

∪ v

2

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−9, 9

]

v

3

∪ v

4

[

−8, 8

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]

¶







w

1

∪ w

3

w

2

∪ w

4

[

−5, 5

]

[

−14, 14

]

[

−3, 3

]

[

−11, 11

]

[

−11, 11

]

[

−8, 8

]

[

−10, 10

]

[

−7, 7

]







(40)

Illustration of the ClosestPairFusion







v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

[

−7, 7

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]

v

4

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−9, 9

]













w

1

w

2

w

3

w

4

[

−3, 3

]

[

−14, 14

]

[

−5, 5

]

[

−6, 6

]

[

−1, 1

]

[

−11, 11

]

[

−3, 3

]

[

−9, 9

]

[

−4, 4

]

[

−8, 8

]

[

−11, 11

]

[

−1, 1

]

[

−9, 9

]

[

−7, 7

]

[

−10, 10

]

[

−2, 2

]











v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

∪ v

4

[

−8, 8

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]





dH(v1, v2) dH(v1, v3∪ v4) dH(v2, v3∪ v4) 4 7 9







w

1

∪ w

3

w

2

w

4

[

−5, 5

]

[

−14, 14

]

[

−6, 6

]

[

−3, 3

]

[

−11, 11

]

[

−9, 9

]

[

−11, 11

]

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−7, 7

]

[

−2, 2

]







dH(w1∪ w3, w2) dH(w1∪ w3, w4) dH(w2, w4) 9 10 8

µ

v

1

∪ v

2

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−9, 9

]

v

3

∪ v

4

[

−8, 8

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]

¶







w

1

∪ w

3

w

2

∪ w

4

[

−5, 5

]

[

−14, 14

]

[

−3, 3

]

[

−11, 11

]

[

−11, 11

]

[

−8, 8

]

[

−10, 10

]

[

−7, 7

]







(41)

Illustration of the ClosestPairFusion







v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

[

−7, 7

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]

v

4

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−9, 9

]













w

1

w

2

w

3

w

4

[

−3, 3

]

[

−14, 14

]

[

−5, 5

]

[

−6, 6

]

[

−1, 1

]

[

−11, 11

]

[

−3, 3

]

[

−9, 9

]

[

−4, 4

]

[

−8, 8

]

[

−11, 11

]

[

−1, 1

]

[

−9, 9

]

[

−7, 7

]

[

−10, 10

]

[

−2, 2

]











v

1

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−6, 6

]

v

2

[

−2, 2

]

[

−1, 1

]

[

−3, 3

]

[

−9, 9

]

v

3

∪ v

4

[

−8, 8

]

[

−4, 4

]

[

−12, 12

]

[

−11, 11

]





dH(v1, v2) dH(v1, v3∪ v4) dH(v2, v3∪ v4) 4 7 9







w

1

∪ w

3

w

2

w

4

[

−5, 5

]

[

−14, 14

]

[

−6, 6

]

[

−3, 3

]

[

−11, 11

]

[

−9, 9

]

[

−11, 11

]

[

−8, 8

]

[

−1, 1

]

[

−10, 10

]

[

−7, 7

]

[

−2, 2

]







dH(w1∪ w3, w2) dH(w1∪ w3, w4) dH(w2, w4) 9 10 8

µ

v

1

∪ v

2

[

−4, 4

]

[

−5, 5

]

[

−5, 5

]

[

−9, 9

]

¶



w

1

∪ w

3

w

2

∪ w

4

[

−5, 5

]

[

−14, 14

]

[

−3, 3

]

[

−11, 11

]



(42)

Analysis of the closest pair algorithm

For square matrices of size n, k × l calls to the cgpeGenDotProduct are issued

Ï

Each dot product uses more than 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2nkl

,→

For 80 × 80 matrices, the table below gives the number of instructions

H H H H H k l 1 2 4 5 8 10 16 20 40 80 1 160 320 640 800 1280 1600 2560 3200 6400 12800 2 320 640 1280 1600 2560 3200 5120 6400 12800 25600 4 640 1280 2560 3200 5120 6400 10240 12800 25600 51200 5 800 1600 3200 4000 6400 8000 12800 16000 32000 64000 8 1280 2560 5120 6400 10240 12800 20480 25600 51200 102400 10 1600 3200 6400 8000 12800 16000 25600 32000 64000 128000 16 2560 5120 10240 12800 20480 25600 40960 51200 102400 204800 20 3200 6400 12800 16000 25600 32000 51200 64000 128000 256000 40 6400 12800 25600 32000 51200 64000 102400 128000 256000 512000 80 12800 25600 51200 64000 102400 128000 204800 256000 512000 1024000

Advantages

3 Code size can be controlled

through the parameters k and l

Drawbacks

7 Numerical quality deteriorates with

small values of k and l

(43)

Analysis of the closest pair algorithm

For square matrices of size n, k × l calls to the cgpeGenDotProduct are issued

Ï

Each dot product uses more than 2n instructions (n multiplications + n additions)

,→

The generated code for the product is proportional in size to 2nkl

,→

For 80 × 80 matrices, the table below gives the number of instructions

H H H H H k l 1 2 4 5 8 10 16 20 40 80 1 160 320 640 800 1280 1600 2560 3200 6400 12800 2 320 640 1280 1600 2560 3200 5120 6400 12800 25600 4 640 1280 2560 3200 5120 6400 10240 12800 25600 51200 5 800 1600 3200 4000 6400 8000 12800 16000 32000 64000 8 1280 2560 5120 6400 10240 12800 20480 25600 51200 102400 10 1600 3200 6400 8000 12800 16000 25600 32000 64000 128000 16 2560 5120 10240 12800 20480 25600 40960 51200 102400 204800 20 3200 6400 12800 16000 25600 32000 51200 64000 128000 256000 40 6400 12800 25600 32000 51200 64000 102400 128000 256000 512000 80 12800 25600 51200 64000 102400 128000 204800 256000 512000 1024000

Advantages

3 Code size can be controlled

Drawbacks

(44)

Let us compare these algorithms

These results were produced for interval matrices of size 80 × 80

Ï

The center of each interval is randomly selected in

[

−1000, 1000

]

Ï

The diameter of the intervals is fixed to 100

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

Certified evaluation error

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

AccurateProduct

Largest certified error: ≈ 0.1254

Mean certified error: ≈ 0.0865

Number of instructions: ≈ 1024000

CompactProduct

Largest certified error: ≈ 0.5585

Mean certified error: ≈ 0.5585

Number of instructions: ≈ 160

(45)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(46)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(47)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(48)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(49)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(50)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(51)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(52)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(53)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(54)

Let us compare these algorithms, cont’d

’-’ matrix 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5

(55)

Let us compare these algorithms, cont’d

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

1

5

10 16 20

40

80

160

640 2560

4000

10240

16000

40960

64000

256000

1024000

Certified bound for the absolute error

Approximate number of instructions

k (Number of fused rows and columns)

Maximum certified error

Mean certified error

Code size

(56)

Conclusion

Outline of the talk

1. Background of fixed-point arithmetic

1.1 Basics of fixed-point arithmetic

1.2 Numerical and combinatorial issues in fixed-point programs

1.3 CGPE

2. Matrix multiplication in fixed-point

2.1 An accurate algorithm

2.2 A compact algorithm

2.3 Closest pair algorithm

(57)

Conclusion

In this talk:

Ï

We suggested 3 strategies to generate code for matrix product in fixed-point arithmetic

Ï

The accurate algorithm performs well in terms of numerical quality but is prohibitive

Ï

The compact algorithm generates concise codes but deteriorates the numerical quality

Ï

The Closest Pair algorithm enables the tradeoffs between code size and numerical quality

For the future, we will be working on:

Ï

Suggesting similar algorithms for the discrete convolution in fixed-point arithmetic

Ï

Investigating the synthesis of VHDL code for building blocks like matrix multiplication

(58)

Conclusion

In this talk:

Ï

We suggested 3 strategies to generate code for matrix product in fixed-point arithmetic

Ï

The accurate algorithm performs well in terms of numerical quality but is prohibitive

Ï

The compact algorithm generates concise codes but deteriorates the numerical quality

Ï

The Closest Pair algorithm enables the tradeoffs between code size and numerical quality

For the future, we will be working on:

Ï

Suggesting similar algorithms for the discrete convolution in fixed-point arithmetic

Ï

Investigating the synthesis of VHDL code for building blocks like matrix multiplication

(59)

Conclusion

13 th

_{École Jeunes Chercheurs en Informatique Mathématique (EJCIM 2013)}

Perpignan, April 12

th

, 2013

Synthesis of fixed-point programs:

the case of matrix multiplication

Amine Najahi

Advisers

: M. Martel and G. Revy

Équipe-projet DALI, Univ. Perpignan Via Domitia

LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2

(60)

Conclusion

Example of code generated by CGPE

// File a u t o m a t i c a l l y g e n e r a t e d by CGPE ( Code G e n e r a t i o n for P o l y n o m i a l E v a l u a t i o n )

// Sc heme : (((( a0 +( S *( a1 +( T * a2 ) ) ) ) +(( S *( T * T ) ) *( a3 +( T * a4 ) ) ) ) +(( S *( T * T ) ) *(( T * T ) *( a5 +( T * a6 ) ) ) ) ) +((( S *( T * T ) ) *(( T * T ) *( T * T ) ) ) *(( a7 +( T * a8 ) ) +(( T * T ) *( a9 +( T * a10 ) ) ) ) ) ) // De gree : [9 ,1] u i n t 3 2 _ tfun c_0(u i n t 3 2 _ tT, u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =mul(T, 0x 5 a 8 2 6 8 5 d) ; // 1.31 u i n t 3 2 _ tr1 = 0x b 5 0 4 f 3 1 f -r0; // 1.31 u i n t 3 2 _ tr2 =mul(S,r1) ; // 2.30 u i n t 3 2 _ tr3 = 0x 0 0 0 0 0 0 2 0 +r2; // 2.30 u i n t 3 2 _ tr4 =mul(T,T) ; // 0.32 u i n t 3 2 _ tr5 =mul(S,r4) ; // 1.31 u i n t 3 2 _ tr6 =mul(T, 0x 3 8 6 f d 5 f 4) ; // 1.31 u i n t 3 2 _ tr7 = 0x 4 3 d f 7 2 f 7 -r6; // 1.31 u i n t 3 2 _ tr8 =mul(r5,r7) ; // 2.30 u i n t 3 2 _ tr9 =r3+r8; // 2.30 u i n t 3 2 _ tr10 =mul(T, 0x 2 8 7 2 4 1 0 0) ; // 1.31 u i n t 3 2 _ tr11 = 0x 3 0 8 b 1 7 9 8-r10; // 1.31 u i n t 3 2 _ tr12 =mul(r4,r11) ; // 1.31 u i n t 3 2 _ tr13 =mul(r5,r12) ; // 2.30 u i n t 3 2 _ tr14 =r9+ r13; // 2.30 u i n t 3 2 _ tr15 =mul(r4,r4) ; // 0.32 u i n t 3 2 _ tr16 =mul(r5,r15) ; // 1.31 u i n t 3 2 _ tr17 =mul(T, 0x 1 0 6 c 5 c d 9) ; // 1.31 u i n t 3 2 _ tr18 = 0x 1 d 7 b f 9 6 8-r17; // 1.31 u i n t 3 2 _ tr19 =mul(T, 0x 0 0 f a 9 a a 4) ; // 1.31 u i n t 3 2 _ tr20 = 0x 0 5 d f f f a 4-r19; // 1.31 u i n t 3 2 _ tr21 =mul(r4,r20) ; // 1.31 u i n t 3 2 _ tr22 =r18 +r21; // 1.31 u i n t 3 2 _ tr23 =mul(r16, r22) ; // 2.30 u i n t 3 2 _ tr24 =r14 +r23; // 2.30 retu rnr24; }

/* Error bound c o m p u t e d using MPFI :

* [ - 1 0 1 4 3 0 1 6 4 9 5 7 3 0 0 9 0 4 9 8 7 7 7 9 2 3 4 2 6 4 7 4 9 4 6 1 3 8 4 1 9 8 7 8 7 0 8 2 1 1 6 2 1 3 4 8 9 6 9 0 0 2 2 5 5 6 3 3 7 0 1 1 4 0 1 7 1 5 b -283 , * 2 5 5 8 6 6 6 6 3 9 3 6 8 2 6 1 5 6 7 5 4 0 0 0 9 2 4 0 0 5 2 6 3 9 4 1 1 9 9 8 8 7 5 8 1 0 2 7 1 1 8 5 4 6 9 3 9 2 2 3 6 0 4 8 0 7 5 0 4 4 4 1 8 5 7 6 7 b -282]

* ~ [ -2^{ -27.191} ,2^{ -28.1781}]