• Aucun résultat trouvé

Synthesis of fixed-point programs based on instruction selection, the case of polynomial evaluation

N/A
N/A
Protected

Academic year: 2021

Partager "Synthesis of fixed-point programs based on instruction selection, the case of polynomial evaluation"

Copied!
44
0
0

Texte intégral

(1)

HAL Id: lirmm-01277361

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01277361

Submitted on 22 Feb 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Synthesis of fixed-point programs based on instruction

selection, the case of polynomial evaluation

Mohamed Amine Najahi

To cite this version:

Mohamed Amine Najahi. Synthesis of fixed-point programs based on instruction selection, the case

of polynomial evaluation. RAIM: Rencontres Arithmétiques de l’Informatique Mathématique, Jun

2012, Dijon, France. 5ièmes Rencontres Arithmétique de l’Informatique Mathématique, 2012.

�lirmm-01277361�

(2)

Dijon, 20-22 juin 2012

Synthesis of fixed-point programs based on

instruction selection

... the case of polynomial evaluation

Amine Najahi

Advisors

: Matthieu Martel and Guillaume Revy

Joint work with Christophe Mouilleron

Équipe-projet DALI, Univ. Perpignan Via Domitia

LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2

D

(3)

Motivation

Embedded systems

are ubiquitous

I

microprocessors and/or DSPs dedicated to one or a few specific tasks

I

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

Embedded systems

No FPU

Highly used in audio and video applications

(4)

Motivation

Embedded systems

are ubiquitous

I

microprocessors and/or DSPs dedicated to one or a few specific tasks

I

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

Embedded systems

No FPU

Highly used in audio and video applications

(5)

Motivation

Embedded systems

are ubiquitous

I

microprocessors and/or DSPs dedicated to one or a few specific tasks

I

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

FP computations

Applications

Embedded systems

No FPU

Highly used in audio and video applications

(6)

Motivation

Embedded systems

are ubiquitous

I

microprocessors and/or DSPs dedicated to one or a few specific tasks

I

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

floating−point arithmetic

Software implementing

FP computations

Applications

Embedded systems

No FPU

Highly used in audio and video applications

(7)

Motivation

Embedded systems

are ubiquitous

I

microprocessors and/or DSPs dedicated to one or a few specific tasks

I

satisfy constraints: area, energy consumption, conception cost

Some embedded systems

do not have any FPU

(floating-point unit)

Applications

Embedded systems

No FPU

Highly used in audio and video applications

(8)

How to use floating-point programs on embedded systems?

Two approaches to continue using numerical algorithms on these cores:

1.

convert the entire numerical application from floating to fixed-point arithmetic

2.

write a floating-point emulation library and link the numerical application against it

Fixed-point conversion

3

produces a fast code

3

consumes less energy

7

machine specific: no standard

7

smaller dynamic range than

floating-point

7

tedious and time consuming

Floating-point support design

3

tons of code are written using

floating-point

3

an algorithm can be synthesized on

a PC and then transferred to the

device without modifications

7

slower

7

tedious and time consuming

(9)

Fixed-point conversion vs. floating-point emulation design

Floating to fixed-point conversion tools:

I

addressed by the ANR project DEFIS, with IRISA, LIP6, CEA, THALES, INPIXAL

I

some tools are currently developed: ID.Fix,

. . .

I

two main approaches:

1.

statistical methods: perform well, but provide no guarantees and may be slow.

2.

analytical methods: usually quite pessimistic, but they are safer to use.

Floating-point emulation support:

I

a number of high quality emulation libraries exist: FLIP, SoftFloat,

. . .

I

more or less compliant with the IEEE-754 standard

I

FLIP: relies on polynomial evaluation to evaluate division and square root

a huge number of schemes for evaluating a given polynomial

development of CGPE

(10)

Outline of the talk

1. The CGPE tool

2. Instruction selection: an extension of CGPE

(11)

Outline of the talk

1. The CGPE tool

2. Instruction selection: an extension of CGPE

(12)

Overview of CGPE

Goal of CGPE

: automate the design of fast and certified C codes for evaluating

univariate/bivariate polynomials

I

in fixed-point arithmetic

I

by using the target architecture features (as much as possible)

Remarks

:

I

fast

that reduce the evaluation latency on a given target

I

certified

for which we can bound the error entailed by the evaluation within the

(13)

Global architecture of CGPE

Input of CGPE

1.

polynomial coefficients and variables: value intervals, fixed-point format, ...

2.

set of criteria: maximum error bound and bound on latency (or the lowest)

3.

some architectural constraints: operator cost, parallelism, ...

<polynomial>

<c o e f f i c i e n t x=" 0 "y=" 0 "inf=" 0 x 0 0 0 0 0 0 2 0 "sup=" 0 x 0 0 0 0 0 0 2 0 "sign=" 0 "i n t e g e r _ p a r t=" 2 "f r a c t i o n _ p a r t=" 30 "/ > <c o e f f i c i e n t x=" 0 "y=" 1 "inf=" 0 x 8 0 0 0 0 0 0 0 "sup=" 0 x 8 0 0 0 0 0 0 0 "sign=" 0 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 1 "y=" 1 "inf=" 0 x 4 0 0 0 0 0 0 0 "sup=" 0 x 4 0 0 0 0 0 0 0 "sign=" 0 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 2 "y=" 1 "inf=" 0 x 1 0 0 0 0 0 0 0 "sup=" 0 x 1 0 0 0 0 0 0 0 "sign=" 1 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 3 "y=" 1 "inf=" 0 x 0 7 f e 9 3 e 4 "sup=" 0 x 0 7 f e 9 3 e 4 "sign=" 0 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 4 "y=" 1 "inf=" 0 x 0 4 e e f 6 9 4 "sup=" 0 x 0 4 e e f 6 9 4 "sign=" 1 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 5 "y=" 1 "inf=" 0 x 0 3 2 d 6 6 4 3 "sup=" 0 x 0 3 2 d 6 6 4 3 "sign=" 0 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 6 "y=" 1 "inf=" 0 x 0 1 c 6 c e b d "sup=" 0 x 0 1 c 6 c e b d "sign=" 1 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 7 "y=" 1 "inf=" 0 x 0 0 a e b e 7 d "sup=" 0 x 0 0 a e b e 7 d "sign=" 0 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <c o e f f i c i e n t x=" 8 "y=" 1 "inf=" 0 x 0 0 2 0 0 0 0 0 "sup=" 0 x 0 0 2 0 0 0 0 0 "sign=" 1 "i n t e g e r _ p a r t=" 1 "f r a c t i o n _ p a r t=" 31 "/ > <v a r i a b l e x=" 1 "y=" 0 " inf=" 0 x 0 0 0 0 0 0 0 0 " sup=" 0 x f f f f f e 0 0 " sign=" 0 "i n t e g e r _ p a r t=" 0 " f r a c t i o n _ p a r t=" 32 "/ > <v a r i a b l e x=" 0 "y=" 1 " inf=" 0 x 8 0 0 0 0 0 0 0 " sup=" 0 x b 5 0 4 f 3 3 4 " sign=" 0 "i n t e g e r _ p a r t=" 1 " f r a c t i o n _ p a r t=" 31 "/ > <a b s o l u t e _ e v a l e r r o r value=" 2 5 0 8 1 3 7 3 4 8 3 1 5 8 6 9 3 0 1 2 4 6 3 0 5 3 5 2 8 1 1 8 0 4 0 3 8 0 9 7 6 7 3 3 1 9 8 9 2 1 b -191 " stric t=" false "/ > </polynomial>

cgpe--deg ree=" [8 ,1] "--xml-input=cgpe-test1.xml --coefs=" [ 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 ] " \ --l a t e n c y=lo west --gappa-c e r t i f i c a t e--output \ --s c h e d u l e=" [4 ,2] "--max-kept=5 --o p e r a t o r s=" [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 : 1 3 3 3 3 3 3 3 3 1 1 1 3 3 3 3 3 1 ] " ...

(14)

Global architecture of CGPE (cont’d)

Internals of CGPE

CGPE proceeds in two steps:

1. Computation step:

I

computes evaluation schemes while

reducing their latency on unbounded

parallelism

I

considers only two possible arithmetic

operations: addition and multiplication

I

produces DAGs that represent the

computed efficient schemes

Computation of low latency parenthesizations

Selection of effective parenthesizations

a(t)

E

approx≤ θ

F(s,t)

E

eval≤ η ST231 features

C code Certificate

Computation of polynomial approximant

CGPE

CGPE

2. Filtering step:

I

prunes the evaluation schemes that do not satisfy different criteria:

(15)

Global architecture of CGPE (cont’d)

Output of CGPE

u i n t 3 2 _ t f u n c _ d 9 _ 0(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =T>> 2; // (+) Q [ 1.31] u i n t 3 2 _ tr1 = 0x 8 0 0 0 0 0 0 0+ r0; // (+) Q [ 1.31] u i n t 3 2 _ tr2 =mul(S,r1) ; // (+) Q [ 2.30] u i n t 3 2 _ tr3 = 0x 0 0 0 0 0 0 2 0+ r2; // (+) Q [ 2.30] u i n t 3 2 _ tr4 =mul(T,T) ; // (+) Q [ 0.32] u i n t 3 2 _ tr5 =mul(S,r4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr6 =mul(T, 0x 0 7 f e 9 3 e 4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr7 = 0x 1 0 0 0 0 0 0 0 -r6; // ( -) Q [ 1.31] u i n t 3 2 _ tr8 =mul(r5,r7) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr9 =r3 -r8; // (+) Q [ 2.30] u i n t 3 2 _ tr10 =mul(r4,r4) ; // (+) Q [ 0.32] u i n t 3 2 _ tr11 =mul(S,r10) ; // (+) Q [ 1.31] u i n t 3 2 _ tr12 =mul(T, 0x 0 3 2 d 6 6 4 3) ; // (+) Q [ 1.31] u i n t 3 2 _ tr13 = 0x 0 4 e e f 6 9 4 -r12; // ( -) Q [ 1.31] u i n t 3 2 _ tr14 =mul(T, 0x 0 0 a e b e 7 d) ; // (+) Q [ 1.31] u i n t 3 2 _ tr15 = 0x 0 1 c 6 c e b d -r14; // ( -) Q [ 1.31] u i n t 3 2 _ tr16 =r4 >> 11; // ( -) Q [ 1.31] u i n t 3 2 _ tr17 =r15 +r16; // ( -) Q [ 1.31] u i n t 3 2 _ tr18 =mul(r4,r17) ; // ( -) Q [ 1.31] u i n t 3 2 _ tr19 =r13 +r18; // ( -) Q [ 1.31] u i n t 3 2 _ tr20 =mul(r11,r19) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr21 =r9 -r20; // (+) Q [ 2.30] retu rnr21; }

Listing 1:

C code

## C o e f f i c i e n t s and v a r i a b l e s d e f i n i t i o n a0=fixed< -30 ,dn>(0x00000020p-30) ; a1=fixed< -31 ,dn>(0x80000000p-31) ; a2=fixed< -31 ,dn>(0x40000000p-31) ; . . . a8=fixed< -31 ,dn>(0x00aebe7dp-31) ; a9=fixed< -31 ,dn>(0x00200000p-31) ;

T=fixed< -32 ,dn>(fixed< -23 ,dn>(var0) ) ;

S=fixed< -31 ,dn>(var1) ; C e r t i f i e d B o u n d= 2 5 0 8 1 3 7 3 4 8 3 1 5 8 6 9 3 0 1 2 4 6 3 0 5 3 5 2 8 1 1 8 0 4 0 3 8 0 9 7 6 7 3 3 1 9 8 9 2 1b-191; ## E v a l u a t i o n sch eme r0 fixed< -31 ,dn>= T*a2; Mr0 =T*a2; r1 fixed< -31 ,dn>= a1+r0; Mr1 =a1 +Mr0; . . . r21 fixed< -30 ,dn>=r9 -r20; Mr21=Mr9 -Mr20; ## R e s u l t s { ( var0 in [0x00000000p-32 ,0xfffffe00p-32] /\var1 in [0x80000000p-31 ,0xb504f334p-31] -> /\r0 in[0 ,0xffffffffp-31] /\r0 -Mr0 in? . . . /\r21 in [0 ,0xffffffffp-30] /\ |r21 -Mr21| -C e r t i f i e d B o u n d<= 0 /\ C e r t i f i e d B o u n d in? ) }

(16)

Global architecture of CGPE (cont’d)

Output of CGPE

u i n t 3 2 _ t f u n c _ d 9 _ 0(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =T>> 2; // (+) Q [ 1.31] u i n t 3 2 _ tr1 = 0x 8 0 0 0 0 0 0 0+ r0; // (+) Q [ 1.31] u i n t 3 2 _ tr2 =mul(S,r1) ; // (+) Q [ 2.30] u i n t 3 2 _ tr3 = 0x 0 0 0 0 0 0 2 0+ r2; // (+) Q [ 2.30] u i n t 3 2 _ tr4 =mul(T,T) ; // (+) Q [ 0.32] u i n t 3 2 _ tr5 =mul(S,r4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr6 =mul(T, 0x 0 7 f e 9 3 e 4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr7 = 0x 1 0 0 0 0 0 0 0 -r6; // ( -) Q [ 1.31] u i n t 3 2 _ tr8 =mul(r5,r7) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr9 =r3 -r8; // (+) Q [ 2.30] u i n t 3 2 _ tr10 =mul(r4,r4) ; // (+) Q [ 0.32] u i n t 3 2 _ tr11 =mul(S,r10) ; // (+) Q [ 1.31] u i n t 3 2 _ tr12 =mul(T, 0x 0 3 2 d 6 6 4 3) ; // (+) Q [ 1.31] u i n t 3 2 _ tr13 = 0x 0 4 e e f 6 9 4 -r12; // ( -) Q [ 1.31] u i n t 3 2 _ tr14 =mul(T, 0x 0 0 a e b e 7 d) ; // (+) Q [ 1.31] u i n t 3 2 _ tr15 = 0x 0 1 c 6 c e b d -r14; // ( -) Q [ 1.31] u i n t 3 2 _ tr16 =r4 >> 11; // ( -) Q [ 1.31] u i n t 3 2 _ tr17 =r15 +r16; // ( -) Q [ 1.31] u i n t 3 2 _ tr18 =mul(r4,r17) ; // ( -) Q [ 1.31] u i n t 3 2 _ tr19 =r13 +r18; // ( -) Q [ 1.31] u i n t 3 2 _ tr20 =mul(r11,r19) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr21 =r9 -r20; // (+) Q [ 2.30] retu rnr21; }

Listing 3:

C code

13 12 11 10 9 8 7 6 5 4 3 2 1 T a2 r0 a1 r1 S r2 a0 r3 T T r4 S r5 T a4 r6 a3 r7 r8 r9 r10 S r11 T a6 r12 a5 r13 T a8 r14 a7 r15 a9 r16 r17 r18 r19 r20 r21

(17)

The CGPE tool

Achievements and lacking features of CGPE

Features achieved by CGPE

3

validated on the ST200 core

3

so far, no ambushes were

encountered for

,

3

,

1

,

1

3

· · ·

3

produced optimal schemes for some

of the above functions such as

Features lacking in CGPE

7

simplistic description of the

underlying architecture

(ex. no handling of advanced

operators such as ST200

shift_and_add instruction)

7

the only shifts handled correspond

to the multiplication by a power of 2

7

hypotheses are made on the format

of the input coefficients

Problem:

without hypotheses on the formats of the input coefficients, CGPE fails

(18)

Achievements and lacking features of CGPE

Features achieved by CGPE

3

validated on the ST200 core

3

so far, no ambushes were

encountered for

,

3

,

1

,

1

3

· · ·

3

produced optimal schemes for some

of the above functions such as

Features lacking in CGPE

7

simplistic description of the

underlying architecture

(ex. no handling of advanced

operators such as ST200

shift_and_add instruction)

7

the only shifts handled correspond

to the multiplication by a power of 2

7

hypotheses are made on the format

of the input coefficients

Problem:

without hypotheses on the formats of the input coefficients, CGPE fails

(19)

The CGPE tool

Shift handling in CGPE

There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts:

allows to gain a few cycles

shifting is usually less costly than multiplication

addition of a Q

[

1

.

31

]

and a Q

[

2

.

30

]

3. leading zeros’ elimination shifts:

used to gain some bits of precision

0x40000000 in the Q

[

2

.

30

]

format

0x80000000 in the Q

[

1

.

31

]

format

4. overflow prevention shifts:

used before an arithmetic operation to prevent it from

overflowing

to prevent the addition of a Q

[

1

.

31

]

and a Q

[

1

.

31

]

from overflowing the Q

[

1

.

31

]

format,

both operands are shifted to the Q

[

2

.

30

]

format

Remark:

to detect whether one of these shifts is needed, we rely on:

I

fixed-point arithmetic rules (for case 2)

(20)

The CGPE tool

Shift handling in CGPE

There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts:

allows to gain a few cycles

shifting is usually less costly than multiplication

2. alignment shifts:

used to align commas for an arithmetic operation

addition of a Q

[

1

.

31

]

and a Q

[

2

.

30

]

0x40000000 in the Q

[

2

.

30

]

format

0x80000000 in the Q

[

1

.

31

]

format

4. overflow prevention shifts:

used before an arithmetic operation to prevent it from

overflowing

to prevent the addition of a Q

[

1

.

31

]

and a Q

[

1

.

31

]

from overflowing the Q

[

1

.

31

]

format,

both operands are shifted to the Q

[

2

.

30

]

format

Remark:

to detect whether one of these shifts is needed, we rely on:

I

fixed-point arithmetic rules (for case 2)

(21)

The CGPE tool

Shift handling in CGPE

There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts:

allows to gain a few cycles

shifting is usually less costly than multiplication

2. alignment shifts:

used to align commas for an arithmetic operation

addition of a Q

[

1

.

31

]

and a Q

[

2

.

30

]

3. leading zeros’ elimination shifts:

used to gain some bits of precision

0x40000000 in the Q

[

2

.

30

]

format

0x80000000 in the Q

[

1

.

31

]

format

overflowing

to prevent the addition of a Q

[

1

.

31

]

and a Q

[

1

.

31

]

from overflowing the Q

[

1

.

31

]

format,

both operands are shifted to the Q

[

2

.

30

]

format

Remark:

to detect whether one of these shifts is needed, we rely on:

I

fixed-point arithmetic rules (for case 2)

(22)

Shift handling in CGPE

There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts:

allows to gain a few cycles

shifting is usually less costly than multiplication

2. alignment shifts:

used to align commas for an arithmetic operation

addition of a Q

[

1

.

31

]

and a Q

[

2

.

30

]

3. leading zeros’ elimination shifts:

used to gain some bits of precision

0x40000000 in the Q

[

2

.

30

]

format

0x80000000 in the Q

[

1

.

31

]

format

4. overflow prevention shifts:

used before an arithmetic operation to prevent it from

overflowing

to prevent the addition of a Q

[

1

.

31

]

and a Q

[

1

.

31

]

from overflowing the Q

[

1

.

31

]

format,

both operands are shifted to the Q

[

2

.

30

]

format

Remark:

to detect whether one of these shifts is needed, we rely on:

I

fixed-point arithmetic rules (for case 2)

(23)

Shift handling in CGPE

There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts:

allows to gain a few cycles

shifting is usually less costly than multiplication

2. alignment shifts:

used to align commas for an arithmetic operation

addition of a Q

[

1

.

31

]

and a Q

[

2

.

30

]

3. leading zeros’ elimination shifts:

used to gain some bits of precision

0x40000000 in the Q

[

2

.

30

]

format

0x80000000 in the Q

[

1

.

31

]

format

4. overflow prevention shifts:

used before an arithmetic operation to prevent it from

overflowing

to prevent the addition of a Q

[

1

.

31

]

and a Q

[

1

.

31

]

from overflowing the Q

[

1

.

31

]

format,

both operands are shifted to the Q

[

2

.

30

]

format

Problem:

shifts may affect the critical path, potentially increasing the latency of the DAG

Solution:

use more advanced instructions to help absorb this increase

(24)

Outline of the talk

1. The CGPE tool

2. Instruction selection: an extension of CGPE

(25)

The problem of instruction selection

A well known problem in compilation that was proven to be NP-complete on DAGs.

Usually solved using a tiling algorithm:

I

input:

a DAG representing an arithmetic expression.

a set of tiles, with a cost for each.

a function that associates a cost to a subtree.

I

output:

a set of covering tiles that minimize the cost function.

+

1

7

×

3

x

1

x

2

×

3

x

3

x

4

((

x

1

·

x

2

) + (

x

3

·

x

4

))

FmaLeft

3

6

x

1

x

2

3

×

x

3

x

4

FmaLeft

(

x

1

,

x

2

, (

x

3

·

x

4

))

FmaRight

3

6

×

3

x

1

x

2

x

3

x

4

FmaRight

((

x

1

·

x

2

),

x

3

,

x

4

)

(26)

The problem of instruction selection

A well known problem in compilation that was proven to be NP-complete on DAGs.

Usually solved using a tiling algorithm:

I

input:

a DAG representing an arithmetic expression.

a set of tiles, with a cost for each.

a function that associates a cost to a subtree.

I

output:

a set of covering tiles that minimize the cost function.

+

1

4

×

3

x

1

x

2

×

3

x

3

x

4

((

x

1

·

x

2

) + (

x

3

·

x

4

))

FmaLeft

3

6

x

1

x

2

3

×

x

3

x

4

FmaLeft

(

x

1

,

x

2

, (

x

3

·

x

4

))

FmaRight

3

6

×

3

x

1

x

2

x

3

x

4

FmaRight

((

x

1

·

x

2

),

x

3

,

x

4

)

(27)

Remark on instruction selection

Some work in the area

Voronenko and Püschel from the Spiral group (2004):

Automatic Generation of Implementations for DSP Transforms on Fused

Multiply-Add Architectures.

3

They provide a short proof of optimality in the case of trees.

7

Their method handles FMAs in DAGs but is not generic.

(28)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

1

BottomUpDP

()

3

4

2

5

TopDownSelect

()

the progress step by step of the tiling algorithm

(29)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

1

BottomUpDP

()

3

4

2

5

TopDownSelect

()

the progress step by step of the tiling algorithm

(30)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

1

BottomUpDP

()

3

4

2

5

TopDownSelect

()

the progress step by step of the tiling algorithm

(31)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

1

BottomUpDP

()

3

4

2

5

the progress step by step of the tiling algorithm

(32)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

3

1

4

2

5

TopDownSelect

()

the progress step by step of the tiling algorithm

(33)

Instruction selection: an extension of CGPE

The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)

1:

BottomUpDP()

2:

TopDownSelect()

3:

ImproveCSEDecision()

4:

BottomUpDP()

5:

TopDownSelect()

+

×

+

×

a

1

a

2 α

a

3

a

0

1

3

4

2

5

TopDownSelect

()

the progress step by step of the tiling algorithm

(34)

Instruction selection: an extension of CGPE

Instruction tiles considered in CGPE

Classical tiles

1.

addition tile.

2.

multiplication tile.

3.

shift tile.

Advanced tiles

5.

add3 tiles (left and right).

6.

shiftAdd tiles (available on the ST200 core).

7.

square tile.

add

+

multiply

×

shift



fma_left

×

+

×

add3_left

+

×

+

shift_add_left

+

×



square

×

(35)

Instruction tiles considered in CGPE

Classical tiles

1.

addition tile.

2.

multiplication tile.

3.

shift tile.

Advanced tiles

4.

fma tiles (left and right).

5.

add3 tiles (left and right).

6.

shiftAdd tiles (available on the ST200 core).

7.

square tile.

add

+

multiply

×

shift



fma_left

×

+

×

add3_left

+

×

+

shift_add_left

+

×



square

×

(36)

Simple example

Original code

u i n t 3 2 _ tf u n c _ d 9 _ 0(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =T>> 2; // (+) Q [ 1.31] u i n t 3 2 _ tr1 = 0x 8 0 0 0 0 0 0 0+r0; // (+) Q [ 1.31] u i n t 3 2 _ tr2 =mul(S,r1) ; // (+) Q [ 2.30] u i n t 3 2 _ tr3 = 0x 0 0 0 0 0 0 2 0+r2; // (+) Q [ 2.30] u i n t 3 2 _ tr4 =mul(T,T) ; // (+) Q [ 0.32] u i n t 3 2 _ tr5 =mul(S,r4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr6 =mul(T, 0x 0 7 f e 9 3 e 4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr7 = 0x 1 0 0 0 0 0 0 0-r6; // ( -) Q [ 1.31] u i n t 3 2 _ tr8 =mul(r5,r7) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr9 =r3 -r8; // (+) Q [ 2.30] u i n t 3 2 _ tr10 =mul(r4,r4) ; // (+) Q [ 0.32] u i n t 3 2 _ tr11 =mul(S, r10) ; // (+) Q [ 1.31] u i n t 3 2 _ tr12 =mul(T, 0x 0 3 2 d 6 6 4 3) ; // (+) Q [ 1.31] u i n t 3 2 _ tr13 = 0x 0 4 e e f 6 9 4-r12; // ( -) Q [ 1.31] u i n t 3 2 _ tr14 =mul(T, 0x 0 0 a e b e 7 d) ; // (+) Q [ 1.31] u i n t 3 2 _ tr15 = 0x 0 1 c 6 c e b d-r14; // ( -) Q [ 1.31] u i n t 3 2 _ tr16 =r4 >> 11; // ( -) Q [ 1.31] u i n t 3 2 _ tr17 =r15 +r16; // ( -) Q [ 1.31] u i n t 3 2 _ tr18 =mul(r4,r17) ; // ( -) Q [ 1.31] u i n t 3 2 _ tr19 =r13 +r18; // ( -) Q [ 1.31] u i n t 3 2 _ tr20 =mul(r11,r19) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr21 =r9 -r20; // (+) Q [ 2.30] ret urnr21; }

Listing 4:

Original C code

With the fma in 3 cycles and the

shift in 1 cycle

u i n t 3 2 _ tf u n c _ t i l e d(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =power(T, -2) ; u i n t 3 2 _ tr1 =add(0x80000000,r0) ; u i n t 3 2 _ tr2 =f m a _ r i g h t(0x00000020,S,r1) ; u i n t 3 2 _ tr3 =sq uare(T) ; u i n t 3 2 _ tr4 =mul(S,r3) ; u i n t 3 2 _ tr5 =mul(T, 0x 0 7 f e 9 3 e 4) ; u i n t 3 2 _ tr6 =sub(0x10000000,r5) ; u i n t 3 2 _ tr7 =mul(r4,r6) ; u i n t 3 2 _ tr8 =sub(r2,r7) ; u i n t 3 2 _ tr9 =sq uare(r3) ; u i n t 3 2 _ tr10 =mul(S,r9) ; u i n t 3 2 _ tr11 =mul(T, 0x 0 3 2 d 6 6 4 3) ; u i n t 3 2 _ tr12 =sub(0x04eef694,r11) ; u i n t 3 2 _ tr13 =mul(T, 0x 0 0 a e b e 7 d) ; u i n t 3 2 _ tr14 =sub(0x01c6cebd,r13) ; u i n t 3 2 _ tr15 =power(r3, -11) ; u i n t 3 2 _ tr16 =add(r14,r15) ; u i n t 3 2 _ tr17 =f m a _ r i g h t(r12,r3,r16) ; u i n t 3 2 _ tr18 =mul(r10,r17) ; u i n t 3 2 _ tr19 =sub(r8,r18) ; retu rnr19; }

(37)

Simple example

Original code

u i n t 3 2 _ tf u n c _ d 9 _ 0(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =T>> 2; // (+) Q [ 1.31] u i n t 3 2 _ tr1 = 0x 8 0 0 0 0 0 0 0+r0; // (+) Q [ 1.31] u i n t 3 2 _ tr2 =mul(S,r1) ; // (+) Q [ 2.30] u i n t 3 2 _ tr3 = 0x 0 0 0 0 0 0 2 0+r2; // (+) Q [ 2.30] u i n t 3 2 _ tr4 =mul(T,T) ; // (+) Q [ 0.32] u i n t 3 2 _ tr5 =mul(S,r4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr6 =mul(T, 0x 0 7 f e 9 3 e 4) ; // (+) Q [ 1.31] u i n t 3 2 _ tr7 = 0x 1 0 0 0 0 0 0 0-r6; // ( -) Q [ 1.31] u i n t 3 2 _ tr8 =mul(r5,r7) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr9 =r3 -r8; // (+) Q [ 2.30] u i n t 3 2 _ tr10 =mul(r4,r4) ; // (+) Q [ 0.32] u i n t 3 2 _ tr11 =mul(S, r10) ; // (+) Q [ 1.31] u i n t 3 2 _ tr12 =mul(T, 0x 0 3 2 d 6 6 4 3) ; // (+) Q [ 1.31] u i n t 3 2 _ tr13 = 0x 0 4 e e f 6 9 4-r12; // ( -) Q [ 1.31] u i n t 3 2 _ tr14 =mul(T, 0x 0 0 a e b e 7 d) ; // (+) Q [ 1.31] u i n t 3 2 _ tr15 = 0x 0 1 c 6 c e b d-r14; // ( -) Q [ 1.31] u i n t 3 2 _ tr16 =r4 >> 11; // ( -) Q [ 1.31] u i n t 3 2 _ tr17 =r15 +r16; // ( -) Q [ 1.31] u i n t 3 2 _ tr18 =mul(r4,r17) ; // ( -) Q [ 1.31] u i n t 3 2 _ tr19 =r13 +r18; // ( -) Q [ 1.31] u i n t 3 2 _ tr20 =mul(r11,r19) ; // ( -) Q [ 2.30] u i n t 3 2 _ tr21 =r9 -r20; // (+) Q [ 2.30] ret urnr21; }

Listing 6:

Original C code

With the fma in 3 cycles and the

shift in 3 cycle

u i n t 3 2 _ tf u n c _ t i l e d(u i n t 3 2 _ tT,u i n t 3 2 _ tS) { u i n t 3 2 _ tr0 =f m a _ r i g h t(0x80000000,T, 0x 4 0 0 0 0 0 0 0) ; u i n t 3 2 _ tr1 =f m a _ r i g h t(0x00000020,S,r0) ; u i n t 3 2 _ tr2 =sq uare(T) ; u i n t 3 2 _ tr3 =mul(S,r2) ; u i n t 3 2 _ tr4 =mul(T, 0x 0 7 f e 9 3 e 4) ; u i n t 3 2 _ tr5 =sub(0x10000000,r4) ; u i n t 3 2 _ tr6 =mul(r3,r5) ; u i n t 3 2 _ tr7 =sub(r1,r6) ; u i n t 3 2 _ tr8 =sq uare(r2) ; u i n t 3 2 _ tr9 =mul(S,r8) ; u i n t 3 2 _ tr10 =mul(T, 0x 0 3 2 d 6 6 4 3) ; u i n t 3 2 _ tr11 =sub(0x04eef694,r10) ; u i n t 3 2 _ tr12 =mul(T, 0x 0 0 a e b e 7 d) ; u i n t 3 2 _ tr13 =sub(0x01c6cebd,r12) ; u i n t 3 2 _ tr14 =power(r2, -11) ; u i n t 3 2 _ tr15 =add(r13,r14) ; u i n t 3 2 _ tr16 =f m a _ r i g h t(r11,r2,r15) ; u i n t 3 2 _ tr17 =mul(r9,r16) ; u i n t 3 2 _ tr18 =sub(r7,r17) ; retu rnr18; }

(38)

Instruction selection: an extension of CGPE

Remarks on instruction selection in CGPE

A separation is achieved between the computation of DAGs (Intermediate

Representation) and the code generation process

I

the code can be generated according different criteria

cost function

I

this general approach allows to tackle other problems (sum, dot-product, ...)

We are not bound to use these tiles, we can add many others

I

CGPE can thus serve as a platform of simulation

I

this general approach allows to give

some feedback on the eventual

need or usefulness

of some tiles

Gappa certificate Numeric Filter Output Filter Expressions Generator Polynomial.xml Tiles.xml Tiling filter <polynomial> <coefficient ... > <coefficient ... > <coefficient ... > <tiles> </polynomial> </tiles> <add, cost=1, operands=2 ...> <mul, cost=3, operands=2 ...> <shift, cost=1, operands=1 ...>

Step 1 Step 2 C code ... ... ... ... ... <variable ... > ...

(39)

Remarks on instruction selection in CGPE

A separation is achieved between the computation of DAGs (Intermediate

Representation) and the code generation process

I

the code can be generated according different criteria

cost function

I

this general approach allows to tackle other problems (sum, dot-product, ...)

We are not bound to use these tiles, we can add many others

I

CGPE can thus serve as a platform of simulation

I

this general approach allows to give

some feedback on the eventual

need or usefulness

of some tiles

Gappa certificate Numeric Filter Output Filter Expressions Generator Polynomial.xml Tiles.xml Tiling filter <polynomial> <coefficient ... > <coefficient ... > <coefficient ... > <tiles> </polynomial> </tiles> <add, cost=1, operands=2 ...> <mul, cost=3, operands=2 ...> <shift, cost=1, operands=1 ...>

Step 1 Step 2 C code ... ... ... ... ... <variable ... > ...

(40)

Outline of the talk

1. The CGPE tool

2. Instruction selection: an extension of CGPE

(41)

Conclusion

Code synthesis for fast and certified polynomial evaluation

I

fast and certified C codes, in fixed point arithmetic

I

tool to automate polynomial evaluation implementation, using at best architectural

features

I

implemented in the tool CGPE (Code Generation for Polynomial Evaluation)

http://cgpe.gforge.inria.fr/

Extension of CGPE based on instruction selection:

I

automatic handling of all input formats.

I

better usage of the

advanced

architectural features (such as fma, add-3,

shift-and-add, ...)

I

using a tiling algorithm implies more modularity, as code generation is now an

(42)

Conclusion and perspectives

Current work and perspectives

Current work

I

keep working on instruction selection in CGPE

I

make CGPE more general to tackle other problems, like

matrix inversion

and

multiplication

, ...

Further extensions of CGPE

I

handle other arithmetics like floating-point arithmetic, where the fma tile is more and

more ubiquitous

(43)

Current work and perspectives

Current work

I

keep working on instruction selection in CGPE

I

make CGPE more general to tackle other problems, like

matrix inversion

and

multiplication

, ...

Further extensions of CGPE

I

handle other arithmetics like floating-point arithmetic, where the fma tile is more and

more ubiquitous

(44)

5e Rencontres Arithmétique de l’Informatique Mathématique (RAIM 2012)

Dijon, 20-22 juin 2012

Synthesis of fixed-point programs based on

instruction selection

... the case of polynomial evaluation

Amine Najahi

Advisors

: Matthieu Martel and Guillaume Revy

Joint work with Christophe Mouilleron

Équipe-projet DALI, Univ. Perpignan Via Domitia

LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2

D

Références

Documents relatifs

In Section 5.1, we give presentations of certain braid groups of T 2 , in Section 5.2, we describe the set of homotopy classes of split 2-valued maps of T 2 , and in Section 5.3,

Unit´e de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unit´e de recherche INRIA Rh ˆone-Alpes, 46 avenue F´elix Viallet, 38031 GRENOBLE Cedex

Hence, the surface position results from the interplay of the three involved physical mechanisms: surface tension and probe/liquid interaction forces act directly, whereas

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE

The overall objective of the current research is to investigate CALL normalization at the level of English Language Department at Djilali Liabes University, as tackling the issue at

This paper gives a polynomial algorithm for the synthesis of a non-deterministic asynchronous automaton from a regular Mazurkiewicz trace language.. This new construction is based on

These algorithms search recursively in the APEG where a symmetric binary operator is repeated (we referred at these parts as homogeneous parts). As the propagation algorithms tend

For our analytical approach, first the analytical expression of Pb must be com- puted and it represents the most time consuming part equal to 46 seconds for method based on