L*Université de Bordeaux

(1)

Travaux de Recherche période 2011-2012

Bertrand LE GAL  

(

bertrand.legal@ims-bordeaux.fr

)

Laboratoire IMS- UMR CNRS 5218 Institut Polytechnique de Bordeaux

Université de Bordeaux France

Septembre 2012

31.05.2010

^1/2

L"Université de Bordeaux publie une étude comparative des initiatives campus verts menées à l"échelon international, qui représente une source précieuse d"informations et de réflexions pour l"élaboration de son nouveau modèle d"Université dans le cadre de l"Opération campus. Elle fait le choix de diffuser en accès libre cette étude, le développement durable étant l"affaire de tous et pour l"intérêt de tous.

L*Université de Bordeaux s*est engagée à bâtir un nouveau modèle d*Université, et parallèlement à de- venir leader en matière de développement durable. C*est en ce sens que début 2009, elle a répondu favorablement, conjointement avec l*Université Bordeaux 1 Sciences Technologies, à la proposition d*Ecocampus-Nobatek et d*EDF : réaliser un retour d*expériences et des analyses sur des projets campus verts en France, Europe et Amérique du Nord.

L*objectif de cette étude (cf. page suivante) a été d*observer et de capturer les bonnes pratiques et ac- tions exemplaires relatives aux grands piliers du développement durable : domaines économiques, so- ciaux, environnementaux et organisationnels. L*Université de Bordeaux va s*y référer pour mettre en Wuvre une gouvernance et une stratégie à long terme au service d*un campus plus vivable et plus équi- table pour l*ensemble de la communauté universitaire.

Avec le Grenelle de l*environnement comme repère à atteindre puis à dépasser, l*Université de Bor- deaux entend constituer un site pilote à travers une démarche de développement durable globale par : - l*intégration permanente des dimensions humaines dans le projet immobilier et l*aménagement (acces- sibilité, santé, lisibilité, confort, cadre de vie) ;

- une transformation énergétique radicale des bâtiments dans le cadre de leur rénovation en démarche HQE® et un schéma directeur énergétique pour une réduction maximale des gaz à effet de serre ; - la mise en valeur et la sanctuarisation d*un parc sur le site universitaire de Talence-Pessac- Gradignan, véritable poumon vert à l*échelle de l*agglomération, atout exceptionnel pour la qualité de vie des usagers et le développement de la biodiversité en milieu urbain ;

- un plan de déplacement sur l*ensemble des domaines du campus universitaire, afin de réduire l*usage individuel de la voiture et son impact en s*appuyant sur des réseaux de transports en commun perfor- mants et le développement des modes alternatifs ;

- une ouverture concertée sur la ville, visant à favoriser le développement économique des territoires, celui de la vie de campus et à créer une mixité sociale et fonctionnelle ;

- et enfin, condition sine qua non de réussite, la mise en place d*un processus d*information et de concer- tation auprès de tous les membres et acteurs de l*Université, pour une compréhension partagée des en- jeux et un apprentissage des comportements responsables.

Aussi, l*Université de Bordeaux entend-elle élaborer un agenda 21 et faire de son campus un site d*expérimentation permettant de développer des approches innovantes à partir des compétences des laboratoires.

L*étude « Initiatives campus verts » est téléchargeable sur le site www.univ-bordeaux.fr

Contacts presse Université de Bordeaux

Anne SEYRAFIAN . Norbert LOUSTAUNAU . T 33 (0)5 56 33 80 84 . communication@univ-bordeaux.fr Contact Nobatek-Ecocampus

Julie CREPIN, chef de projet . T 33 (0)5 56 84 63 72 . jcrepin@nobatek.com

L*Université de Bordeaux

Vers un nouveau modèle d"Université DURABLE

(2)

Equipe CSN - IMS Septembre 2012

Réduction de la complexité matérielle des processeurs softcore

2

(3)

Les différentes cibles architecturales et technologiques

3

Architecture dédiée Implanta2on logicielle Codesign & so8cores

Circuit ASIC préconçu et fondu par un 2ers

(4)

Méthodologie de suppression des instructions

๏ Phase d’optimisation du coeur

➡ Code source applicatif

‣ Compilé à l’aide de GCC,

‣ Décompilé afin d’obtenir les

instructions réellement exécutées,

➡ Analyse du code assembleur,

‣ Déterminer les instructions inutilisées,

➡ Modification du code VHDL,

‣ Modification du décodeur d’instructions,

‣ Suppression des unités matérielles (ALUs) inutilisées,

➡ Modification du compilateur,

๏ Méthodologie automatisée,

4

VHDL conﬁg.

GCC compiler (modiﬁed)

Source code Binary

program

Modiﬁed processor Instruc2on

usages

Processor (Leon 3)

(5)

Evaluation des performances de l’approche

5

1 Enc/décodeur ADPCM MiBench

Communication numérique

2 Enc/décodeur BCH (31,21) [19]

3 Enc/décodeur de GOLAY [19]

4 Décodeur LDCP Hand written

5 Décodeur MP3 Helix community

6 CRC 32b TRAP [24]

Application cryptographique

et/ou sécurité

7 MD5 Fast DES kit

8 SHA-1 MiBench

9 SHA-2

10 Enc/décodeur ARC4

11 Enc/décodeur DES / 3-DES Fast DES kit 12 Enc/décodeur AES (128b, 512b)

13 Contrôle moteur TRAP

Application de type contrôle

14 Tri de données TRAP

15 Queens TRAP

16 Recherche de motifs TRAP

17 Compression de texte (v42) TRAP

18 g3fax Hand written

19 Contrôle d’écran LCD Hand written

20 Filtrage LMS Hand written

Application de traitement du

signal

21 Filtrage RIF (512 points) TRAP

22 Traitement FFT, iFFT (virgule fixe) TRAP

23 Filtre d’annulation d’écho LibGSM

24 Décompresion vidéo MJPEG TRAP Application de

traitement de la vidéo 25 Vidéo surveillance (detec. mouvements) Hand written

26 Egalisation du contraste temps réel Hand written

26 applications diﬀérentes

2 coeurs de processeur de type softcore

3 cibles technologiques

(6)

Résultats obtenus: processeur «Plasma»

6

0 % 20 % 40 % 60 % 80 % 100 %

1 3 5 7 9 11 13 15 17 19 21 23 25

USELESS INSTRUCTIONS (%)

0 % 10 % 20 % 30 % 40 % 50 %

1 3 5 7 9 11 13 15 17 19 21 23 25

SPARTAN-6 (%) ASIC-65nm (%)

Gain en % sur le cout en surface - Cibles matérielles (ASIC et FPGA) Taux d’instructions inutilisées

(7)

Résultats obtenus: processeur «Léon-3»

7

0 % 20 % 40 % 60 % 80 % 100 %

1 3 5 7 9 11 13 15 17 19 21 23 25

Taux d’instructions inutilisées

0 4000 8000 12000 16000 20000

1 3 5 7 9 11 13 15 17 19 21 23 25

Architecture optimisée Architecture originale

Cout matériel sur cible ACTEL ProASIC-3 (# de eCore)

(8)

Conception d’ASIP dédié aux FEC

(architecture flexible pour les codes LDPC)

8

(9)

Introduction aux communications numériques

9

Informations émises QAM-256

Informations reçues QAM-256

(10)

Introduction aux codes correcteurs d’erreurs

6

Correction des erreurs de transmission

1

, C

2

)

v

3

= G(C

1

, C

2

, C

4

) v

4

= G(C

3

, C

4

)

v

5

= G(C

2

, C

3

) v

6

= G(C

1

, C

3

, C

4

)

DVB-S2 : 20 à 30 itérations, 64k noeuds 32k checks, 90Mbits/s

(12)

Présentation de l’architecture de l’ASIP

12

data

input ASIP

core

data output

uProcessor core

Configurable LDPC decoder

array I/O

data control

signals

RAMs RAMs RAMs RAMs

PU PU PU PU

I I

RAM

Computation sequencer

D M

Contraintes de conception:

- Flexibilité & évolutivité - Surcoût matériel limité - Contraintes de débit élevées

(13)

any bank memory access conflict. Previous works such as [19]

[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This map- ping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.

Global memory banks

Processing units with their own local memories RAM

(LLR) RAM

(LLR)

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle instr.

decod.

virtual.

layer

uProcessor core System interface

∏ / ∏

instr.

RAM

control signals

-1

decision data LLR data

Conﬁg. data Status

Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix

4.3. Design flow for LDPC decoder generation

Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees d

^vn

and d

^cm

of each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.

First, an allocation task is executed for a given parallelism level P . The purpose of the allocation algorithm is to map all the LLR values T

ⁿ

to the P memory blocks. It means that the size of each memory block is equal to ⌈n/P ⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.

The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained sche- duling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.

Architecture parameters LDPC H

matrix

Constraint

graph Scheduling

Binding Memory mapping

Code generation

PU

conﬁguration Analysis

Firmware C codes SIMD matrix

generation

Fig. 4: Methodology for the generation of the ASIP LDPC decoder

4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode ins- tructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its ef- ficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodo- logy described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU pro- gramming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :

– First.P-C : register initialization and parity check node – P-C : parity check node

– First.Var : register initialization and variable node – Var : variable node

– First.P-C&Var : register initialization, parity check node and variable node

– P-C&Var : parity check node and variable node

v o i d l d p c _ d e c o d e r ( )

{ i n t l o o p = 2 0 ;

w h i l e ( l o o p )

{ F i r s t . P−C ( 1 ) ; P−C( n−1);

F i r s t . P−C&Var ( 1 ) ; P−C&Var ( n−1);

F i r s t . P−C ( 1 ) ; P−C&Var ( n−1);

/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;

Var ( n−1); l o o p −= 1 ; }

}

Listing 1: Firmware C-code example for proposed ASIP architecture

Architecture interne de l’ASIP

13

MIPS-32b core

Instruction memory

Shared LLR memories

Processing elements + local memories Interleavers

(Benes) Abstraction layer

memories

(cycle => data set)

(14)

Architecture (simplifiée) du processeur élémentaire (PE)

14

dynamique des

noeuds dynamique des

messages Cout matériel

(outil Synplify) Chemin

critique max.

12 bits 10 bits 210 LUT(s) 69 REG(s) 1 RAM 14,599 ns 10 bits 8 bits 173 LUT(s) 57 REG(s) 1 RAM 12,606 ns 9 bits 7 bits 171 LUT(s) 51 REG(s) 1 RAM 11,090 ns 8 bits 6 bits 133 LUT(s) 45 REG(s) 1 RAM 10.356 ns 6 bits 4 bits 101 LUT(s) 33 REG(s) 1 RAM 8.538 ns

Coeur sans mémoire - Résultats obtenus pour une cible Virtex-6

VHDL conﬁg.

sub Register

file

MIN(s)

sign xor

fifo

sign min1/2

T_m

E_mn

cst₁

cst₂ mux

=

inv add

min₁

sign

Parity check

processing LLR update

processing

(t) (t)

T_nm^(t) T_m^(t+1)

E^(t+1)_mn

min₂

(15)

Mémorisation des données d’entrée - 2 approches

15

En fonction du code LDPC

à implanter 2 approches sont proposées pour mémoriser les entrées du système

=> Favorise l’exploitation du parallélisme entre les calculs (diminution des conflits

d’accès à la mémoire)

LDPC frame n

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

...

RAM (1) RAM (2) RAM (3)

...

RAM (4) LLR memory

filling LDPC frame n

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...

V1

V5

V9

...

V2

V6

V10

...

V3

V7

V11

...

RAM (1) RAM (2) RAM (3)

V4

V8

V12

...

RAM (4) LLR memory

filling

Global memory banks

Processing elements with their own local memories RAM

(LLR) RAM

(LLR)

(logic)PE

Reg.file

(logic)PE

Reg.file

(logic)PE

Reg.file

(logic)PE

Reg.file instr.

decod.

virtual.

layer

∏ / ∏

instr.

RAM

control signals

-1

data output data input

(16)

Automatic compilation flow for efficient usage rate

16

๏ Compilation flow inputs

➡ LDPC H matrix (one positions)

➡ # of PU to allocate

➡ Max. fixed point format

➡ Targetted throughtput (c)

๏ Estimated parmeters

➡ Throughput at 100MHz (FPGA)

➡ Frequency to achieve (c) throughput

➡ Implementation characteristics (usage rates, cycle/bit, estimated area, etc.)

๏ Generated files

➡ ASIP processor + PU matrix (RTL)

➡ C codes for ASIP execution

➡ Testbench at various levels (C => VHDL) any bank memory access conflict. Previous works such as [19]

[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This mapping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.

Global memory banks

Processing units with their own local memories RAM

(LLR) RAM

(LLR)

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle

PU (logic)

Reg.

ﬁle instr.

decod.

virtual.

layer

∏ / ∏

instr.

RAM

control signals

-1

decision data LLR data

Conﬁg. data Status

Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix

4.3. Design flow for LDPC decoder generation

Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees d^vn and d^cm of each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.

First, an allocation task is executed for a given parallelism level P. The purpose of the allocation algorithm is to map all the LLR values Tⁿ to the P memory blocks. It means that the size of each memory block is equal to ⌈n/P⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.

The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained scheduling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.

Architecture parameters LDPC H

matrix

Constraint

graph Scheduling

Binding Memory mapping

Code generation

PU conﬁguration Analysis

Firmware C codes SIMD matrix

generation

Fig. 4: Methodology for the generation of the ASIP LDPC decoder

4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode instructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its efficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodology described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU programming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :

– First.P-C : register initialization and parity check node – P-C : parity check node

– First.Var : register initialization and variable node – Var : variable node

– First.P-C&Var : register initialization, parity check node and variable node

– P-C&Var : parity check node and variable node

v o i d l d p c _ d e c o d e r ( )

{ i n t l o o p = 2 0 ;

w h i l e ( l o o p )

{ F i r s t . P−C ( 1 ) ; P−C( n−1);

F i r s t . P−C&Var ( 1 ) ; P−C&Var ( n−1);

F i r s t . P−C ( 1 ) ; P−C&Var ( n−1);

/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;

Var ( n−1);

l o o p −= 1 ; }

}

Listing 1: Firmware C-code example for proposed ASIP architecture

(17)

The automated compilation flow for LDPC codes

17

(18)

Equipe CSN - IMS Septembre 2012 18

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

The Tanner graph shows the links between the variable nodes and the check nodes.

Horizontal scheduling => when a check node is updated, all linked var. nodes are updated

Factorized graph model - Check nodes centred (C^x) - Mx includes all computations  for Cx update

- Edges show Cx nodes 

having execution dependency => M^xnodes that cannot be  computed in parallel

(19)

Equipe CSN - IMS Septembre 2012 19

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)  - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM₀ RAM₁ RAM₂ RAM₃ 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements

(20)

Computation scheduling and binding algorithm

20

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

# of processing elements

(21)

Computation scheduling and binding algorithm

21

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

# of processing elements M₄

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₄

(22)

Computation scheduling and binding algorithm

22

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₄

(23)

Computation scheduling and binding algorithm

23

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₄

(24)

Computation scheduling and binding algorithm

24

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₂

M₄ M₄ M₄ M₄

M₂ M₂ M₂

M₂ M₂ M₂ M₂

M₂ M₂ M₂

M₂

(25)

Computation scheduling and binding algorithm

25

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₂

M₄ M₄ M₄ M₄

M₂ M₂ M₂

M₂ M₂ M₂ M₂

M₂ M₂ M₂

M₂

(26)

Computation scheduling and binding algorithm

26

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

P.E. 1 1

2 3 4

6 7 8 clock

cycle

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₂

M₄ M₄ M₄ M₄

M₂ M₂ M₂

M₂ M₂ M₂ M₂

M₂ M₂ M₂

M₂

Impossible to schedule the M⁵ computation:

- It do not share LLR value => OK - It needs to access RAM2 to read  one LLR value: 

+ Requied one clock cycle in  memory reservation table.

(27)

P.E. 1 1

2 3 4 5 6 7 8 macro

cycle 1

2 3 4 5 6 7 8

P.E. 2 P.E. 2 P.E. 4 M₄

M₄ M₄ M₄

M₄ M₄

M₄ M₄ M₂

M₄ M₄ M₄ M₄

M₂ M₂ M₂

M₂ M₂ M₂ M₂

M₂ M₂ M₂

M₂ M₅

M₅ M₅ M₅

...

macro cycle 0

9 9

M₅ M₅

M₅

Computation scheduling and binding algorithm

27

2

0 1 3 4 5 6 7 8 9 ¹⁰

C₀ C₁ C₂

11

C₃ C₄ C₅ C₆

M₀

M₁ M₃

M₅ M₂

M₆

M₄

Impossible to schedule the M⁵ computation:

- It do not share LLR value => OK - It needs to access RAM2 to read  one LLR value: 

+ Requied one clock cycle in  memory reservation table.

(28)

L'ordonnancement des calculs sur les PEs

28

Contraintes LDCP H

matrice

Modèle interne

Testbench Architecture

VHDL

Ordonnancement et assignation

Allocation mémoire

Génération (VHDL & C)

Modèle C (BER) Code C

(commande ASIP) Analyse

(29)

Résultats d’expérimentation (norme DVB-S2)

29

# bits # entrées

LUTs FFs Slices RAMs DSPs Fc

(10. 12. 10)

32 cores 47898 7873 55771 416 0 100MHz 64 cores 57414 12738 70152 415 0 100MHz 96 cores 85387 18051 103438 416 0 100MHz 128 cores 96582 23308 119890 416 0 100MHz

(6. 8. 6)

(5. 7. 5)

Architecture supportant tous les codes DVB-S2 + tailles inférieures Résultats obtenus post-synthèse à l’aide de Synopsys Synplify (2011).

Cout = PCIe + Coeur Proc. + Matrice SIMD