• Aucun résultat trouvé

L*Université de Bordeaux

N/A
N/A
Protected

Academic year: 2022

Partager "L*Université de Bordeaux "

Copied!
55
0
0

Texte intégral

(1)

Travaux de Recherche période 2011-2012

Bertrand LE GAL

(

bertrand.legal@ims-bordeaux.fr

)  

Laboratoire IMS- UMR CNRS 5218 Institut Polytechnique de Bordeaux

Université de Bordeaux France

Septembre 2012

31.05.2010

1/2

L"Université de Bordeaux publie une étude comparative des initiatives campus verts menées à l"échelon international, qui représente une source précieuse d"informations et de réflexions pour l"élaboration de son nouveau modèle d"Université dans le cadre de l"Opération campus. Elle fait le choix de diffuser en accès libre cette étude, le développement durable étant l"affaire de tous et pour l"intérêt de tous.

L*Université de Bordeaux s*est engagée à bâtir un nouveau modèle d*Université, et parallèlement à de- venir leader en matière de développement durable. C*est en ce sens que début 2009, elle a répondu favorablement, conjointement avec l*Université Bordeaux 1 Sciences Technologies, à la proposition d*Ecocampus-Nobatek et d*EDF : réaliser un retour d*expériences et des analyses sur des projets campus verts en France, Europe et Amérique du Nord.

L*objectif de cette étude (cf. page suivante) a été d*observer et de capturer les bonnes pratiques et ac- tions exemplaires relatives aux grands piliers du développement durable : domaines économiques, so- ciaux, environnementaux et organisationnels. L*Université de Bordeaux va s*y référer pour mettre en Wuvre une gouvernance et une stratégie à long terme au service d*un campus plus vivable et plus équi- table pour l*ensemble de la communauté universitaire.

Avec le Grenelle de l*environnement comme repère à atteindre puis à dépasser, l*Université de Bor- deaux entend constituer un site pilote à travers une démarche de développement durable globale par : - l*intégration permanente des dimensions humaines dans le projet immobilier et l*aménagement (acces- sibilité, santé, lisibilité, confort, cadre de vie) ;

- une transformation énergétique radicale des bâtiments dans le cadre de leur rénovation en démarche HQE® et un schéma directeur énergétique pour une réduction maximale des gaz à effet de serre ; - la mise en valeur et la sanctuarisation d*un parc sur le site universitaire de Talence-Pessac- Gradignan, véritable poumon vert à l*échelle de l*agglomération, atout exceptionnel pour la qualité de vie des usagers et le développement de la biodiversité en milieu urbain ;

- un plan de déplacement sur l*ensemble des domaines du campus universitaire, afin de réduire l*usage individuel de la voiture et son impact en s*appuyant sur des réseaux de transports en commun perfor- mants et le développement des modes alternatifs ;

- une ouverture concertée sur la ville, visant à favoriser le développement économique des territoires, celui de la vie de campus et à créer une mixité sociale et fonctionnelle ;

- et enfin, condition sine qua non de réussite, la mise en place d*un processus d*information et de concer- tation auprès de tous les membres et acteurs de l*Université, pour une compréhension partagée des en- jeux et un apprentissage des comportements responsables.

Aussi, l*Université de Bordeaux entend-elle élaborer un agenda 21 et faire de son campus un site d*expérimentation permettant de développer des approches innovantes à partir des compétences des laboratoires.

L*étude « Initiatives campus verts » est téléchargeable sur le site www.univ-bordeaux.fr

Contacts presse Université de Bordeaux

Anne SEYRAFIAN . Norbert LOUSTAUNAU . T 33 (0)5 56 33 80 84 . communication@univ-bordeaux.fr Contact Nobatek-Ecocampus

Julie CREPIN, chef de projet . T 33 (0)5 56 84 63 72 . jcrepin@nobatek.com

L*Université de Bordeaux

Vers un nouveau modèle d"Université DURABLE

(2)

Equipe CSN - IMS Septembre 2012

Réduction de la complexité matérielle   des processeurs softcore

2

(3)

Equipe CSN - IMS Septembre 2012

Les différentes cibles architecturales et technologiques

3

Architecture  dédiée Implanta2on  logicielle Codesign  &  so8cores

Circuit  ASIC  préconçu  et  fondu  par  un  2ers

(4)

Equipe CSN - IMS Septembre 2012

Méthodologie de suppression des instructions

๏ Phase d’optimisation du coeur

➡ Code source applicatif

‣ Compilé à l’aide de GCC,

‣ Décompilé afin d’obtenir les

instructions réellement exécutées,

➡ Analyse du code assembleur,

‣ Déterminer les instructions inutilisées,

➡ Modification du code VHDL,

‣ Modification du décodeur d’instructions,

‣ Suppression des unités matérielles (ALUs) inutilisées,

➡ Modification du compilateur,

๏ Méthodologie automatisée,

4

VHDL  config.

GCC  compiler   (modified)

Source  code Binary  

program

Modified   processor Instruc2on  

usages

Processor   (Leon  3)

(5)

Equipe CSN - IMS Septembre 2012

Evaluation des performances de l’approche

5

1 Enc/décodeur ADPCM MiBench

Communication numérique

2 Enc/décodeur BCH (31,21) [19]

3 Enc/décodeur de GOLAY [19]

4 Décodeur LDCP Hand written

5 Décodeur MP3 Helix community

6 CRC 32b TRAP [24]

Application cryptographique

et/ou sécurité

7 MD5 Fast DES kit

8 SHA-1 MiBench

9 SHA-2

10 Enc/décodeur ARC4

11 Enc/décodeur DES / 3-DES Fast DES kit 12 Enc/décodeur AES (128b, 512b)

13 Contrôle moteur TRAP

Application de type contrôle

14 Tri de données TRAP

15 Queens TRAP

16 Recherche de motifs TRAP

17 Compression de texte (v42) TRAP

18 g3fax Hand written

19 Contrôle d’écran LCD Hand written

20 Filtrage LMS Hand written

Application de traitement du

signal

21 Filtrage RIF (512 points) TRAP

22 Traitement FFT, iFFT (virgule fixe) TRAP

23 Filtre d’annulation d’écho LibGSM

24 Décompresion vidéo MJPEG TRAP Application de

traitement de la vidéo 25 Vidéo surveillance (detec. mouvements) Hand written

26 Egalisation du contraste temps réel Hand written

26 applications différentes

2 coeurs de processeur de type softcore

3 cibles technologiques

(6)

Equipe CSN - IMS Septembre 2012

Résultats obtenus: processeur «Plasma»

6

0 % 20 % 40 % 60 % 80 % 100 %

1 3 5 7 9 11 13 15 17 19 21 23 25

USELESS INSTRUCTIONS (%)

0 % 10 % 20 % 30 % 40 % 50 %

1 3 5 7 9 11 13 15 17 19 21 23 25

SPARTAN-6 (%) ASIC-65nm (%)

Gain en % sur le cout en surface - Cibles matérielles (ASIC et FPGA) Taux d’instructions inutilisées

(7)

Equipe CSN - IMS Septembre 2012

Résultats obtenus: processeur «Léon-3»

7

0 % 20 % 40 % 60 % 80 % 100 %

1 3 5 7 9 11 13 15 17 19 21 23 25

Taux d’instructions inutilisées

0 4000 8000 12000 16000 20000

1 3 5 7 9 11 13 15 17 19 21 23 25

Architecture optimisée Architecture originale

Cout matériel sur cible ACTEL ProASIC-3 (# de eCore)

(8)

Equipe CSN - IMS Septembre 2012

Conception d’ASIP dédié aux FEC  

(architecture flexible pour les codes LDPC)

8

(9)

Equipe CSN - IMS Septembre 2012

Introduction aux communications numériques

9

Informations émises QAM-256

Informations reçues QAM-256

(10)

Equipe CSN - IMS Septembre 2012

Introduction aux codes correcteurs d’erreurs

10

information à transmettre = [v

1

, v

2

, v

3

, v

4

, v

5

, v

6

]

information transmise = [v

1

, v

2

, v

3

, v

4

, v

5

, v

6,

r

1

, r

2

, r

3

, r

4

, r

5

, r

6,

]

insertion de la redondance (r=đ)

information reçue = [v

1

, v

2

, v

3

, v

4

, v

5

, v

6,

r

1

, r

2

, r

3

, r

4

, r

5

, r

6,

]

Transmission de l’information

information décodée = [v

1

, v

2

, v

3

, v

4

, v

5

, v

6

]

Correction des erreurs de transmission

(11)

Equipe CSN - IMS Septembre 2012

Présentation du principe de décodage des codes LDPC

11

trame = [v

1

, v

2

, v

3

, v

4

, v

5

, v

6

]

Matrice de décodage (LDPC)

Graphe de Tanner associé

Exemple de trame LDPC reçue

C

1

= F(v

2

, v

3

, v

6

)

C

2

= F(v

1

, v

2

, v

3

, v

5

) C

3

= F(v

1

, v

4

, v

5

, v

6

) C

4

= F(v

3

, v

4

, v

6

)

Processus de décodage

v

1

= G(C

2

, C

3

) v

2

= G(C

1

, C

2

)

v

3

= G(C

1

, C

2

, C

4

) v

4

= G(C

3

, C

4

)

v

5

= G(C

2

, C

3

) v

6

= G(C

1

, C

3

, C

4

)

DVB-S2 : 20 à 30 itérations, 64k noeuds 32k checks, 90Mbits/s

(12)

Equipe CSN - IMS Septembre 2012

Présentation de l’architecture de l’ASIP

12

data

input ASIP

core

data output

uProcessor core

Configurable LDPC decoder

array I/O

data control

signals

RAMs RAMs RAMs RAMs

PU PU PU PU

I I

RAM

RAM

RAM

RAM

Computation sequencer

D M

Contraintes de conception:

- Flexibilité & évolutivité - Surcoût matériel limité - Contraintes de débit élevées

(13)

Equipe CSN - IMS Septembre 2012

any bank memory access conflict. Previous works such as [19]

[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This map- ping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.

Global memory banks

Processing units with their own local memories RAM

(LLR) RAM

(LLR) RAM

(LLR) RAM

(LLR)

PU (logic)

Reg.

file

PU (logic)

Reg.

file

PU (logic)

Reg.

file

PU (logic)

Reg.

file instr.

decod.

virtual.

layer

uProcessor core System interface

∏ / ∏

instr.

RAM

control signals

-1

decision data LLR data

Config. data Status

Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix

4.3. Design flow for LDPC decoder generation

Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees d

vn

and d

cm

of each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.

First, an allocation task is executed for a given parallelism level P . The purpose of the allocation algorithm is to map all the LLR values T

n

to the P memory blocks. It means that the size of each memory block is equal to ⌈n/P ⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.

The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained sche- duling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.

Architecture parameters LDPC H

matrix

Constraint

graph Scheduling

Binding Memory mapping

Code generation

PU

configuration Analysis

Firmware C codes SIMD matrix

generation

Fig. 4: Methodology for the generation of the ASIP LDPC decoder

4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode ins- tructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its ef- ficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodo- logy described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU pro- gramming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :

First.P-C : register initialization and parity check node – P-C : parity check node

First.Var : register initialization and variable node – Var : variable node

First.P-C&Var : register initialization, parity check node and variable node

P-C&Var : parity check node and variable node

v o i d l d p c _ d e c o d e r ( )

{ i n t l o o p = 2 0 ;

w h i l e ( l o o p )

{ F i r s t . PC ( 1 ) ; PC( n1);

F i r s t . PC&Var ( 1 ) ; PC&Var ( n1);

F i r s t . PC ( 1 ) ; PC&Var ( n1);

/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;

Var ( n1); l o o p = 1 ; }

}

Listing 1: Firmware C-code example for proposed ASIP architecture

Architecture interne de l’ASIP

13

MIPS-32b core

Instruction memory

Shared LLR memories

Processing elements + local memories Interleavers

(Benes) Abstraction layer

memories

(cycle => data set)

(14)

Equipe CSN - IMS Septembre 2012

Architecture (simplifiée) du processeur élémentaire (PE)

14

dynamique des

noeuds dynamique des

messages Cout matériel

(outil Synplify) Chemin

critique max.

12 bits 10 bits 210 LUT(s) 69 REG(s) 1 RAM 14,599 ns 10 bits 8 bits 173 LUT(s) 57 REG(s) 1 RAM 12,606 ns 9 bits 7 bits 171 LUT(s) 51 REG(s) 1 RAM 11,090 ns 8 bits 6 bits 133 LUT(s) 45 REG(s) 1 RAM 10.356 ns 6 bits 4 bits 101 LUT(s) 33 REG(s) 1 RAM 8.538 ns

Coeur sans mémoire - Résultats obtenus pour une cible Virtex-6

VHDL  config.

sub Register

file

MIN(s)

sign xor

fifo

sign min1/2

Tm

Emn

cst1

cst2 mux

=

inv add

min1

sign

Parity check

processing LLR update

processing

(t) (t)

Tnm(t) Tm(t+1)

E(t+1)mn

min2

(15)

Equipe CSN - IMS Septembre 2012

Mémorisation des données d’entrée - 2 approches

15

En fonction du code LDPC

à implanter 2 approches sont proposées pour mémoriser les entrées du système

=> Favorise l’exploitation du parallélisme entre les calculs (diminution des conflits

d’accès à la mémoire)

LDPC frame n

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

...

RAM (1) RAM (2) RAM (3)

...

...

...

...

RAM (4) LLR memory

filling LDPC frame n

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...

V1

V5

V9

...

V2

V6

V10

...

V3

V7

V11

...

RAM (1) RAM (2) RAM (3)

V4

V8

V12

...

RAM (4) LLR memory

filling

Global memory banks

Processing elements with their own local memories RAM

(LLR) RAM

(LLR) RAM

(LLR) RAM

(LLR)

(logic)PE

Reg.file

(logic)PE

Reg.file

(logic)PE

Reg.file

(logic)PE

Reg.file instr.

decod.

virtual.

layer

uProcessor core System interface

∏ / ∏

instr.

RAM

control signals

-1

data output data input

(16)

Equipe CSN - IMS Septembre 2012

Automatic compilation flow for efficient usage rate

16

๏ Compilation flow inputs

➡ LDPC H matrix (one positions)

➡ # of PU to allocate

➡ Max. fixed point format

➡ Targetted throughtput (c)

๏ Estimated parmeters

➡ Throughput at 100MHz (FPGA)

➡ Frequency to achieve (c) throughput

➡ Implementation characteristics (usage rates, cycle/bit, estimated area, etc.)

๏ Generated files

➡ ASIP processor + PU matrix (RTL)

➡ C codes for ASIP execution

➡ Testbench at various levels (C => VHDL) any bank memory access conflict. Previous works such as [19]

[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This map- ping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.

Global memory banks

Processing units with their own local memories RAM

(LLR) RAM

(LLR) RAM

(LLR) RAM

(LLR)

PU (logic)

Reg.

file

PU (logic)

Reg.

file

PU (logic)

Reg.

file

PU (logic)

Reg.

file instr.

decod.

virtual.

layer

uProcessor core System interface

∏ / ∏

instr.

RAM

control signals

-1

decision data LLR data

Config. data Status

Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix

4.3. Design flow for LDPC decoder generation

Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees dvn and dcm of each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.

First, an allocation task is executed for a given parallelism level P. The purpose of the allocation algorithm is to map all the LLR values Tn to the P memory blocks. It means that the size of each memory block is equal to ⌈n/P⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.

The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained sche- duling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.

Architecture parameters LDPC H

matrix

Constraint

graph Scheduling

Binding Memory mapping

Code generation

PU configuration Analysis

Firmware C codes SIMD matrix

generation

Fig. 4: Methodology for the generation of the ASIP LDPC decoder

4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode ins- tructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its ef- ficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodo- logy described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU pro- gramming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :

First.P-C : register initialization and parity check node – P-C : parity check node

First.Var : register initialization and variable node – Var : variable node

First.P-C&Var : register initialization, parity check node and variable node

P-C&Var : parity check node and variable node

v o i d l d p c _ d e c o d e r ( )

{ i n t l o o p = 2 0 ;

w h i l e ( l o o p )

{ F i r s t . PC ( 1 ) ; PC( n1);

F i r s t . PC&Var ( 1 ) ; PC&Var ( n1);

F i r s t . PC ( 1 ) ; PC&Var ( n1);

/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;

Var ( n1);

l o o p −= 1 ; }

}

Listing 1: Firmware C-code example for proposed ASIP architecture

(17)

Equipe CSN - IMS Septembre 2012

The automated compilation flow for LDPC codes

17

(18)

Equipe CSN - IMS Septembre 2012 18

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

The Tanner graph shows the links between the variable nodes and the check nodes.

Horizontal scheduling => when a check node is updated, all linked var. nodes are updated

Factorized graph model - Check nodes centred (Cx) - Mx includes all computations
 for Cx update

- Edges show Cx nodes


having execution dependency => Mx nodes that cannot be
 computed in parallel

(19)

Equipe CSN - IMS Septembre 2012 19

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements

(20)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

20

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements

(21)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

21

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M4

M4 M4 M4

(22)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

22

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M4

M4 M4 M4

(23)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

23

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M4

M4 M4 M4

(24)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

24

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M2

M4 M4 M4 M4

M2 M2 M2

M2 M2 M2 M2

M2 M2 M2

M2

(25)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

25

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.

Sched. and binding constraints:

- Computation dependency (edges)
 - Acces conflits (shared memory)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M2

M4 M4 M4 M4

M2 M2 M2

M2 M2 M2 M2

M2 M2 M2

M2

(26)

Equipe CSN - IMS Septembre 2012

Computation scheduling and binding algorithm

26

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

P.E. 1 1

2 3 4

# of processing elements 5

6 7 8 clock

cycle

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4

# of processing elements M4

M4 M4 M4

M4 M4

M4 M4 M2

M4 M4 M4 M4

M2 M2 M2

M2 M2 M2 M2

M2 M2 M2

M2

Impossible to schedule the M5 computation:

- It do not share LLR value => OK - It needs to access RAM2 to read
 one LLR value:


+ Requied one clock cycle in
 memory reservation table.

(27)

Equipe CSN - IMS Septembre 2012

P.E. 1 1

2 3 4 5 6 7 8 macro

cycle 1

RAM0 RAM1 RAM2 RAM3 1

2 3 4 5 6 7 8

Global memory acces reservation table Processing Unit reservation table

P.E. 2 P.E. 2 P.E. 4 M4

M4 M4 M4

M4 M4

M4 M4 M2

M4 M4 M4 M4

M2 M2 M2

M2 M2 M2 M2

M2 M2 M2

M2 M5

M5 M5 M5

...

macro cycle 0

9 9

M5 M5

M5

M5

Computation scheduling and binding algorithm

27

2

0 1 3 4 5 6 7 8 9 10

C0 C1 C2

11

C3 C4 C5 C6

M0

M1 M3

M5 M2

M6

M4

Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)

Impossible to schedule the M5 computation:

- It do not share LLR value => OK - It needs to access RAM2 to read
 one LLR value:


+ Requied one clock cycle in
 memory reservation table.

(28)

Equipe CSN - IMS Septembre 2012

L'ordonnancement des calculs sur les PEs

28

Contraintes LDCP H

matrice

Modèle interne

Testbench Architecture

VHDL

Ordonnancement et assignation

Allocation mémoire

Génération (VHDL & C)

Modèle C (BER) Code C

(commande ASIP) Analyse

(29)

Equipe CSN - IMS Septembre 2012

Résultats d’expérimentation (norme DVB-S2)

29

# bits # entrées

LUTs FFs Slices RAMs DSPs Fc

(10. 12. 10)

32 cores 47898 7873 55771 416 0 100MHz 64 cores 57414 12738 70152 415 0 100MHz 96 cores 85387 18051 103438 416 0 100MHz 128 cores 96582 23308 119890 416 0 100MHz

(6. 8. 6)

32 cores 10578 5852 16430 380 0 100MHz 64 cores 25421 7638 33059 416 0 100MHz 96 cores 31409 12933 44342 416 0 100MHz 128 cores 56735 15989 72724 416 0 100MHz

(5. 7. 5)

32 cores 9526 5528 15054 364 0 100MHz 64 cores 20968 7638 28606 403 0 100MHz 96 cores 27391 11934 39325 416 0 100MHz 128 cores 39606 15262 54868 416 0 100MHz

Architecture supportant tous les codes DVB-S2 + tailles inférieures Résultats obtenus post-synthèse à l’aide de Synopsys Synplify (2011).

Cout = PCIe + Coeur Proc. + Matrice SIMD

Références

Documents relatifs

-Sameh Hassan,Islamic religious terms in English translation vs transliteration in Ezzeddin Ibrahim & Denys Johnson-Davides’ translation of An-Nawawi forty hadiths, from p117

Sélectionner une tâche en entrée qui précise sur quelle carte et sur quelle voie sont le données et qui soit conforme au type de sortie. Le plus simple est de

The interconnections for regular branch engines are shown The next section analyzes the results obtained by our architectures when processing digital video test

Parametric model reduction in this case happened to be easier as the range of parameter values is much more narrow than in the previous case. We needed to generate 10

ne peut pas individualiser une partie de la question ; chaque condition suppl6mentaire s'ajoutant a la pr6c6dente pour former une unique ^quation. Donc on ne peut

The FM allows to define complex constraints between features (e.g., see the propositional constraints in Fig. 1), which allows to generate fully functional, valid and highly

reviews NB-LDPC decoding with particular attention to the Min-Sum and the EMS algorithms. Section III is dedicated to the global decoder architecture and its scheduling. The

In this section we compute the trunkenness with respect to a volume form of some vector fields supported in a tubular neighborhood of a link or knot. Our statement is reminiscent