Travaux de Recherche période 2011-2012
Bertrand LE GAL
(
bertrand.legal@ims-bordeaux.fr)
Laboratoire IMS- UMR CNRS 5218 Institut Polytechnique de Bordeaux
Université de Bordeaux France
Septembre 2012
31.05.2010
1/2
L"Université de Bordeaux publie une étude comparative des initiatives campus verts menées à l"échelon international, qui représente une source précieuse d"informations et de réflexions pour l"élaboration de son nouveau modèle d"Université dans le cadre de l"Opération campus. Elle fait le choix de diffuser en accès libre cette étude, le développement durable étant l"affaire de tous et pour l"intérêt de tous.
L*Université de Bordeaux s*est engagée à bâtir un nouveau modèle d*Université, et parallèlement à de- venir leader en matière de développement durable. C*est en ce sens que début 2009, elle a répondu favorablement, conjointement avec l*Université Bordeaux 1 Sciences Technologies, à la proposition d*Ecocampus-Nobatek et d*EDF : réaliser un retour d*expériences et des analyses sur des projets campus verts en France, Europe et Amérique du Nord.
L*objectif de cette étude (cf. page suivante) a été d*observer et de capturer les bonnes pratiques et ac- tions exemplaires relatives aux grands piliers du développement durable : domaines économiques, so- ciaux, environnementaux et organisationnels. L*Université de Bordeaux va s*y référer pour mettre en Wuvre une gouvernance et une stratégie à long terme au service d*un campus plus vivable et plus équi- table pour l*ensemble de la communauté universitaire.
Avec le Grenelle de l*environnement comme repère à atteindre puis à dépasser, l*Université de Bor- deaux entend constituer un site pilote à travers une démarche de développement durable globale par : - l*intégration permanente des dimensions humaines dans le projet immobilier et l*aménagement (acces- sibilité, santé, lisibilité, confort, cadre de vie) ;
- une transformation énergétique radicale des bâtiments dans le cadre de leur rénovation en démarche HQE® et un schéma directeur énergétique pour une réduction maximale des gaz à effet de serre ; - la mise en valeur et la sanctuarisation d*un parc sur le site universitaire de Talence-Pessac- Gradignan, véritable poumon vert à l*échelle de l*agglomération, atout exceptionnel pour la qualité de vie des usagers et le développement de la biodiversité en milieu urbain ;
- un plan de déplacement sur l*ensemble des domaines du campus universitaire, afin de réduire l*usage individuel de la voiture et son impact en s*appuyant sur des réseaux de transports en commun perfor- mants et le développement des modes alternatifs ;
- une ouverture concertée sur la ville, visant à favoriser le développement économique des territoires, celui de la vie de campus et à créer une mixité sociale et fonctionnelle ;
- et enfin, condition sine qua non de réussite, la mise en place d*un processus d*information et de concer- tation auprès de tous les membres et acteurs de l*Université, pour une compréhension partagée des en- jeux et un apprentissage des comportements responsables.
Aussi, l*Université de Bordeaux entend-elle élaborer un agenda 21 et faire de son campus un site d*expérimentation permettant de développer des approches innovantes à partir des compétences des laboratoires.
L*étude « Initiatives campus verts » est téléchargeable sur le site www.univ-bordeaux.fr
Contacts presse Université de Bordeaux
Anne SEYRAFIAN . Norbert LOUSTAUNAU . T 33 (0)5 56 33 80 84 . communication@univ-bordeaux.fr Contact Nobatek-Ecocampus
Julie CREPIN, chef de projet . T 33 (0)5 56 84 63 72 . jcrepin@nobatek.com
L*Université de Bordeaux
Vers un nouveau modèle d"Université DURABLE
Equipe CSN - IMS Septembre 2012
Réduction de la complexité matérielle des processeurs softcore
2
Equipe CSN - IMS Septembre 2012
Les différentes cibles architecturales et technologiques
3
Architecture dédiée Implanta2on logicielle Codesign & so8cores
Circuit ASIC préconçu et fondu par un 2ers
Equipe CSN - IMS Septembre 2012
Méthodologie de suppression des instructions
๏ Phase d’optimisation du coeur
➡ Code source applicatif
‣ Compilé à l’aide de GCC,
‣ Décompilé afin d’obtenir les
instructions réellement exécutées,
➡ Analyse du code assembleur,
‣ Déterminer les instructions inutilisées,
➡ Modification du code VHDL,
‣ Modification du décodeur d’instructions,
‣ Suppression des unités matérielles (ALUs) inutilisées,
➡ Modification du compilateur,
๏ Méthodologie automatisée,
4
VHDL config.
GCC compiler (modified)
Source code Binary
program
Modified processor Instruc2on
usages
Processor (Leon 3)
Equipe CSN - IMS Septembre 2012
Evaluation des performances de l’approche
5
1 Enc/décodeur ADPCM MiBench
Communication numérique
2 Enc/décodeur BCH (31,21) [19]
3 Enc/décodeur de GOLAY [19]
4 Décodeur LDCP Hand written
5 Décodeur MP3 Helix community
6 CRC 32b TRAP [24]
Application cryptographique
et/ou sécurité
7 MD5 Fast DES kit
8 SHA-1 MiBench
9 SHA-2
10 Enc/décodeur ARC4
11 Enc/décodeur DES / 3-DES Fast DES kit 12 Enc/décodeur AES (128b, 512b)
13 Contrôle moteur TRAP
Application de type contrôle
14 Tri de données TRAP
15 Queens TRAP
16 Recherche de motifs TRAP
17 Compression de texte (v42) TRAP
18 g3fax Hand written
19 Contrôle d’écran LCD Hand written
20 Filtrage LMS Hand written
Application de traitement du
signal
21 Filtrage RIF (512 points) TRAP
22 Traitement FFT, iFFT (virgule fixe) TRAP
23 Filtre d’annulation d’écho LibGSM
24 Décompresion vidéo MJPEG TRAP Application de
traitement de la vidéo 25 Vidéo surveillance (detec. mouvements) Hand written
26 Egalisation du contraste temps réel Hand written
26 applications différentes
2 coeurs de processeur de type softcore
3 cibles technologiques
Equipe CSN - IMS Septembre 2012
Résultats obtenus: processeur «Plasma»
6
0 % 20 % 40 % 60 % 80 % 100 %
1 3 5 7 9 11 13 15 17 19 21 23 25
USELESS INSTRUCTIONS (%)
0 % 10 % 20 % 30 % 40 % 50 %
1 3 5 7 9 11 13 15 17 19 21 23 25
SPARTAN-6 (%) ASIC-65nm (%)
Gain en % sur le cout en surface - Cibles matérielles (ASIC et FPGA) Taux d’instructions inutilisées
Equipe CSN - IMS Septembre 2012
Résultats obtenus: processeur «Léon-3»
7
0 % 20 % 40 % 60 % 80 % 100 %
1 3 5 7 9 11 13 15 17 19 21 23 25
Taux d’instructions inutilisées
0 4000 8000 12000 16000 20000
1 3 5 7 9 11 13 15 17 19 21 23 25
Architecture optimisée Architecture originale
Cout matériel sur cible ACTEL ProASIC-3 (# de eCore)
Equipe CSN - IMS Septembre 2012
Conception d’ASIP dédié aux FEC
(architecture flexible pour les codes LDPC)
8
Equipe CSN - IMS Septembre 2012
Introduction aux communications numériques
9
Informations émises QAM-256
Informations reçues QAM-256
Equipe CSN - IMS Septembre 2012
Introduction aux codes correcteurs d’erreurs
10
information à transmettre = [v
1, v
2, v
3, v
4, v
5, v
6]
information transmise = [v
1, v
2, v
3, v
4, v
5, v
6,r
1, r
2, r
3, r
4, r
5, r
6,]
insertion de la redondance (r=đ)
information reçue = [v
1, v
2, v
3, v
4, v
5, v
6,r
1, r
2, r
3, r
4, r
5, r
6,]
Transmission de l’information
information décodée = [v
1, v
2, v
3, v
4, v
5, v
6]
Correction des erreurs de transmission
Equipe CSN - IMS Septembre 2012
Présentation du principe de décodage des codes LDPC
11
trame = [v
1, v
2, v
3, v
4, v
5, v
6]
Matrice de décodage (LDPC)
Graphe de Tanner associé
Exemple de trame LDPC reçue
C
1= F(v
2, v
3, v
6)
C
2= F(v
1, v
2, v
3, v
5) C
3= F(v
1, v
4, v
5, v
6) C
4= F(v
3, v
4, v
6)
Processus de décodage
v
1= G(C
2, C
3) v
2= G(C
1, C
2)
v
3= G(C
1, C
2, C
4) v
4= G(C
3, C
4)
v
5= G(C
2, C
3) v
6= G(C
1, C
3, C
4)
DVB-S2 : 20 à 30 itérations, 64k noeuds 32k checks, 90Mbits/s
Equipe CSN - IMS Septembre 2012
Présentation de l’architecture de l’ASIP
12
data
input ASIP
core
data output
uProcessor core
Configurable LDPC decoder
array I/O
data control
signals
RAMs RAMs RAMs RAMs
PU PU PU PU
I I
RAM
RAM
RAM
RAM
Computation sequencer
D M
Contraintes de conception:
- Flexibilité & évolutivité - Surcoût matériel limité - Contraintes de débit élevées
Equipe CSN - IMS Septembre 2012
any bank memory access conflict. Previous works such as [19]
[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This map- ping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.
Global memory banks
Processing units with their own local memories RAM
(LLR) RAM
(LLR) RAM
(LLR) RAM
(LLR)
PU (logic)
Reg.
file
PU (logic)
Reg.
file
PU (logic)
Reg.
file
PU (logic)
Reg.
file instr.
decod.
virtual.
layer
uProcessor core System interface
∏ / ∏
instr.
RAM
control signals
-1
decision data LLR data
Config. data Status
Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix
4.3. Design flow for LDPC decoder generation
Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees d
vnand d
cmof each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.
First, an allocation task is executed for a given parallelism level P . The purpose of the allocation algorithm is to map all the LLR values T
nto the P memory blocks. It means that the size of each memory block is equal to ⌈n/P ⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.
The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained sche- duling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.
Architecture parameters LDPC H
matrix
Constraint
graph Scheduling
Binding Memory mapping
Code generation
PU
configuration Analysis
Firmware C codes SIMD matrix
generation
Fig. 4: Methodology for the generation of the ASIP LDPC decoder
4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode ins- tructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its ef- ficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodo- logy described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU pro- gramming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :
– First.P-C : register initialization and parity check node – P-C : parity check node
– First.Var : register initialization and variable node – Var : variable node
– First.P-C&Var : register initialization, parity check node and variable node
– P-C&Var : parity check node and variable node
v o i d l d p c _ d e c o d e r ( )
{ i n t l o o p = 2 0 ;
w h i l e ( l o o p )
{ F i r s t . P−C ( 1 ) ; P−C( n−1);
F i r s t . P−C&Var ( 1 ) ; P−C&Var ( n−1);
F i r s t . P−C ( 1 ) ; P−C&Var ( n−1);
/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;
Var ( n−1); l o o p −= 1 ; }
}
Listing 1: Firmware C-code example for proposed ASIP architecture
Architecture interne de l’ASIP
13
MIPS-32b core
Instruction memory
Shared LLR memories
Processing elements + local memories Interleavers
(Benes) Abstraction layer
memories
(cycle => data set)
Equipe CSN - IMS Septembre 2012
Architecture (simplifiée) du processeur élémentaire (PE)
14
dynamique des
noeuds dynamique des
messages Cout matériel
(outil Synplify) Chemin
critique max.
12 bits 10 bits 210 LUT(s) 69 REG(s) 1 RAM 14,599 ns 10 bits 8 bits 173 LUT(s) 57 REG(s) 1 RAM 12,606 ns 9 bits 7 bits 171 LUT(s) 51 REG(s) 1 RAM 11,090 ns 8 bits 6 bits 133 LUT(s) 45 REG(s) 1 RAM 10.356 ns 6 bits 4 bits 101 LUT(s) 33 REG(s) 1 RAM 8.538 ns
Coeur sans mémoire - Résultats obtenus pour une cible Virtex-6
VHDL config.
sub Register
file
MIN(s)
sign xor
fifo
sign min1/2
Tm
Emn
cst1
cst2 mux
=
inv add
min1
sign
Parity check
processing LLR update
processing
(t) (t)
Tnm(t) Tm(t+1)
E(t+1)mn
min2
Equipe CSN - IMS Septembre 2012
Mémorisation des données d’entrée - 2 approches
15
En fonction du code LDPC
à implanter 2 approches sont proposées pour mémoriser les entrées du système
=> Favorise l’exploitation du parallélisme entre les calculs (diminution des conflits
d’accès à la mémoire)
LDPC frame n
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
...
RAM (1) RAM (2) RAM (3)
...
...
...
...
RAM (4) LLR memory
filling LDPC frame n
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ...
V1
V5
V9
...
V2
V6
V10
...
V3
V7
V11
...
RAM (1) RAM (2) RAM (3)
V4
V8
V12
...
RAM (4) LLR memory
filling
Global memory banks
Processing elements with their own local memories RAM
(LLR) RAM
(LLR) RAM
(LLR) RAM
(LLR)
(logic)PE
Reg.file
(logic)PE
Reg.file
(logic)PE
Reg.file
(logic)PE
Reg.file instr.
decod.
virtual.
layer
uProcessor core System interface
∏ / ∏
instr.
RAM
control signals
-1
data output data input
Equipe CSN - IMS Septembre 2012
Automatic compilation flow for efficient usage rate
16
๏ Compilation flow inputs
➡ LDPC H matrix (one positions)
➡ # of PU to allocate
➡ Max. fixed point format
➡ Targetted throughtput (c)
๏ Estimated parmeters
➡ Throughput at 100MHz (FPGA)
➡ Frequency to achieve (c) throughput
➡ Implementation characteristics (usage rates, cycle/bit, estimated area, etc.)
๏ Generated files
➡ ASIP processor + PU matrix (RTL)
➡ C codes for ASIP execution
➡ Testbench at various levels (C => VHDL) any bank memory access conflict. Previous works such as [19]
[20], proposed methods to map the data in different memory banks without access conflict. In our case, the mapping of LLR values in the P block memories is not constrained. This map- ping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.
Global memory banks
Processing units with their own local memories RAM
(LLR) RAM
(LLR) RAM
(LLR) RAM
(LLR)
PU (logic)
Reg.
file
PU (logic)
Reg.
file
PU (logic)
Reg.
file
PU (logic)
Reg.
file instr.
decod.
virtual.
layer
uProcessor core System interface
∏ / ∏
instr.
RAM
control signals
-1
decision data LLR data
Config. data Status
Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix
4.3. Design flow for LDPC decoder generation
Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrix H. This analysis enables to determine the degrees dvn and dcm of each variable node n and parity check node m, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.
First, an allocation task is executed for a given parallelism level P. The purpose of the allocation algorithm is to map all the LLR values Tn to the P memory blocks. It means that the size of each memory block is equal to ⌈n/P⌉. Three different memory mappings are proposed in our design flow : block by block, data by data modulo P and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.
The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained sche- duling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is pro- vided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.
Architecture parameters LDPC H
matrix
Constraint
graph Scheduling
Binding Memory mapping
Code generation
PU configuration Analysis
Firmware C codes SIMD matrix
generation
Fig. 4: Methodology for the generation of the ASIP LDPC decoder
4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode ins- tructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its ef- ficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodo- logy described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU pro- gramming process is given in Listing 1. The firmware is part of the LDPC decoding that considered loop = 20 iterations and a frame size n. Six instructions have been defined to directly specify the PU execution :
– First.P-C : register initialization and parity check node – P-C : parity check node
– First.Var : register initialization and variable node – Var : variable node
– First.P-C&Var : register initialization, parity check node and variable node
– P-C&Var : parity check node and variable node
v o i d l d p c _ d e c o d e r ( )
{ i n t l o o p = 2 0 ;
w h i l e ( l o o p )
{ F i r s t . P−C ( 1 ) ; P−C( n−1);
F i r s t . P−C&Var ( 1 ) ; P−C&Var ( n−1);
F i r s t . P−C ( 1 ) ; P−C&Var ( n−1);
/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ;
Var ( n−1);
l o o p −= 1 ; }
}
Listing 1: Firmware C-code example for proposed ASIP architecture
Equipe CSN - IMS Septembre 2012
The automated compilation flow for LDPC codes
17
Equipe CSN - IMS Septembre 2012 18
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
The Tanner graph shows the links between the variable nodes and the check nodes.
Horizontal scheduling => when a check node is updated, all linked var. nodes are updated
Factorized graph model - Check nodes centred (Cx) - Mx includes all computations for Cx update
- Edges show Cx nodes
having execution dependency => Mx nodes that cannot be computed in parallel
Equipe CSN - IMS Septembre 2012 19
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
20
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
21
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M4
M4 M4 M4
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
22
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M4
M4 M4 M4
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
23
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M4
M4 M4 M4
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
24
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M2
M4 M4 M4 M4
M2 M2 M2
M2 M2 M2 M2
M2 M2 M2
M2
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
25
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Computations Mx have to be scheduled (clock cycles) and bind (PUx) on the architecture.
Sched. and binding constraints:
- Computation dependency (edges) - Acces conflits (shared memory)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M2
M4 M4 M4 M4
M2 M2 M2
M2 M2 M2 M2
M2 M2 M2
M2
Equipe CSN - IMS Septembre 2012
Computation scheduling and binding algorithm
26
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
P.E. 1 1
2 3 4
# of processing elements 5
6 7 8 clock
cycle
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4
# of processing elements M4
M4 M4 M4
M4 M4
M4 M4 M2
M4 M4 M4 M4
M2 M2 M2
M2 M2 M2 M2
M2 M2 M2
M2
Impossible to schedule the M5 computation:
- It do not share LLR value => OK - It needs to access RAM2 to read one LLR value:
+ Requied one clock cycle in memory reservation table.
Equipe CSN - IMS Septembre 2012
P.E. 1 1
2 3 4 5 6 7 8 macro
cycle 1
RAM0 RAM1 RAM2 RAM3 1
2 3 4 5 6 7 8
Global memory acces reservation table Processing Unit reservation table
P.E. 2 P.E. 2 P.E. 4 M4
M4 M4 M4
M4 M4
M4 M4 M2
M4 M4 M4 M4
M2 M2 M2
M2 M2 M2 M2
M2 M2 M2
M2 M5
M5 M5 M5
...
macro cycle 0
9 9
M5 M5
M5
M5
Computation scheduling and binding algorithm
27
2
0 1 3 4 5 6 7 8 9 10
C0 C1 C2
11
C3 C4 C5 C6
M0
M1 M3
M5 M2
M6
M4
Computations LLR(1) => C4 LLR(3) => C4 LLR(5) => C4 LLR(7) => C4 C4 => LLR(1) C4 => LLR(3) C4 => LLR(5) C4 => LLR(7)
Impossible to schedule the M5 computation:
- It do not share LLR value => OK - It needs to access RAM2 to read one LLR value:
+ Requied one clock cycle in memory reservation table.
Equipe CSN - IMS Septembre 2012
L'ordonnancement des calculs sur les PEs
28
Contraintes LDCP H
matrice
Modèle interne
Testbench Architecture
VHDL
Ordonnancement et assignation
Allocation mémoire
Génération (VHDL & C)
Modèle C (BER) Code C
(commande ASIP) Analyse
Equipe CSN - IMS Septembre 2012
Résultats d’expérimentation (norme DVB-S2)
29
# bits # entrées
LUTs FFs Slices RAMs DSPs Fc
(10. 12. 10)
32 cores 47898 7873 55771 416 0 100MHz 64 cores 57414 12738 70152 415 0 100MHz 96 cores 85387 18051 103438 416 0 100MHz 128 cores 96582 23308 119890 416 0 100MHz
(6. 8. 6)
32 cores 10578 5852 16430 380 0 100MHz 64 cores 25421 7638 33059 416 0 100MHz 96 cores 31409 12933 44342 416 0 100MHz 128 cores 56735 15989 72724 416 0 100MHz
(5. 7. 5)
32 cores 9526 5528 15054 364 0 100MHz 64 cores 20968 7638 28606 403 0 100MHz 96 cores 27391 11934 39325 416 0 100MHz 128 cores 39606 15262 54868 416 0 100MHz
Architecture supportant tous les codes DVB-S2 + tailles inférieures Résultats obtenus post-synthèse à l’aide de Synopsys Synplify (2011).
Cout = PCIe + Coeur Proc. + Matrice SIMD