• Aucun résultat trouvé

Equipe CSN by Bertrand LE GAL

N/A
N/A
Protected

Academic year: 2022

Partager "Equipe CSN by Bertrand LE GAL"

Copied!
29
0
0

Texte intégral

(1)

Quelques travaux récents…

Equipe CSN by Bertrand LE GAL

(bertrand.legal@ims-bordeaux.fr)

 

Laboratoire IMS- UMR CNRS 5218 Institut Polytechnique de Bordeaux

Université de Bordeaux 1 France

7 Février 2013

31.05.2010

1/2

L"Université de Bordeaux publie une étude comparative des initiatives campus verts menées à l"échelon international, qui représente une source précieuse d"informations et de réflexions pour l"élaboration de son nouveau modèle d"Université dans le cadre de l"Opération campus. Elle fait le choix de diffuser en accès libre cette étude, le développement durable étant l"affaire de tous et pour l"intérêt de tous.

L*Université de Bordeaux s*est engagée à bâtir un nouveau modèle d*Université, et parallèlement à de- venir leader en matière de développement durable. C*est en ce sens que début 2009, elle a répondu favorablement, conjointement avec l*Université Bordeaux 1 Sciences Technologies, à la proposition d*Ecocampus-Nobatek et d*EDF : réaliser un retour d*expériences et des analyses sur des projets campus verts en France, Europe et Amérique du Nord.

L*objectif de cette étude (cf. page suivante) a été d*observer et de capturer les bonnes pratiques et ac- tions exemplaires relatives aux grands piliers du développement durable : domaines économiques, so- ciaux, environnementaux et organisationnels. L*Université de Bordeaux va s*y référer pour mettre en Wuvre une gouvernance et une stratégie à long terme au service d*un campus plus vivable et plus équi- table pour l*ensemble de la communauté universitaire.

Avec le Grenelle de l*environnement comme repère à atteindre puis à dépasser, l*Université de Bor- deaux entend constituer un site pilote à travers une démarche de développement durable globale par : - l*intégration permanente des dimensions humaines dans le projet immobilier et l*aménagement (acces- sibilité, santé, lisibilité, confort, cadre de vie) ;

- une transformation énergétique radicale des bâtiments dans le cadre de leur rénovation en démarche HQE® et un schéma directeur énergétique pour une réduction maximale des gaz à effet de serre ; - la mise en valeur et la sanctuarisation d*un parc sur le site universitaire de Talence-Pessac- Gradignan, véritable poumon vert à l*échelle de l*agglomération, atout exceptionnel pour la qualité de vie des usagers et le développement de la biodiversité en milieu urbain ;

- un plan de déplacement sur l*ensemble des domaines du campus universitaire, afin de réduire l*usage individuel de la voiture et son impact en s*appuyant sur des réseaux de transports en commun perfor- mants et le développement des modes alternatifs ;

- une ouverture concertée sur la ville, visant à favoriser le développement économique des territoires, celui de la vie de campus et à créer une mixité sociale et fonctionnelle ;

- et enfin, condition sine qua non de réussite, la mise en place d*un processus d*information et de concer- tation auprès de tous les membres et acteurs de l*Université, pour une compréhension partagée des en- jeux et un apprentissage des comportements responsables.

Aussi, l*Université de Bordeaux entend-elle élaborer un agenda 21 et faire de son campus un site d*expérimentation permettant de développer des approches innovantes à partir des compétences des laboratoires.

L*étude « Initiatives campus verts » est téléchargeable sur le site www.univ-bordeaux.fr

Contacts presse Université de Bordeaux

Anne SEYRAFIAN . Norbert LOUSTAUNAU . T 33 (0)5 56 33 80 84 . communication@univ-bordeaux.fr Contact Nobatek-Ecocampus

Julie CREPIN, chef de projet . T 33 (0)5 56 84 63 72 . jcrepin@nobatek.com

L*Université de Bordeaux

Vers un nouveau modèle d"Université DURABLE

(2)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Sorry…

2

—- 16

(3)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Machine à laver… nombre d’itération nécessaire / SNR

3

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10

3 5 7 9 11 13 15 17 19 21 23 25

0 100 200 300 400 500

# of layered decoding iterations

Airthroughput(Mbps)

1944 972 2048 384 4000 2000 4896 2448 8000 4000 9972 4986 20000 10000 64800 25600 64800 31400

Fig. 12. Throughput performance of LDPC decoders depending on the number of decoding iterations

3 5 7 9 11 13 15 17 19 21 23 25

0 200 400 600 800 1,000 1,200

# of layered decoding iterations

Decodinglatency(us)

1944 972 2048 384 4000 2000 4896 2448 8000 4000 9972 4986 20000 10000 64800 25600 64800 31400

Fig. 13. Decoding latency of LDPC decoders depending on the number of decoding iterations

Throughputs achieved by the x86 LDPC decoders vary from more that 300Mbps for 3 decoding iterations down to 60Mbps for 25 iterations.

Code structure do not impact on throughput performance contrary to GPU-based decoders. Indeed, in GPU related GPU works parameter Quasi-Cyclic structure of a code can hardly impact on throughput [24]. Proposed approach is insensitive to code structure parameters: the overall XXX offers very closed performances.

Code length (from 2k up to 8k) do not affect throughput.

However, this aknowledgment is wrong for longer frame lengths that decrease throughput. This performance reduction comes from the fact that decoder memory footprints are larger that processor cache.

Figure 13 provided the decoding latency of the x86 de- coders. Latency is very low and vary from 200us for 3 decoding iterations up to 1.5ms for 25 iterations on DVB-S2 long frame code. Latency increases linearly according to the number of decoding iterations. However, latency increase is highly impacted by frame length. In all cases, x86 decoder

2 4 6 8 10

0 200 400 600

Eb/N0

Airthroughput(Mbps)

1944 972 2048 384 4000 2000 4896 2448 8000 4000 9972 4986 20000 10000 64800 25600 64800 31400

Fig. 14. BER performance of LDPC codes having: a code rate of 1/2, different LDPC matrix and different frame length

2 4 6 8 10

0 200 400 600 800 1,000

Eb/N0

Decodinglatency(us)

1944 972 2048 384 4000 2000 4896 2448 8000 4000 9972 4986 20000 10000 64800 25600 64800 31400

Fig. 15. BER performance of LDPC codes having: a code rate of 1/2, different LDPC matrix and different frame length

provides short decoding latency (less than 1ms for frame length shorter or equal to 20k ).

Second evaluation focuses on the SNR value. This evalu- ation show the performance improvement introduced by the early termination criteria when SNR level increase. The de- coding process was configured to process at least 20 decoding iterations when no valid codeword is discovered. Figures 14 and 15 respectively provides the throughput and the decoding latency of the x86 decoder depending on the SNR value.

Figures 14 show that the...

This set of experimentation show that the x86 implementa- tion of the LDPC decoding process is efficient in terms of throughput and decoding latency. Selected LDPC decoding algorithm and its implementation offers performances (on a single processor core) that are higher than ones required in nowadays communication standards. Moreover, processing latency introduced by the decoding process is still low (< 1ms at 25 decoding iterations except for DVB-S2 long frames).

These performances enables the proposed solution to be used

in practical SDR (software-defined radio) systems.

(4)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal Equipe CSN - Présentation travaux de recherche 22 Novembre 2013

B. Le Gal Novembre 2012

Les SoCs numériques, une architecture conjointe

4

La majorité des SoCs intègrent des blocs numériques dédiés

+ processeurs

Il en va de même pour les circuits FPGA (Xilinx & Altera)

Regrouper au sein d’une même puce => performances & fiabilité

(5)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Le meilleur example de ces évolutions, votre téléphone !

5

SoC A7 - iPhone 5S SoC A6 - iPhone 4S

Tech. 32 nm, dimension 9,7 mm x 9,97 mm

(6)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Ou bien vos tablettes connectées…

6

Sony Tablet S1 NVIDIA Tegra 2 (SoC)

(7)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Des blocs numériques de différentes granularités

7

Implantation de fonctions numériques simples

Implantation de fonctions numériques complexes ou

systèmes simples

Implantation de systèmes numériques complexes

(8)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Différentes manière de jouer aux « Lego »

8

Le processus d’intégration est

contraint par la bibliothèque de portes logiques fournie par le fondeur.

+

Algorithme

Bibliothèque technologique

(9)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Différentes manière de jouer aux « Lego »

9

(10)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Les architectures programmables

10

(11)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal Equipe CSN - Présentation PSA OpenLab 22 Novembre 2013

B. Le Gal 7 Février 2013

Les décodeurs LDPC logiciels

11

Les objectifs de simulation de codes LDPC

Simuler pour évaluer et/ou valider les performances (BER) d’un code LDPC

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7

Algorihtm Collapsed read Uncollapsed read Collapsed write Uncollapsed write

576x288 (F) m m m m

576x288 (L) m m m m

1024x512 (F) m m m m

1024x512 (L) m m m m

4000x2000 (F) m m m m

4000x2000 (L) m m m m

8000x4000 (F) m m m m

8000x4000 (L) m m m m

20000x10000 (F) m m m m

20000x10000 (L) m m m m

64880x32400 (F) m m m m

64880x32400 (L) m m m m

Fig. 7. Memory access requires by the LDPC decoder

0.5 1.5 2.5

10 9 10 7 10 5 10 3 10 1

Eb/N0

BER

816x408 1944x972 2304x1152 4000x2000 4896x2448 8000x4000 20000x10000 64800x32400

Fig. 8. BER performance of LDPC codes having: a code rate of1/2, different LDPC matrix and different frame length

is mainly due to the fact that: from an architectural point of view, then cores shared the same memory caches and from a software point of view, synchronisation barriers are required to synchronize the processing thread (e.g. to exchange data).

Efficiency implementing an application onto such kind of architecture that provides SM and SIMD parallelism at the same time is complex. In this section we detail the parallelism levels available in LDPC decoding algorithm, and then explain the approach used to achieve high throughput performances.

Implementing an efficient LDPC decoder on such architec- tures is a challenging task. Indeed, LDPC decoding algorithm has characteristic that made it inefficient: it requires large memory set, data reusing is quite low between consecutive computations, the memory access ratio over the computation complexity is low and a large set of memory access are irregulars.

LDPC decoding process is memory consuming and neces- sitates collapsed and uncollapsed read/write accesses to the memory. The main optimization stage consists in identifying an efficient way to reduce memory footprint and reduces memory accesses. Constraint on memory optimization differs from previous GPU works because decoding computations are

0.5 1.5 2.5

10 8 10 6 10 4 10 2

Eb/N0

BER

2it.

4it.

6it.

8it.

10it.

12it.

14it.

16it.

18it.

20it.

30it.

40it.

50it.

100it.

Fig. 9. BER performance of the 80004000 LDPC code depending on the number of iterations

sequential.

However, such functionality is used to boost LDPC de- coding performance. Technique used ans associated cost is presented bellow.

Finality, performance is improved taking advantage of LDPC code characteristics (degree of the VN and CN nodes) to reduce memory accesses, minimize control instructions and improve data locality in registers.

These optimization techniques are presented in the follow- ing subsections.

A. Memory optimization

LDPC decoder consume memory to store VN values and messages from VN to CN and vice-et-versa. Moreover, LDPC decoder needs to storeHmatrix information to realize compu- tations over the right data set. Amount of memory depends on LDPC code and computation scheduling (flooding/layered).

1) Flooding algorithm: The flooding based algorithm is composed of two main stages. The first one computes the messages from Check-Nodes to Variable Node and the second stage performs the inverse processing. This algorithm required at least two memory array to store message from/to VN

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7

Algorihtm Collapsed read Uncollapsed read Collapsed write Uncollapsed write

576x288 (F) m m m m

576x288 (L) m m m m

1024x512 (F) m m m m

1024x512 (L) m m m m

4000x2000 (F) m m m m

4000x2000 (L) m m m m

8000x4000 (F) m m m m

8000x4000 (L) m m m m

20000x10000 (F) m m m m

20000x10000 (L) m m m m

64880x32400 (F) m m m m

64880x32400 (L) m m m m

Fig. 7. Memory access requires by the LDPC decoder

0.5 1.5 2.5

10 9 10 7 10 5 10 3 10 1

Eb/N0

BER

816x408 1944x972 2304x1152 4000x2000 4896x2448 8000x4000 20000x10000 64800x32400

Fig. 8. BER performance of LDPC codes having: a code rate of1/2, different LDPC matrix and different frame length

is mainly due to the fact that: from an architectural point of view, then cores shared the same memory caches and from a software point of view, synchronisation barriers are required to synchronize the processing thread (e.g. to exchange data).

Efficiency implementing an application onto such kind of architecture that provides SM and SIMD parallelism at the same time is complex. In this section we detail the parallelism levels available in LDPC decoding algorithm, and then explain the approach used to achieve high throughput performances.

Implementing an efficient LDPC decoder on such architec- tures is a challenging task. Indeed, LDPC decoding algorithm has characteristic that made it inefficient: it requires large memory set, data reusing is quite low between consecutive computations, the memory access ratio over the computation complexity is low and a large set of memory access are irregulars.

LDPC decoding process is memory consuming and neces- sitates collapsed and uncollapsed read/write accesses to the memory. The main optimization stage consists in identifying an efficient way to reduce memory footprint and reduces memory accesses. Constraint on memory optimization differs from previous GPU works because decoding computations are

0.5 1.5 2.5

10 8 10 6 10 4 10 2

Eb/N0

BER

2it.

4it.

6it.

8it.

10it.

12it.

14it.

16it.

18it.

20it.

30it.

40it.

50it.

100it.

Fig. 9. BER performance of the 80004000 LDPC code depending on the number of iterations

sequential.

However, such functionality is used to boost LDPC de- coding performance. Technique used ans associated cost is presented bellow.

Finality, performance is improved taking advantage of LDPC code characteristics (degree of the VN and CN nodes) to reduce memory accesses, minimize control instructions and improve data locality in registers.

These optimization techniques are presented in the follow- ing subsections.

A. Memory optimization

LDPC decoder consume memory to store VN values and messages from VN to CN and vice-et-versa. Moreover, LDPC decoder needs to storeHmatrix information to realize compu- tations over the right data set. Amount of memory depends on LDPC code and computation scheduling (flooding/layered).

1) Flooding algorithm: The flooding based algorithm is composed of two main stages. The first one computes the messages from Check-Nodes to Variable Node and the second stage performs the inverse processing. This algorithm required at least two memory array to store message from/to VN

Simuler pour évaluer l’impact de la réduction du nombre d’itérations sur les performances (BER) => impact sur la complexité calculatoire

(12)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal Equipe CSN - Présentation PSA OpenLab 22 Novembre 2013

B. Le Gal 7 Février 2013

Les décodeurs LDPC logiciels

12

Les objectifs de simulation de codes LDPC

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8

1 2

10 8 10 6 10 4 10 2

Eb/N0

BER

= 0.00

= 0.05

= 0.10

= 0.15

= 0.20

= 0.25

= 0.50

= 0.70

= 0.75

= 0.80

= 0.85

= 0.90

= 0.95

Fig. 10. BER performance of the80004000LDPC running25iterations with different BP replacement heuristics

1 2

10 8 10 6 10 4 10 2

Eb/N0

BER

floating point ref.

QX.2(5/6/4) QX.3(5/6/4) QX.4(5/6/4) QX.2(5/7/5) QX.3(5/7/5) QX.4(5/7/5) QX.2(6/8/6) QX.3(6/8/6) QX.4(6/8/6)

Fig. 11. BER performance of the 80004000 LDPC code running 20 iterations with different data word-length

from/to CN. Array sizes are specified by the number of messages to exchange namedz. Moreover, to execute decoding computation, decoder must know the degree of each CN and VN to compute. Two arrays whose dimensions equal n and m respectively are required.

Degree of node information is only required to execute the VN kernel. However, to execute the CN one, data interleaving is required. Indeed, each CN must read the right VN messages and must generate the message to the right VN. A data array of sizezis required to store the interleaving rules (interleaving from VN to CN and from CN to VN are identical).

Word-length of array elements depends on values to store.

For most of LDPC codes, one byte is required to store the CN and VN degree, two bytes are required to interleaving rules and n-bytes are used message values2.

2) Layered algorithm: The layered algorithm is composed of a single processing loop that sequentially:

2In case of floating point decoder, 4-bytes are used. When fixed-point format is preferred,1-byte is enough.

1) computes a CN value using its linked VN ones and the latest CN to VN messages.

2) updates the linked VN values and the CN to VN messages from computed CN value.

This computation scheduling removes the requirement to store the VN to CN messages, reducing the memory footprint.

The overall data array are however required.

B. Using SIMD capability

SIMD capability provided in x86 architectures enables to perform identical computations of multiple data sets. There exist many parallelism levels that can be exploited on LDPC decoding:

First parallelism level is located at inside node compu- tation: each node (variable node or check node) requires similar computations to compute its value from incoming messages and update outgoing messages. However, the node degree (number of messages) is not often a power of two and is mainly irregular. These LDPC code charac- teristics discard SIMD optimization at this computation level.

Second parallelism level is located at node computations:

each node (variable node or chack node) can be com- puted independently to others. Using SIMD approach it becomes possible to speed-up node computations. This approach was mainly used in GPU related work to speed- up decoding process. However, to be it necessitates that node degrees are identical (wrong for irregular codes) and moreover due to data dependencies in layered based decoder it is hardly usable.

Third parallelism level is located at codeword computa- tion: different codewords can be decoded simultaneously.

Indeed, the overall codewords are decoded using the same computation sequence over different data sets. The SIMD paradigm enables parallel processing, but increases decoding latency. Latency is not an issue in simulation context.

Depending on the SIMD instruction set available on the x86 architecture, its become possible to process:

4 codeword decodings in parallel (in 32b floating point data format) for x86 SSE architecture and up to 8 codeword decodings for x86 AVX architecture.

8 codeword decodings in parallel (in 8-bit fixed point- format 3) for x86 SSE architecture and up to 16 codeword decodings for x86 AVX-2 architecture.

However, using SIMD processing is not cost free: con- secutive frames received by the LDPC decoder must be first interleaved to align data from the different frames in the memory before decoding and secondly deinterleaved at computation end.

C. LDPC decoder specialization

LDPC decoder performances increase whereas (a) the num- ber of instructions executed decrease (b) the temporary data

3Data are store in memory in 8-bit format, however, processing requiers

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8

1 2

10 8 10 6 10 4 10 2

Eb/N0

BER

= 0.00

= 0.05

= 0.10

= 0.15

= 0.20

= 0.25

= 0.50

= 0.70

= 0.75

= 0.80

= 0.85

= 0.90

= 0.95

Fig. 10. BER performance of the80004000LDPC running25iterations with different BP replacement heuristics

1 2

10 8 10 6 10 4 10 2

Eb/N0

BER

floating point ref.

QX.2(5/6/4) QX.3(5/6/4) QX.4(5/6/4) QX.2(5/7/5) QX.3(5/7/5) QX.4(5/7/5) QX.2(6/8/6) QX.3(6/8/6) QX.4(6/8/6)

Fig. 11. BER performance of the 80004000 LDPC code running 20 iterations with different data word-length

from/to CN. Array sizes are specified by the number of messages to exchange namedz. Moreover, to execute decoding computation, decoder must know the degree of each CN and VN to compute. Two arrays whose dimensions equal n and m respectively are required.

Degree of node information is only required to execute the VN kernel. However, to execute the CN one, data interleaving is required. Indeed, each CN must read the right VN messages and must generate the message to the right VN. A data array of sizez is required to store the interleaving rules (interleaving from VN to CN and from CN to VN are identical).

Word-length of array elements depends on values to store.

For most of LDPC codes, one byte is required to store the CN and VN degree, two bytes are required to interleaving rules and n-bytes are used message values2.

2) Layered algorithm: The layered algorithm is composed of a single processing loop that sequentially:

2In case of floating point decoder, 4-bytes are used. When fixed-point format is preferred,1-byte is enough.

1) computes a CN value using its linked VN ones and the latest CN to VN messages.

2) updates the linked VN values and the CN to VN messages from computed CN value.

This computation scheduling removes the requirement to store the VN to CN messages, reducing the memory footprint.

The overall data array are however required.

B. Using SIMD capability

SIMD capability provided in x86 architectures enables to perform identical computations of multiple data sets. There exist many parallelism levels that can be exploited on LDPC decoding:

First parallelism level is located at inside node compu- tation: each node (variable node or check node) requires similar computations to compute its value from incoming messages and update outgoing messages. However, the node degree (number of messages) is not often a power of two and is mainly irregular. These LDPC code charac- teristics discard SIMD optimization at this computation level.

Second parallelism level is located at node computations:

each node (variable node or chack node) can be com- puted independently to others. Using SIMD approach it becomes possible to speed-up node computations. This approach was mainly used in GPU related work to speed- up decoding process. However, to be it necessitates that node degrees are identical (wrong for irregular codes) and moreover due to data dependencies in layered based decoder it is hardly usable.

Third parallelism level is located at codeword computa- tion: different codewords can be decoded simultaneously.

Indeed, the overall codewords are decoded using the same computation sequence over different data sets. The SIMD paradigm enables parallel processing, but increases decoding latency. Latency is not an issue in simulation context.

Depending on the SIMD instruction set available on the x86 architecture, its become possible to process:

4codeword decodings in parallel (in 32b floating point data format) for x86 SSE architecture and up to 8 codeword decodings for x86 AVX architecture.

8 codeword decodings in parallel (in 8-bit fixed point- format 3) for x86 SSE architecture and up to 16 codeword decodings for x86 AVX-2 architecture.

However, using SIMD processing is not cost free: con- secutive frames received by the LDPC decoder must be first interleaved to align data from the different frames in the memory before decoding and secondly deinterleaved at computation end.

C. LDPC decoder specialization

LDPC decoder performances increase whereas (a) the num- ber of instructions executed decrease (b) the temporary data

3Data are store in memory in 8-bit format, however, processing requiers

Evaluer les performances d’une heuristique de décodage (et sélectionner le paramètre dont la

valeur est la plus efficace)

Choisir le format de donnée (virgule fixe) fournissant la plus faible complexité matérielle

et les meilleurs performances (BER)

(13)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Etude des performances pré-implantation

๏ Estimer fidèlement les performances avant de développer le circuit,

➡ Simulation de systèmes numériques
 sur ordinateur / stations,

➡ Modèles réalistes du comportement,

‣ Comparaison d’algorithmes,

‣ Comparaison de codes,

‣ Evaluation des optimisations,

➡ Etudes chronophages


(> semaines de simulation),

๏ Modèle de départ < 100kbps,

13

(14)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Application au codes LDPC (cible x86), différents codes

14

0 30 60 90 120

200x100 576x288 1200x6001944x9722048x3842304x11524000x20004896x24488000x40009972x498616200x756020000x1000064800x32400 GCC 4.6

Etat de l’art

Premiers travaux réalisés sur des ordinateurs de bureau.

Gain sur les temps de simulation avoisinant un facteur 100.

« High throughput LDPC decoding on GPU and CPU targets ».


IEEE Transactions / Springer, January 2014.

(15)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Application au codes LDPC (cible GPU), différents #iters

15

0 500 1000 1500

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 816x408

1024x512 1944x972 2304 x1152 4000 x2000 4896 x2448 8000 x4000 9972 x4986 20kx10k 64kx21k 64kx32k

Etat de l’art

Etude réalisée sur des cibles de type GPU.

Gain d’un facteur (4 =>16)
 vis a vis de l’état de l’art.

Simulation 7 fois plus rapide que le circuit réel…

« A high throughput efficient approach for decoding LDPC codes onto GPU devices ». IEEE Embedded System Letter, November 2014.

(16)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Application au Codes Polaires, différentes tailles de code

16

0 450 900 1350 1800

256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k

s = 0.7 & r = 0.2 s = 0.7 & r = 0.5 s = 0.7 & r = 0.9

Etat de l’art

Application de l’approché a d’autres familles de FEC.

Débits atteints supérieurs au Gbps => Intérêt de concevoir des

circuits dédié ?

« More than 1Gbps throughput Polar Codes decoders on CPU target ».


IEEE Transactions / Springer, January-march 2014.

(17)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

La SDR, les premiers SoC numérique pour 2014 ?

17

La plateforme Tegra 4 pour les téléphones portables & Tablettes (2014).

La SDR en application !

3

TABLE III

POLAR CODE DECODER PERFORMANCES(T/P:THROUGHPUT INMBPS, L:

LATENCY IN US, M:MEMORY FOOTPRINT IN KBYTES).

Rate 29 210 211 212 213 214 215 T/P 0.20 107 94.9 75.8 66.7 58.4 53.3 44.2 T/P 0.50 72.9 70.5 62.6 56.8 51.6 47.1 39.3 T/P 0.75 67.7 62.3 55.4 50.3 50.3 42.4 35.8 T/P 0.90 65.4 58.6 51.7 46.9 46.9 38.8 33.1

L 0.20 76 172 432 981 2244 4914 11864

L 0.50 112 232 522 1153 2540 5564 13344

L 0.75 121 262 591 1300 2601 6173 14624

L 0.90 125 279 633 1396 2793 6740 15844

M 18.12 36.25 72.5 145 290 580 1160

III. CONCLUSION

Not yet...

(18)

Equipe CSN - Workshop du Groupe Conception

B. Le Gal 22 Novembre 2013

Architectures dédiées (ASIC, FPGA)

18

Références

Documents relatifs

In this correspondence, we present an Adaptive Importance Sampling (AIS) technique inspired by statistical physics called Fast Flat Histogram (FFH) method to evaluate the performance

Unlike traditional belief propagation (BP) based MP algo- rithms which propagate probabilities or log-likelihoods, the new MP decoders propagate messages requiring only a finite

We also focus on spaces of open sets, on which one can take advantage of the lattice structure to give a very neat characterization of the fixed-point property and again prove

Even if the material resources are the first factor explaining the satisfaction of entrepreneurs, the work quality of the accompaniment team constitutes a main interest

Here we study conditions under which the floor of an exact product (or the Euclidean division, when the divisor can be known in advance) can be implemented using a

These algorithms search recursively in the APEG where a symmetric binary operator is repeated (we referred at these parts as homogeneous parts). As the propagation algorithms tend

7 Such is the case, for instance, of results computed in the long double 80-bit type of the x87 floating-point registers, then rounded to the IEEE double precision type for storage

In the present paper, we prove that the system of the Newton’s identities can be used for offline Gr¨obner bases computation, (called One Step Decoding in de Boer and