Bertrand LE GAL and Christophe JEGO

(1)

Low-latency software LDPC decoders for x86 multi-core devices

[email protected]

Bertrand LE GAL and Christophe JEGO

IMS laboratory, CNRS UMR 5218 Digital Circuits and Systems team Bordeaux-INP, University of Bordeaux  

France

IEEE International Workshop on Signal Processing Systems (SIPS)   October 3

^rd

, 2017  

Lorient, France

(2)

IEEE International Workshop on Signal Processing Systems (SIPS)

B. Le Gal October 3, 2017

Historically, software decoders were limited to…

2

1053-587X (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2014.2311964, IEEE Transactions on Signal Processing

2

V(n) V(n-1) V(1) V(2) V(3)

C(m) C(1) C(2)

Fig. 1: Bipartite graph representation

In this paper, unlike [8] and [9], we propose to study a nonsymmetric NISC architecture in whichP andMare not necessarily equal. The numbersM and P can be selected in order to provide the most efficient resource usage under throughput or hardware resource constraint. This provides a higher degree of freedom during the computation/data as- signement phase of the design flow which keeps the PUs utilization rate high. We also propose to overcome the limited programmability issue by using an automated design flow that schedules the computations on the different PUs and map the data on the different MUs. Finally, the designed NISC-based LDPC decoder implements layered decoding [10]

which provides the same decoding performance as TPMP scheduling for half the number of iterations. This whole design framework can be used to automatically generate an LDPC decoder able to support a predetermined set of codes under throughput/resource constraints. The hardware efficiency of the generated decoders is higher than state-of-the-art flexible LDPC decoders even for the challenging-to-implement unstructured LDPC codes.

The remainder of the paper is organized as follows. In section II, the characteristics of structured and unstructured LDPC codes are detailed. Section III reminds the LDPC decoding algorithm and its simplified versions. The proposed NISC- based architectural model is described in section IV. Section V details the associated automated design flow. Hardware implementation results and Bit Error Rate (BER) performance are presented in section VI. Finally, conclusions are drawn in section VII.

II. STRUCTURED AND UNSTRUCTUREDLDPCCODES

A (n,k) LDPC code is a linear block code defined by a parity check matrixHof sizem=n krows andncolumns.

The ratioR=k/ndenotes the code rate. As shown in Figure 1, an LDPC code may be represented by a bipartite graph, also called Tanner graph [11], in which the two types of nodes are thevariable nodes{V(i)}ⁿi=1and thecheck nodes{C(j)}^mj=1. In this graph representation, the degree of a node is equal to the number of nodes connected to it. The degrees of a variable nodeV(i)and a parity check nodeC(j)are denoted asdv(i)

andd_c(j), respectively. A code is consideredregularifd_v(i) andd_c(j) are constant for alli,jindices in theHmatrix.

(a) WiMAX (576, 288) (b) Gallager (4000, 2000) [13]

Fig. 2: An illustration of structured and unstructured LDPC H matricies

Alternatively, anirregularLDPC code does not have constant node degrees over all rows/columns.

When designing LDPC codes for digital communication standards, the main issue is to obtain a wide range of codes in terms of code length and code rate that have good decoding performance. Moreover, the structure of an LDPC code should be as regular as possible to ease the encoding and decoding processes and thus the design of the encoder and the decoder architectures. Most of the standards contain structured codes based on quasi-cyclic QC submatrices. In Figure 2a, theH matrix can be partitioned in blocks of96⇥96that can be processed over 24 processing elements [12]. Despite their hardware-friendly properties, structured codes are prone to produce Tanner graph with short cycles. Moreover, the use of a structure limits the flexibility in terms of code length and code rate.

On the other hand, unstructured/random codes enables a flexible choice in parameters (n, R) and performance that achieves the channel capacity [2]. However, the lack of structure in the resultingHmatrix increases the complexity of encoding and decoding interleavers. For these reasons, only few works [9] [8] address the design of decoders that can process unstructured LDPC codes. The LDPC decoder proposed in this work supports both structured and unstructured codes.

III. LDPCDECODING ALGORITHM

LDPC codes can be efficiently decoded using the Belief Propagation (BP) algorithm. This algorithm operates on the bipartite graph representation of the code by iteratively ex- changing messages between the variable and parity check nodes along the edges of the graph [14]. The Min-Sum (MS) algorithm is an approximation of the BP algorithm.

It provides a significant computational complexity reduction for a reasonable error correction performance loss. This performance degradation can be partially compensated by using some correction factors in the MS algorithm. The two well- known approximate MS versions are denoted asnormalized MSandoffset MS[15]. Based on these different improvements, many LDPC decoders were described in previous papers. A review can be found in [16].

In previous works on flexible LDPC decoder design, the TPMP scheduling is usually selected to exchange messages on the graph. It makes PU and MU mapping easier especially for structured codes. Counter to previous work, we have selected the horizontal layered scheduling. It reduces the number of iterations by two for the same decoding performance. More- over, the horizontal layered approach is less complex than a

2

Fig. 1. N= 8polar code systematic encoder graph.

low bit error rate values. With the increasing power efficiency of actual processor, a software channel decoder could also be considered as a viable solution in SDR systems.

Classically, in modern digital communication system, channel decoding functions are usually implemented in dedicated hardware in order to respect the silicon cost and the energy constraints fixed by mobile systems. However, recent mobile systems also require:

• flexibility to be adapted to the system context;

• low development time to meet time to market pressure;

• genericity and scalability to enable multi-standard support.

A way to achieve these objectives is to implement mobile application algorithms onto programmable architectures such as multi-core or many-core devices. Many recent studies [13]–[20] have focused on efficient implementations of complete or partial receivers. Indeed, today’s processors offer an important amount of parallelism [21] that enables the implementation of complex algorithms such as the ones used in SDR systems under real-time constraints.

In this paper, we take advantage of the parallelism available in x86 processors (SIMD, multi-core) in order to efficiently map the successive cancellation (SC) decoding of polar codes. The proposed approach is generic enough to be easily applied on other kinds of targets such as embedded processors. However, unlike [12], we propose to apply the SIMD facility to processFframes in parallel. This provides a more regular program structure and guarantees a very high utilization rate of the SIMD processing units. In addition, the memory mapping is optimized in order to reduce the probability of cache misses during the execution. Other software optimizations such as function duplication and data packing allow the SC decoder to reach up to3.2Gbps on a single processor core.

The remainder of the paper is organized as follows. Section II presents polar codes and the successive cancellation decoding algorithm. In section III, all the optimization techniques that contribute in the speedup of the decoder are detailed. Then, the experimental setup is introduced in section IV. An analysis of the latency and throughput is

November 10, 2014 DRAFT

Channel

SRAM 1 P RAM LLR

Channel PEs SRAM 2 Channel SRAM 3 Channel SRAM 4

P PEs

RAM S1

ROM Frozen Channel

buf.

Unrolled ALU

RAM LLR

RAM S2 RAM S3

RAM S4 xor xor Soft datapath

Hard datapath

P.Qc P.Qi 2P.Qi 2P.Qi

P.Qi

P

P 4.P

1053-587X (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2014.2311964, IEEE Transactions on Signal Processing

3

Algorithm 1horizontal layered offset Min-Sum algorithm

init

t= 0,T_i⁽⁰⁾=^2y2ⁱ,i✏[1, .., n]and E_ji⁽⁰⁾= 0,j✏[1, .., m]

repeat for alljdo

STEP (- 1-) E^(t+1)_j = +1 for alli✏N(j)do

Variable to parity check messageT_ij^(t)processing T_ij^(t)=T_i^(t) E^(t)_ji

Parity check nodeE^(t+1)_j update E^(t+1)_j =M in⇣

T_ij^(t) , E_j^(t+1)⌘ sgn⇣

E^(t+1)_j ⌘

=Q (i)sgn⇣

T_ij^(t)⌘ end for

STEP (- 2-) for alli✏N(j)do

Parity check to variable messageE_ji^(t+1)processing

= 8<

:

E_j^(t+1), if T_ij^(t) 6= E_j^(t+1) M in(i⁰✏N(j)\i)

⇣ T_i^(t)0j

⌘

, otherwise E^(t+1)_ji =M ax( ⌘,0)

sgn⇣ E^(t+1)_ji ⌘

=sgn⇣ E_j^(t+1)⌘

⇥sgn⇣ T_ij^(t)⌘ Variable nodeT_i^(t+1)update

T_i^(t+1)=T_ij^(t)+E_ji^(t+1) end for

end for t=t+ 1 untilttmax

The decoded bits are estimated bysign(T_i^(t))

vertical layered approach for the check node processing of the MS algorithm as explained in [17]. The difficult PU and MU mapping is handled by an automated tool that assigns computations to PUs and maps data on MUs. In a horizontal layered schedule, the check nodes are processed sequentially and the variable nodes are directly updated by the corre- sponding check-to-variable messages. The chosen horizontal layered offset Min-Sum decoding algorithm is described in Algorithm1. It is composed of two main steps: 1and 2. y_iis the channel observation related to the received bitiin log-likelihood ratio (LLR) format.T_i^(t)denotes thea posteriori LLR of the variable nodeiduring the iterationtin the case of a BPSK modulation over an AWGN channel of variance

2.E^(t)_j corresponds to the LLR of the parity check nodej during the iterationt.T_ij^(t)andE_ji^(t)denote the messages that are sent from variable nodeito parity check nodejand from parity check nodejto variable nodei, respectively.M(i)is the set of all the parity check nodes that are connected to the variable nodei.N(j)is the set of all the variable nodes that are connected to the parity check nodej.N(j)\iis the set of variable nodes that are connected to the parity check node jexcept the variable nodei.⌘is the correction factor that is used in the offset Min-Sum version in order to improve the accuracy of the extrinsic messages.

Global memory banks

Processing units with their own local registers MU

(LLR Ti) MU

(LLR Ti)

PU Reg.

file

PU Reg.

file

PU Reg.

file NISC

controller

System interface

control signals

k information LLR Ti bits

∏ / ∏-1 control

signals control signals IO status

SIMD matrix

Fig. 3: NISC-based LDPC decoder architecture

IV. NISC-BASEDLDPCDECODER ARCHITECTURE

An alternative to ASIP architecture is NISC architecture [18]. Similar to ISP-based architectures, a NISC generates signals to control a datapath during each clock cycle. However, instead of using any abstraction such as instruction set, the NISC controler stores control values in a ROM and directly send them to each resource (PUs, MUs, interconnection network, etc). The absence of instruction set makes the control of the datapath more flexible. Indeed, in a NISC, at each clock cycle, any hardware resource in the architecture can be activated independantly. NISC-based architectures can thus achieve bet- ter resource utilization than conventional instruction-set based processors. This high hardware efficiency comes at the cost of reduced programmability. An automated tool is then required to generate the content of the control ROM (Cf section V).

In this section, we describe the architecture of a NISC-based LDPC decoder. This architecture is automatically generated by the design flow described in section V.

A. Top level architecture

The architectural model is composed of a NISC controller associated with a homogeneous SIMD matrix as detailed in Figure 3. The SIMD matrix is a specialized form of parallel machine, where P Processing Units (PUs) andM Memory Units (MUs) process and store independent data (LLRsTi), respectively. As opposed to previous works, the proposed SIMD matrix has a nonsymmetric structure in which the number of PUs and the number of MUs may differ (PM).

Each PU includes dedicated processing logic and a register file to store local data (Eji messages). An interconnection network is implemented to manage the exchange of LLRsT_i. This interconnection network performs the interleave ⇧and deinterleave⇧ ¹ functions. These functions enable to route messages between the variable nodes and the check nodes.

Finally, a system interface monitors the data exchanges with the outside world.

B. Processing Unit architecture

The PU architecture is detailed in Figure 4. It is composed of two cascaded blocks devised to operate in pipeline mode.

These blocks correspond to the two steps 1 and 2 of

Validate and compare error correction code families

Benchmarking of decoding algorithms or code construction techniques

Parameter optimization Estimation of hardware decoder performances

before development

(3)

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as

fast as many hardware circuits

Currently, compatible with some industrial use cases.

Throughputs are higher than 1 Gbps on multi-core

or many-core devices.

Processing latencies from hundreds of us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame

parallelism exploitation [1].

[1] OpenAirInterface 5G software alliance for democratising wireless innovation

(4)

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as

fast as many hardware circuits

Currently, compatible with some industrial use cases.

Throughputs are higher than 1 Gbps on multi-core

or many-core devices.

Processing latencies from hundreds of us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame

parallelism exploitation [1].

(5)

Currently they can fulfill others realtime performance requirements

3

Provide design and runtime flexibilities Software decoders are at least as

fast as many hardware circuits

Currently, compatible with some industrial use cases.

Throughputs are higher than 1 Gbps on multi-core

or many-core devices.

Processing latencies from hundreds of us or ms are too high.

Consecutive frame configurations can be different (N, rate) discarding inter-frame

parallelism exploitation [1].

(6)

The processing performance of GPU & CPU devices

4

Multicore device (e.g. INTEL Core-i7)

One chip composed hierarchically of physical processor cores (4) and SIMD unit (1).

During 1 clock cycle, a SIMD instr. can perform 32 computations on 8-bits fixed point data => 32 8b-oper.

During 1 clock cycle, a physical processor (superscalar) can perform up to 6 SIMD instr => 192 8b-oper.

During 1 clock cycle, a Core-i7 processor can execute 4 cores x 6 SIMD instr => 768 8b-oper.

INTEL Core-i7 processor NVIDIA Tegra K1 GPU

GPU devices (e.g NVIDIA Titan GPU)

One chip composed hierarchically of stream processors (14) and cores (2688). Each stream processor controls a set of cores (192).

During 1 clock cycle

2688 floating point operations can be executed.

However, more computations are required to hide processing and memory access latencies.

With 1 to 3 GHz clock frequency, it delivers (theoretically) a high processing performance.

(7)

The structure of standardized LDPC code

๏ Standardized H matrix have a Quasi-Cyclic structure,

➡ Compressed matrix definition,

➡ Z expansion factor,

➡ Shifting coefficients,

๏ This QC structure of H matrix

➡ Reduces the H memory footprint,

➡ Limits the data dependency during the decoding making parallel computing easy,

๏ From an hardware point of view, Z factor « enforce »,

➡ Z processing units,

➡ Z memory banks,

➡ One or two Z × Z data interleavers.

5

Fig. 1. An illustration of QC-LDPCHmatrix coming from the IEEE 802.11e WiMAX standard K = 576, rate=1/2 and Z = 24.

II. LDPC DECODING ALGORITHM

An LDPC code is a linear block code defined by a binary sparse M ⇥N parity-check matrix called H. In the H matrix, M = N K rows are associated to parity-check informations (CN) and N columns represent received bit-node informations (VN) with N the codeword length and K the number of information bits in the received sequence. The number of one elements in the y^th row of H is defined as the row degree d_c(y). Similarly, the number of one elements in the x^th column is defined as the column degree d_v(x). The decoding can be performed thanks to a message passing (MP) approach where VNs and CNs exchange soft information.

The sum-product algorithm (SPA) approximations such as Offset Min-Sum (OMS) or Normalized Min-Sum (NMS) are commonly used to implement LDPC decoders [23]. Indeed, they drastically reduce the computation complexity of the SPA decoding algorithm and they provide negligible error correction performance degradations.

In their general form, H matrix are un-structured and irregular. It means that d_c and d_v are not constant in the H matrix.

However, to facilitate the design of LDPC decoders, a special class of LDPC codes was proposed (Quasi-Cyclic LDPC codes). QC-LDPC codes are LDPC codes that are composed of an array ofZ⇥Z circulant identity sub-matrices.Z is the order that defines the parallelism level. It eases H matrix storage and insures thatZ CNs or VNs can be processed in parallel without memory access conflicts. Consequently, QC-LDPC codes have been adopted in many standards [2], [3], [4], [5]. For each standardized QC-LDPC code, the Z parameter and the rotation offsets (rid) for each non null sub-matrices are specified. Fig.

1 shows the H matrix of a QC-LDPC code from the IEEE 802.11e standard (N = 576, rate=1/2 and Z = 24). This H matrix structure enables hardware architectures to allocate at least Z = 24 processing units for a parallel processing.

Moreover, the QC structure grouping VN elements in sub- sets helps the data interleaving. Indeed, it can be implemented thanks to using a simple shift register or a mutiplexer network generated according to the Z value when Z memory banks are available.

To reduce the computational and the memory complexities of the LDPC decoder architectures, some schedulings of the

CN computations were proposed [24]. Indeed, in the origi- nal LDPC decoding algorithm formulation, named flooding schedule, the overall VN and CN computations are executed in parallel. This formulation was mainly applied in first decoder architectures [25], [26] and GPU-based software LDPC decoder implementations [27], [19], [20], [6], [18] because it provides large parallelization opportunities. However, an horizontal layered-based formulation is more efficient for both hardware and software implementations [28], [29], [30]. It halves the number of decoding iterations required to reach similar error correction performances. Moreover, it allows a reduction of about 30% the memory requirement. More details about the horizontal layered-based algorithm used in this work are provided in [7].

The main drawback of the layered-based scheduling formulation comes from its maximum computation parallelism level that is lower than the flooding based formulation where all VNs or CNs can be evaluated in parallel. Indeed, in the layered-based formulation, CNs can be executed in parallel if they do not share the same VN neighbors which increases the complexity of the layered-based formulation parallelization.

This is the major argument behind discarding it in many multi-core and many-core implementation works. However, for multi-core devices such as x86 or ARM processors, the parallelization issue was solved by using an inter-frame parallelization approach [7], [22]. Nonetheless, processing Q frames in parallel increases the processing latency higher than 100 µs for the decoding of most LDPC codes.

In [22] and [7], the main objective was to design a generic decoder implementation for both structured and un-structure LDPC codes. In fact, the rescheduling of CN computations to maintain a constant parallelism level without violating VN access policy with unstructured LDPC code was a complex task.

For this reason, the inter-frame parallelization was discarded.

Since, standardized LDPC codes are QC-LDPC codes, it is possible to benefit from the code structure so as to optimize multi-core implementations. Indeed, CN elements involved in a Z ⇥ Z sub-matrix are independent and thus they can be updated in parallel.

III. PARALLELIZATION STRATEGY FOR LOW-^LATENCY In this section, we describe an LDPC decoding parallelization strategy and the applied optimization techniques used to reduce the processing latency while maintaining a high processing throughput of software LDPC decoders.

Low-processing latency involves parallelization of single frame decoding (intra-frame) contrary to previous works [22], [7] in order to discard system frame buffering and also to obtain shorter decoding process. However, like for all efficient parallelized software implementations, computation and memory access regularities are required. Indeed, multi- core processors have SIMD units that can process in parallel Q 8-bit fixed-point values. However, the main concern for an efficient intra-frame parallelization comes from irregular memory accesses that depends on the H matrix. Indeed, depending on Z value and sub-block rotation values, accessing

WIMAX 576 × 288 LDPC code, Z = 24

Reconstructed H matrix Z × Z shifted 

ID matrix

Fig. 1. An illustration of QC-LDPC H matrix coming from the IEEE 802.11e WiMAX standard K = 576 , rate= 1/2 and Z = 24 .

II. LDPC DECODING ALGORITHM

An LDPC code is a linear block code defined by a binary sparse M ⇥ N parity-check matrix called H . In the H matrix, M = N K rows are associated to parity-check informations (CN) and N columns represent received bit-node informations (VN) with N the codeword length and K the number of information bits in the received sequence. The number of one elements in the y ^th row of H is defined as the row degree d _c (y ) . Similarly, the number of one elements in the x ^th column is defined as the column degree d _v (x) . The decoding can be performed thanks to a message passing (MP) approach where VNs and CNs exchange soft information.

The sum-product algorithm (SPA) approximations such as Offset Min-Sum (OMS) or Normalized Min-Sum (NMS) are commonly used to implement LDPC decoders [23]. Indeed, they drastically reduce the computation complexity of the SPA decoding algorithm and they provide negligible error correction performance degradations.

In their general form, H matrix are un-structured and irreg- ular. It means that d _c and d _v are not constant in the H matrix.

However, to facilitate the design of LDPC decoders, a special class of LDPC codes was proposed (Quasi-Cyclic LDPC codes). QC-LDPC codes are LDPC codes that are composed of an array of Z ⇥ Z circulant identity sub-matrices. Z is the order that defines the parallelism level. It eases H matrix storage and insures that Z CNs or VNs can be processed in parallel without memory access conflicts. Consequently, QC-LDPC codes have been adopted in many standards [2], [3], [4], [5]. For each standardized QC-LDPC code, the Z parameter and the rotation offsets ( rid ) for each non null sub-matrices are specified. Fig.

1 shows the H matrix of a QC-LDPC code from the IEEE 802.11e standard ( N = 576 , rate= 1/2 and Z = 24 ). This H matrix structure enables hardware architectures to allocate at least Z = 24 processing units for a parallel processing.

Moreover, the QC structure grouping VN elements in sub- sets helps the data interleaving. Indeed, it can be implemented thanks to using a simple shift register or a mutiplexer network generated according to the Z value when Z memory banks are available.

To reduce the computational and the memory complexities of the LDPC decoder architectures, some schedulings of the

CN computations were proposed [24]. Indeed, in the origi- nal LDPC decoding algorithm formulation, named flooding schedule, the overall VN and CN computations are executed in parallel. This formulation was mainly applied in first de- coder architectures [25], [26] and GPU-based software LDPC decoder implementations [27], [19], [20], [6], [18] because it provides large parallelization opportunities. However, an horizontal layered-based formulation is more efficient for both hardware and software implementations [28], [29], [30]. It halves the number of decoding iterations required to reach similar error correction performances. Moreover, it allows a reduction of about 30% the memory requirement. More details about the horizontal layered-based algorithm used in this work are provided in [7].

The main drawback of the layered-based scheduling for- mulation comes from its maximum computation parallelism level that is lower than the flooding based formulation where all VNs or CNs can be evaluated in parallel. Indeed, in the layered-based formulation, CNs can be executed in parallel if they do not share the same VN neighbors which increases the complexity of the layered-based formulation parallelization.

This is the major argument behind discarding it in many multi-core and many-core implementation works. However, for multi-core devices such as x86 or ARM processors, the parallelization issue was solved by using an inter-frame parallelization approach [7], [22]. Nonetheless, processing Q frames in parallel increases the processing latency higher than 100 µs for the decoding of most LDPC codes.

In [22] and [7], the main objective was to design a generic decoder implementation for both structured and un-structure LDPC codes. In fact, the rescheduling of CN computations to maintain a constant parallelism level without violating VN ac- cess policy with unstructured LDPC code was a complex task.

For this reason, the inter-frame parallelization was discarded.

Since, standardized LDPC codes are QC-LDPC codes, it is possible to benefit from the code structure so as to optimize multi-core implementations. Indeed, CN elements involved in a Z ⇥ Z sub-matrix are independent and thus they can be updated in parallel.

III. P ARALLELIZATION STRATEGY FOR LOW - ^LATENCY

In this section, we describe an LDPC decoding paralleliza- tion strategy and the applied optimization techniques used to reduce the processing latency while maintaining a high processing throughput of software LDPC decoders.

Low-processing latency involves parallelization of single

frame decoding (intra-frame) contrary to previous works [22],

[7] in order to discard system frame buffering and also

to obtain shorter decoding process. However, like for all

efficient parallelized software implementations, computation

and memory access regularities are required. Indeed, multi-

core processors have SIMD units that can process in parallel

Q 8-bit fixed-point values. However, the main concern for

an efficient intra-frame parallelization comes from irregular

memory accesses that depends on the H matrix. Indeed,

depending on Z value and sub-block rotation values, accessing

(8)

The standardized LDPC codes structure

๏ Standardized H matrix have a Quasi-Cyclic structure,

➡ Compressed matrix definition,

➡ Z expansion factor,

➡ Shifting coefficients,

๏ This QC structure of H matrix

➡ Reduces the H memory footprint,

➡ Limits the data dependency during the decoding making parallel computing easy,

๏ From a hardware point of view, Z factor « enforce » the design,

➡ Z processing units,

➡ Z memory banks,

➡ One or two Z × Z data interleavers.

6

MU

(LLR Ti) MU

(LLR Ti)

PU

Reg.

file

PU

Reg.

file

PU

Reg.

file FSM

controller

System interface

control signals

k information LLR Ti bits

control signals control signals IO status

PU

Reg.

file

∏ / ∏

-1

Bertrand LE GAL and Christophe JEGO

Low-latency software LDPC decoders for x86 multi-core devices