Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm

(1)

HAL Id: hal-00777131

https://hal.archives-ouvertes.fr/hal-00777131

Submitted on 31 Jan 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm

Emmanuel Boutillon, Laura Conde-Canencia, Ali Al Ghouwayel

To cite this version:

Emmanuel Boutillon, Laura Conde-Canencia, Ali Al Ghouwayel. Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm. IEEE Transactions on Circuits and Systems Part 1 Fundamental Theory and Applications, Institute of Electrical and Electronics Engineers (IEEE), 2013, 60 (10), pp.2644 - 2656. �10.1109/TCSI.2013.2279186�. �hal-00777131�

(2)

Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm

Emmanuel Boutillon, Senior Member, IEEE, Laura Conde-Canencia, Member, IEEE, and Ali Al Ghouwayel

Abstract—This paper presents the architecture, performance and implementation results of a serial GF(64)-LDPC decoder based on a reduced-complexity version of the Extended Min- Sum algorithm. The main contributions of this work correspond to the variable node processing, the codeword decision and the elementary check node processing. Post-synthesis area results show that the decoder area is less than 20% of a Virtex 4 FPGA for a decoding throughput of 2.95 Mbps. The implemented decoder presents performance at less than 0.7 dB from the Belief Propagation algorithm for different code lengths and rates.

Moreover, the proposed architecture can be easily adapted to decode very high Galois Field orders, such as GF(4096) or higher, by slightly modifying a marginal part of the design.

Index Terms—Non-Binary low-density parity-check decoders, low-complexity architecture, FPGA synthesis, Extended Min Sum algorithm.

I. INTRODUCTION

T

HE extension of binary Low-Density Parity-Check (LDPC) codes to high-order Galois Fields (GF(q), with q >2), aims at further close the gap of performance with the Shannon limit when using small or moderate codeword lengths [1]. In [2], it has been shown that this family of codes, named Non-Binary (NB) LDPC, outperforms convolutional turbo- codes (CTC) and binary LDPC codes because it retains the benefits of steep waterfall region for short codewords (typical of CTC) and low error floor (typical of binary LDPC). Com- pared to binary LDPC, NB-LDPC generally present higher girths, which leads to better decoding performance. Moreover, since NB-LDPC are defined on high-order fields, it is possible to identify a closer connection between NB-LDPC and high- order modulation schemes. When associating binary LDPC to M-ary modulation, the demapper generates likelihoods that are correlated at the binary level, initializing the decoder with messages that are already correlated. The use of iterative demapping partially mitigates this effect but increases the whole decoder complexity. Conversely, in the NB case, the symbol likelihoods are uncorrelated, which automatically improves the performance of the decoding algorithms [3]

[4]. Moreover, a better performance of the q-ary receiver processing has been observed in MIMO systems [5] [6].

Finally, NB-LDPC codes also outperform binary LDPC codes in the presence of burst errors [7] [8]. Further research on NB- LDPC considers their definition over finite groups G(q), which

E. Boutillon and L. Conde-Canencia are with the Lab-STICC laboratory, Lorient, CNRS, Universit´e de Bretagne Sud

A. Al Ghouwayel is with the Lebanese International University.

However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

is a more general framework than finite Galois fields GF(q) [9]. This leads to hybrid [10] and split or cluster NB-LDPC codes [11], increasing the degree of freedom in terms of code construction while keeping the same decoding complexity.

From an implementation point of view, NB-LDPC codes highly increase complexity compared to binary LDPC, espe- cially at the reception side. The direct application of the Belief Propagation (BP) algorithm to GF(q)-LDPC leads to a computational complexity dominated by O(q²) and considering values ofq >16results in prohibitive complexity. Therefore, an important effort has been dedicated to design reduced- complexity decoding algorithms for NB-LDPC codes. In [12]

and [13], the authors present an FFT-Based BP decoding that reduces complexity to the order of O(dc×q×logq), where dc is the check node degree. This algorithm is also described in the logarithm domain [14], leading to the so-called log-BP- FFT. In [15] [16], the authors introduce the Extended Min-Sum (EMS), which is based on a generalization of the Min-Sum algorithm used for binary LDPC codes ([17], [18] and [19]).

Its principle is the truncation of the vector messages fromqto nmvalues (nm<< q), introducing a performance degradation compared to the BP algorithm. However, with an appropriate estimation of the truncated values, the EMS algorithm can approach, or even in some cases slightly outperform, the BP- FFT decoder. Moreover, the complexity/performance trade-off can be adjusted with the value of thenmparameter, making the EMS decoder architecture easily adaptable to both implementation and performance constraints. A complexity comparison of the different iterative decoding algorithms applied to NB- LDPC is presented in [20]. Finally, the Min-Max algorithm and its selective-input version are presented in [21].

In the last years several hardware implementations of NB- LDPC decoding algorithms have been proposed. In [22] and [23], the authors consider the implementation of the FFT-BP on an FPGA device. In [24] the authors evaluate implementation costs for various values of q by the extension of the layered decoder to the NB case. An architecture for a parallel or serial implementation of the EMS decoder is proposed in [16]. Also, the implementation of the Min-Max decoder is considered in [25], [26] and optimized in [27] for GF(32).

Finally, a recent paper¹ presents an implementation of a NB- LDPC decoder based on the Bubble-Check algorithm and a low-latency variable node processing [28].

Even if the theoretical complexity of the EMS is in the order ofO(nm×lognm), for a practical implementation, the parallel insertion needed to reorder the vector messages at the

1Paper published during the reviewing process of our manuscript.

(3)

TABLE I NOTATION Code parameters

q order of the Galois Field

m number of bits in a GF(q) symbol,m= log₂q H parity-check matrix

M number of rows inH

N number of columns inH or number of symbols in a codeword dc check node degree

dv variable node degree hj,k an element of theHmatrix

Notation for the decoding algorithm

X a codeword

xk a GF(q) symbol in a codeword

xk,i thei^thbit of the binary representation of xk

Y received codeword (channel information) y_k a GF(q) symbol in a received codeword yk,i thei^thnoisy channel sample in y_k

nm size of the truncated message in the EMS algorithm L^k(x) LLR value of thek^thsymbol

˜

xk symbol of GF(q) that maximizesP(y_k|x) ˆ

ck a decoded symbol Cˆ the decoded codeword

{L^k(x)} the intrinsic message, (x∈GF(q))

C2V_j^k check to variable message associated to edgehj,k

V2C_j^k variable to check message associated to edgehj,k

λ^k EMS message associated to symbol xk

λ^k(l)^GF GF(q) value of thel^thelement in the EMS message λ^k(l)^L LLR value of thel^thelement in the EMS message

Architecture parameters

nb number of quantization bits for an intrinsic message ny number of quantization bits for the representation ofyk,i

nit number of decoding iterations

nop number of operations in an elementary check node processing Ldec latency of the decoding process (in number of clock cycles) LV N latency of the variable node processing

LCN latency of the check node processing nbub number of bubbles

S_C2V subset of GF(q),S_C2V ={C2V^GF(l)}_l=1...n_m S¯_C2V subset of GF(q) that contains the symbols not inS_C2V

Elementary Check Node (ECN) increases the complexity to the order of O(n²m). An algorithm to reduce the EMS ECN complexity is introduced in [29] for a complexity reduction in the order ofO(nm√nm). The complexity of this architecture was further reduced without sacrifying performance with the L-Bubble-Check algorithm [30].

As the EMS decoder considers Log-Likelihood Ratios (LLR) for the reliability messages, a key component in the NB decoder is the circuit that generates the a priori LLRs from the binary channel values. An LLR generator circuit is proposed in [31], but this algorithm is software oriented rather than hardware oriented, since it builds the LLR list dynamically. In [32], an original circuit is proposed as well as the accompanying sorter which provides the NB LLR values to the processing nodes of the EMS decoder.

In this paper, we present a design and a reduced-complexity implementation of the L-Bubble Check EMS NB-LDPC decoder focusing our attention on the following points: the Variable Node (VN) update, the Check Node (CN) processing as a systolic array of ECNs and the codeword decision-making.

Table I summarizes the notation used in the paper.

The paper is organized as follows: section II introduces ultra-sparse quasi-cyclic NB-LDPC codes, which are the one considered by the decoder architecture. This section also

reviews NB-LDPC decoding with particular attention to the Min-Sum and the EMS algorithms. Section III is dedicated to the global decoder architecture and its scheduling. The VN architecture is detailed in section IV. The CN processor and the L-Bubble Check ECN architecture are presented in section V. Section VI is dedicated to performance and complexity issues and, finally, conclusions and perspectives are discussed in section VII.

II. NB-LDPCCODES ANDEMSDECODING

This section provides a review of NB-LDPC codes and the associated decoding algorithms. In particular, the Min-Sum and the EMS algorithms are described in detail.

A. Definition of NB-LDPC codes

An NB-LDPC code is a linear block code defined on a very sparse parity-check matrix H whose nonzero elements belong to a finite field GF(q), whereq >2. The construction of these codes is expressed as a set of parity-check equations over GF(q), where a single parity equation involvingdc codeword symbols is: Pd_c

k=1hj,kxk = 0, where hj,k are the nonzero values of the j-th row of H and the elements of GF(q) are {0, α⁰, α¹, . . . , α^q−2}. The dimension of the matrixH isM× N, whereM is the number of parity-Check Nodes (CN) and N is the number of Variable Nodes (VN), i.e. the number of GF(q) symbols in a codeword. A codeword is denoted by X = (x1,x2, . . . ,xN), where (xk), k = 1. . . N is a GF(q) symbol represented by m = log₂(q) bits as follows: xk = (xk,1 xk,2. . . xk,m).

The Tanner graph of an NB-LDPC code is usually much more sparse than the one of its homologous binary counterpart for the same rate and binary code length ([33], [34]). Also, best error correcting performance is obtained with the lowest possible VN degree, dv = 2. These so-called ultra-sparse codes [33] reduce the effect of stopping and trapping sets, and thus, the message passing algorithms become closer to the optimal Maximum Likelihood decoding. For this reason, all the codes considered in this paper are ultra-sparse. To obtain both good error correcting performance and hardware friendly LDPC decoder, we consider the optimized non-binary protograph-based codes [35] [36] with dv = 2 proposed by D. Declercq et al. [37]. These matrices are designed to maximize the girth of the associated bi-partite graph, and minimize the multiplicity of the cycles with minimum length [38]. This NB-LDPC matrix structure is similar to that of most binary LDPC standards (DVB-S2, DVB-T2, WiMax,...), and allows different decoder schedulings: parallel or serial node processors². Finally, the nonzero values of H are limited to onlydcdistinct values and each parity check uses exactly those dc distinct GF(q) values. This limitation in the choice of the hj,k values reduces the storage requirements.

B. Min-Sum algorithm for NB-LDPC decoding

The EMS algorithm [15] is an extension of the Min-Sum ([39] [40]) algorithm from binary to NB LDPC codes. In this

2The final choice will be determined by the latency and surface constraints.

(4)

section we review the principles of the Min-Sum algorithm, starting with the definition of the NB LLR values and the exchanged messages in the Tanner graph.

1) Definition of NB LLR values: Considering a BPSK modulation and an Additive White Gaussian Noise (AWGN) channel, the received noisy codeword Y consists of N × m binary symbols independently affected by noise: Y = (y1,1y1,2. . . y1,my2,1. . . yN,m), whereyk,i=B(xk,i)+wk,i, k ∈ {1,2, . . . , N},i ∈ {1, . . . , m}, wk,i is the realization of an AWGN of varianceσ² andB(x) = 2x−1 represents the BPSK modulation that associates symbol ‘-1’ to bit 0 and symbol ‘+1’ to bit 1.

The first step of the Min-Sum algorithm is the computation of the LLR value for each symbol of the codeword. With the hypothesis that the GF(q) symbols are equiprobable, the LLR value L^k(x)of the k^th symbol is given by [21]:

L^k(x) = lnP(yk|˜xk) P(y_k|x)

(1) where˜xk is the symbol of GF(q) that maximizesP(y_k|x), i.e.

˜

xk={arg maxx∈GF(q), P(y_k|x)}.

Note that L^k(˜xk) = 0 and, for all x ∈ GF(q), L^k(x) ≥ 0. Thus, when the LLR of a symbol increases, its reliability decreases. This LLR definition avoids the need to re-normalize the messages after each node update computation and permits to reduce the effect of quantization when considering finite precision representation of the LLR values.

As developed in [32],L^k(x)can be expressed as:

L^k(x) =

m

X

i=1

(yk,i−B(xi))²

2σ² +yk,i−B(˜xk,i)² 2σ²

(2)

= 1 2σ²

m

X

i=1

2yk,i(B(˜xk,i)−B(xi))

. (3)

Using (3), L^k(x) can be written as:

L^k(x) =

m

X

i=1

|LLR(yk,i)|∆k,i, (4) where∆k,i =xi XORx˜k,i, i.e.∆k,i = 0ifxi andx˜k,i have the same sign, 1 otherwise and LLR(yk,i) = _σ²²yk,i is the LLR of the received bityk,i.

2) Definition of the edge messages: The Check to Variable (C2V) and the Variable to Check (V2C) messages associated to edgehj,kare denotedC2V_j^kandV2C_j^k, respectively. Since the degree of the VN is equal to 2, we denote the two C2V (respectively V2C) messages associated to the variable nodek (k= 1. . . N)C2V_j^k

k(1) andC2V_j^k

k(2) (respectivelyV2C_j^k

k(1)

and V2C_j^k

k(2)) where jk(1) and jk(2) indicate the position of the two nonzero values of the k^th column of matrix H.

Similarly, thedcC2V (respectively V2C) messages associated to CN j (j = 1. . . M) are denoted C2V_j^k^j^(v) (respectively V2C_j^k^j^(v)),v= 1. . . dc, wherekj(v)indicates the position of the v^th nonzero value in thej^th row ofH.

3) The Min-Sum decoding process: The Min-Sum algo- rithm is performed on the Tanner bi-partite graph. At high level, this algorithm does not differ from the classical binary decoding algorithms that use the horizontal shuffle scheduling [41] or the layered decoder [42] principle.

The decoding process iteratesnittimes and for each iteration M CN updates and M×dc VN updates are performed.

During the last iteration a decision is taken on each symbol, the decoded symbol is denoted byˆck and the decided codeword byC. The codeword decision performed in the VN processorsˆ concludes the decoding process and the decoder then sequentially outputsC to the next block of the communication chain.ˆ

The steps of the algorithm can be described as:

Initialisation: generate the intrinsic message {L^k(x)}x∈GF(q), k = 1. . . N and set V2C_j^k_k_(v)=L^k fork= 1. . . N andv= 1,2.

Decoding iterations: for 1 to the maximum number of iterations

for(j= 1. . . M) do

1) Retrieve in parallel from memory V2C_j^k^j^(v), v = 1. . . dc messages associated to CNj.

2) Perform CN processing to generate dc new C2V_j^k^j^(v), v= 1. . . dc messages³.

3) For each variable node kj(v) connected to CNj, update the second V2C message using the new C2V message and the L^k intrinsic message.

Final decision For each variable node, make a decisionˆck using the C2V_j^k_k₍₁₎,C2V_j^k_k₍₂₎ messages and the intrinsic message.

4) VN equations in the Min-Sum algorithm: Let L(x), V2C(x) and C2V(x) be respectively the intrinsic, V2C and C2V LLR values associated to symbol x. The decoding equations are:

Step 1: VN computation : for all x∈GF(q) V2C(x) =C2V(x) +L(x) (5) Step 2: Determination of the minimum V2C LLR value

ˆ

x= arg min

x∈GF(q){V2C(x)} (6)

Step 3: Normalization

V2C(x) =V2C(x)−V2C(ˆx) (7) 5) CN equations in the Min-Sum algorithm: With the forward-backward algorithm [43] a CN of degree dc can be decomposed into3(dc−2)ECNs, where an ECN has two input messagesU andV and one output messageE (see Figure 7).

E(x) = min

xu,xv∈GF(q)²{U(xu) +V(xv)}^xu⊕xv=x (8) where⊕is the addition in GF(q).

3Note that the multiplicative coefficients associated to the edge of the Tanner graph are included in the CN processor.

(5)

6) Decision-making equations in the Min-Sum algorithm:

The decision ˆck, k= 1. . . N is expressed as:

ˆck = arg min

x∈GF(q){C2V_j^k_k₍₁₎(x) +C2V_j^k_k₍₂₎(x) +L^k(x)} (9) C. The EMS algorithm

The main characteristic of the EMS is to reduce the size of the edge messages from q tonm (nm<< q) by considering the sorted list of the first smallest LLR values (i.e. the set of the nmmost probable symbols) and by giving a default LLR value to the others.

Let λ^k be the EMS message associated to the k^th sym- bol xk knowing y_k (the so-called intrinsic message). λ^k is composed of nm couples (λ^k(l)^L, λ^k(l)^GF)l=1...nm, where λ^k(l)^GF is a GF(q) element and λ^k(l)^L is its associated LLR: L^k(λ^k(l)^GF) = λ^k(l)^L. The LLR verifies λ^k(1)^L ≤ λ^k(2)^L ≤ . . . ≤ λ^k(nm)^L. Moreover, λ^k(1)^L = 0. In the EMS, a default LLR value λ^k(nm)^L +O is associated to each symbol of GF(q) that does not belong to the set {λ^k(l)^GF}l=1...nm, whereO is a positive offset whose value is determined to maximize the decoding performance [15].

The structure of the V2C and the C2V messages is identical to the structure of the intrinsic message λ^k. The output message of the VN should contain only, in sorted order, the firstnmsmallest LLR valuesV2C(l)^L, l= 1. . . nmand their associated GF symbolsV2C(l)^GF, l= 1. . . nm. Similarly, the output message of the CN contains only the first nmsmallest LLR values C2V(l)^L, l = 1. . . nm (sorted in increasing order), their associated GF symbols C2V(l)^GF, l = 1. . . dc

and the default LLR valueC2V(nm)^L+O.

Except for the approximation of the exchanged messages, the EMS algorithm does not differ from the Min-Sum algorithm, i.e., it corresponds to equations (5) to (9).

III. ARCHITECTURE AND DECODING SCHEDULING

This section presents the architecture of the decoder and its characteristics in terms of parallelism, throughput and latency.

A. Level of parallelism

We propose a serial architecture that implements a horizontal shuffled scheduling with a single CN processor anddcVN processors. The choice of a serial architecture is motivated by the surface constraints as our final objective is to include the decoder in an existing wireless demonstrator platform [44]) (see section VI). The horizontal shuffled scheduling provides faster convergence because during one iteration a CN processor already benefits from the processing of a former CN processor. This simple serial design constitutes a first FPGA implementation to be considered as a reference for future parallel or partial-parallel enhanced architecture designs.

B. The overall decoder architecture

The overall view of the decoder architecture is presented in Figure 1. A single CN processor is connected to dc

VN processors and dc RAM V2C memory banks. The CN

processor receives in paralleldc V2C messages and provides, after computation,dc C2V messages. The C2V messages are then sent to the VN processors to compute the V2C messages of their second edge.

Fig. 1. Overall decoder architecture

Note that, for the sake of simplicity, we have omitted the description of the permutation nodes that implement the GF(q) multiplications. The effect of this multiplication is to replace the GF(q) value V2C^GF(l) by V2C^GF(l)×hj,k where the GF multiplication requires only a few XOR operations.

1) Structure of the RAMs: The channel information Y and theV2C message associated to theN variables are stored in dcmemory banks RAMy and RAM V2C respectively⁴. Each memory bank contains information related toN/dc variables.

In the case of RAMy, the(yk,i)i=1...m received values asso- ciated to the variable xk are stored inmconsecutive memory addresses, each of sizenybits, whereny is the number of bits of the fixed-point representation ofyk,i (i.e. the size of RAMy is(N/dc×m)words ofny bits). Similarly, each RAM V2C is also associated to N/dc variables. The information V2Ck

related to xk is stored in nm consecutive memory addresses, each location containing a couple (V2C^L(l), V2C^GF(l)), i.e., two binary words of size (nb, m), where nb is the number of bits to encode the V2C^L(l) values. To reduce memory requirements, for each symbol xk, only the channel samples yk,i and the extrinsic messages are stored in the RAM blocks.

The intrinsic LLR are stored after their computation but they are overwritten by the V2C messages during the first decoding iteration. Each time an intrinsic LLR is required for the VN update, it is re-computed in the VN processor by the LLR generator circuit. Such approach avoids the memorisation of all the LLR of the input message (qmessages) and thus, saves significant area when considering high-order Galois Fields (q≥64).

The partition of the N variables in the dc memories is a coloring problem: the dc variables associated to a given CN should be stored each in a different memory bank to avoid memory access conflicts (i.e. each memory bank must have a different color). A general solution to this problem has been

4In this paper, we represent two separate RAMs for the sake of clarity.

However, in the implementation, RAMy and RAM V2C are merged into a single RAM.

(6)

studied in [45]. Since the NB-LDPC matrices considered in our study are highly structured (see [37]), the problem of partitioning is solved by the structure of the code.

2) Wormhole layer scheduling: The proposed architecture considers a wormhole scheduling. The decoding process starts reading the stored Y and V2C information sequentially and sends, in m+nm clock cycles, the whole V2C message to the CN. After a maximum delayLCN, the CN starts to send the C2V messages to the VN processors, again with a value C2V(l),l= 1. . . nmat each clock cycle⁵.

After a delay ofLV N (see section IV-B), the VNs send the newV2Cmessages to the memory. The process is pipelined, i.e, every ∆ = (m+LCN +nm) clock cycles, a new CN processing is started. The total time to process nit decoding iterations is:

Ldec=nit×M×∆ +LV N+nm (10) where Ldec is given in clock cycles. Figure 2 illustrates the scheduling of the decoding process.

Fig. 2. Scheduling of the global architecture

3) The decoding steps: The decoding process iterates nit

times performing M CN updates and M ×dc VN updates at each iteration. During the last iteration a decision is taken on each symbol. The codeword decision is performed in the VN processors. This concludes the decoding process and the decoder then sequentially outputs C to the next block of theˆ communication chain. Note that the interface of the decoder is then rather simple:

1) Load y_k and store them in RAMy (N×mclock cycles).

2) Compute intrinsic information from y_k to initialize the V2C messages.

3) Perform the nitdecoding iterations.

4) During the second edge processing of the last iteration, use the decision process to determineˆc.

5) Output the decoded message (N clock cycles) and wait for the new input codeword to decode.

IV. VARIABLE NODE ARCHITECTURE

Although most papers on NB-LDPC decoder architectures focus on the CN, the implementation of the VN architecture

5The time scheduling of the C2V message generation is not fully regular (see section V-C), but we consider a global latency LCN so that the last elementC2V(nm)arrives afterLCN+nmclock cycles

Fig. 3. Variable node architecture of the EMS NB-LDPC decoder

is almost as complex, if not more, than the implementation of the CN in terms of control. In the proposed decoder, the VN processor works in three different steps: 1) the intrinsic generation; 2) the VN update and 3) the codeword decision. During the first step, prior to the decoding iterations, the Intrinsic Generation Module (IGM) circuit is active and generates the intrinsic message(λ^k)k=1...N from the received y_k samples.

During the VN update, all the blocks of the VN processor, except the Decision block, are active. Finally, during the last decoding iteration, the Decision block is active (see Figure 3).

A. The Intrinsic Generator Module (IGM)

The role of the IGM is to compute theλ^kintrinsic messages.

In [32], the authors propose an efficient systolic architecture to perform this task. The purpose is to iteratively construct the intrinsic LLR list considering, at the beginning, only the first coordinate, then the first two coordinates and so on, up to the complete computation of the intrinsic vector. The systolic architecture works as a FIFO that can be fed when needed.

Once the input symbols y_k,i are received, and after a delay of m+ 2 clock cycles (m =log2(q)), the IGM generates a new output λ^k(l) at every clock cycle. When pipelined, this module generates a new intrinsic vector every nm+ 1clock cycles. Each intrinsic message is stored in the corresponding V2C memory location in order to be used during the first step of the iterative decoding process.

In the present design, in order to minimize the amount of memory, the intrinsic messages are not stored but re- generated when needed, i.e., during each VN update of the iterative decoding process. This choice was dictated by the limited memory resources of the existing FPGA platform. In another context, it could be preferable to generate only once the intrinsic messages, store them in a specific memory and retrieve them when needed.

B. The VN update

In the VN processor, the blocks involved in the VN update are the following: the elementary LLR generator (eLLR), the Sorter, the IGM, the Flag memory and the Min block.

The task of the VN update is simple: it extracts in sorted order the nm smallest values, and their associated GF(q) symbols, from the set S = {C2V^L(x) +L(x)} indexed by x∈GF(q)to generate the new V2C message.

(7)

The set of GF(q) values can be divided into two disjoint sub- sets SC2V andS¯C2V, withSC2V the subset of GF(q) defined as SC2V = {C2V^GF(l)}l=1...nm. In this set, C2V^L(x) = C2V^L(l), with l such that C2V^GF(l) = x. The second set, S¯C2V contains the symbols not in SC2V. If x∈S¯C2V, then C2V^L(x)takes the default valueC2V^L(nm) +O(see section II-C). The generation of SC2V is done serially in 3 steps:

1) C2V^GF(l) is sent to the eLLR module to compute L(C2V^GF(l))according to (4). The value ofC2V^GF(l) is also used to put a flag from 0 to 1 in the Flag memory of size q = 2^m to indicate that this GF(q) value now belongs to SC2V. To be specific, the Flag memory is implemented as two memory blocks in parallel, working in ping-pong mode to allow the pipeline of two consecutive C2V messages without conflicts.

2) L(C2V^GF(l)) is added to C2V^L(l) to generate SC2V(l). Note thatSC2V is no more sorted in increasing order.

3) The Sorter reorders serially the values in SC2V in increasing order. The architecture of this Sorter is described in section IV-C.

The IGM is used to generate the second set S¯C2V. Each output valueλ(l)^L of the IGM is first added toC2V^L(nm) + O. Then, if λ(l)^GF belongs to SC2V (i.e. the flag value at address λ(l)^GF in the flag memory equals ‘1’), the value is discarded and a new valueλ(l+ 1)^L is provided by the IGM component to the Min component.

The Min component serially selects the input with the minimum LLR value from SC2V and S¯C2V. Each time it retrieves a value from a set, it triggers the production of a new value of this set until all thenmvalues ofV2Care generated.

C. The architecture of the Sorter block in the VN

The Sorter block in the VN processor is composed of ⌈log₂(nm)⌉ stages, where ⌈x⌉ is the smallest interger greater than or equal to x (see Figure 4). The i^th (i = 1, . . . ,⌈log₂(nm)⌉) stage serially receives two sorted lists of size 2ⁱ⁻¹, and provides a sorted list of size 2ⁱ. The first received list goes into FIFO H and the second list goes into FIFO L. Then, the Min Select block compares the first values of the two FIFOs, pulls the minimum one from the corresponding FIFO and outputs it. In practice, a stage starts to output the sorted list as soon as the first element of the second list is received. The latency of a stage is then2ⁱ⁻¹+ 1 clock cycles, plus one cycle for the pipeline, i.e.2ⁱ⁻¹+2clock cycles. The size of FIFO H is double (i.e.2×2ⁱ⁻¹) in order to allow receiving a new input list while outputting the current sorted list.

As an example, to order a list ofnm= 16values, the Sorter consists of 4 stages. The first stage receives 16 sequences of size 2⁰= 1and outputs 8 sorted lists of size 2¹= 2 (i.e. the elements are ordered by couples). The second stage outputs 4 lists of size 2² = 4, the third stage outputs 2 lists of size 8 and, finally, the last stage outputs the whole sorted list of size 2⁴ = 16. The global latency of the Sorter is then expressed

Fig. 4. Architecture of the Sorter block in the VN processor

as:

Lsorter(nm) =

⌈log2(n_m)⌉

X

i=1

(2ⁱ⁻¹+ 2) (11) Note that the sorter is able to process continuously blocks of size power of two, i.e., fornm= 12, it is able to process a new block every 16 clock cycles and the latency isLsorter(nm) = 23.

D. Decision circuit architecture

The architecture of the simplified codeword decision circuit is presented in Figure 5. The optimal decoding is given by:

ˆ

ck = arg min

x∈GF(q){C2V_j^k_k₍₁₎(x)^L+C2V_j^k_k₍₂₎(x)^L+L(x)} (12) Since the decision is done during the second branch update, we can replace in equation (12) C2V_j^k_k₍₁₎(x)^L +L(x) by V2C_j^k_k₍₂₎(x)^L (see equation (5)). Thus, we can write:

ˆck= arg min

x∈GF(q){V2C_j^k_k₍₂₎(x)^L+C2V_j^k_k₍₂₎(x)^L} (13) The processing of this equation is rather complex, since it requires either an exhaustive search for all values of x, or a complex Content Addressable Memory (CAM) to search for the common GF(q) values in the V2C and C2V messages. At this point, any method leading to a hardware simplification without significant performance degradation can be accepted.

In a very pragmatic way, we tried several methods and we propose to replace, , in equation (13), x ∈ GF(q) by x ∈ {V2C_j^k

k(2)(m)^GF}m=1,2,3 in order to reduce the size of the CAM fromnm to 3.

LetS0 be the set of the common values between the C2V and V2C messages, indexed bym:

S0={{C2V_j^k_k₍₂₎(l)}^GFl=1...nm}∩{{V2C_j^k_k₍₂₎(m)}^GFl=1,2} (14) The decided symbolcˆk is defined as:

ˆ

ck = arg min{V2C_j^k_k₍₂₎(3)^L;C2V_j^k_k₍₂₎(l)^L+V2C_j^k_k₍₂₎(m)^L} (15) wherearg minrefers to the associated GF(q) value.

Figure 5 presents the architecture of the Decision circuit and Figure 6 shows performance simulation of the decision circuit comparing CAM sizes 3 and 12 for 8 and 20 decoding iterations. Note that reducing the CAM size from 12 to 3 does not introduce any performance loss when considering 20 decoding iterations.

(8)

Fig. 5. Architecture of the codeword decision circuit

2.95 3 3.05 3.1 3.15 3.2 3.25 3.3 3.35 3.4 3.45

10⁻⁶ 10⁻⁵ 10⁻⁴ 10⁻³ 10⁻²

Eb/No

FER

CAM size = 12; 20 iter CAM size = 3; 20 iter CAM size = 12; 8 iter CAM size = 3; 8 iter

Fig. 6. Simulation of the decoder performance for different CAM sizes in the decision circuit

E. The latency of the VN

The critical path in the VN is the one containing the Sorter block, because this block waits for the arrival of the last C2V message to start its processing. The latencyLV N is then determined by the latency of the Sorter, i.e. Lsorter, plus a clock cycle for the adder and another one for the Min block.

LV N =Lsorter(nm) + 2 (16) V. THECHECKNODEPROCESSOR

The CN processor receivesdcmessagesV2C_j^k^j^(v), performs its update based on the parity test described in equation (8), and generatesdc messagesC2V_j^k^j^(v) to be sent to the corre- spondingdc VNs. The processing of the received messages is executed according to the Forward-Backward algorithm [43]

which splits the data processing into 3 layers ofdc−2ECNs, as shown in Figure 7. The main advantage of this architecture is that it can be easily modified to implement different values ofdc (i.e., to support different code rates).

Each ECN receives two vector messages U and V, each one composed ofnm(LLR,GF) couples, and outputs a vector message E whose elements are defined by equation (8) [15]

[16]. This equation corresponds to extracting thenmminimum values of a matrix TΣ, defined as TΣ(i, j) = U(i) +V(j), for (i, j) ∈ [1, nm]². In [16], the authors propose the use of a sorter of size nm which gives a O(n²m) computational complexity and constitutes the bottleneck of the EMS algorithm. In order to reduce this computational complexity, two simplified algorithms were proposed [29] [30]. In [29] the Bubble-Check algorithm simplifies the ECN processing by

Fig. 7. Architecture scheme of a forward/backward CN processor withdc= 6. The number of ECNs is3×(dc−2)

Fig. 8. L-Bubble Check exploration of matrixTΣ. Thenbub= 4 values in the sorter are initialized with the matrix valuesTΣ(i,1), fori= 1, . . . ,4, and only a maximum of4×nm−4values inTΣare considered in the ECN processing.TΣ(i, j) =U(i) +V(j)

exploiting the properties of the matrixTΣand by considering a two-dimensional solution of the problem. This results in a reduction of the size of the sorter, theoretically in the order of√nm. It is also shown in [29] that no performance loss is introduced when considering a size of the sorter smaller than the theoretical one.

In [30], the authors suppose that the most reliable symbols are mainly distributed in the first two rows and two columns of matrix TΣ and propose to use the so called L-Bubble Check which presents an interesting performance/complexity tradeoff for the EMS ECN processing. As depicted in Figure 8, the nbub = 4 values in the sorter are initialized with the matrix values TΣ(i,1), i = 1, . . . ,4, and only a maximum of 4 ×nm −4 values in TΣ are considered in the ECN processing. Simulation results provided in [30] showed that the complexity reduction introduced by the L-Bubble Check algorithm does not introduce any significant performance loss.

For this reason, we adopt the L-Bubble Check algorithm for the implementation of the present NB-LDPC decoder.

A. The L-Bubble ECN Architecture

The L-Bubble ECN architecture is depicted in Figure 9.

The input values are stored in two RAMs U and V to be read during the ECN processing. At each clock cycle, each RAM