Extended Barrel-Shifter for Versatile QC-LDPC Decoders

(1)

HAL Id: hal-02441491

https://hal.archives-ouvertes.fr/hal-02441491

Submitted on 15 Jan 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Decoders

Emmanuel Boutillon, Hassan Harb

To cite this version:

Emmanuel Boutillon, Hassan Harb. Extended Barrel-Shifter for Versatile QC-LDPC Decoders. IEEE

wireless communications letters, IEEE comsoc, In press. �hal-02441491�

(2)

Extended Barrel-Shifter for Versatile QC-LDPC

Decoders

Emmanuel Boutillon Senior Member, IEEE, and Hassan Harb

Universit´e de Bretagne-Sud, Lab-STICC, UMR 6285 CNRS – Lorient, France

emmanuel.boutillon@univ-ubs.fr and hassan.harb@univ-ubs.fr

Abstract—In this paper, we present a Barrel-Shifter of size n extended by an additional layer that can handle any circular permutation on a vector of size m, m ≤ n, thanks to a specific initial positioning of the data. The construction of the so-called Extended Barrel-Shifter is motivated by the hardware decoder constraint related to the Low-Density Parity-Check (LDPC) code recently adopted for the 5G mobile standard. The proposed algorithm requires 42 % less multiplexers than the best state-of-the-art solution for n = 384 of the 5G LDPC standard. This proposal is also able to process several inputs in parallel without extra hardware cost.

Index Terms—Barrel-Shifter, circular-shift network, LDPC decoder, 5G standard.

I. INTRODUCTION

A Barrel-Shifter (BS) is a hardware component that allows to perform any circular rotation on a vector of size n. BS’s are key components of Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) decoder [1]. In fact, QC-LDPC codes are hardware friendly error control codes based on a prototype matrix where each null element is replaced by an n × n null matrix and each non-null element is replaced by an n×n circularly shifted identity matrix. This particular structure allows implementing a decoder with n Processing Elements (PE) in parallel without any memory access conflict, as shown in Fig. 1. The BS of size n inserted between the memories and the PE allows the reordering of the data sent from the RAM’s to PE’s, according to the structure of the QC-LDPC code.

Fig. 1. Parallel structure of a QC-LDPC decoder.

Recently, a new family of QC-LDPC codes has been adopted for the next 3GPP standard (the so-called 5G) [2]. The particularity of this standard is that the expansion factor sizes of the QC-LDPC code can take 51 values ranging between m = 2 and m = 384 according to the code rate and the size of the code. This new characteristic translates in hardware by architectures that are able to perform rotations on vectors

of variable size. The idea is to set the number n of PE’s to the maximum expansion factor of the code to achieve the maximum decoding throughput. When the effective expansion factor m of the QC-LDPC code is smaller than n, then only m PE’s among the n PE’s are used. This property is also required for a design of a versatile QC-LDPC decoder able to process several types of codes (WIFI, DVB-S2 for example).

A Benes Network (BN) of size n, with n a power of 2 is able to perform any kind of permutation [3]. In [4], a network inspired from the Benes’s network is proposed to perform permutations for any value of n without extra cost. However, since only a reduced subset of permutations (i.e., rotations) is required, those solutions are not optimal. The BN requires a total of DBN layers of multiplexers (MUX’s) in its critical

path and a hardware complexity given by CBN MUX’s with

DBN= 2 log2(n) − 1, (1)

CBN= (2 log2(n) − 1)n. (2)

This high number of layers introduces a high propagation delay, thus reducing the clock frequency of the design. Pipelin-ing the BN has also its own drawback, since it introduces latency in a processing loop. The work in [5] proposes to prune the BN in order to tailor it exactly to the requirement of the application. The resulting architecture can be significantly smaller than the initial BN when the number of required expansion factor sizes is low, which is not the case for the 5G standard. Moreover, the resulting structure inherits from the long path delay of the mother BN. For a low number of distinct sizes, [6] proposed a two stage network where the first stage is composed of an n/d parallel BS’s, each of size d, with d the greatest common divisor of all the required sizes. This method is effective when d is high enough but of little help for the 5G where d = 1 (expansion factors of 2 and 3 are both required).

The best reference is given in [7] where a new architecture of a flexible BS called Quasi-Cyclic LDPC Shift Network (QSN) is proposed. When the vector size m is smaller than n, QSN proposes to select the first m PE’s and the first m RAM’s of the architecture and put the other n−m PE’s and RAM’s in the idle mode (see Fig. 2.a). The QSN interconnection network is then able to process any cyclic rotation between PE’s and RAM’s of index 0, 1, . . . , m − 1, m ≤ n thanks to a clever association of two partial size-n BS’s and a final MUX’s stage. Its overall complexity, CQSN in terms of MUX’s is given as

CQSN= 2(dlog2(n)e + 1)n − 2dlog2

(n)e+1_{+ 1,}

(3)

Fig. 2. Comparison between QSN and EBS principles. a) with QSN, only the first m PE are used. b) With EBS, the m active PE’s are spread regularly

among the n PE _.

where dxe represents the smallest integer greater than or equal to x. In summary, QSN roughly doubles the complexity compared to a simple BS while its number of layers, DQSN,

is equal to the number of layers of a classical BS plus one, i.e.,

DQSN= dlog2(n)e + 1. (4)

The hardware cost and the critical path of QSN is not negligible with regards to the other blocks. For example, for the 5G LDPC decoder, a combinational QSN costs roughly 80% of the area required for the PE’s, moreover, without pipeline layers, the QSN limits the maximum clock frequency due to its long critical path.

In this paper, we tackle the same problem with a new approach. In fact, nothing prevents the designer to select any subset of m PE’s and m memories to relax the constraint on the design. Based on this idea, we propose a solution called Extended Barrel-Shifter (EBS) composed of a classical BS of size n extended with an extra layer of MUX’s. Fig. 2.b illustrates the principle of EBS compared to the QSN. Note that the EBS has exactly the same number of layers than the QSN but it significantly reduces the number of MUX’s (from 38.3 % for n = 48 up to 44.5 % for n = 512). Moreover, EBS also offers the possibility to work in parallel with T vectors of size m, provided that T m ≤ n.

The rest of the paper is divided into three sections. Section II presents the principle of the EBS. Section III describes the hardware architecture and compares EBS with the state of the art. Finally Section IV concludes the paper.

II. EXTENDEDBARREL-SHIFTER

In this section we first mathematically describe the EBS, and then we explain its extension for the parallel processing of several vectors.

A. Formal description of EBS

Let X0 be a vector of size m ≤ n, i.e., X0 ₌

(x0

0, x01, . . . , x0m−1). The rotation Rp of index p applied on

X0 _{to generate X}p _{= R}p_(X0_{) is the permutation that shifts}

the ith element x0i of X0 to the position (i + p) mod n in

Xp_{, thus}

Xp= Rp(X0) = (x_(−p)0 , x0_(−p+1), . . . , x0_(m−p−1)), (5) where 0 ≤ p < m and operations of the indexes are performed modulo m, e.g. x0

(−p) refers to x 0

m−p when p > 0.

The proposed EBS allows to perform the operation Rp_(X0₎

for any value of m, 1 < m ≤ n and p, 0 ≤ p < m thanks to operations on a vector of size n. To simplify notations, in the sequel, X refers always to a vector of size m and Y and Z to vectors of size n. Moreover, rotation on a vector of size n will be denoted with a bar as ¯R. Mathematically, EBS can be defined by the three following functions:

1) Initial mapping: An injective map πmbetween X0 and

Z0 _{defined as}

Z0= πm(X0). (6)

This map associates the ithcoordinate x0i of X0 to the πm(i)

coordinate z_π0 m(i) of Z 0_{, i.e., z}0 πm(i)= x 0 i with πm(i) = bin/mc, i = 0, 1, . . . , m − 1, (7)

where bac is the floor function that returns the highest integer smaller than or equal to a. For example, for n = 8 and m = 5, the vector X0 _{= (x}0

0, x01, x02, x03, x04) is mapped into Z0 as

Z0 _{= (x}0

0, x01, ∅, x02, x03, ∅, x04, ∅) where ∅ stands for ‘idle’,

or equivalently, unused position. Note that the mapping spread regularly the value on X0 _{into the vector Z}0_.

We also define Zp_{as the result of the mapping π}

m applied

on Xp_{, i.e.,}

Zp= πm(Rp(X0)). (8)

Back to the previous example, p = 3 would lead to Z3 ₌ _π

5(R3(X0)) = π5((x02, x03, x04, x00, x01)) =

(x0

2, x03, ∅, x04, x00, ∅, x01, ∅).

2) Rotation: The rotation ¯Rp¯ _{of index ¯}_{p = π}

m(p) applied

to Z0 _{is given by}

Yp= ¯Rπm(p)_(Z0_). ₍₉₎

Back to the previous example and considering p = 3, we have π5(p = 3) = b3 × 8/5c = 4, and thus Y3 =

¯

R4_(Z0_{) = (x}0

3, ∅, x04, ∅, x00, x01, ∅, x02). We can notice that

the initial position of the first coordinate x0₀of X0in Z0is 0. After a rotation of index πm(p), x00gets in position πm(p) in

Yp. This position is identical to the position of x00in Zp. Since

the values of X0 are spread regularly in Z0thanks to (6), this property is not affected by the rotation. In other words, after the rotation ¯Rπm(p)_(Z0_{), x}0

(4)

all the other values are also close to their final position. Let us examine what is happening for the rest of the values of Yp_.

3) Final Mapping: For a given index i = 0, 1, . . . , m − 1, the position of x0i in Yp is given by the summation (modulo

n) of its initial position πm(i) in Z0 and the rotation factor

πm(p), i.e., πm(p) + πm(i). The final position of x0i in Zp

should be πm(i + p). Let us define δm,p(i) as the difference

between the theoretical position of xi in Zp and its effective

position in Yp_{, i.e.,}

δm,p(i) = πm(i + p) − πm(p) − πm(i). (10)

According to (7), it is possible to rewrite (10) as

δm,p(i) = b(i + p)n/m)c − bin/mc − bpn/mc. (11)

Theorem 1: For all (a, b) ∈ R2, ba + bc − bac − bbc ∈ {0, 1}. Proof: The real a can be decomposed as a sum of its integer part na = bac and a fractional part ra = a − na ∈ [0, 1[,

i.e, a = na+ ra. Similarly, b can be written as b = nb+ rb

with rb ∈ [0, 1[. Thus ba + bc = na+ nb+ bra+ rbc. Since

ra+ rb ∈ [0, 2[, bra+ rbc can only take the values 0 or 1

Corollary: δm,p(i) ∈ {0, 1}.

Proof Use theorem 1 with a = in/m and b = pn/m. From the corollary, we deduce that from Yp, we still have to perform either one shift right when δm,p(i) = 1 or none

when δm,p(i) = 0 for each index i to obtain Zp. Thanks to

this additional permutation, for all p, m, n, 0 ≤ p < m ≤ n, we have

Zp= δm,p( ¯Rπm(p)(Z0)). (12)

Fig. 3.a shows graphically the equivalence of the two paths to obtain Zpwhile Fig. 3.b gives an example for n = 8, m = 5 and p = 3.

Fig. 3. Illustration of the proposed method, a) general case, b) example for n = 8, m = 5 and a rotation index of p = 3.

Let us examine how to implement in practice the final map-ping δm,p(i). The Euclidean division of ni (respectively, np)

by m gives the quotient qm,iand the remainder 0 ≤ rm,i< m,

i.e., ni = qm,im + rm,i (respectively, np = qm,pm + rm,p).

Thus, (11) yields:

δm,p(i) =bqm,i+ qm,p+ (rm,i+ rm,p)/mc

− bqm,i+ rm,i/mc − bqm,p+ rm,p/mc (13)

=b(rm,i+ rm,p)/mc.

Note that δm,p is a vector of size m and that δm,p(i)

indi-cates the local control signal associated to the ith coordinate x0

i of X0. Vector δm,p should follow the same transformation

than X0 _{in (8) in order to send the local control signal}

associated to the ith _{coordinate of x}0

i of X0 to the final

position of x0

i in Zp, i.e.,

m,p= πm(Rp(δm,p)). (14)

Thus, the local signal control m,p(j) allows to select either

z_jp = yp_j if m,p(j) = 0 or zjp = y p

j−1 if m,p(j) = 1,

(j = 0, 1, 2, . . . , n − 1). Back to the previous example, given m = 5 and p = 3, r5,3 = 3 × 8 mod 5 = 4

and r5,i = 8i mod 5 = {0, 3, 1, 4, 2}i=0,1,...,4, thus δ5,3 =

(b4/5c, b7/5c, b5/5c, b8/5c, b6/5c) = (0, 1, 1, 1, 1) and thus 5,3 = π5(R3(δ5,3)) = (1, 1, ∅, 1, 0, ∅, 1, ∅), where 0∅0 stands for idle position, as shown in Fig. 3.b. In the sequel, all the idle positions are set with a zero value, i.e., 5,3 =

(1, 1, 0, 1, 0, 0, 1, 0). B. Parallelism

When m ≤ n/2, if only one vector is processed, most of the PE’s stay in the idle mode. EBS offers the possibility to take profit of the idling PE to either process several independent frames in parallel, and thus increase the processing throughput, or to process the same frame with several decoding hypothesis to increase the decoding performance using techniques likes [8] or [9] to name only a few. In both cases a ‘Single Instruction, Multiple Data (SIMD)’ type of processing should be used, i.e., vectors of same size should be concurrently rotated by the same index. EBS can process in parallel T vectors with the same rotation index p, provided that T m ≤ n. Let {X0

k}k=0,1,...,T −1 be the T > 1 vectors to be processed

in parallel, with x0

k,i representing the ith coordinate of the

kth_{vector. From those T vectors, we generate a single vector}

X00of size T m as the concatenation of the elements of vectors {X0

k}k=0,1,...,T −1, i.e, j = 0, 1, 2, . . . , mT − 1

x00_j = x0_r_j_,q_j, (15) where qj and rj are given by the Euclidean division of j

by T , i.e, j = qjT + rj. For example, if T = 2 and

m = 3, then the T = 2 vectors X00 = (x00,0, x00,1, x00,2)

and X10 = (x01,0, x01,1, x01,2) are mapped to X00 =

(x00,0, x01,0, x00,1, x01,1, x00,2, x01,2). Thus by construction, any

rotation of index pT on X0 generates a rotation of index p on each component vectors X_k0 of X00. It is thus possible to perform those operations on Z00= πm(X00) with the method

described in section II-A.

III. HARDWARE ARCHITECTURE

In this section, we first review the general architecture, then we focus on the particular case of EBS for the 5G standard requirement.

A. Global architecture

EBS can be split up into two parts: first, a BS of size n that can perform any rotation on a vector Z of size n. Let 0 ≤ ¯p < n be the index of rotation on Z, L = dlog2(n)e the

(5)

(¯p(L − 1), ¯p(L − 2), . . . , ¯p(0))2 the binary representation of

¯

p. The BS is composed of L layers, each one is composed of n multiplexers (MUX’s) that allows either to perform the identity function if ¯p(l) = 0, with l the index of the layer, 0 ≤ l < L, or a circular rotation of index 2l _{if ¯}_{p(l) = 1.}

Second, the final extra stage of Multiplexers performs the δ permutation. Contrary to the BS, in the final stage, the n multiplexers are controlled independently by the signal m,p(j), j = 0, 1, . . . , n − 1. The overall complexity CEBS

in number of MUX’s is thus given as

CEBS= (dlog2(n)e + 1)n, (16)

and the number of layers of MUX’s DEBS is equal to DQSN

given in (4). Fig. 4 shows the datapath of EBS for n = 8. The determination of the effective rotation index ¯p = πm(p)

and the generation of control vector m,p as a function of m

and p can be obtained directly by using a Read Only Memory (ROM) that stores all configurations. For a given value of m, p can take the values 0, 1, . . . , m − 1, i.e., m possible values. Thus, it is possible to generate a unique address A(m, p) for a given couple (m, p) as

A(m, p) = (m − 1)m/2 + p, (17) to access a word in a ROM of size SR= n(n+1)/2 (addresses

vary from 0 up to A(n, n − 1)). The word at address A(m, p) is of size n+L bits, containing the n values of vector m,pand

the L bits of the binary representation of πm(p), respectively.

Fig. 4. Detailed EBS datapath for n = 8.

B. Complexity analysis

Table I compares the number of MUX’s and the number of layers between BN, QSN and EBS. The table also gives the relative percentage saving S in number of MUX’s between QSN and EBS with S defined as

S =CQSN− CEBS CQSN

× 100%. (18) For the size n = 384, this reduction reaches S = 42.3% which is very significant. Table I also compares the synthesis results of QSN [7] and EBS using a Virtex 4 LX160-10 FPGA.

TABLE I

COMPLEXITY COMPARISON(IN NUMBER OFMUX’S)AND DELAY(IN NUMBER OF LAYERS OFMUX’S)BETWEEN THEBENES, QSNAND THE

PROPOSEDEBSNETWORKS.

n 48 64 128 256 384 512 DBN 11 11 13 15 17 17 DQSN 7 7 8 9 10 10 DEBS 7 7 8 9 10 10 ˜ DEBS 7 7 8 9 10 10 CBN 528 704 1664 3840 6528 8704 CQSN 497 705 1665 4097 6657 9217 CEBS 336 448 1024 2304 3840 5120 S 32.4% 36.4% 38.5% 43.8% 42.3% 44.5% ˜ CQSN 4290 5996 13589 - - -˜ CEBS 2712 3600 8192 18634 39752 53080 ˜ S 36.8% 40% 39.7% - -

-The complexity ˜CQSNand ˜CEBS are given in number of Look

Up Table (LUT). The observed LUT reduction ˜S between ˜

CQSNand ˜CESB is consistent with the theoretical reduction S.

Although the number of layers is equal for both architectures, the reduction in MUX’s translates also in a reduction in the number of interconnection wires, which in turn, reduces the complexity of the place and route operation in an Application Specific Integrated Circuit (ASIC) or in a Field Programmable Gate Array (FPGA). This reduction of complexity enables the reduction of both area and power consumption, along with a higher clock frequency. However, EBS has also some drawbacks compared to QSN. First it requires some memories to store the control configuration which is not the case for QSN where control signals are generated thanks to few simple logical gates. Second, the initialization of the RAM with the incoming frames and the output of the decoded frames are more complex than [7] due to the mapping πm. Nevertheless,

those operations are performed once during the initialization phase and the output phase so their latency is out of the global latency of the iterative decoding process.

The next section describes the particular case of the permu-tation network associated to a 5G LDPC decoder.

C. The EBS for the 5G LDPC decoder

In this section, we describe the EBS applied to the 5G LDPC decoder. Although described for this special case, some optimisation methods can also be used in the general case. First, when p = 0, there is no rotation, and thus, no need to apply the permutation δm,p (note that (11) is always 0 when

p = 0). In other words, there is no need to store the vector m,0, nor the value of πm(p = 0) since it is always equal

to 0. Moreover, for the 5G LDPC code, n = 384 is not a prime number. Thus, there exist values of m that divide n. In that case, n/m = q is an integer and thus, (11) yields to δm,p(i) = 0, for all values of p. Moreover, in that case,

according to (7), πm(p) = pq, thus a simple multiplier is

enough to determine the value of πm(p). For the 5G, only a

subset S of size 51 values of m among 383 are used. It is thus possible to identify the value of m by its position in the set S = {m0, m1, ..., m50}, or equivalently, ma = S(a), with a

(6)

with Ma ∈ {2, 3, 5, 7, 9, 11, 13, 15}, ea ∈ {0, 1, . . . , 7} and

ma ≤ 384). Based on these remarks, the size of the ROM

can be reduced by computing the address thanks to a table B generated recursively as B(0) = 0, B(a) = B(a − 1) if ma

divides n, B(a) = B(a − 1) + ma− 1 otherwise. Then from

B, the address of the ROM is computed as A(a, 0) = 0 for all a = 0, 1, . . . , 50, A(a, p) = 0 for all p = 0, 1, . . . , ma− 1 if

ma divides n, and otherwise, A(a, p) = B(a) + p. A second

ROM C of size 51 words can store the value of qa = n/ma

when qa is integer, or 0 otherwise. Thus, the test qa = 0 and

p > 0 indicates that address A(a, p) = B(a) + p should be used to access ROM E that contains the values ma,p, and the

ROM P that contains the values πm(p). The size of ROM E

and P, for the 5G code is 2766 words (note that, in practice, ROM E and P can be merged in a single ROM). When the conditions qa 6= 0 and p > 0 is not fulfilled, then the 0

address is used in ROM E and P. It gives respectively the null vector for ma,p and a dummy zero value for πma(p)

added to pqa to generate the πma(p). Note that when ROM

P gives a non-null value for πma(p), then pqa is necessarily

equal to 0, thus the final addition gives πma(p) in all cases.

Fig. 5 illustrates the particular control structure of EBS for the 5G LDPC decoders. Finally, Fig. 6 shows the contents of ROM E. From the arabesques of the ROM contents, we can deduce that there is still redundancy that can be suppressed in this memory.

Fig. 5. Control of the EBS for the 5G LDPC code.

Finally, after determining the contents of ROM E, we realized that m,p(j) is always equal to zero for 16 values

of j defined as j = 24 × c, c = 1, 2, . . . , 16. This observation allows the reduction of the word size of ROM E from 384 to 368 bits, and also allows us to eliminate 16 multiplexers in the final layer of EBS.

IV. CONCLUSION

In this paper we show an EBS of size n can perform any circular-shift rotation of vector of size m ≤ n albeit to an appropriate initial positioning of the data in the processing element. We also extend the method to process an identical rotation on T vectors of size m, T m ≤ n, which allows either to increase the processing throughput and/or to enhance the decoding performance of the LDPC decoder. Compared to the QSN, the EBS requires between 38 % and 50 % less MUX’s

48 96 144 192 240 288 336 384 500 1000 1500 2000 2500

Fig. 6. Graphical representation of the ROM E contents. A ‘one’ is represented by a black dot, a ‘zero’ by a white dot.

for n ≥ 48. This reduction of MUX’s translates in a reduction of wires and thus, overall, in a significant area and critical path reductions. EBS has some drawbacks: it requires a more complex generation of control signals than for the QSN (we propose using a ROM to store the control signals) and requires more complex input/output operation, since each data should be stored in the appropriate memory bank. We showed that the proposed architecture is suitable for the 5G LDPC code where all the lifting sizes are covered.

ACKNOWLEDGMENT

This work has been funded by the Brittany region and the EU through the FEDER program in the frame of the FLEXDEC-5G project. The authors would like to thank Jérémie Nadal, Alireza Tasdighi, Ali Al-Ghouwayel, Cédric Marchand, Laura Conde-Canencia and Franklin Cochachin for their corrections and suggestions to improve the paper.

REFERENCES

[1] D. E. Hocevar, “A Reduced Complexity Decoder Architecture Via Lay-ered Decoding of LDPC Codes,” in IEEE Workshop on Signal Processing Systems, 2004. SIPS 2004., Oct 2004, pp. 107–112.

[2] 3GPP, Available Online, https://www.3gpp.org/.

[3] V. Benes, “Optimal Rearrangeable Multistage Connecting Networks,” Bell Syst. Tech. J., vol. 43, no. 7, pp. 1641–1656, 1964.

[4] D. Oh and K. K. Parhi, “Low-Complexity Switch Network for Recon-figurable LDPC Decoders,” IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol. 18, no. 1, pp. 85–94, Jan 2010. [5] J. Lin, Z. Wang, L. Li, J. Sha, and M. Gao, “Efficient Shuffle Network

Architecture and Application For WiMAX LDPC Decoders,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 56, no. 3, pp. 215–219, March 2009.

[6] M. Rovini, G. Gentile, and L. Fanucci, “Multi-Size Circular Shifting Networks for Decoders of Structured LDPC Codes,” Electronics Letters, vol. 43, no. 17, pp. 938–940, August 2007.

[7] X. Chen, S. Lin, and V. Akella, “QSN−A Simple Circular-Shift Network for Reconfigurable Quasi-Cyclic LDPC Decoders,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, no. 10, pp. 782–786, Oct 2010.

[8] T. Tonnellier, C. Leroux, B. Le Gal, B. Gadat, C. Jego, and N. Van Wambeke, “Lowering the Error Floor of Turbo Codes With CRC Verifica-tion,” IEEE Wireless Communications Letters, vol. 5, no. 4, pp. 404–407, Aug 2016.

[9] F. Leduc-Primeau, S. Hemati, S. Mannor, and W. J. Gross, “Lowering Error Floors Using Dithered Belief Propagation,” in 2010 IEEE Global Telecommunications Conference GLOBECOM 2010, Dec 2010, pp. 1–6.