DESIGN OF AN ASIP LDPC DECODER COMPLIANT WITH DIGITAL COMMUNICATION STANDARDS

(1)

DESIGN OF AN ASIP LDPC DECODER COMPLIANT WITH DIGITAL COMMUNICATION STANDARDS

Bertrand LE GAL and Christophe JEGO

IPB / ENSEIRB-MATMECA, CNRS IMS, UMR 5218 351 Cours de la Libération, 33405 Talence

Université de Bordeaux, France ﬁrstname.surname@ims-bordeaux.fr

ABSTRACT

Application Specific Instruction set Processor (ASIP) is a promising approach to design an LDPC decoder that have to be compliant with multi-standards. Indeed, channel decoding is mainly dominated by dedicated hardware implementations that cannot easily support a large variety of digital communication standards.In this paper, an LDPC decoder architecture based on a publicly available MIPS processor core associated with a homogeneous matrix of processing units is presented. The proposed architecture corresponds to an intermediate approach between the creation of an new application specific instruction set processor and a fully dedicated decoder. The design and the FPGA prototyping of the resultant architectures are thus described. Results demonstrate the potential of this ASIP approach to implement efficient flexible LDPC decoders.

Index Terms—LDPC codes, ASIP architecture, MIPS processor, SIMD matrix, digital communication standards.

1. INTRODUCTION

In telecommunication systems, Forward Error Correction (FEC) is used to improve digital communication quality. Error correction encoding consists in the addition of redundancy to the binary information sequence before the transmission over a communication channel. This redundancy allows the FEC decoder to detect and/or to correct the effects of noise and interfe- rence encountered during the transmission of the information. If kinformation bits set a codeword having a length ofnbits, the ratioR=k/nis called code rate. Nowadays, more advanced FEC techniques such as LDPC codes [1] closely approach the ulti- mate limit of channel capacity on a variety of channel models.

LDPC codes are a family of FECs that are especially attractive for digital communication standards and have been adopted as part of several channel codings such as WiFi (IEEE802.11n), WiMax (IEEE802.16e), 10GBASE-T (IEEE 802.3an) or Digi- tal Video Broadcasting (DVB-S2, T2 and C2). But, the design of a fully compliant LDPC decoder architecture is still a major challenge. Indeed, the lack of homogeneity in the standardized matrices that deﬁned the LDPC codes leads to an over dimen- sioned and/or partially compliant decoder.

LDPC codes can be efﬁciently decoded using the Belief Pro- pagation (BP) algorithm. This algorithm operates on the bipartite graph representation of the code by iteratively exchanging

The authors wish to thank Valentin Dorison (Enseirb-Matmeca student) for its signiﬁcant help during the architecture design of the LDPC decoder.

messages between the variable and parity check nodes along the edges of the graph [2]. The Min-Sum (MS) algorithm, that is an alternative method, can significantly reduce the hardware complexity of the BP algorithm. Moreover, modified versions of MS algorithm such as normalized MS or offset MS using additio- nal correction factors offer comparable decoding performance to the BP algorithm. Based on these different improvements, many LDPC decoders have been described in previous papers ; a brief review can be found in [3]. The schedule defines the order of passing messages between all the nodes of the bipartite graph. Since a bipartite graph contains some cycles, the schedule directly affects the algorithm convergence rate and hence its computational complexity. We recall that a cycle in a bipartite graph refers to a finite set of connected edges that starts and ends at the same node, and satisfies the condition that no node appears more than once. The classical schedule is flooding where decoder iteration is divided into two phases : in the first phase, all the variable nodes send messages to their neighboring parity check nodes, and in the next phase the parity check nodes send messages to their neighboring variable nodes. More efficient layered schedules have been proposed in literature [4].

Indeed, the parity check matrix can be viewed as a horizontal or a vertical layered decoded sequentially. Decoder iteration is then split into sub-layer iterations. The layered schedules enable the decoding convergence to speed up by two. They can also ensure a good matching between decoding algorithms on one hand and decoder architectures on the other hand.

The most effective way to implement area and power consumption efficient LDPC decoders is to design fully dedicated architectures. But, these architectures are unsuitable for digital receiver that have to support several physical layer spe- cifications such as a base station or a customer premises equip- ment. Indeed, flexible, adaptive and reconfigurable properties are essential for these applications. Another way to implement applications under flexibility constraints is based on high-end processor usage. Currently, General-Purpose Processors (GPP) and Digital Signal Processors (DSP) provide high computational performance. Moreover, programming languages and compiler tools offer high flexibility degree. However, general processors are unsuitable for embedded systems that have to achieve high performance with low power dissipation. A third type of architecture combines both approaches : dedicated hardware elements to achieve the required performances and low cost processor cores to introduce flexibility in the architecture.

It corresponds to ﬂexible architectures with limited programmability and possibility for customization that is targeted to a

2012 IEEE Workshop on Signal Processing Systems

(2)

class of applications with high levels of data and instruction parallelism. These architectures are called Application Specific Instruction set Processors [5]. In this paper, we detail a flexible LDPC decoder that gets the benefits of ASIP architectures.

The remainder of the paper is organized as follows. Sec- tion 2 discusses related work about ASIP approaches for FEC decoding. Then, the LDPC decoding algorithm and its simpli- ﬁed versions are recalled in Section 3. The challenging issue of designing ASIP LDPC decoders is detailed in Section 4. Imple- mentation results and BER performance measured for an FPGA target are given in Section 5. Finally, conclusions are drawn in Section 6.

2. RELATED WORK

Nowadays, a designer who requires an ASIP in an embedded system has two possibilities : designing a dedicated processor from scratch or reusing an existing flexible softcore processor. The first approach is based on the complete design of a processor core dedicated to a specific application or to an application domain. In such methodology the designer has to identify the required functionalities (instructions, processing resources, memory requirements) and then to fully describe the processor core using a hardware description language. Automa- ted tools - i.e.Processor Designerfrom Synopsys using LISA language [6] orIP Designer from Target based on nML language [7] - were introduced to facilitate the RTL description of the processor. These tools generate the processor description and its development tool flow. This approach enables efficient dedicated processor implementations. However, it has serious drawbacks : architecture validation, long time design, human- unreadable RTL description.

A second approach consists in using publicly-available flexible softcore processors. By this way, a designer can be- nefit from a full support of the processor instruction set, an established design flow and well-documented modular HDL descriptions. Moreover softcore processor descriptions are often done thanks to optimized primitives on current techno- logy targets that provide more efficient implementations. Many flexible softcore processors are available in the literature [8] [9].

The designer can customize the softcore processor by adding application-specifc instructions that are implemented on spe- ciﬁcally designed hardware extensions [10]. These extensions are often directly connected to the processor’s data-path. It is also possible to automatically reduce softcore processor functionalities and hardware complexity according to application requirements as explained in [11].

Some ASIP approaches for FEC decoding can be found in the literature. A ﬁrst motivation is to propose an architecture that addresses Turbo Codes and LDPC codes for a variety of standards. Several ﬂexible decoder architectures are based on an optimized data path combined with a reduced instruction set [12]

[13] [14]. These application-specific processors were described in the LISA language usingProcessor Designertool from Sy- nopsys. Other studies are about the ASIP design only optimized for layered decoding of structured LDPC codes [15] [16]. Ho- wever, all these research works propose a customized processor obtained from scratch. Unfortunately, this practice may be unsuitable with atime to marketpressure. To our knowledge, no previous research work explores the reuse of available flexible softcore processors for the design of a flexible LDPC decoder.

Moreover, a design ﬂow is also introduced in order to implement

efﬁcient ASIP LDPC decoders. Note that it corresponds to an intermediate approach between the creation of an new softcore processor and a fully dedicated decoder.

3. LDPC DECODING ALGORITHM

Irregular Repeat Accumulate (IRA) codes are a family of LDPC codes which can be encoded/decoded with linear complexity while still keeping good BER performance. An IRA code is characterized by a parity check matrix composed of two sub-matrices : a sparse sub-matrix and a staircase lower tri- angular sub-matrix. Moreover, periodicity has been introduced in matrix design in order to reduce storage requirements. This family of LDPC codes has been adopted in the current digital communication standards such as DVB-(S2, T2 and C2), WiFi and WiMax. It enables to split a decoding iteration into sub- layer iterations. The parity check matrix is viewed as horizontal or vertical layers that can be decoded sequentially.

In order to decrease the complexity of the standard BP decoding algorithm, simpliﬁed versions have been proposed. The best-known is the offset Min-Sum algorithm in which the parity check node processing is replaced by a selection of the minimum value for the magnitude. As previously explained, a layered schedules enable the decoding convergence to speed up for a given number of iterations. In this paper, we employ the horizontal layered decoding strategy because it is favorable in terms of computational complexity and it enjoys fast convergence. The chosen horizontal layered offset Min-Sum decoding algorithm is summed up inAlgorithm1.

Algorithm 1horizontal layered offset Min-Sum algorithm

init

t= 0,T_n⁽⁰⁾=^2y_σ2ⁿ,n [1, .., N]and E⁽⁰⁾_mn= 0 repeat

for allmdo for allnN(m)do

Variable to parity check messageT_nm^(t)processing T_nm^(t)=T_n^(t)−E^(t)_mn

Parity check nodeE_m^(t+1)processing sgn(E^(t+1)_m ) =

(n)sgn(T_nm^(t)) Em^(t+1)=M ax

M in_(n)(Tnm^(t))−η ,0 end for

for allnN(m)do

Parity check to variable messageE_mn^(t+1)processing sgn(E^(t+1)_mn ) =

(nN(m)\n)sgn(T_n^(t)_m) Emn^(t+1)=M ax

M in_(nN(m)\n)(T_n^(t)m)−η ,0 Variable nodeT_n^(t+1)processing

T_n^(t+1)=T_n^(t)+

mM(n)E^(t+1)_mn end for

end for t=t+ 1 untilt≤t_max

The decoded bits are estimated throughsign(Tn^(t))

ynis the channel observation related to the received bitn.

Tn^(t) denotesa posteriori log-likelihood ratio for the variable nodenduring the iterationtin the case of a BPSK modula- tion over an additive white Gaussian noise (AWGN) channel of varianceσ². The sign ofTn^(t) corresponds to a hard decision of the variable nodenand the absolute valueTn^(t)represents the reliability of the decision. Similarly,E^(t)m corresponds to the soft value of the parity check nodemduring the iterationt. Let

(3)

Tnm^(t) andEmn^(t) denote the messages that are sent from variable nodento parity check nodemand from parity check nodem to variable noden, respectively.M(n)is the set of all the parity check nodes that are connected to the variable noden.N(m) is the set of all the variable nodes that are connected to the parity check nodem.N(m)\nis the set of variable nodes that are connected to the parity check nodemwithout the variablen.

In a bipartite graph representation, the degree of a node is the number of edges connected to it. The degrees of a variable node nand a parity check nodemare noted asd_vn andd_cm, respectively.ηis a factor that is employed in the offset Min-Sum version in order to reduce the effect of the parity check node processing simpliﬁcation.

4. ASIP ARCHITECTURE FOR LDPC CODES 4.1. ASIP architecture model

In order to address a large variety of LDPC codes speciﬁed in existing communication standards, we have designed a decoding architecture from an existing ﬂexible softcore processor.

In order to achieve this, we have evaluated several publicly- available MIPS processor implementations and selected the Plasma processor. This processor is a public domain 32-bit soft processor designed by Steve Rhoads which implements most of the MIPS-I (TM) instruction set [8]. As it has the same instruction set as a MIPS processor, it can be programmed from the same GNU tool chain. The designed architecture is composed of a Plasma microprocessor controller associated with a homogeneous Single-Instruction Multiple-Data (SIMD) matrix as detailed in Fig. 1. The SIMD matrix is a specialized form of parallel computing, whereP Processing Units (PUs) andP block memories compute and store independent data - LLRs T_n -, respectively. All PUs are dedicated to a same speciﬁc function. Moreover, a register ﬁle is dedicated to each PU to store local data - messagesE_mn-. A duplication of the PUs provides high computation rates and the SIMD matrix ensures the homogeneous property. The LLR transfers between PUs and block memories are done thanks to an interconnection network that performs the interleaveΠand deinterleaveΠ⁻¹functions.

The communication between the microprocessor core and the SIMD matrix is provided by a system interface that manage data exchanges. The proposed LDPC decoder architecture enables to answer to two challenges :

– genericity : the computation capacity can be adapted in function of frame size, code rate and throughput, – programmability : the architecture can process LDPC

codes of different standards (WiFi, WiMAX and DVB).

Fig. 1: ASIP architecture model

The architecture of the PU is detailed in Fig. 2. It is composed of two cascaded blocks that have been designed to operate in pipeline mode. As a horizontal layered decoding strategy has been adopted, a PU is defined in order to process the calcula- tions for a parity check node. The first block is in charge of processing the variable to parity check messageTnm^(t). Then, the signsgn(Em^(t+1))and the absolute valueEm^(t+1)of the parity check nodemis calculated from the messageT_nm^(t) that corresponds to the variable noden.d_cm clock periods are thus necessary to compute the value of the parity check nodemin function of its degreedcm. In a second step, the parity check to variable messageE_mn^(t+1)of the parity check nodemare up- dated. This task is performed thanks to the signsgn(Em^(t+1)) and the two minimum values associated to the absolute value Em^(t+1)that were previously computed in the first block. This message is then stored in a register file allocated to the PU as shown in Fig. 2. The Log-Likelihood RatioT_n^(t+1)value of the variable nodenis also recalculated to take account of the new parity check to variable messageE^(t+1)mn . In the second block, d_cmclock periods are also necessary to complete all the com- putations. Moreover, some registers and a FIFO component are also present in order to ensure the two stage pipeline of the PU architecture.

"

!&

$

&

! ! "$

#

!% "!

!

^! ^!

!

Fig. 2: Processing Unit architecture

4.2. Benes network

The LLRT_ntransfer between PUs and block memories is a major bottleneck of the proposed ASIP architecture. Indeed, it suffers from PU execution problems because concurrent accesses to LLR values have to be performed without any conﬂict.

One well-known solution consists in employing interconnection networks in order to solve collision problem. A review about this technique can be found in [17]. In the previous sub-section, we have introduced a homogeneous SIMD matrix. This matrix is composed of the same numberPof PUs and block memories as detailed in Fig. 3. In order to handle the message exchanges, we proposed to use a multi-stage interconnection network architecture based on a Benes topology. The Benes network [18] is a network suitable forPtoPpermutation. It offers path diversity wherePpaths exist for each source/destination couple. Moreo- ver, the latency associated to the Benes topology is constant and corresponds to the network diameter. In contrast, the conﬂict are avoided if all sources have different destination. Unfortunately, standardized LDPC codes are not designed by taking into account of this type of constraints. Consequently, the executions on the PUs have to be scheduled to process LLR values without

(4)

any bank memory access conﬂict. Previous works such as [19]

[20], proposed methods to map the data in different memory banks without access conﬂict. In our case, the mapping of LLR values in theP block memories is not constrained. This mapping is just an information that has to be considered during the schedule of the PU execution to know the LLR availability, as explained in the next sub-section.

#

"

$

!

#

∏ / ∏

%

$

Fig. 3: Homogeneous Single-Instruction Multiple-Data matrix

4.3. Design ﬂow for LDPC decoder generation

Designing LDPC decoder based on our ASIP architecture model requires a design flow to automatically generate the SIMD matrix, the memory mapping and the PU execution spe- cification. Proposed automatic design methodology is detailed in Fig. 4. A first step is the analysis of the LDPC code and in particular its parity check matrixH. This analysis enables to determine the degreesd_vn andd_cm of each variable node n and parity check nodem, respectively. It also enables to esti- mate the maximum parallelism level of the SIMD matrix. This information associated to the bipartite graph representation of the LDPC code is required to the construction of a constraint graph over the PU execution. The rest of the design flow is then applied on the constraint graph.

First, an allocation task is executed for a given parallelism levelP. The purpose of the allocation algorithm is to map all the LLR valuesT_nto theP memory blocks. It means that the size of each memory block is equal ton/P. Three different memory mappings are proposed in our design flow : block by block, data by data moduloP and fixed by the designer. The two first approaches are low cost in terms of control resources because the data accesses are regular. However, they introduce a memory mapping constraint for the scheduling-binding that do not take into account the LDPC code construction.

The most critical task is the scheduling-binding of the PU executions. This task is performed concurrently in order to take into account the memory mapping. A resource constrained scheduling also called List-Based scheduling is used. This algorithm is a generalization of the ASAP algorithm with the inclusion of memory mapping constraints. A scheduling priority list is provided according to a priority function. Naturally, the efficiency of this algorithm mainly depends on the priority function. In our design flow, this function depends on the mobility of the PU executions and the data availability. Once all the tasks are completed, the VHDL RTL description of the SIMD matrix is generated. Finally, the Plasma processor has to be programmed to execute the corresponding firmware C-code.

!

Fig. 4: Methodology for the generation of the ASIP LDPC decoder

4.4. Firmware C-code dedicated to the ASIP architecture The Plasma CPU executes all MIPS I (TM) user mode instructions except unaligned load and store operations. Instruc- tions are divided into three types : R, I and J. As it has the same instruction set as a MIPS processor, a GNU tool chain can be used for its programming. Eleven new instructions have been added to the Plasma CPU instruction set to increase its efficiency in terms of execution cycles. As some of the MIPS I instructions and corresponding hardware resources are useless in our design, we have optimized the softcore processor. To per- form this optimization, we have applied an automated methodology described in [11]. The methodology is based on the extrac- tion of the application characteristics from the binary program file to remove useless parts of the processor core. An example of firmware C-code example to illustrate the Plasma CPU programming process is given inListing1. The firmware is part of the LDPC decoding that consideredloop= 20iterations and a frame sizen. Six instructions have been defined to directly specify the PU execution :

– First.P-C: register initialization and parity check node – P-C: parity check node

– First.Var: register initialization and variable node – Var: variable node

– First.P-C&Var: register initialization, parity check node and variable node

– P-C&Var: parity check node and variable node

v o i d l d p c _ d e c o d e r ( ) { i n t l o o p = 2 0 ;

w h i l e( l o o p ) { F i r s t . P−C ( 1 ) ;

P−C ( n−1 ) ; F i r s t . P−C&Var ( 1 ) ; P−C&Var ( n−1 ) ;

F i r s t . P−C ( 1 ) ; P−C&Var ( n−1 ) ;

/ / . . . . . . . . . . . . F i r s t . Var ( 1 ) ; Var ( n−1 ) ; l o o p −= 1 ; }

}

Listing 1: Firmware C-code example for proposed ASIP architecture

(5)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 10−11

10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100

Eb/N0

BER

2/5 1/2 3/5 2/3 3/4 4/5 5/6 8/9 9/10

Fig. 5: BER performance for 64K DVB-T2 LDPC codes measured on an ASIP architecture that contains64PUs

5. EXPERIMENTAL RESULTS

Designing ASIP architectures for LDPC decoding is a challenging issue. In this section, implementation results of two LDPC decoders based on the proposed ASIP architecture are detailed. It is presented in [21] that a good trade-off between hardware complexity and decoding performance can be achie- ved with a 5 bits quantization scheme for LLR values. Let us consider the following notation (x, y, z)where x, y and z refer to bit quantizations of LLRTn, messageTnmand mes- sageE_mn, respectively. An uniform quantized with 1 sign bit, 2 magnitude bits and 2 fractional bits has been selected for LLR values in all our investigations. It means that the ﬁxed-version of the LDPC decoding algorithm and the decoder architectures have been implemented with the(5,7,5)quantization scheme.

Two different LDPC decoders have been designed and then implemented onto a Xilinx Virtex-6 LX240T FPGA. The first one is dedicated to the decoding of the smallest WiMAX standard LDPC code : LDPC (567, 288). Another study has been done to demonstrate the potential of our ASIP architecture model to design a flexible efficient decoder that supports all the LDPC code configurations of the DVB-T2 standard. The resul- ting decoder is more complex in terms of hardware resources.

Indeed, the SIMD matrix is made of 64 PUs and 64 blocks me-

Table 1: FPGA implementation results for two ASIP architectures

Virtex-6 WiMAX (576,288) 64K DVB-T2

LX240T LDPC code LDPC codes

PU P= 24 P= 64

Quantization (5,7,5) (5,7,5)

Frequency 100M Hz 100M Hz

Slice 3,137 (8%) 12,216 (32%)

Flip-Flop 5,342 (1%) 12,336 (4%) LUT 8,685 (5%) 33,839 (22%)

RAM 36Kb 53 (12%) 353 (84%)

Throughput 33M bps 62M bps

mory to exploit the long frame (64800 LLRs) mode. Implemen- tation results after place and route are given in Table 1. Com- putational resources of the WiMAX decoder take up 5,342 slice Flip-Flops and 8,685 slice LUTs. It means that the occupation rates are only about 1% and 5% of a XC6VLX240T FPGA for slice registers and slice LUTs, respectively. In addition, memory resources for this decoder take up 53 BlockRAMs of 36kbits.

For its part, the DVB-T2 decoder occupies 12,807 slice Flip- Flops and 41,094 slice LUTs. It is well-known that the major bottelneck of a LDPC decoder that has to support the long frame mode of DVB-T2 standard is the memory usage. It our design 336 BlockRAMs of 36kbits are necessary to support both the two frame modes and all the code rates. The clock frequency has been ﬁxed at 100 MHz and 20 iterations have been chosen for the decoding process. It results in a throughput of 33 Mbps and 65 Mbps for WiMAX (567, 288) LDPC and DVB-T2 (64800, 32400) LDPC decoders, respectively.

In order to validate the designed ASIP LDPC decoders, BER performance measures have to be carried out. For this reason, we have successively integrated the two LDPC decoder versions into an experimental setup composed of a computer associated with the Virtex-6 FPGA ML605 evaluation kit. The LDPC encoder and an AWGN channel emulator are a software running on the computer. The intrinsic information generated by the channel emulator is truncated and rounded, and is sent to the FPGA board thanks to a PCI express interface. Frame by frame communication is operated into our experimental setup. First, a comparison between floating-point simulated performance, fixed-point simulated performance and experimental setup measured performance in terms of BER of the designed WiMAX LDPC decoder is presented in Fig. 6. For the decoding process, the offset Min-Sum algorithm is employed. 20 iterations has been fixed for all the investigations. Results for the LDPC codes (576, 288) and (2304, 1152) that have a code rate equal to1/2over a Gaussian Channel using a BPSK mapping are given. The ASIP decoder prototype shows quasi-identical

(6)

performance when compared to fixed-point simulation. The ob- served BER performance fulfills the WiMAX standard requirements. Measured BER performance obtained by our experimental setup for 9 code rates of the DVB-T2 standards are plotted in Fig. 5. The error floor produced by the code rate2/5can only be solved by implementing a more robust simplified version of the BP algorithm. Fortunately, all other results are compliant with the DVB-T2 standard requirements.

Fig. 6: BER performance for WiMAX LDPC codes

6. CONCLUSION

In this paper, an LDPC decoder architecture based on a publicly available Plasma CPU associated with a homogeneous SIMD matrix of processing units has been detailed. The ASIP architecture model but also a design ﬂow to generate and manage LDPC decoders, have been successively presented. Imple- mentation results and BER performance measured demonstrate the potential of an ASIP approach based on an existing softcore processor. Indeed, the proposed architecture can be easily and rapidly programmed to process any LDPC code. Note that our design approach also enables to implement an LDPC decoder that supports all the LDPC codes of one or more digital communication standards.

7. REFERENCES

[1] R. G. Gallager, “Low density parity check codes,” IRE Trans. Inform. Theory, vol. IT, pp. 21–28, Jan. 1962.

[2] F. Kschischang, B. Frey, and H.-A. Loeliger, “Fac- tor graphs and the sum-product algorithm,”Information Theory, IEEE Transactions on, vol. 47, no. 2, Feb. 2001.

[3] F. Guilloud, E. Boutillon, J. Tousch, and J.-L. Danger,

“Generic description and synthesis of LDPC decoders,”

Communications, IEEE Transactions on, vol. 55, no. 11, Nov. 2007.

[4] D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” inIEEE Workshop on Signal Processing Systems, SIPS 2004, Oct. 2004.

[5] A. Orailoglu and A. Veidenbaum, “Guest editors’ introduction : application-speciﬁc microprocessors,” Design Test of Computers, IEEE, vol. 20, no. 1, Jan.-Feb. 2003.

[6] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wah- len, and H. Meyr, “A methodology for the design of application speciﬁc instruction set processors (ASIP) using the machine description language LISA,” inComputer Ai- ded Design, 2001. ICCAD 2001. IEEE/ACM International Conference on, 2001, pp. 625 –630.

[7] A. Fauth, J. Van Praet, and M. Freericks, “Describing instruction set processors using nml,” in European Design and Test Conference, EDTC 1995,, March 1995.

[8] S. Rhoads, “Plasma 32-bit softcore,” www.plasmacpu.no- ip.org, Tech. Rep., 2011.

[9] GRLIB IP Library User’s Manual, Aeroﬂex Gaisler, 2010.

[10] R. E. Gonzalez, “Xtensa : A conﬁgurable and extensible processor,”IEEE Micro, vol. 20, no. 2, April 2000.

[11] B. Le Gal and C. Jego, “Improving architecture efﬁciency of softcore processors,” inEmbedded Real Time Software and Systems, ERTS 2012, Feb. 2012.

[12] M. Alles, T. Vogt, and N. Wehn, “FlexiChaP : A recon- ﬁgurable ASIP for convolutional, Turbo, and LDPC code decoding,” inTurbo Codes and Related Topics, 2008 5th International Symposium on, Sept. 2008.

[13] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Ra- ghavan, L. Van der Perre, and F. Catthoor, “A uniﬁed instruction set programmable architecture for multi-standard advanced forward error correction,” inIEEE Workshop on Signal Processing Systems, SiPS 2008, Oct. 2008.

[14] P. Murugappa, R. Al-Khayat, A. Baghdadi, and M. Jeze- quel, “A ﬂexible high throughput multi-ASIP architecture for LDPC and turbo decoding,” inDesign, Automation Test in Europe Conference Exhibition, 2011, March 2011.

[15] F. Vacca, G. Masera, H. Moussa, A. Baghdadi, and M. Je- zequel, “Flexible architectures for LDPC decoders based on network on chip paradigm,” inDigital System Design, Architectures, Methods and Tools, 2009. DSD ’09. 12th Euromicro Conference on, Aug. 2009.

[16] X. Zhang, Y. Tian, J. Cui, Y. Xu, and Z. Lai, “An multi-rate LDPC decoder based on ASIP for DMB-TH,” inASICON

’09. IEEE 8th International Conference on, Oct. 2009.

[17] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a ﬂexible LDPC decoder,”Circuits and Systems II : Express Briefs, IEEE Transactions on, vol. 54, no. 6, June 2007.

[18] V. E. Benes,Mathematical theory of connecting networks and telephone trafﬁc. Academic Press, New York, 1965.

[19] A. Tarable, S. Benedetto, and G. Montorsi, “Mapping in- terleaving laws to parallel turbo and LDPC decoder architectures,”Information Theory, IEEE Transactions on, vol. 50, no. 9, pp. 2002 – 2009, sept. 2004.

[20] C. Chavet and P. Coussy, “A memory mapping approach for parallel interleaver design with multiples read and write accesses,” inCircuits and Systems (ISCAS), Procee- dings of 2010 IEEE International Symposium on, 2010.

[21] C. Marchand, L. Conde-Canencia, and E. Boutillon,

“Architecture and ﬁnite precision optimization for layered LDPC decoders,”Journal of Signal Processing Systems, vol. 65, pp. 185–197, 2011. [Online]. Available : http ://dx.doi.org/10.1007/s11265-011-0604-z