Optimized precoded spatio-temporal partial-response signaling over frequency-selective MIMO channels

(1)

Optimized Precoded Spatio-Temporal

Partial-Response Signaling over Frequency-Selective MIMO Channels

Jelle Bailleul, Student Member, IEEE, Lennert Jacobs,Member, IEEE, Mamoun Guenach, Senior Member, IEEE, Marc Moeneclaey,Fellow, IEEE

Abstract—Due to the continuous demand for higher bit rates, the management of the spatio-temporal intersymbol interference in frequency-selective multiple-input multiple-output (MIMO) channels becomes increasingly important. For single-input single- output channels, equalized precoded partial-response signaling is capable of handling a large amount of intersymbol interference, but, to date, no equalization scheme with general partial-response signaling has been presented for the frequency-selective MIMO channel. Not only does this contribution extend partial-response signaling to the MIMO channel by proposing a general spatio- temporal partial-response precoder, but it also develops a minimum mean-squared-error optimization framework in which the equalization coefficients and the spatio-temporal target response are jointly optimized. Three iterative optimization algorithms are discussed, which update (part of) a row of the target impulse response matrix in each iteration. In particular, the third algorithm reformulates this row optimization as a lattice decoding problem. Numerical simulations confirm that the general partial-response signaling clearly outperforms the traditional full-response signaling in terms of the mean squared error and the bit error rate. The third optimization algorithm has a better performance but a higher complexity, compared to the first and the second algorithm.

Index Terms—decision-feedback equalization, lattice decoding, MIMO frequency-selective channel, minimum mean-squared- error equalization, partial-response signaling, precoding.

I. INTRODUCTION

B

ECAUSE of the continuous growth in bit rate and associated bandwidth, the spatio-temporal (ST) intersymbol interference (ISI) caused by frequency-selective (FS) multiple-input multiple-output (MIMO) channels becomes an increasingly prominent challenge in the design of high-speed communication systems. To mitigate this ST ISI, several equalization schemes have been proposed in the literature.

The optimal detector for uncoded transmission performs maximum likelihood sequence detection (MLSD), which involves a bank of matched filters and the application of the Viterbi algorithm [1], [2]. Unfortunately, its exponential complexity in both the number of data streams and the channel memory is intolerably large for most applications, therefore limiting

J. Bailleul, L. Jacobs, and M.Moeneclaey are with the Department of Telecommunications and Information Processing, Ghent University, Ghent, Belgium (email: jelle.bailleul, lennert.jacobs,marc.moeneclaey@ugent.be).

M.Guenach is both with Imec, Leuven, Belgium and the Department of Telecommunications and Information Processing, Ghent University, Ghent, Belgium (email: guenach@ieee.org). Part of the research work was conducted at the Nokia Bell Labs, Antwerp.

the practicability of this detector. A more practically feasible alternative is the MIMO linear equalization scheme, in which linear equalization filters first reduce the ST ISI, and a symbol- by-symbol detector then recovers the transmitted data [3]. In contrast to the MLSD, this equalization scheme requires only limited complexity, but suffers from a (sometimes consider- able) performance degradation.

Besides the linear equalizers, several nonlinear equalization schemes for continuous transmission over FS MIMO channels have been investigated, that combine adequate performance with low complexity. For example, a decision-feedback equalizer (DFE) [3], [4] with the previously detected symbols at the receiver (RX) as input can completely eliminate the postcursor ISI provided that the previous symbols were correctly detected.

However, its performance may be severely degraded by the propagation of erroneous decisions. Error propagation can be overcome by transferring the feedback structure from the RX to the transmitter (TX), resulting in an equalization scheme called Tomlinson-Harashima precoding (THP). The THP scheme has originally been introduced in the context of FS SISO channels [5], [6], and has also been described for flat and FS MIMO channels in [7] and [8], respectively. However, THP requires a modulo operator at both the TX and the RX, inducing both a power and a modulo loss.

The equalization performance can also be enhanced by partial-response signaling (PRS) [9], in which the channel is equalized to a target impulse response (TIR) with integer coefficients. In this TIR, a controlled amount of ISI is allowed, thereby considerably facilitating the equalization task. At the RX, the data detection is executed based on the equalized channel output in the extended signal set, after which the original data is retrieved from these decision by means of a feedback filter at the RX. However, this feedback structure is prone to error propagation. Alternatively, the feedback filter can be omitted when employing a precoder at the TX. Contrary to THP, this precoder does interestingly not induce any power loss. A common example of PRS is duo-binary signaling, where the TIR in z-transform notation is1+z⁻¹ [10], [11].

Interestingly, a relation between PRS and lattice-reduction- aided (LRA) equalization has been established in [12], [13], [14]. LRA techniques have been applied to flat fading MIMO channels with only spatial ISI [15], [16], [17], [18]. More precisely, a lattice reduction algorithm, e.g., the LLL algorithm [19] or the element-based reduction algorithm [18], first com- putes a reduced channel matrix that is better conditioned than

(2)

the original channel matrix, i.e., closer to being orthogonal.

Next, the equalizer with respect to this reduced channel is obtained by either the zero-forcing (ZF) or the minimum mean-squared-error (MMSE) criterion. After decision, the data is then recovered by means of the inverse base transformation.

This procedure has been applied to block transmission over FS channels for single-input single-output (SISO) [20] and MIMO [21] systems and to mitigate the spatial ISI with low complexity in MIMO orthogonal frequency division multiplex- ing systems [22]. Additionally, [16] and [23] have proved that optimal diversity is reached by the LRA equalizers for a flat fading MIMO channel and a FS MIMO channel, respectively.

However, continuous transmission, e.g., PRS signaling, might be preferred over block transmission to avoid guard times and to allow for time-invariant equalization filters. In [12], generalized PRS for the flat fading MIMO channel has been derived from the LRA equalizer by implementing the inverse base transformation at the TX instead of the RX. The similarity between PRS and LRA has been emphasized for a FS SISO channel in [13], [14], in which the channel’s transfer function is decomposed as the product of a target response and a reduced channel. The ZF infinite-length equalizer is then designed for completely equalizing the latter and the target response inducing the least noise enhancement is considered as optimal. This optimization is done by means of a look- up table in [13], but a more advanced algorithm has been proposed in [14] by reformulating this problem as a lattice decoding problem [24].

Although PRS has already been described in the literature for the FS SISO channel [9] and for the flat-fading MIMO channel [12], no general ST PRS equalization scheme has been investigated for the FS MIMO channel. Indeed, the TIR in [12] and [10] is confined to spatial-only components and temporal-only components, respectively. Moreover, most contributions assume that the TIR is given [9], [10], [11], and no optimization with respect to this TIR is performed at all, resulting in an inferior trial-and-error selection of the target response.

This contribution focuses on PRS applied to a symbol- spaced, complex-valued, discrete-time, and FS MIMO channel. To address the problem of the ISI in the continuous transmission over the FS channel, a general PRS architecture is presented, consisting of a ST PRS precoder at the TX and a ST-DFE (encompassing a forward and a decision-feedback filter) at the RX. Interestingly, the TIR of the PRS contains both spatial and temporal components that are complex-valued as the baseband channel can be complex-valued. Contrary to the trial-and-error TIR selection often found in literature [9], [10], [11], the ST-TIR is jointly optimized with the ST- DFE according to the MMSE criterion. First, the optimal equalization coefficients can readily be expressed as a function of the TIR. The resulting MSE is then minimized with respect to the TIR by three different algorithms. Interestingly, the third algorithm reformulates the TIR optimization as a sequence of lattice decoding problems, which can be solved by the algorithm presented in [24]. In [14], a similar algorithm is proposed to compute a real-valued TIR in the context of a FS SISO channel with a whitened-matched filter front-end.

Figure 1. Model of a PRS-ST communication link. The TX consists of the partial-response precoder, whereas the RX aims to reduce the ISI by means of a linear feedforward and linear feedback filter.

However, the present contribution significantly differs from [14], since the real-valued TIR in [14] has been derived by minimizing a lower bound on the noise enhancement of the infinite-length ZF equalizer for a SISO channel, whereas this contribution jointly optimizes the complex-valued TIR coefficients and more practical, finite-length filters according to the MMSE-criterion for a MIMO channel.

The remainder of the paper is organized as follows. Section II mathematically describes the communication link, consisting of a PRS precoder at the TX, a FS MIMO channel, and a nonlinear ST-DFE at the RX, the latter comprising a linear feedforward filter and a linear feedback filter operating on the previously detected symbols. The filters and the ST- TIR are calculated by minimizing the MSE in section III.

Following the simple optimization with respect to the filters of the ST-DFE in subsection III-B, the computation of the optimized TIR is more challenging (subsection III-B). In total, three iterative optimization algorithms, each performing a row- by-row optimization of the TIR, are discussed (subsections III-B1-III-B4). To assess the performance of these algorithms, the bit error rate (BER) expression is derived in Section IV, after which numerical results are given in Section V. Finally, conclusions are drawn in Section VI.

In the following, upper-case (lower-case) boldface letters are used for matrices (vectors). The notations (·)^∗, (·)^T, (·)^H,E[·], k · k,Tr(·), Re[·],Im[·],det(·), and | · | represent the complex conjugate, transpose, Hermitian transpose, the statistical expectation, the Euclidean vector norm, the trace, the real part, the imaginary part, the determinant, and the absolute value, respectively. Additionally, I, 0, and 1 denote the identity matrix, the all-zero matrix, and the all-ones column. The elements of the single-entry matrix J_l,q are zero, except for the (l,q)th element which equals 1. The set of all integers ranging from a to b is written as {a, . . . ,b}, whereas X∪Y, X∩Y, and X\Y denote the union and intersection of the sets X andY, and the complement ofY relative toX, respectively.

Finally, an optimized variable is indicated by·^?. II. SYSTEMMODEL

Fig. 1 displays the system model of an equalized complex- valued MIMON_r×N_t baseband channel with spatio-temporal PRS (PRS-ST). At the TX, the input consists of N_t continuous complex-valued data symbol streams, i.e., c(k) = [c₁(k), ...,c_l(k), ...,c_N_t(k)]^T withc_l(k) denoting the kth symbol from the lth stream. All symbols are independently and uniformly drawn from the complex symbol set C = {0,1, . . . ,√

M−1}+j{0,1, . . . ,√

M−1}, where M is assumed to be an integer power of 4. The binary labels associated with the elements from C are according to the binary-reflected

(3)

Gray mapping. Next, the symbol vectors c(k) are applied to the MIMO partial-response precoder at the symbol rate 1/T_s. This precoder is characterized by the target polynomial T_tar(D) = PLT

m=0T(m)D^m with Gaussian integer N_t × N_t matrices T(m), m = 0, ...,L_T, i.e., T(m) ∈ Z[j]^N^t^×N^t with Z[j] = {a+ j b|a,b ∈ Z}. In contrast to previous work on PRS that focused mainly on SISO channels (N_t = N_r =1), on strictly temporal partial response (T(m) is diagonal), or on strictly spatial partial response (T(m)=0_N_t for m>0), the more general complex-valued MIMO precoder presented here allows ST components (with spatial-only and temporal-only components as particular cases). The input sequence {c(k)}is converted by the precoder into the precoded output sequence {b(k)} according to

b(k)=





T⁻¹(0)* ,

c(k)−

LT

X

m=1

T(m)b(k−m)+

-







√ M

, (1)

where[.]_X symbolizes the element-wise modulo-X reduction to the interval [0,X) of both the real and the imaginary part.

To avoid any power loss, PRS requires that all entries of the precoded sequence b(k), just as the elements of c(k), are independently and uniformly drawn from the symbol set C. This property is acquired when all entries of T(m) are Gaussian integers andT(0) is a (complex-valued) unimodular matrix (Appendix A), since one can easily prove thatT⁻¹(0)∈ Z[j]^N^t^×N^t if and only if the determinant of the Gaussian integer matrixT(0) is restricted to the set {1,−1,j,−j}([25]

for real-valued matrix).

The components from b(k) are mapped to the M-QAM constellation, i.e.,

a(k)=2∆b(k)+(1+j)∆(−

√

M+1)1, (2) where ∆=q

3

2(M−1) such that the symbol energy is normalized, i.e., Ea(k)a^H(k) =I_N_t. The sequence a(k) is then transmitted over the discrete-time FS N_r×N_t MIMO channel which encompasses the complex-valued impulse response matrices H(m) and the zero-mean circular symmetric complex Gaussian noisen(k). The channel input-output relationship is given by

y(k)=

L_H⁽²⁾

X

m=−L_H⁽¹⁾

H(m)a(k−m)+n(k), (3) in which the channel impulse response is assumed to be finite, i.e., H(m) = 0_N_r_×N_t ∀m < {−L⁽¹⁾_H , ...,L_H⁽²⁾}. The RX first applies y(k) to the linear MIMO feedforward equalization filter characterized by the N_t×N_r impulse response matrices W(m). Next, the output of the linear MIMO feedback filter with impulse response matricesB(m), operating on the previously detected a(k), is subtracted from the feedforwardˆ filter output. When the decisions a(kˆ −m) are correct, the decision variableu(k) can be formulated as

u(k)=

L_W⁽²⁾

X

m2=−L_W⁽¹⁾ L_H⁽²⁾

X

m1=−L_H⁽¹⁾

W(m₂)H(m₁)a(k−m₁−m₂)

+

L_W⁽²⁾

X

m=−L_W⁽¹⁾

W(m)n(k−m)− X

m∈Φ_B

B(m)a(k−m), (4) where both filters are assumed to be finite, i.e., W(m) = 0_N_t_×N_r ∀m < {−L_W⁽¹⁾, . . . ,L⁽²⁾_W} and B(m) = 0_N_t ∀m < Φ_B withΦ_B={Φ_B(1), ...,Φ_B(L_B)}. TheL_Bactive time instants do not necessarily coincide with the time-delay indices1, . . . ,L_B, but the only restriction, imposed by causality, isΦ_B(i)≥1for i=1, ...,L_B. According to (1), the target response vector

u_T(k)=

LT

X

m=0

T(m)a(k−m) (5)

and the original sequencec(k) are related by c(k)=

"

u_T(k) 2∆ +cv

#

√ M

, (6)

where

cv= (1+j)(√ M−1) 2

LT

X

m=0

T(m)1. (7)

The equalization coefficients are therefore selected such that the decision variable u(k) from (4) approaches the target response vectoru_T(k) as close as possible. Consequently, the decision on c(k), c(k), follows from replacing the quantityˆ

uT(k)

2∆ in the right-hand side of (6) by the Gaussian integer which is nearest to ^u(k_2∆⁾, after which the modulo operator maps this decision to the original M-QAM constellation. The input of the feedback equalizers, i.e., the decision on a(k), a(k),ˆ is constructed by substituting c(k) for c(k) in the precoderˆ equation (1), yielding the decision on the vectorsb(k), b(k),ˆ which are then mapped to M-QAM symbols.

To simplify the notations in (4), the following matrices are introduced:

W =f

W(−L_W⁽¹⁾) · · · W(L_W⁽²⁾) g

, (8)

B=f

B(Φ_B(1)) · · · B(Φ_B(L_B)) g

, (9)

G(m)=f

H^H(m+L⁽¹⁾_W) · · · H^H(m−L⁽²⁾_W) gH

, (10)

and

n(k)¯ =f

n^H(k+L_W⁽¹⁾) · · · n^H(k−L_W⁽²⁾) gH

. (11)

Based on these shorthand notations and Φ_G = {−L_H⁽¹⁾ − L⁽¹⁾_W, . . . ,L_H⁽²⁾+L⁽²⁾_W}, the decision variableu(k) from (4) can be rewritten as

u(k)= X

m∈ΦG

W G(m)a(k−m)+Wn(k)−¯ X

m∈ΦB

B(m)a(k−m),

(12) III. MMSEEQUALIZATION

In this contribution, the MMSE criterion is employed to jointly optimize the feedforward filter W, the feedback filter B, and the TIR T = f

T(0) · · · T(L_T) g . With e(k) = u(k)−u_T(k) denoting the difference between the

(4)

decision variable u(k) and the target response vector u_T(k), the corresponding normalized MSE is defined as

MSE, E

f||e(k)||²g

E

fka(k)k²g = Tr(R_e) E

fka(k)k²g = 1 N_t

Nt

X

l=1

MSE^(l), (13) where the expectation is taken over the data symbols and the noise, and the error covariance matrix R_e is given by R_e , E

fe(k)e^H(k)g

. More precisely, the lth diagonal element of R_e, denoted byMSE^(l), equals the MSE corresponding to the lth data stream. The MSE from (13) equals the normalized sum of allMSE^(l⁾.

First, a simplified notation is developed. The vector y_W contains all channel output samples contributing to the decision variable u(k). Vectorsa_T anda_B comprise all symbols contributing to u_T(k) and to the input of the feedback filter, respectively. All remaining data symbols are collected in the vector a_N, i.e.,

y_W =f

y^H(k+L_W⁽¹⁾) · · · y^H(k−L_W⁽²⁾) gH

, (14)

a_B=f

a^H(k−Φ_B(1)) · · · a^H(k−Φ_B(L_B)) gH

, (15) a_T=f

a^H(k) · · · a^H(k−L_T) gH

, (16)

a_N=f

· · · a^H(k−m) · · · gH

∀m∈Φ_G\ {Φ_T∪Φ_B}. (17) Since all symbols in a(k) are spatially and temporally uncorrelated, all cross correlations betweena_T,a_B,a_Nare zero and the autocorrelation matrix of these vectors is equal to the identity matrix, e.g., E[a_Ta^H_T] =I. In this contribution, the intersection of Φ_T = {0, ...,L_T} and Φ_B is assumed to be empty by design. Otherwise, the target response matrices T(m)∀m∈Φ_T∩Φ_B would not influence the communication link’s quality, as the feedback equalizer is capable of removing the contribution from a(k−m) to e(k), for any m ∈ Φ_B. Based on (3) and the notations in (14)-(17), the input-output relationship betweeny_W and the data symbols can be rewritten as

y_W =G_Ta_T+G_Ba_B+G_Na_N+n(k),¯ (18) where

G_T =f

G(0) · · · G(L_T) g

, (19)

G_B=f

G(Φ_B(1)) · · · G(Φ_B(L_B)) g

, (20)

G_N =f

· · · G(m) · · · g

∀m∈Φ_G\ {Φ_T∪Φ_B}. (21) Consequently, the MSE defined in (13) can be expressed as

MSE= 1 NtE

W(G_Ta_T+G_Ba_B+G_Na_N+n(k))¯

−Ba_B−T a_T

2

. (22)

In the subsections below, the optimal W and B are first formulated for a givenT in subsection III-A, after which three algorithms are presented in subsection III-B to optimize the matrix T.

A. Optimization overW and B

The optimal coefficients W^? and B^?, for given T, are straightforwardly obtained by equating the derivatives of the MSEin (22) with respect to W andB to zero, yielding [4], [10]

W^?=T G_T^H

G_TG_T^H+G_NG^H_N +R_n−1

(23)

B^?=W^?G_B (24)

whereR_n ,E

fn(k)¯ n¯^H(k)g

. The MSEW^?,B^? is then given by plugging (23) and (24) into (22), yielding

MSEW^?,B^? = 1 N_tTrf

T GT^Hg

, (25)

whereG =I_N_t_(L_T₊₁₎−G^H_T

G_TG_T^H+G_NG^H_N+R_n−1

G_T. For full-response signaling (FRS), i.e.,T =f

I_N_t 0_N_t_×L_T_N_tg , the expressions in (23) and (24) are verified to simplify to the standard expressions for the MMSE equalizer [3], [4].

B. Optimization overT

Instead of considering a channel-independent TIR T such as FRS or duo-binary, this contribution optimizes T by minimizing (25) subjected to two constraints: (i) all entries of T must belong to Z[j], and (ii) det(T(0)) must belong to {−1,1,j,−j}. TIRs satisfying these constraints are called feasible, and ensure that b(k) and c(k) possess identical statistics (see Appendix A).

Equation (25) reveals that thelth row ofT only influences MSE^(l)=

T GT^H

l,l, the MSE of thelth data stream. Hence, when the constraint on det(T(0)) is satisfied, constructing a new target impulse response matrix T⁰ by multiplying one row ofT with the complex conjugate ofdet(T(0)) results in det(T⁰(0))=1, while the MSE from (25) remains unaltered.

Therefore, without loss of generality, the second constraint will be replaced bydet(T(0))=1 in the sequel.

Although the lth row of T affects only MSE^(l⁾, the rows of T cannot be chosen independently from each other due to the constraint det(T(0)) = 1, impeding the optimization of each row individually. This subsection instead discusses three iterative algorithms to compute an optimized TIR T^?. In theith iteration of each algorithm, the TIR is incremented as follows:

T_i+1=T_i+T_inc, (26) where the increment matrix T_inc is a function of T_i. The corresponding MSE_i+1 is given by

MSE_i+1= 1 N_tTr

T_i+1GT_i+1^H

(27)

=MSE_i+ 1 Nt

Tr

T_iGT_inc^H +T_incGT_i^H+T_incGT_inc^H . (28) In the ith iteration, T_inc is restricted to consist of only one nonzero row, namely the row with index l^?_i. Hence, the ith iteration focuses entirely on the reduction ofMSE^(l

? i) i , which is chosen to be the largest among the reducible MSE_i^(l⁾: MSE^(l

? i)

i is defined as the largest diagonal element of the error

(5)

covariance matrix that can be reduced by an increment T_inc computed by a specific algorithm. More specifically, when the sequence

MSE^(m_i ¹⁾,...,MSE^(m_i ^Nt⁾

is sorted from large to small, i.e., MSE^(m_i ⁿ⁾ ≥ MSE^(m_i ⁿ⁺¹⁾ for n = 1, ...,N_t −1, then l^?_i = m_n if and only if MSE^(l)_i is not reducible for l=m₁, ...,mn−1 andMSE_i^(mⁿ⁾is reducible. Further, to guarantee the feasibility ofT_i+1, each entry ofT_inc must be Gaussian integer, i.e., (T_inc)_l,q ∈ Z[j] and, when updating a row of T_i(0), the first N_t elements of the increment must additionally be a linear combination of the other rows of T_i(0) such that det(T(0)) does not alter. Initializing T₀ as any feasible TIR results then obviously in an optimized T^?, since convergence to a (local) optimum is guaranteed as the MSE is lowered in each iteration and bounded below by 0. Unfortunately, this optimization method does not guarantee convergence to the global optimum, making the initialization and the computation of T_inc crucial for the quality ofT^?. Initializing T₀ as FRS is the most logical choice because the resulting MSE is then upper bounded by the MSE of FRS.

The first algorithm A1 updates the element of the l^?_ith row ofT_i inducing the largest decrease inMSE^(l

? i)

i , yielding a fast and low-complexity algorithm. The second algorithm A2 is similar to A1, but updates the element of the l^?_ith row of T_i yielding the largest guaranteed decrease, which is defined in subsection III-B2 and does not only depend on the currentT_i, but also on the potential of future TIR increments. Due to this extra information, convergence to a better (local) optimum is expected at the cost of a slightly larger complexity. In the last algorithm A3, even better performance is expected as the entire l^?_i-th row of T_i is updated by solving a relatively complex lattice decoding problem. Below, a detailed description of A1, A2, and A3 is included.

1) Algorithm 1 (A1): In algorithm A1,T_inc is restricted to the set (

T_inc^(l,q),l ∈ {1, . . . ,N_t},q∈ {1, ...,(L_T+1)N_t}) . When q>Nt,T_inc^(l,q) increments the (l,q)th element ofT_i, i.e.,

T_inc^(l,q)=λ_l,qJ_l,q q>N_t. (29) Whenq≤N_t,T_inc^(l,q)adds a multiple of rowq ofT_i(0) to row l ofT_i(0), i.e.,

T_inc^(l,q)=[ λ_l,qJ_l,qT_i(0) 0 ] q≤N_t,l,q. (30) In (29) and (30), λ_l,q must belong to Z[j], and its optimum value, λ^?_l,q, is determined by maximizing the decrease δ_l,q = MSE_i^(l⁾−MSE^(l)_i+1, whereMSE_i+1 is obtained by plugging (29) and (30) into (28). For given λ_l,q, this decrease is verified to be

δ_l,q =−2Ref λ_l,qAg

−

λ_l,q

2B (31)

with

A=









(GT_i^H)_q,l q>N_t T_i(0)G_N_tT_i^H

q,l q≤N_t,l,q (32) and

B=









(G)_q,q q>N_t

T_i(0)G_N_t_,N_tT_i^H(0)

q,q q≤N_t,l ,q, (33)

Table I

PSEUDOCODE OFALGORITHMA1

1: T0=[INt 0],i=0

2: while∃l: MSE^(l)_i reducibledo 3: Selectl^?_i as largest reducibleMSE^(l)_i 4: q^?_i =arg max

q δ^? l_i^?,q 5: ComputeT_i+1=Ti+T^(l

? i,q^?_i) inc withT^(l

? i,q^?_i)

inc given by (29)-(30) 6: i←i+1

7: end 8: T^?=Ti

whereG_N_t andG_N_t_,_N_t are constructed by keeping the firstN_t rows from G and the first N_t columns of G_N_t, respectively.

The optimal Gaussian integerλ^?_l,q that maximizesδ_l,q is given by

λ^?_l,q =

$−A^∗ B

'

, (34)

where b·e rounds both the real and the imaginary part to the nearest integer. For given (l,q), the largest decrease ofMSE_i^(l) corresponding to the incrementT_inc^(l,q) withλ_l,q =λ^?_l,q is then given by inserting (34) into (31), which yields

δ^?_l,q =−2Ref λ^?_l,qAg

−

λ^?_l,q

2B. (35)

For given l, MSE_i^(l) is called reducible when the largest reduction ofMSE_i^(l)resulting from an increment matrixT_up^(l,q), i.e.,max

q δ^?_l,q, is larger than 0. Otherwise,MSE^(l)_i is irreducible.

In the ith iteration, A1 determines the index l^?_i such that MSE^(l

? i)

i is the largest of the reducibleMSE^(l)_i . The algorithm then incrementsT_i withT_up^(l^?ⁱ^,q^?ⁱ⁾, whereq^?_i =arg max_qδ_l^?

i,q, and reduces MSE^(l

? i)

i by an amount δ_l?

i,q^?_i. Algorithm A1 terminates when none of the MSE^(l)_i can be further reduced.

Due to the greedy nature of this algorithm, the resulting T^? is not guaranteed to be globally optimal. For example, several small decreases may be better than one large decrease, or one small decrease in the current iteration could enable a large decrease in future iterations. Both these events are completely ignored by this greedy algorithm. Finally, Table I lists the pseudocode of A1.

2) Algorithm 2 (A2): The structure of algorithm A2 is identical to the structure of algorithm A1 described in subsection III-B1 except that the selection criterion (line 4 in Table I) exercises more caution. Instead of opting for the TIR increment T_inc^(l,q) inducing the largest decrease in the largest reducibleMSE^(l)_i , a more promising TIR increment is chosen.

More precisely, this algorithm selects the TIR incrementT_inc^(l,q) inducing the largestguaranteeddecrease rather than the largest current decrease. This guaranteed decrease d^?_l,q is defined as the sum of two terms:

d^?_l,q=δ^?_l,q+δ^?_l,q,LB ∀(l,q). (36) The first term, δ^?_l,q, is defined in (35) and represents the optimal decrease achieved by the incrementT_inc^(l,q)in iteration i. The second term, δ^?_l,q,LB, is a lower bound on the maximal

(6)

Table II CALCULATION OFδ^?_l,q,LB

1: Input:Ti,T_inc^(l,q)

2: T_i+1=Ti+Tinc^(l,q),δ^?_LB=0,m=0 3: ∀linc,qinc: calculate∆_l_inc_,q_inc=δ_l^?

inc,qincusing (35) withTireplaced byTi+1

4: whilem<Nt do

5: li+1+m,qi+1+m=arg max linc,qinc

∆l_inc,qinc

6: δ^?_LB←δ_LB^? +∆l_i₊₁₊_m,qi+1+m

7: ∆_l_i₊₁₊_m_,|=0 8: ifqi+1+m≤Nt then 9: ∆_|,q_i₊₁₊_m=0

10: end

11: m←m+1 12: end

realizable decrease achievable by TIR increments in future iterationsi+1,i+2,etc. For givenT_i+1, the reduction of the MSE realized during iteration i+1+m, with m>0, usually depends on the increments previously performed during iterationsi+1, ...,i+m. To compute the lower bound δ^?_l,q,LB, the increments are constrained in iterationsi+2,i+3, etc. such that, for allm>0, the decrease achieved in iterationi+1+mis influenced only byT_i+1and by the increment made in iteration i+1+m, but not by the increments in iterationsi+1, ...,i+m.

This way, the sum of the decreases related to the increments in the iterationsi+1,i+2,etc. represents a lower bound on the maximum possible decrease, which would be accomplished by unconstrained increments.

WhenT_inc^(l^j^,q^j⁾denotes the increment in iterationj, equations (32) and (33) infer that the MSE reduction achieved in iteration j depends on the l_jth row of T_j and, if q_j ≤ N_t, also on the q_jth row of T_j(0). Hence, with the purpose that the MSE reduction in iterationi+1+mis not influenced by the increments made in iterations i+1, ...,i+m for any m > 0, the quantities l_i+1+m and q_i+1+m must be outside the sets {l_i+n | n = 1, ...,m} and {l_i+n | q_i+n ≤ N_t,n = 1, ...,m}, respectively. Consequently, the largest value of mto be considered cannot exceedN_t−1. When{(l_i+1+m,q_i+1+m)|m>0}

is constructed this way, the MSE reduction corresponding to iterationi+1+mcan be computed from (35) by substituting the matricesT_i₊₁andT_i₊₁(0) (rather thanT_i₊₁₊_mandT_i₊₁₊_m(0)) for T_i and T_i(0) in (32) and (33). For given T_i and T_up^(lⁱ^,qⁱ⁾, the suboptimal greedy algorithm from Table II selects in each iteration i+1,i+2, ...,i+N_t the constrained increment that achieves the largest decrease, and sums these decreases to obtain δ^?_l

i,qi,LB. Subsequently, the guaranteed decrease d^?_l

i,qi

corresponding to T_up^(lⁱ^,qⁱ⁾ then follows from (36).

Algorithm A2 opts in the ith iteration for the increment matrix that induces the largest guaranteed reduction, denoted byd^?

l_i^?,q^?_i, ofMSE^(l

? i)

i , wherel^?_i equals the index of the largest of the reducible MSE^(l)_i (as in A1,MSE^(l)_i is called reducible whenmaxqiδ^?_l,q

i >0), andq^?_i =arg max

qi

d^?

l_i^?,qi. Consequently, the pseudocode of A2 differs from A1 only in the selection criterion in line 4 from Table I. Although this criterion more carefully selects the increment T_inc^(l,q) in A2, the obtainedT^?

is not guaranteed to be globally optimal. Still, A2 is expected to yield a better performance than A1, as it can only benefit from the additional information in the increment selection.

Moreover, the convergence condition for A1 and A2 is the same, i.e., both algorithms terminate when none of theMSE_i^(l) can be further reduced. Hence, neither algorithm is able to further lower the MSE when one algorithm is initialized with the TIR obtained by the other algorithm.

3) Algorithm 3 (A3): In algorithm A3, the ith iteration determines the optimized increment for an entire row of T_i rather than a part of row T_i as in the algorithms above.

In this regard, the Hermitian positive-definite matrix G is first factorized according to its Cholesky decomposition, i.e., G =LL^H, where L is a lower triangular matrix. Equation (27) can thus be rewritten as

MSE_i+1= 1 Nt

Tr

T_i+1LL^HT_i+1^H

= 1 Nt

Nt

X

l=1

t_l,i+1L

2, (37) wheret_l,i+1 equals the lth row of T_i+1. When l_i denotes the index of the single row that is updated in theith iteration, the updated row t_l_i_,i+1 can be obtained by

t_l_i_,i+1=t_l_i_,i+t_inc, (38) while the other rows remain unaltered. To satisfy the additional constraints, all components of the row increment t_inc must belong to Z[j], and its first Nt components must be a linear combination of the first Nt components of the other rows t_l,i withl ,li. For a more mathematical description, t_inc andL are first decomposed as

t_inc=f

t⁽⁰⁾_inc λ⁽¹⁾_incg

(39) L=f

(L⁽⁰⁾)^H (L⁽¹⁾)^HgH

, (40)

wheret_inc⁽⁰⁾ andL⁽⁰⁾ contain the first N_t elements of t_inc and the first Nt rows of the matrixL, respectively, whereas λ⁽¹⁾_inc andL⁽¹⁾ comprise the last NtL_T elements oft_inc and the last NtL_T rows of L, respectively. In (39), t_inc⁽⁰⁾ must be a linear combination of the firstNtelements of the rowst_l,iwithl ,li, which can be expressed as

t_inc⁽⁰⁾=

Nt

X

l,ll=1i

λ_l,inc⁽⁰⁾ t_l,i⁽⁰⁾=λ⁽⁰⁾_incT⁽⁰⁾

li,i, (41)

whereλ_inc⁽⁰⁾ represents the row vector containing allλ_l,inc⁽⁰⁾ ,t⁽⁰⁾_l,i consists of the firstN_telements of thelth row ofT_i, andT_l⁽⁰⁾

i,i

is obtained by removing thel_ith row fromT_i(0). For givenT_i, the incrementt_inc in theith iteration is selected to minimize MSE^(l_i+1ⁱ⁾. After the substitution of (38), (39), (40), and (41) into (37), this minimization problem can be reformulated as the following closest point search:

λ^?_inc =arg min

λ∈Z[j]^(L^T+1)^Nt⁻¹

kλG_lat−xk², (42)

whereG_lat =

"

T_l⁽⁰⁾

i,iL⁽⁰⁾ L⁽¹⁾

#

, x=−t_l_i_,iL, and λ= f

λ⁽⁰⁾_inc λ_inc⁽¹⁾g . To solve(42), the lattice decoding algorithm presented in [24]

is applied after decomposing all complex-valued quantities in

(7)

Table III

PSEUDOCODE OFALGORITHMA3

1: T₀=[INt 0],i=0

2: while∃l: MSE_i^(l) reducibledo 3: Selectl_i^?as largest reducibleMSE^(l)_i

4: Compute updatetincby solving optimization problem (42) 5: UpdateTi+1by computingt_l?

i,i+1using (38) 6: i←i+1

7: end 8: T^?=Ti

(42) into their real and their imaginary parts. This algorithm searches for a pointxˆ from a lattice with generator matrixG_lat that is closest tox, by recursively decomposing the lattice into lower-dimensional sublattices. Further, MSE_i^(lⁱ⁾ is reducible when λ^?_inc has at least one nonzero element. The pseudocode of A3 (Table III) is similar to A1 and A2 except that A3 updates in each iteration an entire row of T_i. The increments T_inc, allowed in A1 and A2, correspond to a vectorλ^?_inc with exactly one nonzero element, thus forming a small subset of all increments allowed in A3. Hence, A3 could potentially further lower the MSE when A3 is initialized with the TIR derived by either A1 or A2, whereas the reverse is not true, implying that A3 has superior performance to A1 and A2.

4) Complexity considerations: This section provides some general remarks on the complexity per iteration of the proposed algorithms. For A1, at most

(L_T+1)N_t²−Nt

optimal λ^?_l,q from (34) must be calculated per iteration, which is of course feasible in polynomial time. In A2, for each increment in the current iteration, all λ^?_l,q of the next iteration must be determined as well. Hence, at most

(L_T+1)N_t²−N_t2

different λ^?_l,q must be computed, yielding, compared to A1, a larger complexity that is still polynomial inN_tandL_T. Clearly, A3 is the most complex, as each iteration must solve a closest point problem. Unfortunately, this problem is NP-hard and no polynomial time algorithm is available (yet) to solve it. Indeed, the search time of the lattice decoding algorithm in [24] rises exponentially with the problem’s dimensions, e.g., N_t,L_T.

IV. BEREXPRESSION FORPRS

This section describes how the BER expression for the MIMO PRS-ST is derived. In [26], the symbol error rate in the case of SISO PRS with an M-PAM constellation has already been discussed in detail, and the present contribution extends this work to obtain the BER for MIMO PRS-ST with an M- QAM constellation using a two-dimensional binary reflected Gray mapping.

First, a scaled and translated version v(k) of the decision variableu(k) is introduced, i.e.,

v(k)=u(k)

2∆ +c_v (43)

with cv defined in (7). Based on the decomposition of u(k) from (12), the target response vectoru_T(k) from (5), and the

mapping rule (2),v(k) can be expressed as v(k)=

LT

X

m=0

T(m)b(k−m)+isi(k)+n_v(k), (44) wheren_v(k)=^W_2∆^n(k)^¯ andisi(k)= P

m∈ΦG

E(m)a(k−m) with

2∆E(m)=













W G(m)−T(m) m∈Φ_T W G(m)−B(m) m∈Φ_B

W G(m) m∈Φ_G\ {Φ_T∪Φ_B} . (45)

Because (1) yields PLT

m=0T(m)b(k − m) = c(k) + T(0)d(k)

√M,where the components ofd(k) are Gaussian integers,v(k) from (44) reduces tov(k)=c_ex(k)+isi(k)+n_v(k) withc_ex(k) =c(k)+T(0)d(k)√

M. At the Rx, the symbol- by-symbol detector makes a decision cˆ_ex(k) ofc_ex(k) in the extended symbol set by roundingv(k) to the closest Gaussian integer. As this procedure is performed for each data stream individually, the decision of the received symbol at instant k in thelth stream, cˆ_ex^(l⁾(k) is based on

v^(l)(k)=c_ex^(l)(k)+isi^(l)(k)+n_v^(l)(k), (46) where v^(l)(k), c_ex^(l)(k), isi^(l)(k), and n_v^(l)(k) equal the lth component of the associated vectors v(k), c_ex(k), isi(k), and n_v(k), respectively. Afterwards, the modulo operator is applied to cˆ_ex^(l)(k), yielding the decision cˆ^(l)(k) ∈ C of the symbol c^(l)(k) transmitted in the lth data stream at instant k. As the error performance does not depend on the symbol indexk, this index is dropped in the sequel. The BER for the lth data stream,BER^(l), can be expressed as

BER^(l)= X

(c,c)∈ Cˆ ²

N_bit(c,c)ˆ

log₂(M) Pr(c^(l)=c,cˆ^(l)=c),ˆ (47) in whichN_bit(c,c) represents the number of bits by which theˆ binary labels of cand cˆ differ. In Appendix B, an elaborate discussion is given on how the following simple but accurate approximation forBER^(l) is derived from (47):

BER^(l)≈ 1 log₂(M)E^a

"

4Q* . ,

0.5−Ref isi^(l)g σn^(l)v

+ / -

#

, (48)

where Q(·) represents the tail distribution function of the standard normal distribution, σ_n(l)

v equals the standard de- viation of the real part of n^(l)_v (k), and E^a[·] denotes the expectation over all symbols contributing to Ref

isi^(l)g . The computational complexity in (48) increases exponentially with channel and filter length, thus rapidly becoming prohibitively large in numerical simulations. To circumvent this complexity problem, a large number, N_large, of realizations of Ref

isi^(l)g is generated, and the expectation in (48) is replaced by the arithmetical average.

V. NUMERICALRESULTS

This section characterizes the performance of the optimization algorithms (A1, A2, and A3) discussed above by numerical simulations for three scenarios, whose simulation settings

(8)

Table IV SIMULATION SETTINGS

•4-QAM constellation •Φ_B={4,5}

•3500 channel realizations •L_H⁽¹⁾=0,L⁽²⁾_H =25

•PRS settings: •Scenarios:

FRS (LT=0,T(0)=I) S1:µ=2,L_w⁽¹⁾+L_w⁽²⁾+1=13 PRS-S (LT=0; A1, A2 and A3) S2:µ=2,L_w⁽¹⁾+L_w⁽²⁾+1=21 PRS-ST(LT=3; A1, A2 and A3) S3:µ=5,L_w⁽¹⁾+L_w⁽²⁾+1=13

are summarized in Table IV. In all scenarios, the channel H(m) is a FS Rayleigh-fading 4×4 MIMO channel with an exponentially decaying power delay profile with base µ, i.e.,E

(H(m))_(p,q)

2

=µ^−(m+L⁽¹⁾^H ⁾, p,q∈ {1, ...,4},andm∈ {−L_H⁽¹⁾, . . . ,L⁽²⁾_H }. All channel taps are further assumed to be spatially and temporally uncorrelated. Moreover, all components of the circular symmetric Gaussian noise n(k) are also spatially and temporally uncorrelated and possess varianceN₀, whereas the components froma(k) belong to a Gray-mapped 4-QAM constellation. In the first scenario S1, a severely FS channel (µ = 2) is equalized using a DFE consisting of a 13-tap feedforward filter and a 2-tap feedback filter with Φ_B ={4,5}. Extending the feedforward filter to 21 taps in the second scenario S2 evidently improves the performance at the cost of a larger complexity. The third scenario S3 is identical to S1, except that the FS channel is less severe (µ = 5).

In all scenarios, a comparison is made between: (i) PRS-ST, whose TIR consists of both spatial and temporal components (L_T=3); (ii) spatial-only PRS (PRS-S), whose TIR possesses only spatial components (L_T=0); and (iii) traditional FRS.

This discussion principally focuses on the performance in terms of MSE and BER (N_large = 4 · 10⁷), because the former is the objective function of the optimization, whereas the latter is an important performance measure in practice.

In the case of S1, Fig. 2, depicts the 1/MSE and the BER performances averaged over 3500 channel realizations as a function ofSNR= ^E_N^b₀ withE_bthe transmitted energy per bit.

Furthermore, Table V lists the SNR in dB needed to achieve a BER of10⁻⁸ in all scenarios. What is immediately apparent is the significant improvement achieved by PRS compared to FRS. Indeed, the smaller MSE accomplished by PRS is a direct consequence of the additional minimization over T, but, more interestingly, Fig. 2 and Table V reveal that also the BER of the PRS is significantly lower compared to FRS.

One noteworthy example is the drastic reduction of the error floor by considering PRS instead of FRS in S1 (Fig. 2).

Moreover, the longer feedforward filter in S2 allows PRS even to completely remove the error floor encountered by FRS or at least to lower it to below 10⁻⁸. Finally, PRS accomplishes a reduction up to nearly 16dB in SNR to reach a target BER of 10⁻⁸ in S3 (Table V).

Significant differences in average performance occur between the algorithms when the TIR has both spatial and temporal components (L_T=3). In this configuration, A3 not only yields an average MSE that is smaller than the MSEs of A1 and A2, but, more importantly, also drastically lowers the

5 10 15 20 25 30

SNR(dB) 10

15 20

1/MSE(dB)

A3

A2 A1 PRS-S

FRS PRS-ST

5 10 15 20 25 30

SNR(dB) 10^-8

10^-6 10^-4 10^-2

BER ^PRS-S

FRS

A1

A3 PRS-ST

A2

Figure 2. Scenario S1: average MSE (upper plot) and average BER (lower plot) withµ=2and a 13-tapW. Compared to FRS, PRS drastically improves the performance. In particular, PRS-ST optimized using A3 considerably reduces the BER floor.

BER. For instance, only A3 is able to reduce the error floor below10⁻⁸in S1, whereas its SNR required to reach a BER of 10⁻⁸in S2 is approximately3.0 dBand2.5 dBlower compared to A1 and A2, respectively; in S3 the gain of A3 compared to A1 and A2 is about2.2 dBand1.5 dB, respectively. These results therefore imply not only that A1 and A2 do not yield the global optimum, but also that A3 is to be preferred when performance is essential. Furthermore, the numerical results confirm our expectation that A2 outperforms A1 on average.

However, the gain of A2 compared to A1 on average turns out to be quite moderate: a lower error floor is achieved in S1 (factor 2), whereas the SNR required to meet the BER target decreases by less than0.8 dB in S2 and S3.

When the TIR has only spatial components, i.e., L_T =0, the performance difference between A1, A2, and A3 is almost negligible for all scenarios. Indeed,Tcontains onlyT(0), limiting the increments in all algorithms to the set of consecutive row additions. All algorithms are therefore similar, resulting in