TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS .1 Problems Formulation

Solving a System of Linear Algebraic Equations and

2.4 TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS .1 Problems Formulation

In the previous sections of this chapter, we have considered the case where only vector x is contaminated by the error and the matrix His known precisely. The total least-squares approach is suitable for solving estimation problems that can be formulated as a system of over-determined linear equations of the formHs ≈x, in which both the entries of data matrix H and the sensor vector x are contaminated by noise, i.e., we have the system of linear equations

(Htrue+N)s≈xtrue+n=x (2.83)

where the true data (the data matrixHtrueand the measurement vectorxtrue) are unknown.

In contrast to the TLS, in the LS problems it is assumed that the data matrix Htrue is known precisely, i.e., the noise matrix N∈IR^m×n is zero or negligibly small and only the measurement vector x is contaminated by the unknown noise n, while, in the data least squares (DLS) problem, it is assumed that noisenis zero or negligibly small and only the noise contained in the matrixN exists. In TLS problem, it is assumed that the noise has zero-mean with a Gaussian distribution.

In many signal processing applications the TLS problem is reformulated as an approxi-mate linear regression problem of the form

x_i≈h^T_i s; (i= 1,2, . . . , m) (2.84) where h^T_i = [hi1, hi2, . . . , hin] is thei-th row ofH.

The TLS solution that eliminates the effects of certain types of noise in the signals can be shown to be related to a lower-rank approximation of the augmented matrix ¯H= [x H].

Based on this result, we show how the unknown vectorscan be estimated from noisy data.

This section is concerned with the estimation algorithms that are designed to alleviate the effects of noise present in both the input and the sensor signals. We will show that the total least-squares estimation procedure can produce unbiased estimates in the presence of certain types of noise disturbances in the signals. This procedure will then be extended to the case of arbitrary noise distribution. There is a large class of problems requiring on-line estimation of the signals and the parameters of the underlying systems.

2.4.1.1 A Historical Overview of the TLS Problem The total least-squares (TLS) method was independently derived in several areas of science, and is known to statisticians as the orthogonal regression or the error-in-variables problem. The error-in-variables problem

0 x y

LS TLS x DLS

x x

Fig. 2.3 This figure illustrates the optimization criteria employed in the total least-squares (TLS), least-squares (LS) and data least-squares (DLS) estimation procedures for the problem of finding a straight line approximation to a set of points. The TLS optimization assumes that the measurements of thexandyvariables are in error, and seeks an estimate such that the sum of the squared values of the perpendicular distances of each of the points from the straight line approximation is minimized.

The LS criterion assumes that only the measurements of theyvariable is in error, and therefore the error associated with each point is parallel to the y axis. Therefore the LS minimizes the sum of the squared values of such errors. The DLS criterion assumes that only the measurements of thex variable is in error.

has a long history in statistical literature. Pearson in 1901 [581] solved the two-variable model fitting problem that may be formulated as follows: given a set of points (xk, yk) for k= 1,2. . . , m, we wish to find the optimal straight line

y=mx+c (2.85)

that minimizes the sum of the squared values of the perpendicular distances between the points in the set and the straight line. Figure2.3describes this problem graphically. Pearson solved this problem and expressed the modelling errors associated with the points in terms of the mean, standard deviation and correlation coefficients of the data. In the classical least-squares problem, we wish to find the values of slope and intercept (m, c) which minimizes the sum of the squared distances between yk and its predicted values. In other words, the LS solution results from minimizing the sum of the squared values of the vertical distances between the line and the measurements yk. It assumes that the variablesxk are error free and all the noise is contained inyk. The Data Least-Squares (DLS) algorithms are another class of estimation techniques based on the assumption that pointsyk are error-free and all the noise is contained only in the measurementsxk. Therefore, the DLS algorithm attempts to minimize the sum of the squared distances between the line and the measurements xk

along the horizontal axis. For example, the DLS solution is useful in equalization problems that involve certain types of deconvolution models. Unlike both the LS and DLS approaches,

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 69 the TLS algorithms assume that bothxk andyk are contaminated by noise. Consequently, we can consider the standard LS and DLS algorithms as special cases of the extended TLS technique.

Example 2.3 Let us determine the optimal liney=mx+cby the ordinary LS, TLS, DLS algorithm and also the solution to the minimum 1-norm and infinity-norm problem, for the data points:

(xk, yk) = (1,2), (2,1.5), (3,3), (4,2.5), (5,3.5)

This problem is equivalent to solving the system of linear equations with respect to s:

H s ≈ x or equivalently to minimize kx−H sk, where x = £

2 1.5 3 2.5 3.5 ¤_T , s= [m c]^T and data (input) matrix

H^T =

· 1 2 3 4 5

1 1 1 1 1

¸ .

Fig.2.4illustrates solutions of this problem using different criteria.

The problem of fitting a straight line to a noisy data set has been generalized to that of fitting a hyper-plane to noisy, higher-dimensional data [581,582].

In the field of numerical analysis, the total least-squares problem was first introduced by Golub and Van Loan in 1980 [501], studied extensively and refined by Van Huffel, Vandevalle, Lemmerlinget. al. [581,582], Hansen and O’Leary, and many other researchers [581,582].

There are many applications that require on-line adaptive computation of the parameters s for the system model. Adaptive algorithms that employ the TLS formulation and their extensions have been developed and analyzed by Amari and Kawanabe, Mathews, Cichocki, Unbehauen, Xu, Oja, Douglas along with many others [41, 581, 582, 282, 827, 1309, 293, 294].

2.4.2 Total Least-Squares Estimation

The TLS solution explicitly recognizes that both the input matrixHand the sensor vector x may be contaminated by noise. Let Htrue = H−N represent the noise-free input matrix, and let xtrue = x−n represent the noise-free desired response vector. Here, we do not consider any possible relationships that might exist among the elements of H.

Such constraints may be incorporated into the TLS formulation, but this will result in a considerable increase in the complexity of the solution. The TLS procedure attempts to estimate both the noise matrixNand the noise vectornto satisfy an exact solution of the system of linear equations:

(H−N)sTLS= (x−n). (2.86)

0 1

Fig. 2.4 Straight lines fit for the five points marked by ‘x’ obtained using the: (a) LS (L2 -norm), (b) TLS, (c) DLS, (d)L1-norm, (e)L∞-norm, and (f) combined results.

Generally, there may be many choices of N and n that satisfy (2.86). Among all such choices, we select Nandnsuch that¹¹

k£

11We have assumed, that the signals are real-valued. The extension to complex-valued data is straightfor-ward.

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 71 is minimized, where nij is the (i, j)-th element of N and ni is the i-th element of n. To solve the above problem, we rewrite (2.86) as

¡H−£

whereH= [x, H]. In the above equation,0is anm-dimensional vector filled with all zeros.

In general, because of the noise, the augmented input matrix H is full rank. If we assume that m > n+ 1, the rank ofHis (n+ 1).

The problem of findingnandNcan be recast as that of finding the smallest perturbation of the augmented input matrix Hthat results in a rank-nmatrix.

Let us expand the matrix Husing the singular value decomposition as H=£

whereσi’s are the singular values ofH, arranged in the descending order of magnitude,ui’s andvi’s are respectively the left singular vectors containingmelements each, and the right singular vectors containing (n+ 1) elements each. We assume that we have selected the singular vectors such that they have unit length, and that the sets{ui; i= 1,2, . . . , n+ 1}

and{vi ; i= 1,2, . . . , n+ 1}contain orthogonal elements so thatu^T_iuj= 0 andv^T_ivj = 0 fori6=j. It is well-known that the rank-napproximation ofHintroducing the least amount of perturbation to its entries is given by [581]

Hb = Xn

i=1

σiuiv^T_i. (2.90)

Moreover, the error matrix [ n N ] is given by

£ n N ¤

=σn+1un+1v_n+1^T . (2.91)

Taking into account (2.88) and (2.90), we can write Hb

(where vn+1,1 is the first non-zero entry of the right singular vector vn+1) satisfies (2.92).

Thus, the total least-squares solution is described by the right singular vector corresponding

to the smallest singular value of the augmented input matrix H. An efficient approach to computing this singular vector is to find the vector vthat minimizes the cost function¹²

J(v) =v^TH^THv

kvk²₂ , (2.94)

and then to normalize the resulting vector so that

· −1 sTLS

=− vopt

vopt,1, (2.95)

wherev_optdenotes the solution to the optimization problem in (2.94), andv_opt,1 is the first entry ofvopt. Simple calculations will show that the vectorvminimizingJ(v) is identical to the (n+ 1)-th right singular vector ofH, and choosingv=vn+1will provide the minimum value ofJ(v) given by

J(vopt) =σ²_n+1. (2.96)

It is also straightforward to show that the optimum choice for vcorresponds to the eigen-vector corresponding to the smallest eigenvalue of the matrix H^TH. Thus, a numerical method based on SVD or minor component analysis (MCA) for finding the TLS estimate of the coefficients can be applied. The above derivation assumes that the smallest singular value of the augmented data matrix His unique. If this is not the case, the TLS problem has an infinite number of solutions. To uniquely define sTLS in such situations, usually a solution is chosen for whichksTLSk² is the smallest among all the possibilities.

When the noise in the entries of the augmented data matrixH belongs to independent and identically distributed (i.i.d.) Gaussian processes with zero-mean, it can be shown that the TLS solution obtained by minimizing the cost function in (2.94) is the maximum likelihood estimate of the coefficient vector. When the noise sequences satisfy the i.i.d.

condition, the standard TLS estimate is unbiased. In other words, we obtain an unbiased estimate if both the noise variance of the vector x(sensor signals) and the data matrixH are the same, i.e., (σ_n² =σ²_N).

Consequently, even though the standard total least-squares approach described above results in unbiased estimates of the parameters of the system model, it does not necessarily provide a good estimate of the signals of interest. In order to obtain better estimates of the parameters, one may use an augmented data matrix H with the number of columns n⁰ >> n+1 (where (n+1) is the minimum number of columns needed by the TLS approach) and then approximate this matrix with a rank-nmatrix. The noise matrix estimated using such an approximation has a rank larger than one, and therefore it is considered that the estimate is a better one than that provided by the standard TLS algorithm described above.

12Scaling the coefficient vector by a scalar multiplier does not change the cost function. Consequently, we can also formulate this problem equivalently as that of minimizing

J(v) =v^TH^THv subject tokvk²₂= 1.

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 73

2.4.3 Adaptive Generalized Total Least-Squares

The standard (ordinary) TLS is a method which gives an improved unbiased estimator only when both the noise (errors) in the data matrix H and the sensor vectorx are i.i.d. and exhibit the same variance. However, in practice, the data matrix H and the observation vectorxrepresent different physical quantities and are therefore usually subject to different noise or error levels. The generalized TLS (GTLS) problem deals with the case where the data errors ∆hij =nij are i.i.d. with zero-mean and varianceσ²_N (i.e., RN N =σ_N²In) and where the observation (sensor) vector components ∆xi=ni are also i.i.d. with zero-mean and varianceσ²_n6=σ_N².

There are many situations in which the parameters of the underlying system in the estimation problem are time-varying, while the input signals are corrupted by uncorrelated noise. Even in the situations in which the characteristics of the operating environment do not vary over time, either adaptive or iterative solutions are often sought, because the singular value decomposition-based solutions tend to be computationally expensive and such methods usually do not exploit the special structures or sparseness of the system model to reduce the computational complexity. In this section, we discuss how the traditional adaptive filtering algorithms such as the least-mean-square (LMS) algorithm can be modified to account for the presence of the additive i.i.d. noise in both the sensor signals and the mixing (data) matrix.

Solving the generalized TLS problem consists in finding the vector s which minimizes [582,294]

γk∆Hk²_F+ (1−γ)k∆xk²_F, where 1−γ

γ =β = σ²_n

σ²_N. (2.97) and ∆H and ∆xrefer to perturbations of the matrixH and sensor vectorx, respectively.

By changing the parameterγ in the range [0,1], we obtain the special cases: the standard LS, TLS and DLS problems. The parameter γ = 0 (β = ∞) yields the standard LS formulation since in this case σ_N² = 0, whereasγ= 0.5 gives the standard TLS formulation sinceσ²_N =σ_n², and finally γ= 1 (β= 0) results in the DLS formulation withσ_n²= 0.

Let us first consider the standard mean square error cost function formulated as

Je(s) =E{e^T(k)e(k)}, (2.98)

where the error vectoreis defined as

e(k) =x−H s(k) = (xtrue+n)−(Htrue+N)s(k), (2.99) where Htrue and xtrue are unknown true parameters. The cost function (2.98) can be evaluated as follows

Je(s) = E{(xtrue−Htrues)^T(xtrue−Htrues)}+σ_n²+s^TRN Ns

= E{(xtrue−Htrues)^T(xtrue−Htrues)}+σ_N² (β+s^Ts) (2.100) on the assumption that noise components are uncorrelated i.i.d. andRN N =σ_N²In.

It is obvious that minimizing the cost functionJe(s) with respect to the vectorswill yield a biased solution, since the noise components are the functions ofs. To avoid this problem, we can use the modified mean square error cost function formulated as the generalized TLS problem which can be further reformulated as the following optimization problem [282,294]:

Minimize the cost function J(s) = 1

E{e^T(k)e(k)}

β+s^T s

= 1

E{(xtrue−Htrues(k))^T(xtrue−Htrues(k))}

β+s^T s +σ_N². (2.101)

The above cost function removes the effect of the noise, assuming that the power ratio of the noise componentsβ=σ²_n/σ²_N is known, since the last term in (2.101) is independent of s.

To derive the iterative adaptive algorithm, we represent the cost function as J(s) =

i=1

Ji(s), (2.102)

where

Ji(s) =E{ε²(k)}= 1 2

E{e²_i(k)}

β+s^T s andei =xi−h^T_i s(h^T_i denotes thei-th row ofH).

Then the instantaneous gradient components can be evaluated as dε²

ds = ei(k)hi

β+s^T s− e²_i(k)s

[β+s^T s]². (2.103)

Hence, the iterative discrete-time algorithm exploiting the gradient descent approach can be written as

s(k+ 1) =s(k) +η(k)eei(k) [hi+eei(k)s(k)], i=kmodulo (m+ 1), (2.104) where

ei(k) = ei(k)

β+s^T(k)s(k) = xi−h^T_i s(k)

β+s^T(k)s(k). (2.105) Since the term (β+s^T(k)s(k))⁻¹ is always positive, it can therefore, be absorbed by the positive learning rate, thus the algorithm can be represented in a simplified form as

s(k+ 1) =s(k) +η(k)e_i(k) [h_i+ee_i(k)s(k)], i=kmodulo (m+ 1) (2.106)

Remark 2.7 It should be noted that the indexi is taken modulo(m+ 1), i.e., the rows hi

of matrix H and elements xi of vectorx are selected and processed in a cyclical order. In

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 75 other words, after the first m iterations, for the(m+ 1)th iteration, we revert back to the first row of H and the first component of x. We continue with the (m+ 2)-nd iteration using the second row ofH and second component ofx, and so on, repeating the cycle every m iterations. Moreover, it is interesting to note that the above algorithm simplifies to the standard LMS algorithm when β = ∞, while it becomes the standard DLS algorithm for β = 0.

Using the concept of component averaging (say, for block of all indicesiin one iteration cycle) and by applying self-normalization like in the Kaczmarz or NLMS algorithms, we can easily derive a novel GTLS iterative formula for the sparse matrixH as

s(k+ 1) =s(k) +η(k)

where 0 < η(k)< 2 is the normalized learning rate (relaxation parameter) and rj is the number of non-zero elements hij of the columnj.

2.4.4 Extended TLS for Correlated Noise Statistics

We noted earlier that the TLS solution is unbiased when the noise inH belongs to an i.i.d.

process. When the noise is i.i.d. Gaussian-distributed this is also the maximum likelihood solution. Unfortunately, when the noise is correlated in turn, the estimates are no longer guaranteed to be unbiased. In such a case a modification is needed to make the TLS approach useful, especially when the input signal is corrupted by non-i.i.d. noise sequences.

For the purpose of the derivation, we will assume that the noise samples are Gaussian distributed with zero-mean. If the noise is non-Gaussian, the procedure described here will still result in unbiased estimates of the system model parameters. However, these estimates will no longer satisfy the maximum likelihood property. Let the statistical expectation of the product of the augmented input matrixH with its own transpose be given by [827]

En respectively represent the autocorrelation matrices of the unbiased by noise signal compo-nent and the zero-mean noise compocompo-nent. We have assumed that the two compocompo-nents are uncorrelated with each other. Let ˜H denote the transformation of the augmented input matrix given by

H˜ =HR¯^−1/2NN . (2.111)

The (scaled) autocorrelation matrix of ˜His then E

nH˜^TH˜ o

= ¯R^−1/2NN RHHR¯^−1/2NN +I. (2.112) Thus, the transformed input matrix is corrupted by a noise process that is i.i.d. Gaussian with zero-mean. Therefore, the maximum likelihood estimate ˜sETLSof the coefficient vector for the transformed input matrix is given by the solution of the optimization problem

min˜s J(˜s) =˜s^TH˜^TH˜ ˜s

˜s^T˜s . (2.113)

Obviously, this solution is unbiased. Let ˆsETLSdenote the coefficient vector forH, obtained by appropriately transforming the optimal solution ˜sETLS of (2.113). Since

H˜ ˜sETLS=HR¯^−1/2NN ˜sETLS, (2.114) we conclude that ˆsETLS, which is the optimal solution for the correlated noise problem, is related to ˜sETLS through the transformation

ˆsETLS= ¯R^−1/2NN ˜sETLS. (2.115) The entries for the augmented regression vector [xi h^T_i]^T are obtained by an appropriate scaling of ˆsETLS and given as

· −1 sETLS

=− ˆsETLS

ˆ sETLS,1

, (2.116)

where ˆsETLS,1 is the first non-zero element of ˆsETLS. Since scaling the solution vector does not change the cost function, after substituting (2.115) and (2.116) in (2.113), we can state the optimization problem for the extended TLS approach as [827]

mins J(s) =

· −1 s

¸_T H^TH

· −1 s

¸_T R¯NN

· −1 s

¸. (2.117)

The solution to the above optimization problem is given by the generalized eigenvector¹³ corresponding to the smallest generalized eigenvalue of the matrix pencil (H^TH, R¯NN).

13Given two square matricesGandH, the generalized eigenvector of the matrix pencil (G,H) is a vectorv that satisfies the equalityGv=λHv, where the constantλis known as a generalized eigenvalue. Assuming that the inverse of the matrixH exists, the generalized eigenvectors of the matrix pencil (G,H) are the eigenvectors ofH⁻¹G.

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 77 2.4.4.1 Choice of R¯NN in Some Practical Situations The coefficient estimate obtained by solving the optimization problem of (2.117) can be shown to be unbiased for any noise cor-relation matrix ¯RNN. However, we need to havea prioriknowledge of the noise correlation matrix in order to implement the procedure. Fortunately, we need to use only a scaled version of the noise correlation matrix, since the minimization of the cost function

J(s) =

where c is a scalar positive constant, gives the same solution as the optimization problem in (2.117). There are many situations in which we can provide an estimate of such scaled noise correlation matrices. Some of these situations are:

• Uncorrelated Noise in the Input Signals: In many estimation problems, the matrixH or equivalently each of its column vectorhi is corrupted by the additive i.i.d. noise with varianceσ²_N, and the vector of sensor signalsxis contaminated by independent noise with varianceσ_n². In such cases, we can use

cR¯NN= diag{β,1,1, . . . ,1}, (2.119) whereβ =σ²_n/σ²_N is the ratio of the variance of the noise sequences associated with xi andhi.

• Data Least-Squares (DLS) problem: In a variety of the estimation problems that belong to this class, the sensor signalxi contains no noise and the noise inhican be reliably modelled as i.i.d. An appropriate choice ofcR¯NN in this case can be

cR¯NN= diag{0,1,1, . . . ,1}. (2.120)

• Least-Squares (LS) Problems: In this situation, we assume that the regression vector hi is noise-free and that only xi is corrupted by noise. Then, choosing

cR¯NN= diag{1,0,0, . . . ,0} (2.121) results in the standard least-squares problem formulation.

Dans le document Adaptive Blind Signal and Image Processing (Page 99-111)