LEAST ABSOLUTE DEVIATION (1-NORM) SOLUTION OF SYSTEMS OF LINEAR EQUATIONS

Solving a System of Linear Algebraic Equations and

2.3 LEAST ABSOLUTE DEVIATION (1-NORM) SOLUTION OF SYSTEMS OF LINEAR EQUATIONS

The use of the minimum 1-norm or least absolute deviation (LAD) can often provide a useful alternative to the minimum 2-norm (least-squares) or the infinity- (Chebyshev) norm solution of a system of linear and nonlinear equations, especially in signal processing appli-cations [282]. These appliappli-cations span the areas of deconvolution, state space estimation, inversion and parameter estimation. The (LAD) solutions of systems of linear algebraic equations have certain properties not shared by the ordinary least-squares solutions, such as:

(1) The minimum 1-norm solution of an overdetermined system of linear equations always exists though the 1-norm solution is not necessarily unique in contrast to the minimum 2-norm solution, where the solution is always unique when the matrix H has a full rank.

(2) The minimum 1-norm solutions are robust to outliers, that it, the solution is resistant (insensitive) to some large changes in the data. It is an extremely useful property when the data are known to be contaminated by occasional “wild points” (outliers) or spiky noise.

(3) For fitting a number of data points by a constant, the 1-norm estimate can be inter-preted as themedianwhile the interpretation of the 2-norm estimate is themean.

(4) The minimum 1-norm solutions are in general sparse, in the sense that they have a small number of non-zero components in the underdetermined case (see Section2.5).

(5) Minimum 1-norm problems are equivalent to linear programming problems and vice versa. Linear programming problems may also be formulated as the minimum 1-norm problems, while linear least-squares problems can be considered as a special case of the quadratic programming problem [282].

2.3.1 Neural Network Architectures Using a Smooth Approximation and Regularization

The most straightforward approach for solving the problem of the least absolute value problem is to approximate the absolute value functions |ei(s)| (i= 1,2, ..., m) by smooth differentiable functions, for example

Jρ(s) = Xm

i=1

ρ[ei(s)], (2.73)

where

ρ[ei(s)] = 1

γlog(cosh(γei(s))), (2.74)

with γ À1. To ensure a good approximation, the coefficient γ must be sufficiently large (although its actual value is not critical). Applying the gradient approach, we obtain the associated system of differential equations

dsj

dt =µ_j Xm

i=1

h_ijΨ_i[e_i(s)], (2.75)

where

Ψi[ei(s)] = ∂ρ[ei(s)]

∂ei(s) = tanh(γei(s)), (2.76)

ei(s) =xi− Xn

j=1

hijsj, µj = 1 τj

>0.

It should be noted that, for a large value of the gain parameter γ (typically γ > 50), the sigmoidal activation function Ψi[ei(s)] quite good approximates the sign (hard-limiter) function and such a network is able to find a solution which approaches the minimum 1-norm solution as γ→ ∞. On the other hand, for small values of γ(typically 0.1< γ <1) the activation function is almost linear over a wide range and the network is able to solve approximately the least-squares (a minimum 2-norm) problem.

In fact, by controlling the gain parameterγ(i.e., by changing its value over a wide range, say over the range 0.1< γ <1000) the network is able to solve a system of linear equations in the minimum p-norm sense with 1< p < 2.¹⁰ However, in order to achieve the exact 1-norm solution, it is necessary that the gain γapproaches infinity. Unfortunately, a large value of the parameter γ is difficult to control, and this is inconvenient from a practical implementation point of view, since an infinite gain is in fact often responsible for various parasitic effects, such as parasitic oscillations which can decrease the final accuracy. To

10Strictly speaking, the activation function for the p-norm problem is given as Ψi[ei(s)] =

|ei(s)|^p−1sign[ei(s)] (1≤p≤ ∞).

LEAST ABSOLUTE DEVIATION (1-NORM) SOLUTION OF SYSTEMS OF LINEAR EQUATIONS 63

Fig. 2.2 Detailed architecture of the Amari-Hopfield continuous-time (analog) model of recurrent neural network with regularization.

avoid this problem (especially the ill-conditioned problems), we have developed a modified cost function with regularization [282]

J(s, α) = Xm

i=1

(ρ[ei(s)] +αjϑi(sj)), (2.77) where ϑ(s) is convex function, typicallyϑ(s) =s². The minimization of the above energy function leads to the set of differential equations

dsj easily realize an appropriate neural network called the Amari-Hopfield network illustrated in Fig.2.2. Such a network will force the residuals ei(s) with the smallest absolute values

to tends to zero, while other residuals with large values (corresponding to outliers) will be inhibited (suppressed). The algorithm given by Eqs. (2.78) makes it possible to obtain an approximate minimum 1-norm solution, even when the gain parameter γ has a relatively low value (typically γ >80) [282].

2.3.2 Neural Network Model for LAD Problem Exploiting Inhibition Principles Inhibition plays an important role of self-regulatory control in biologically plausible neural networks and also in many artificial neural networks (ANN), mainly in various decision-making and selection tasks [282]. The extremal form of inhibition is the Winner-Take-All (WTA) function. In general, the function of an inhibition sub-network is to suppress some signals (e.g., the strongest signals) while allowing other signals to be transmitted for further processing. Our goal in this section is to employ this mechanism explicitly for solving the LAD problem [282].

For an overdetermined linear system of equations (cf. Eq. (2.3)), in general, it is impos-sible to satisfy all the m equations exactly, but there may exist a point s∗ in then-space satisfying at least the nequations if the matrix His of full rank. This can be formulated by the following Theorem [282]:

Theorem 2.2 There exists a minimizer sLAD∈IRⁿ of the energy functionJ1(s) =kek1 = P_m

i=1|ei(s)| (withe=x−H sandm > n) for which the residualsei(s∗) = 0 for at leastn values of i, sayi1, i2, . . . , in, wherendenotes the rank of matrixH.

From this theorem, it follows that the minimum 1-norm solutionsLADof the overdetermined m×n system interpolates at leastnpoints of them (withm > n) observed or measured data points, assumingH is of full rank. We can say that the minimum 1-norm solution is the median solution. On the other hand, the ordinary minimum 2-norm LS solution is the mean solution, since it tries to satisfy most of the equations in the set, but this solution will usually not solve exactly any of these equations. However, in the special case that the matrix H is nonsingular (for m=n), all the n equations are exactly solved with residuals equal to zero. From this analysis, it follows that by using the ordinary least-squares technique, we can solve the LAD problem iteratively in two stages. In the first stage, we compute all the residuals ei(sLS) (i = 1,2, . . . , m) and select, from the m set of equations only the n equations corresponding to thenresiduals which are the smallest in absolute value, the rest of equations are ignored (or inhibited). In the second stage, we use the reduced number of determined equation to estimate the vectorsLAD. The algorithm is summarized as follows:

Algorithm Outline: LAD Solution Using Multi-stage LS Procedure

Step 1. Compute the LS solution as

sLS= (H H^T)⁻¹H^Tx (2.79)

and on the basis of residual vectore(s) =x−H sLSselect the reduced set of equations corresponding to the modulus of the smallest residuals. After eliminating the (m−n)

LEAST ABSOLUTE DEVIATION (1-NORM) SOLUTION OF SYSTEMS OF LINEAR EQUATIONS 65

equations, we obtain the reduced set of determined equations in the form Hrs=xr,

where Hr is a nonsingular n×n reduced matrix andxr ∈ IRⁿ is a reduced sensor vector.

Step 2. Compute the LAD solution as

sLAD=H⁻¹_r xr. (2.80)

Example 2.1 Let us consider the following ill-conditioned LAD problem:

Minimize J(s,H) =kx−H sk1, where

H^T =



 1.1 4.0 7.0 −3.1 5.5 2.0 5.1 8.0 −4.0 6.9 3.1 6.0 8.9 −5.0 8.1





andx= [−2 1.1 4.2 −0.5 2.1]^T.

In the first step, we can obtain the LS (2-norm) solution and corresponding vector of the residuals as

s2 = (H^TH)⁻¹H^Tx= [1.6673 1.4370 −2.1198]^T,

e2 = x−H s2= [0.1368 −0.1795 −0.1015 −0.1821 0.1844]^T.

Since the residuals corresponding to the fourth and fifth row have the largest amplitude, we can remove them and compute the optimal LAD solution as

s1=H⁻¹_r xr= [2 1 −2]^T, (2.81) where xr= [−2 1.1 4.2]^T and

Hr=



 1.1 2 3.1

4 5.1 6

7 8 8.9



.

Let us consider now a more difficult example with some ambiguities.

Example 2.2 Let us consider the following LAD problem:

Minimize J(s) =kx−H sk1, where

H^T =

· 1 1 1 1 1 1 2 3 4 5

andx= [1 1 2 3 3]^T.

It is impossible to explicitly find the LAD solution in one step. In the first step, we can obtain the LS (2-norm) solution and the error vector (residuals) as

s2 = (H^TH)⁻¹H^Tx= [0.2 0.6]^T,

e2 = x−H s2= [0.2 −0.4 0 0.4 −0.2]^T.

Since the residuals corresponding to the second or fourth rows have the same largest am-plitude, we can remove only the second and the fourth row. Then, the optimal solution is computed as

s1= (H^T_r Hr)⁻¹H^T_r xr= [0.5 0.5]^T, (2.82) where xr= [1 2 3]^T and

Hr=



 1 1 1 3 1 5



.

The above simple algorithm can easily be implemented on-line by the Amari-Hopfield neural network shown in Fig.2.2, with the adaptive activation functions Ψi(ei) (taking one of two forms: eior 0). In the first phase of the computation, all the activation functions are Ψi(ei) =ei, thus both the LS estimationsLSand corresponding residualse(s) =x−H sLS

are simultaneously estimated automatically by the neural network.

In the second phase of the computation, the inhibition control circuit (not explicitly shown on Fig.2.2) selects the (m−n) largest modulus of the residualsep(sLS) from the set of all the m residuals ei(sLS) and inhibits the corresponding (m−n) hidden neurons by switching their activation functions to Ψp(ep) = 0, allowing the smallestnresiduals to be further processed in the network. In this way, in the second phase of the computation, only nequations are selected for which the residuals are minimized to zero, while the rest of the equations are simply discarded.

The inhibition control subnetwork can be realized on the basis of the Winner-Take-All principle. Firstly, the circuit selects the largest signal |ep(sLS)| which is immediately inhib-ited and the corresponding switch is opened. Then, the procedure is sequentially repeated for (n−m) times for the rest of the signals|ei(s)|.

While both the LS and LAD formulation of the estimation problem and their solution are commonly employed in practice, it is worthy to mention of not only their usefulness, but also their limitations [282]. These techniques are able to provide unbiased estimates of the

TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS 67 coefficient vector in an ergodic environment, only when the entries of matrixH are known precisely. In other words, an implicit assumption employed in the LS and LAD estimation procedures and many of their variations is that there is negligible noise or contained in the data matrixH. In other words, we attributed till now all the uncertainty about the system parameters to the noise contained in the sensor signalsx.

2.4 TOTAL LEAST-SQUARES AND DATA LEAST-SQUARES PROBLEMS

Dans le document Adaptive Blind Signal and Image Processing (Page 93-99)