Representation and Approximation in Vector Spaces
3.3 Error minimization via gradients
While the orthogonality theorem is used principally throughout this chapter as the ge- ,,rnetr~cal basis for finding a minimum error approximation under an induced norm, it 1s pedagogically worthwhile to consider another approach based on gradients, which reaffirms what we already know but demonstrates the use of some new tools.
M~nlmlzing llellz for the induced norm in
requires minimizing
Using the vector notations defined in ( 3 . 3 , (3.6), and (3.7), we can write (3.13) as
Gradient formulas appropriate for this optimization are presented in section E. 1.1 of ap- pendix E. In particular, the following gradient formulas are derived:
Taking the gradient of (3.14) using the last two of these, we obtain
Equating this result to zero we obtain
giving us again the normal equations.
To determine whether the extremum we have obtained by the gradient is in fact a minimum, we compute the gradient a second time. We have the Hessian matrix
wh~ch is a positive-semidefinite matrix, so the extremum must be a minimum.
Restricting attention for the moment to real variables, consider the plot of the norm of the error J ( c ) as a function of the variables cl, cz,
. . .
, c,. Such a plot is called an error surface. Because J ( c ) is quadratic in c and R is positive semidefinite, the error surface is a parabolic bowl. Figure 3.3 illustrates such an error surface for two variables cl and cz.Because of its parabolic shape, any extremum must be a minimum, and is in fact a global minimum.
138 Representation and Approximation in Vector Spaces
Figure 3.3: An error surface for two variables
3.4 Matrix representations of least-squares problems
While vector space methods apply to both infinite- and finite-dimensional vectors (signals), the notational power of matrices can be applied when the basis vectors are finite dimensional.
The linear combination of the finite set of vectors {pl
.
p z.
..
, pm} can be written asThis is the linear combination of the columns of the matrix A defined by A = [PI P2 . . .
P ~ I .
which we compute by
i = A c . The approximation problem can be stated as follows:
1
Determine c to minimize //el/: in the problem x = A c+
e = i+
e ./
(3.16)The mlnlmum llell: = Ilx -
~ ~
occurs when1 1 ~
e 1s orthogonal to each of the vectors ( x - A c , p , ) = O , J = 1 . 2 , . mStack~ng these orthogonality condltlon\, we obtaln
Recogn~z~ng that the stack of vector5 I \ \imply A ~we obta~n . A)' AC = AI'X
t.4 Matrix Representations of Least-Squares I'roblerns 139
The mdtrix A"A i i the Cramrntan R, and the vector AHx 15 the crosi-correiatlon p We can write (3 17) ai
R c = A x - p H (3 18)
These equations are the normal equations. Then the optimal (least-squares) coefficients are
/
c = ( A ~ A ) - ' A ~ x = R - ' ~1
(3.19)By theorem 3.1, AHA is positive definite if the p l , . . . , p, are linearly independent. The matrix (AHA)-'AH is called a pseudoinverse of A, and is often denoted At. More is said about pseudoinverses in section 4.9. While (3.19) provides an analytical prescription for the optimal coefficients, it should rarely be computed explicitly as shown, since many problems are numerically unstable (subject to amplification of roundoff errors). Numerical stability is discussed in section 4.10. Stable methods for computing pseudoinverses are discussed in sections 5.3 and 7.3. In MATLAB, the pseudoinverse may be computed using the command
pin-J.
Using (3.19), the approximation is
The matnx P = A (AH A)-"AH is a projection matrix, which we encountered in section 2.13. The matrix P projects onto the range of A. Consider geometrically what is taking place: we wish to solve the equation Ac = x, but there is no exact solution, since x is not in the range of A. So we project x orthogonally down onto the range of A, and find the best solution in that range space. The idea is shown in figure 3.4
Figure 3.4: Projection solution
A useful representation of the Grammian R = A H A can be obtained by considering A as a stack of rows,
so that = [qi, q 2 ,
.
. . , q,] and140 Representation and Approximation in Vector Spaces
3.4.1 Weighted least-squares
A weight can also be applied to the data points, reflecting the confidence in the data, as illustrated by the next example. This is naturally incorporated into the inner product. Define a weighted inner product as
( x , y ) , = y H w x .
Then minimizing l/e11L =
11
Ac - x11L leads to the weighted normal equationsA ~ W A C = A * W X , (3.23)
so the coefficients which minimize the weighted squared error are
c = (A* W A ) - ' A ~ W X . (3.24)
Another approach to ( 3 . 2 4 ) is to presume that we have a factorization of the weight W =
sH
S (see section 5 . 2 ) . Then we weight the equationSAC
=
S y .Multiplying through by ( S A ) ~ and solving for c , we obtain c = ( ( S A ) * S A ) - ~ ( S A ) ~ S ~ , which is equivalent to (3.24).
3.4.2 Statistical properties of the least-squares estimate
The matrix-least squares solution ( 3 . 2 0 ) has some useful statistical properties. Suppose that the signal
x
has the true model according to the eqaationx = A c o + e , (3.25)
for some "true" model parameter vector
co;
and that we assume a statistical model for the model error e: assume that each component of e is a zero-mean, i.i.d. random variable with variance a:. The estimated parameter vector isc = ( A ~ A ) - ' A * X . ( 3 . 2 6 )
This least-squares estimate, being a function of the random vector
x,
is itself a random vector. We will determine the mean and covariance matrix for this random vector.Mean of c Substituting the "true" model of (3.25) into (3.261, we obtaln c = ( A H A ) - ' A H A m
+
( A H A ) - ' A H e= co
+
( A H A ) - ] A H e .If we now take the expected value of our estimated parameter vector, we obtain E [ c ] = E[co
+
( A * A ) - ' A H e ] = co,slnce each component of e has zero mean. Thus. the expected value of the estimate is equal to the true value. Such an est~mate I S said to be unbiased.
Covariance of c The covariance can be written as Cov[c] = E [ ( c - c o ) ( c - C O ) ~ ]
= ( A * A ) - ~ A * E [ ~ ~ ~ ] A ( A " A ) - ' . S ~ n c e the components of e are 1 1.d.. i t follows that E [ e e H ] = o:l, so that
C O V [ C ~ = CT??(A A ) - ] = 0: R