A method for minimizing a sum of squares of non-linear functions without calculating derivatives

(1)

A method for minimizing a sum of squares of non-linear functions without calculating derivatives

By M. J. D. Powell*

The minimum of a sum of squares can often be found very efficiently by applying a generalization of the least squares method for solving overdetermined linear simultaneous equations. An original method that has comparable convergence but, unlike the classical procedure, does not require any derivatives is described and discussed in this paper. The number of times the individual terms of the sum of squares have to be calculated is approximately proportional to the number of variables. Finding a solution to a set of fifty non-linear equations in fifty unknowns required the left-hand sides of the equations to be worked out fewer than two hundred times.

1. Introduction

Although the problem of minimizing a sum of squares of non-linear functions occurs frequently in curve fitting, in determining physical parameters from experi- mental observations, and in solving non-linear simultaneous equations, which may be overdetermined, it appears that this field of research is practically neglected by numerical analysts. This is probably because of the undeniable efficacy of the "generalized least squares"

method which, although well known, will be described in Section 2. Most users of this procedure are thankful that only first derivatives of the functions are required, particularly as in many applications the eventual convergence is quadratic. However, it will be shown that a procedure, which is similar to the well-known one, can be devised that has the same convergence properties but does not require any derivatives. In essence it approxi- mates the derivatives by differences, but this is done in such a way that only before the initial iteration are sub- stantially more function evaluations required. The new method is described in Section 3.

In Section 4, three important properties of the method are derived. They are (i) that it is unlikely to have converged before the minimum is reached, (ii) that, in a sense, it chooses conjugate directions of search, and (iii) that it yields a ready approximation to the variance- covariance matrix.

Some numerical examples are provided in section 5.

They include an illustration that the procedure to be described may be applicable to fitting problems in which the "best fit" is necessarily poor, although the fast convergence depends on the sum of squares tending to zero.

2. The generalized least squares method

It is required to find xu x2, • • ., xn, (x, say), to minimize

= S [f«Kx)]\ m>n.

(1) It is hoped that using a superscript to distinguish the m different functions that appear in the sum of squares will

not be found confusing—derivatives will be written out explicitly. In addition the notation

(2)

and (3)

will be used.

The method is iterative, and an iteration requires an approximation % to the position of the minimum. If the actual minimum is at % -f 8 then, by differentiating (1),

k = l

= 0; i=\,2,...,n. (4) By approximating the left-hand side of (4) by the first two terms of the Taylor series in 8 about %, equation (5) is obtained.

m r~

L sf(©./

^w

GD

* - i L

+ ^S

0. (5) The least squares method hinges on the further approximation that the term G^O;)/***© can be ignored.

This term is of order S if /^w( 5 ) is z e r o a t t n e

minimum, and it vanishes i f /^{( i )} is linear in the variables. In all other cases the convergence of the procedure will be only linear, the correction to \ being calculated by solving

n ( m

2 { S s = - S

fc=i

1 = 1 , 2 , . . ., n. (6) Note that the matrix of these equations is in general positive definite so that

(7) unless all the derivatives of F(x) at % are zero. There- fore, unless 5 happens to be a stationary point of F(x),

* Applied Mathematics Group, Theoretical Physics Division, A.E.R.E., Harwell, Berks.

Downloaded from https://academic.oup.com/comjnl/article-abstract/7/4/303/354228 by guest on 10 March 2020

(2)

extending the iteration to calculate a positive value of A, Am say, which minimizes F(5 + A8), provides a theoretical guarantee that the least squares method will converge. Of course (£ + Am8) is chosen as the new approximation to the minimum. The positive semi- definite case is discussed in Section 4.

If the second derivatives G^j\%) are not zero, the quadratic convergence depends on the functions /^(/c)(£) being of the same order as the correction 8. In this case numerical estimates of the derivatives gf\%) in (6) that are in error by 8 are acceptable. On the other hand, if the second derivatives are zero, one expects numerical estimates of the derivatives to be exact. It is for these reasons that the procedure to be described has convergence comparable to the generalized least squares method.

3. The procedure without derivatives

The new method is iterative, and at the start of an iteration n linearly independent directions in the space of the variables, </(l), d(2), . . ., d(n), say, are required together with estimates of the derivatives of the /^{( f c )} along the directions. The notation that will be used for the estimated derivative of the A:th function along the ith direction is y^(fc)(/), so

/*>(/) « S gf\x).dj{i); i = 1, 2, . . . , « ;

k=l,2,...,m. (8) To equilibrate the matrix of equations (11), the directions should be scaled so that

£ [ y

^w

0 ) ]

²

= i ; » = i , 2 , . . . , » .

fci (9)

It is intentional that the notation does not allow for the dependence of y^w(i) on x, because the approximation to the derivative is a number which is calculated when d(i) is chosen. If x is changed by 8, the resultant error in y^(fc)(0 will be of order 8 multiplied by a second derivative term, and it has been pointed out that this can be tolerated.

As in the least squares method, an approximation to the position of the minimum, \, is required and a correction to it, 8, is calculated. The correction is worked out by substituting the approximate derivatives in (6) so, if

8 = S

⁽¹⁰⁾

m

s

1 Ufc=i

J

\

1 = 1 , 2 , . . . , / ! . (11) It is convenient to define

(12)

As recommended in Section 2, the iteration is extended to find Am to minimize F(& + AS) but, because (8) is an approximation, Am is not necessarily positive. The procedure described by Powell (1964) is used for finding the minimum along a line and, at the same time, estimates of the derivatives of the functions f^m in the direction 8 are worked out in the following way.

The function values /<*>(§ + A, 8) and /*>(? + A28), k = 1, 2, . . ., m, which yield the lowest and next lowest values of F(^ + A8) are noted. These are differenced to provide the approximation

3A-⁷ v*-r™>~ (Aj-A2)

= W(^fc>(8). (13) The approximation is improved by

v<«(8) = ii<»(8) - /*/*>(§ + Am8) where

Am8)]/

(14)

(15) because it is known that the derivative of F(x) along 8 at \ + Am8 must be zero. Finally v^(ft)(8) and 8 are scaled so that the Euclidean norm of the derivative vector is unity, in accordance with (9).

Of course the derivatives along 8 have been calculated in order that 8 may replace one of </(l), d(2), . . ., d(ri).

d(f) is replaced, where t is the integer such that

\p{t).q{i)\=m^\p{i).q{i)\. (16) 5 + Am8 replaces the original value of %, and then the next iteration may be commenced.

An important point to notice is that, apart from calculating function values, the most laborious stage of the iteration is solving the equations (11). After each iteration just one row and one column of the left-hand side matrix are changed so that, if the inverse of the old matrix is stored, that of the new can be worked out by partitioning in an n² rather than an n³ process. The details of this calculation have been set out very clearly by Rosen (1960).

For the first iteration, </(l), </(2),..., d(n) are chosen to be the coordinate directions. A starting value of \ has to be provided, and then values of y^(fc)(') must be worked out. This calculation requires increments eu e2, ••-,€„

to be specified which will yield reasonable estimates of the first derivatives. They are calculated from

(17) where s, is a scaling factor, introduced so that (9) may be satisfied. In accordance with (17), for the first iteration,

i/(0 = (0, 0, . . ., 0, s,, 0, . . .,0) (18) the only non-zero element being the ith component.

304

(3)

Minimization without derivatives The obvious criterion for ultimate convergence has

been found satisfactory, although it is probably not difficult to construct examples in which it fails. It is to stop iterating when both 8 and XmS have acceptably small components.

4. Properties of the method

For this discussion, it is convenient to introduce a notation that has purposely been avoided so far. It is to consider the numbers f^w(£), /⁽²⁾(5), • • •, /^(m)(?) as the components of a vector /(%)• Similarly, y(0^W'U represent y^(l)0"), y⁽²⁾(0> • • •= y^(m)(0- Therefore, the effect

of equation (14) is to ensure that

and a necessary condition for \ to be an approximate minimum of Fix) is that

; i = 1, 2, . . . , » . (20) From (19), it may be seen that after an iteration

0 (21) so at least one of the equations (20) is satisfied. In consequence, on the next iteration, p(t) is zero and, by (16), the direction just defined cannot be replaced, until an iteration is started from a point different from the current \. Because (16) also ensures that q(t) is not zero, the directions d(l), d(2), . . ., d(n) remain linearly independent, and these two facts are practically always sufficient to ensure that the procedure will not "stick."

The words "practically always" have been chosen because the possibility that \p(f).q(i)\ is zero for all i has not been considered. Unless the matrix of equations (11) is positive semi-definite, it can only occur if all the numbers p(i) are zero, which is the condition for con- vergence (20). Otherwise, in the semi-definite case, the numbers q(j) are not defined by the equations, and will be calculated to be the components of an eigenvector of the matrix, the eigenvalue being zero. Along such an eigenvector the derivative of each f^w(x) is predicted to be zero, and a search along 8 will correct the predictions.

If the derivatives of the individual functions are in fact zero, the position of the minimum is poorly determined, and more functions are required to define it.

The linear approximations behind the least squares method are such that, to first order in 8, equations (20) are satisfied at the predicted position of the minimum,

§ + 8. Therefore, if/(?)._YW = O(S²),

(22)

(23) (24) Hence, provided Am is of order unity,

and

These equations show that the derivative vector of a new direction is, to order 8, orthogonal to the derivative

vectors of those directions that satisfy (20) at the initial point of the iteration. Therefore, remembering (9), the left-hand side matrix of equations (11) tends to the unit matrix. Furthermore, by (8), the successive directions d that are chosen tend to be mutually conjugate (Powell, 1964) with respect to the matrix

r

_l7

= £

_s

f\x)

_g

f(.x).

⁽²⁵⁾

The last important property of the method is that it provides a ready approximation to the least squares variance-covariance matrix, H say, which, by definition, is the inverse of T. It will now be proved that

S d

⁽²⁶⁾

For the proof, it is most convenient to use a matrix notation, and the following definitions are employed:

Bij = g,^a\x); i = 1, 2 , . . ., n; j = l , 2 , . . ., m,

Co = y^{O )}(0; i = 1, 2, . . ., n; j = 1, 2, . . ., m and

D,j =^di{j)> i= 1,2, . . ., n; j = 1, 2, . . ., n.

Therefore, from (25),

T = BB^T (27)

and it is required to prove that

r -¹ » DD^T. (28)

From (8),

C us D^TB (29)

and it has just been observed that

CC^T » I (30)

so D^TBB^TD (31)

D has an inverse because the directions d(i) are linearly independent, therefore, from (27) and (31),

T w (D⁷)-¹!)-¹ = (DD⁷^-¹ (32) which proves (28).

Note that, particularly if the procedure converges in less than n iterations, (30) may be a very crude approxi- mation. This is usually acceptable, because not often are the elements of a variance-covariance matrix required to high accuracy. If they are required more precisely, the following formula, derived from (27) and (29), may be used

r-

¹ ^D(CC^T^)~ⁱ^D^T^. ⁽³³⁾

The elements of (CC⁷)*¹ will have been calculated, if the recommendation of partitioning in Section 3 is heeded.

5. Numerical examples

The method was tested using the trigonometrical equations introduced by Fletcher and Powell (1963).

(4)

Table 1

Number of function values to solve equations (34)

ft

3 3 5 5 10 10 20 20 30 30 50 50

LST. SQS.

6 5 5 10 5 8 6 9 12 10

—

N E W

19 18 24 24 38 34 46 65 75 61 173 101

METHOD

DAVIDON

_

19 23 36 29 89 84 86 92 169 119

CONJ. DIRN.

61 84 104 103 329 369 1519 2206

—

They are

n

S Aki sin xj + BkJ cos x} = Ek; k = 1, 2,..., m (34) and a solution is obtained by denning

f^w(x) = 2 AkJ sin xj + Bkj cos Xj - Ek. (35)

y=i

The elements of A and B are random integers between

— 100 and +100, and the components of the initial value of 5 differ from those of a known solution by up to + 0 • 1 TT. No difficulties were encountered in obtaining the expected answer from the initial approximation.

The number of function values required to calculate JCI, x2, . . ., xn to accuracy 0-0001 is given in Table 1 for values of n ranging from 3 to 50. m was chosen to be equal to n so that a comparison could be made with the results of Fletcher and Powell (1963) and Powell (1964). The number of function values required by the least squares method, as described in Section 2, is also tabulated.

In comparing the columns of the table, it must be remembered that each time function values are required by the least squares method, or by Davidon's method, all the first derivatives must be provided as well. Also, the methods of the last two columns are designed to minimize a general function, and take only F{x) into account when choosing the directions of search. As well as the labour of calculating first derivatives, a further factor may be significant. It is that the number of administrative operations in an iteration of each of the last three methods is of order n², while solving equations (6) is an n³ process.

An experiment was tried to find out the effect of applying the method of this paper to a sum of squares of non-linear functions which do not all tend to zero at

Table 2

Number of functions to minimize a sum of squares that does not tend to zero

n

3 3 5 5 10 10 20 20 30 30

0 21 16 17 20 29 26 41 35 47 46

<5 0 1

29 16 17 20 26 29 42 36 59 47

1 31 14 37 29 47 47 118 88 77 106

10 29 21 33 34 78 86 175 93 134 206

Table 3

The procedure applied to Rosenbrock's function

ITERATION

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-1-2000 -1-0770 -0-9759 -0-4205 -0-4270 -0-3573 -0-4232 -0-1620 0-0380 0-4193 0-4089 0-6089 0-6770 0-7602 0-8207 0-8725 0-9747 0-9841 0-9919 0-9986 1-0000

Xl

1-0000 0-7294 0-5294 0-0701 0-0765 0-0181 0-1697 -0-0048 -0-0419 0-1554 0-1641 0-3465 0-4309 0-5854 0-6675 0-7514 0-9514 0-9678 0-9827 0-9973 1-0000

-4-4000 -4-3053 -4-2304 -1-0669 -1-0577 -1-0949 -0-0941 -0-3110 -0-4336 -0-2045 -0-0307 -0-2420 -0-2747 0-0741 -0-0605 -0-0992 0-0132 -0-0062 - 0 0 1 1 0 0-0009 -0-0001

/(2)

2-2000 2-0770 1-9759 1-4205 1-4270 1-3573 1-4232 1-1620 0-9620 0-5807 0-5911 0-3911 0-3230 0-2398 0-1793 0-1275 0-0253 0-0159 0-0081 0-0014 0-0000

the minimum. The individual functions are again defined by (35), but m is chosen to be equal to 2«. The initial value of \ is chosen as before but, before com- mencing the iterations, all the values of Ek are changed by random numbers between — 8 and + S . Again the position of the minimum is found to accuracy 0-0001;

the required number of function values is given in Table 2.

306

(5)

The table shows that the method can be very effective even if the individual functions do not tend to zero at the minimum. The number of function values quoted for 8 = 0 is less than the corresponding number in Table 1, because the minimum is better determined in the experiment, as there are twice as many functions as variables.

This paper would not be complete without the example showing the effect of the procedure on Rosenbrock's (1960) minimization problem

/<•> = 10(x2 - = 1 - *,. (36) Because there are only two variables, the results of each

iteration are given in Table 3. The total number of function values required is 70, and during the iterations the progress can best be described as "lively."

Once the corner of the parabolic valley has been turned, the variables increase monotonically to their final values, the eventual convergence being particularly impressive.

As well as being tried on the examples presented, the procedure has been used to solve a number of practical problems at A.E.R.E. It has proved thoroughly success- ful, and it is particularly encouraging that there appears to be no tendency for the method to become less efficient as the number of variables is increased.

References

DAVIDON, W. C. (1959). "Variable metric method for minimization," A.E.C. Research and Development Report, ANL-5990 (Rev.).

FLETCHER, R., and POWELL, M. J. D. (1963). "A rapidly convergent descent method for minimization," The Computer Journal, Vol. 6, p. 163.

POWELL, M. J. D. (1964). "An efficient method for finding the minimum of a function of several variables without calculating derivatives," The Computer Journal, Vol. 7, p. 155.

ROSEN, J. B. (1960). "The gradient projection method for nonlinear programming. Part I, Linear Constraints," Journal of S.I.A.M.,Vo\. 8, p. 181.

ROSENBROCK, H. H. (1960). "An automatic method for finding the greatest or least value of a function," The Computer Journal, Vol. 3, p. 175. -

THE COMPUTER JOURNAL

Published Quarterly by

The British Computer Society, Finsbury Court, Finsbury Pavement, LONDON, E.C.2, England.

The Computer Journal is registered at Stationers' Hall, London (certificate No. 20825, May 1958). The contents may not be reproduced, either wholly or in part, without permission.

Subscription price per volume £3 10s. (U.S. $10). Single Copies £1.

All inquiries should be sent to the Assistant Secretary at the above address.

E D I T O R I A L BOARD P. G. Barnes

D. V. Blake M. Bridger R. A. Brooker E. C. Clear Hill L. R. Crawley G. M. Davis

A. S. Douglas R. G. Dowse L. Fox H. W. Gearing P. Giles S.Gill

E. T. Goodwin

T. F. Goodwin I. H. Gould J. G. Grover D. W. Hooper T. Kilbum J. G. W. Lewarne E. N. Mutch

R. M. Needham T. H. O'Beirne E. S. Page R. M. Paine D. Rogers P. A. Spooner K. H. Treweek F. Yates {Chairman)

HONORARY EDITORS For scientific and engineering papers:

E. N. Mutch, c/o The University Mathematical Laboratory, Corn Exchange Street, CAMBRIDGE.

Associate Editor: R. M. Needham.

For business applications:

H. W. Gearing, c/o The Metal Box Company Limited, Woodside, WORCESTER.

Associate Editor: L. R. Crawley.