Updating the Trial Stepsize

3 Line-Searches

3.4 Updating the Trial Stepsize

We turn now to the possibilities for interpolation and extrapolation in an algorithm such as 3.3.1. The simplest way to satisfy the safeguard-reduction property 3.1.3 is a rough doubling and halving process such as

- iflR

= +00

^replace

1

^by21,

- if tR <

+00

replace t by 1/2 (tL

+

tR)'

More intelligent formulae exist, however: as the number of cycles increases in the line-search, more and more information is accumulated from q, which can be used to guess where a convenient t is likely to lie. Then, the idea is to fit a simple model-function (like a polynomial) to this information. The model-function is used to obtain a desired value, say td, for the next trial, and it remains to force td inside the safeguard ]tR,

td,

so as to ensure the safeguard-reduction property 3.1.3.

Remark 3.4.1 The idea of having q(t) := !(Xk

+

tdk) as line-search function, of fitting a model to it, say O(t), and of choosing td minimizing 0, is attractive but may not be the most suitable. Remember that the descent test (3.2.1) might be satisfied by no minimizer of q.

A possible way round this discrepancy is to compute td minimizing the tilted function t H- O(t) - mtq'(O); or equivalently to choose O(t) fitting q(t) - mtq'(O); or also to take q(t) := !(Xk +tdk) -mt(V !(Xk), dk) as line-search function. The resulting td will certainly aim at satisfying (3.2.1) and

(V!(Xk

+

tdk), dk) ;:: m(V!(xk), dk) .

It will thus aim at satisfying (3.2.4) as well, and this strategy is more consistent with a Wolfe criterion, say; see Fig. 3.4.1.

The above strategy may look anti-natural, but luckily the perturbation term mtq' (0) is small, admitting that m itself is small (see Remark 3.2.2). 0

f(x+td)

Fig. 3.4.1. A perturbation of the line-search function

Ca) Forcing the Safeguard-Reduction Property The forcing mechanism can be done as follows:

- When no t R > 0 has been found yet, one chooses K > 1 and the next trial t+

is max{td, Ktd (each extrapolation multiplies the stepsize at least by K; K may vary with the number of extrapolations but one should not let it tend to I without precautions).

- When some tR <

+00

is on hand, one chooses p E]O, 1/2] and one does the following:

-replace td by max{td, (1 - P)tL

+

^ptR};

- then replace the new td thus obtained by min {td, pt L

+

(1 - p)t R};

- finally, the next trial t+ is set to this last td.

In other words, t+ is forced inside the interval obtained from [t L, t R] by chopping off P(tR -

td

from its two endpoints. At the next cycle, t+ will become a tL or a tR (unless clause (0) occurs) and in both cases, the bracket [tL, tR] will be reduced by a factor of at least 1 - p; p may vary at each interpolation but one must not let p .J.. 0 without precaution.

These questions, particularly that of choosing td, present a moderate interest because efficient line-searches need on the average far less than two cycles to reach (0) and to accomplish the descent iteration. Asymptotic properties of the interpolation formulae are therefore hardly involved.

Remark 3.4.2 The question of the initial trial is crucial, since the above-mentioned score of less than two cycles per line-search is certainly not attainable without a good initialization.

With Newtonian methods, one must try t = 1 first, so the Newton step has a chance to prevail.

For other methods, the following technique is universally used: pretend that q is quadratic q (t) ::::

!

^{a t}²

+

^q'^(O)t

+

^{q (0)} ^(a^>0 is unknown)

and that its decrease from t = 0 to the (asserted) optimal t* := -q'(O)/a is going to be L1:= !(Xk-l)-!(xk);itisstraightforwardtocheckthatt*isthengivenbyt*

=

-2L1/q'(0), which happens to be an excellent initialization.

Observe that, at the first descent iteration k = 1, L1 does not exist. In the notations of Fig. 1.2.1, it is actually block (UO) which should give to block (A) an idea of the very first initial trial; for example (UO) may pass to (A) an estimate of L1 together with Xl and 8. These are the kind of details that help an optimization program to run efficiently. 0

3 Line-Searches 83 Remark 3.4.3 Having thus settled the question of the initialization, let us come again to the question of stopping criteria. The "ideal" event IIV' f(Xk) II ~ 8 occurs rarely in practice, for many possible reasons. One is that 8 may have been chosen too small by the user and, in view of the speed ofthe minimization algorithm, iterations should go on essentially forever.

Another reason, a very common one, is that s(x) is actually not the gradient of f at x, either (i) because of a mistake in the black box (U 1) - this is fairly frequent, see Remark 1.2.2 - or (ii) simply because of roundoff errors: (Ul) can work only with finitely many digits, and there must be a threshold under which the computation errors become important. Then, observed values of q and of q' [= (s, d)] are inconsistent with each other and a prooflike that of Theorem 3.3.2 does not reflect reality. For example, the property

q(tR) - q(t*) -+ q' (t*) when tR

i

tR - t*

may become totally wrong. As a result, (0) never occurs, tR - tL does tend to zero and the line-search loops forever. The cause of this problem can be (i) or (ii) above; in both cases, the process must be stopped manually, so as to "spare" computing time (i.e. reduce it from infinity to a reasonable value!).

Thus, in addition to 8 for the ideal test (1.2.1), the user [i.e. block (UO)] must set another tolerance, say 8', allowing block (A) to somehow guess what a very small step size is. This 8' defines a threshold, under which tR - tL must be considered as essentially O. In these conditions, an "emergency stopping criterion", acting when tR - tL becomes lower than this threshold, can be inserted in Step 4 of Algorithm 3.3.1. Note also that another emergency stop can be inserted in Step 3 to prevent infinite loops occurring when the objective function is unbounded from below.

It is interesting to observe that this 8' -precaution is not sufficient, however: the user may have overestimated the accuracy ofthe calculations in (UI) and 8' may never act, again because of roundoff errors. There exists, at last, an unfailing means for Algorithm 3.3.1 to detect that it is starting to loop forever. When the new stepsize t+ becomes close enough to a previous one, say tL, there holds

+

tLdk

=

+

t+dk

although tL =f t+. This is another effect of roundoff errors - but beneficial, this time: when it happens, Algorithm 3.3.1 can be safely stopped.

All this belongs more to the art of computer programming than to mathematics and explains what we meant in the introduction of this Section 3, when mentioning that

imple-menting a good line-search requires experience. 0

We gave the above details in Remark 3.4.3 because they illustrate the kind of care that must be exercised when organizing automatic calculations. We conclude this section with some more details concerning the fit of q by some simple function. Although not particularly exciting, they are further illustrations of another kind of precaution: when doing a calculation, one should try to avoid division by O!

(b) Computing the Interpolation td The most widely used fit for q is by a cubic function, which is done by the following calculations:

- Call a and a_ the two step size-values that have been tried last (the current one and the previous one; at the first cycle, a_ = 0).

- We have on hand q := q(a), q' := q'(a), q_ := q(a_) and q'- := q'(a_).

- These four data define a polynomial of degree ~ 3 in t, which we find convenient to write as

O(t) := ~a(t - a)3

+

b(t - a)2

+

q' (t - a)

+

q .

- The coefficients a and b are identified by equating O(a_) and O'(a_) with q_ and

q'-

respectively. Knowing that E := a - a_ 0/= 0 this gives the linear system

! E²a - Eb = Q' - q'

3 "

E 2a - 2Eb

=

q_ - q

in which we have set Q' := (q - q_)/ E. With p' := q'

+

^q'- ^- ^3Q',its unique solution is

E 2a

=

^q'

+

^q'-

+

2P' and Eb

=

^q'

+

^{P' .}

- Then the idea is to take td as the local minimum of 0 (if it exists), i.e. one of the real solutions (if they exist) of the equation

0' (t) = a(t - a)2

+

2b(t - a)

+

q' =

o.

^(3.4.1)

With respect to the unknown t - a, the reduced.discriminant of this equation is ..1 := b2 - aq' = E2 (p'2 - q' q'-) 1

which we assume nonnegative, otherwise there is nothing to compute.

- Clearly enough, if t - a

=

^(-b± ..11/2) / a solves (3.4.1), then O"(t) = 2a(t - a)

+

2b = ±2..11/ 2 •

(3.4.2)

Because Oil must be nonnegative at td, it is the "+" sign that prevails; in a word, td can be computed by either of the following equivalent formulae:

td _ a = -b

+

a ..11/2 (3.4.3)

(-b + Ll

^{1/ 2}

)(b +

..11/2) -q'

td - a

=

a (b

+

..11/2)

=

+

..11/2 . (3.4.4) - The tradition is to use (3.4.3) if b ~ 0 and (3.4.4) if b > O. Then, roundoff errors are reduced because the additions involve two nonnegative numbers. In particular, the denominator in (3.4.4) cannot be zero.

- Now comes the delicate part of the calculation. In both cases, the desired t is expressed as

=a+

N D (3.4.5)

but this division may blow up if D is close to O. On the other hand, we know that if td is going to be outside the interval ]tL. tR[ (assumed to be known; in case of extrapolation we can temporarily set tR

=

10 K td, the forcing mechanism of §(a) above will kill the computation of td. A formula like (3.4.5) is then useless anyway.

3 Line-Searches 85 Now, a key observation is that ex E [tL' tR]. In fact, the current trial ex is either tL or tR, as can be seen from a long enough contemplation of Algorithm 3.1.2. Then the property

tL - ex < D < N tR - ex,

which must be satisfied by td, implies

Wi INI

^<^{tR - tL·} ^(3.4.6)

To sum up, td should be computed from (3.4.5) only if (3.4.6) holds. Otherwise the cubic model is helpless, as it predicts anew stepsize outside the bracket [tL, tRl.

Then td should be set for example to tL or tR according to the sign of q'.

Remark 3.4.4 Perhaps the most important reason for these precautions is the danger

Dans le document Wissenschaften 305 A Series (Page 96-100)