5 Optimization algorithms

(1)

5 Optimization algorithms

We present in this chapter some algorithms which allow us to find or approximate the solution of optimization problems.

5.1 Unconstrained optimization

We consider the following minimization problem: find u∈V such thatJ(u) = inf_v∈VJ(v).

The goal is to find a minimizing sequence (u_k) ∈ V (i.e. such that J(u_k) →J(u)). Then J(u_k)−J(u_k+1) should be as large as possible. For anyw∈V,

J(uk+w) =J(uk) + (J⁰(uk), w) +kwkε(w).

|(J⁰(u_k), w)| ≤ kJ⁰(u_k)k.kwk, with equality ifw=λJ⁰(u_k), λ∈R. Then, based on the first order approximation of J (by neglecting the second and higher order derivatives), J(u_k)− J(u_k+w) is the largest forwopposite to the gradient:

uk+1=uk−ρkJ⁰(uk),

withρk ∈R+.

Gradient algorithms consist of following the line of greatest slope, given by the gradient of the cost function.

5.1.1 Gradient algorithm with fixed step

LetV be a Hilbert space. The fixed step gradient algorithm is:

u_k+1=u_k−ρJ⁰(u_k).

Theorem 5.1 Let J : V →R be a Gˆateaux-differentiable and α-convex function. If J⁰ is Lipschitz continuous(i.e. ∃M such that kJ⁰(v1)−J⁰(v2)k ≤Mkv1−v2k, ∀v1, v2∈V), and if 0 < ρ < 2α

M², then the fixed step gradient algorithm converges to the minimum: uk →u, and the convergence is geometric (kuk+1−uk ≤γkuk−uk).

Proof

uis given byJ⁰(u) = 0. Letw_k =u_k−u. Thenw_k+1=w_k−ρ(J⁰(u_k)−J⁰(u)). Then kwk+1k²=kwkk²−2ρhuk−u, J⁰(uk)−J⁰(u)i+ρ²kJ⁰(uk)−J⁰(u)k²

≤ kwkk²−2ραkuk−uk²+ρ²M²kuk−uk²=γ²kwkk²

withγ²= 1−2ρα+ρ²M²= 1−ρ(2α−ρM²)<1 (quadratic function ofρ, maximum value

= 1 forρ= 0 or 2α/M²). Thenkw_kk ≤γ^kkw₀k.

Example of a quadratic functional: J(v) = ¹₂hAv, vi − hb, vi uk+1=uk−ρ(Auk−b).

(2)

5.1.2 Gradient algorithm with optimal step ( J(uk−ρkJ⁰(uk)) = inf

ρ>0J(uk−ρJ⁰(uk)), uk+1=uk−ρkJ⁰(uk)

This is an improvement of the fixed step algorithm, with a one-dimensional minimization problem on the step size.

Theorem 5.2 IfJ is Gˆateaux-differentiable, α-convex, and if J⁰ is Lipschitz continuous on every bounded subset ofV, then the optimal step gradient algorithm converges.

Proof

i) Assume that J⁰(u_k) 6= 0 (otherwise u_k is the minimum and the algorithm has converged). Letf(ρ) =J(u_k−ρJ⁰(u_k)). Thenf⁰(ρ) =−hJ⁰(u_k−ρJ⁰(u_k)), J⁰(u_k)i.

(ρ₂−ρ₁)(f⁰(ρ₂)−f⁰(ρ1)) =−(ρ2−ρ₁)hJ⁰(u_k−ρ₂J⁰(u_k)−J⁰(u_k−ρ₁J⁰(u_k)), J⁰(u_k)i

≥α(ρ2−ρ1)²kJ⁰(uk)k².

Then f(ρ) is α-convex, and has then a unique minimum, defined byf⁰(ρk) = 0. And then hJ⁰(uk+1), J⁰(uk)i = 0 (the successive descent directions are orthogonal). Then hJ⁰(uk+1), uk+1−uki= 0, and thenJ(uk)−J(uk+1)≥ ^α₂kuk−uk+1k².

ii) (J(u_k)) is a decreasing sequence, bounded below (byJ(u)), and then it is a convergent sequence. ThenJ(u_k)−J(u_k+1)→0. Then ku_k−u_k+1k →0.

iii) kJ⁰(uk)k²=hJ⁰(uk), J⁰(uk)−J⁰(uk+1)i ≤ kJ⁰(uk)k kJ⁰(uk)−J⁰(uk+1)k. ThenkJ⁰(uk)k ≤ kJ⁰(uk)−J⁰(uk+1)k.

iv) As (J(uk)) is a decreasing sequence, (uk) is bounded. Otherwise, ifkukk →+∞, then J(uk)→+∞asJ(uk)≥J(u0) +hJ⁰(u0), uk−u0i+^α₂kuk−u0k².

Moreover,J⁰ is Lipschitz continuous. Using iii), the Lipschitz continuity on a bounded subset, and ii), J⁰(uk)→0.

v) αkuk −uk² ≤ hJ⁰(u_k)−J⁰(u), u_k −ui ≤ kJ⁰(u_k)k kuk −uk. Then kuk −uk ≤

1

αkJ⁰(u_k)k →0.

Note that the last inequality of the proof gives an estimation of the error.

Application to quadratic functionals:

J(v) =1

2hAv, vi − hb, vi,

where A is a symmetric positive definite matrix, andb ∈ Rⁿ. Then J⁰(v) = Av−b. The optimal step ρ^k is characterized byhJ⁰(u_k), J⁰(u_k+1)i= 0:

hAuk−b, A(u_k−ρ_k(Au_k−b))−bi= 0 Letwk =Auk−b. Thenρk = kwkk²

hAw_k, w_ki. An iteration of the optimal step gradient algorithm consists in computing wk,ρk and thenuk+1.

It can be difficult to find the optimalρk. An easy approximation consists in approximating f(ρ) = J(uk−ρJ⁰(uk)) by a parabola ˜f(ρ) (quadratic function), uniquely determined by f˜(0) := f(0) = J(uk), ˜f⁰(0) := f⁰(0) = −kJ⁰(uk)k², and a third value (e.g. f˜(ρk−1) :=

f(ρk−1)).

(3)

5.1.3 Conjugate gradient

The main issue of the optimal step (or fixed step) gradient algorithm is that it usually does not converge in a finite number of iterations.

LetJ be a quadratic functional defined onRⁿ: J(v) =1

2hAv, vi − hb, vi, whereA is symmetric definite positive.

Conjugate gradient algorithm: Assume thatu1, . . . ,ukhave been computed. We assume that J⁰(ui) 6= 0, ∀i ≤ k, otherwise the algorithm has converged. Then let uk+1 be the minimum ofJ on the affine subset containinguk and spanned by (J⁰(ui))0≤i≤k:

J(uk+1) = inf

α∈R^k+1J uk+

k

X

i=0

αiJ⁰(ui)

! ,

whereα= (α₀, . . . , α_k). Note thatJ has a unique minimum on this subset.

Remark: The conjugate gradient (inf overα∈R^k+1) is better than the optimal step gradient (inf over (0, . . . ,0, α_k)).

Remark: The successive gradientsJ⁰(ui) are mutually orthogonal.

Proof

The derivative is equal to zero at the optimum: hJ⁰(uk+1), J⁰(ui)i= 0,∀0≤i≤k.

Corollary 5.1 The gradients (J⁰(ui))_0≤i≤k are linearly independent.

Corollary 5.2 The conjugate gradient algorithm converges in at mostniterations.

Definition 5.1 Two non-zero vectorsp_i andp_j areconjugate with respect to A (A being a symmetric positive definite matrix) ifhpi, Ap_ji= 0.

Proposition 5.1 Conjugate vectors are linearly independent.

Proof

Assume that there exist (λ1, . . . , λk) such that

k

X

i=0

λipi= 0. Then 0 =hA(Pk

i=0λipi), pji= λjhApj, pji. Aspj6= 0 andA is symmetric positive definite,λj= 0.

Letd_k=u_k+1−u_k=

k

X

i=0

δ^k_iJ⁰(u_i) be the descent direction at iterationk.

Proposition 5.2 The descent directions (d_i)are mutually conjugate.

Proof

J⁰(uk+1) = J⁰(uk +dk) = A(uk +dk)−b = J⁰(uk) +Adk. Then for 0 ≤ i < j ≤ k, 0 =hJ⁰(uj+1), J⁰(ui)i=hJ⁰(uj), J⁰(ui)i+hAdj, J⁰(ui)i=hAdj, J⁰(ui)i.

ThenhAdj, dii=hAdj,Pi

l=0δⁱ_lJ⁰(ul)i= 0,∀0≤i < j≤k.

(4)

Computation of the direction d_k using Gram-Schmidt: the first direction is d₀ = J⁰(u₀). Then, as the direction lives in the subspace spanned by (J⁰(u_i)), the second direction can be set to

d1=J⁰(u1) +β0d0.

We then use the conjugation constraint to findβ0: 0 =hd1, Ad0i=hJ⁰(u1), Ad0i+β0hd0, Ad0i and then

β₀=−hJ⁰(u₁), Ad₀i hd0, Ad0i . Using a similar process, at iterationk,

d_k+1=J⁰(u_k+1) +

k

X

i=0

β_id_i

as the successive directions span the same vector subspace as the successive derivatives. Then, the conjugation constraint gives: 0 = hdk+1, Adki=hJ⁰(uk+1, Adki+βkhdk, Adki+ 0, and thenβk=−^hJ⁰^(u_hd^k+1^),Ad^kⁱ

k,Ad_ki .

Moreover, for i < k, 0 = hdk+1, Ad_ii = hJ⁰(u_k+1), Ad_ii+β_ihdi, Ad_ii. But J⁰(u_i+1) = Au_i+1−b=Au_i−b+A(u_i+1−u_i) =J⁰(u_i)−Aρ_id_i.

Then hJ⁰(u_k+1), Ad_ii= _ρ¹

ihJ⁰(u_k+1), J⁰(u_i+1−J⁰(u_i)i= 0 as the gradients are orthogonal.

Thenβ_i= 0 fori < k. Finally,

d_k+1=J⁰(u_k+1) +β_kd_k, with β_k =−hJ⁰(u_k+1), Ad_ki hdk, Adki . Computation of the step size ρk:

The iterate is defined byuk+1=uk+ρkdk. As the derivative is equal to zero at the optimum (with respect toρ),hJ⁰(u_k+1), d_ki= 0. Then, asJ⁰(u_k+1) =J⁰(u_k+ρ_kd_k) =J⁰(u_k) +Aρ_kd_k, the step size is given by

ρk=−hJ⁰(uk), dki hAdk, d_ki . Algorithm:

• Choose any u0; setd0=J⁰(u0);

• u_k+1=u_k−ρ_kd_k, withρ_k= hJ⁰(u_k), d_ki hAdk, dki ;

• dk+1=J⁰(uk+1) +βkdk, with βk =−hJ⁰(uk+1), Adki hd_k, Ad_ki .

This is one of the best method for solving linear systems Ax = b, where A is symmetric positive definite. Note that this algorithm can be extended to non quadratic functionals.

5.2 Constrained optimization

5.2.1 Gradient algorithm with projection We consider the following optimization problem:

v∈Kinf J(v)

whereKis a closed convex subset ofV (a Hilbert space), andJis Gˆateaux-differentiable and α-convex. The minimumuofJ overKis characterized by Euler’s inequality: hJ⁰(u), v−ui ≥

(5)

0,∀v∈K. Then forρ >0,hu−u+ρJ⁰(u), v−ui ≥0 andh(u−ρJ⁰(u))−u, v−ui ≤0. As K is convex, then

u=P rojK(u−ρJ⁰(u)).

The gradient algorithm with projection is the following:

u_k+1=P roj_K(u_k−ρJ⁰(u_k)), ρ >0.

Note that ifK=V, this algorithm is exactly the fixed step gradient algorithm.

Theorem 5.3 If K is a closed non-empty convex subset of V, if J : V → R is Gˆateaux- differentiable and α-convex, ifJ⁰ is Lipschitz continuous onV (letM be the Lipschitz constant), and if

0< a≤ρ≤b < 2α M², then the gradient algorithm with projection converges, and

kuk−uk ≤β^kku0−uk with β <1.

Proof

The projection onK is Lipschitz continuous, with a Lipschitz constant of 1. Thenkuk+1− uk ≤ k(uk −u)−ρ(J⁰(uk)−J⁰(u))k. Then kuk+1−uk² ≤ k(uk −u)k²+ρ²k(J⁰(uk)− J⁰(u))k²−2ρhuk−u, J⁰(uk)−J⁰(u)i. Using the Lipschitz continuity ofJ⁰andα-convexity ofJ, kuk+1−uk²≤ k(uk−u)k²+ρ²M²k(uk−u)k²−2ραk(uk−u)k²= (1−2ρα+ρ²M²)k(uk−u)k². f(ρ) := (1−2ρα+ρ²M²) isα-convex, and reaches its maximum value 1 forρ= 0 andρ=_M^2α2. Then for 0< a≤ρ≤b < _M^2α2,f(ρ)≤β²<1.

5.2.2 Identification of a saddle-point: Uzawa’s algorithm

As the projection operator is not explicity known in general, the projection at each iteration can be very difficult. The idea is then to consider the Lagrangian associated to the constrainted minimization problem, and to identify the saddle-points of the Lagrangian.

We consider the convex minimization problem:

inf

F(v)≤0J(v),

where J :V →R is convex, and F : V →R^m is convex. We assume that the hypotheses of Kuhn-Tucker theorem are satisfied. Let L(v, q) =J(v) +hq, F(v)i. Then, by definition, (u, p) is a saddle-point if

∀q∈R^m+, L(u, q)≤ L(u, p)≤ L(v, p), ∀v∈V.

We deduce thathp−q, F(u)i ≥0 for allq∈R^m+. This is equivalent tohp−q, p−(p+ρF(u))i ≤ 0, withρ >0. Then

p=P roj_R^m

+(p+ρF(u)), ∀ρ >0.

Uzawa’s algorithm is then the following:

• one choosesp₀∈R^m+;

• pn being known, computeun solution of the (unconstrained) optimization problem L(un, pn) = inf

v∈VL(v, pn), ∀v∈V;

(6)

• compute the next Lagrange multiplier:

pn+1=P roj_R^m

+(pn+ρnF(un)), ρn>0.

Note that the minimization problem is now unconstrained, and the Lagrange multiplier is given by a projection onR^m+, which is straightforward.

Theorem 5.4 Under the previous hypotheses, and if 0 < a ≤ ρn ≤ b < _M^2α2, then the algorithm converges: un →uinV.

Note that the convergence of (pn) is not ensured.

Proof

u_n anduare characterized by the following inequalities (see proposition 2.15):

hJ⁰(u_n), v−u_ni+hp_n, F(v)−F(u_n)i ≥0, ∀v∈V, hJ⁰(u), v−ui+hp_n, F(v)−F(u)i ≥0, ∀v∈V.

If we denote byrn=pn−p, by choosing respectivelyv=uandv=un,

−hJ⁰(un)−J⁰(u), un−ui − hrn, F(un)−F(u)i ≥0.

hrn, F(vn)−F(u)i ≤ −hJ⁰(un)−J⁰(u), un−ui ≤ −αkun−uk²

and krn+1k ≤ krn+ρn(F(un)−F(u))k (by Lipschitz-continuity of the projection). Then krn+1k²≤ krnk²+ 2ρnhrn, F(un)−F(u)i+ρ²_nkF(un)−F(u)k²≤ krnk²−2ρnαkun−uk²+ ρ²_nM²kun−uk².

As 0 < a ≤ ρn ≤ b < _M^2α2 ⇔ 2αρn−ρ²_nM² ≥ β > 0, krn+1k² ≤ krnk²−βkun−uk². Then the sequence (krnk) is decreasing, and bounded below, and then it converges. As

0≤βkun−uk²≤ krnk²− krn+1k²,kun−uk →0.

Remark: Uzawa’s algorithm has a dual interpretation: it is exactly the gradient algorithm with fixed step and projection applied to the dual problem.