5 Optimization algorithms
We present in this chapter some algorithms which allow us to find or approximate the solution of optimization problems.
5.1 Unconstrained optimization
We consider the following minimization problem: find u∈V such thatJ(u) = infv∈VJ(v).
The goal is to find a minimizing sequence (uk) ∈ V (i.e. such that J(uk) →J(u)). Then J(uk)−J(uk+1) should be as large as possible. For anyw∈V,
J(uk+w) =J(uk) + (J0(uk), w) +kwkε(w).
|(J0(uk), w)| ≤ kJ0(uk)k.kwk, with equality ifw=λJ0(uk), λ∈R. Then, based on the first order approximation of J (by neglecting the second and higher order derivatives), J(uk)− J(uk+w) is the largest forwopposite to the gradient:
uk+1=uk−ρkJ0(uk),
withρk ∈R+.
Gradient algorithms consist of following the line of greatest slope, given by the gradient of the cost function.
5.1.1 Gradient algorithm with fixed step
LetV be a Hilbert space. The fixed step gradient algorithm is:
uk+1=uk−ρJ0(uk).
Theorem 5.1 Let J : V →R be a Gˆateaux-differentiable and α-convex function. If J0 is Lipschitz continuous(i.e. ∃M such that kJ0(v1)−J0(v2)k ≤Mkv1−v2k, ∀v1, v2∈V), and if 0 < ρ < 2α
M2, then the fixed step gradient algorithm converges to the minimum: uk →u, and the convergence is geometric (kuk+1−uk ≤γkuk−uk).
Proof
uis given byJ0(u) = 0. Letwk =uk−u. Thenwk+1=wk−ρ(J0(uk)−J0(u)). Then kwk+1k2=kwkk2−2ρhuk−u, J0(uk)−J0(u)i+ρ2kJ0(uk)−J0(u)k2
≤ kwkk2−2ραkuk−uk2+ρ2M2kuk−uk2=γ2kwkk2
withγ2= 1−2ρα+ρ2M2= 1−ρ(2α−ρM2)<1 (quadratic function ofρ, maximum value
= 1 forρ= 0 or 2α/M2). Thenkwkk ≤γkkw0k.
Example of a quadratic functional: J(v) = 12hAv, vi − hb, vi uk+1=uk−ρ(Auk−b).
5.1.2 Gradient algorithm with optimal step ( J(uk−ρkJ0(uk)) = inf
ρ>0J(uk−ρJ0(uk)), uk+1=uk−ρkJ0(uk)
This is an improvement of the fixed step algorithm, with a one-dimensional minimization problem on the step size.
Theorem 5.2 IfJ is Gˆateaux-differentiable, α-convex, and if J0 is Lipschitz continuous on every bounded subset ofV, then the optimal step gradient algorithm converges.
Proof
i) Assume that J0(uk) 6= 0 (otherwise uk is the minimum and the algorithm has con- verged). Letf(ρ) =J(uk−ρJ0(uk)). Thenf0(ρ) =−hJ0(uk−ρJ0(uk)), J0(uk)i.
(ρ2−ρ1)(f0(ρ2)−f0(ρ1)) =−(ρ2−ρ1)hJ0(uk−ρ2J0(uk)−J0(uk−ρ1J0(uk)), J0(uk)i
≥α(ρ2−ρ1)2kJ0(uk)k2.
Then f(ρ) is α-convex, and has then a unique minimum, defined byf0(ρk) = 0. And then hJ0(uk+1), J0(uk)i = 0 (the successive descent directions are orthogonal). Then hJ0(uk+1), uk+1−uki= 0, and thenJ(uk)−J(uk+1)≥ α2kuk−uk+1k2.
ii) (J(uk)) is a decreasing sequence, bounded below (byJ(u)), and then it is a convergent sequence. ThenJ(uk)−J(uk+1)→0. Then kuk−uk+1k →0.
iii) kJ0(uk)k2=hJ0(uk), J0(uk)−J0(uk+1)i ≤ kJ0(uk)k kJ0(uk)−J0(uk+1)k. ThenkJ0(uk)k ≤ kJ0(uk)−J0(uk+1)k.
iv) As (J(uk)) is a decreasing sequence, (uk) is bounded. Otherwise, ifkukk →+∞, then J(uk)→+∞asJ(uk)≥J(u0) +hJ0(u0), uk−u0i+α2kuk−u0k2.
Moreover,J0 is Lipschitz continuous. Using iii), the Lipschitz continuity on a bounded subset, and ii), J0(uk)→0.
v) αkuk −uk2 ≤ hJ0(uk)−J0(u), uk −ui ≤ kJ0(uk)k kuk −uk. Then kuk −uk ≤
1
αkJ0(uk)k →0.
Note that the last inequality of the proof gives an estimation of the error.
Application to quadratic functionals:
J(v) =1
2hAv, vi − hb, vi,
where A is a symmetric positive definite matrix, andb ∈ Rn. Then J0(v) = Av−b. The optimal step ρk is characterized byhJ0(uk), J0(uk+1)i= 0:
hAuk−b, A(uk−ρk(Auk−b))−bi= 0 Letwk =Auk−b. Thenρk = kwkk2
hAwk, wki. An iteration of the optimal step gradient algorithm consists in computing wk,ρk and thenuk+1.
It can be difficult to find the optimalρk. An easy approximation consists in approximating f(ρ) = J(uk−ρJ0(uk)) by a parabola ˜f(ρ) (quadratic function), uniquely determined by f˜(0) := f(0) = J(uk), ˜f0(0) := f0(0) = −kJ0(uk)k2, and a third value (e.g. f˜(ρk−1) :=
f(ρk−1)).
5.1.3 Conjugate gradient
The main issue of the optimal step (or fixed step) gradient algorithm is that it usually does not converge in a finite number of iterations.
LetJ be a quadratic functional defined onRn: J(v) =1
2hAv, vi − hb, vi, whereA is symmetric definite positive.
Conjugate gradient algorithm: Assume thatu1, . . . ,ukhave been computed. We assume that J0(ui) 6= 0, ∀i ≤ k, otherwise the algorithm has converged. Then let uk+1 be the minimum ofJ on the affine subset containinguk and spanned by (J0(ui))0≤i≤k:
J(uk+1) = inf
α∈Rk+1J uk+
k
X
i=0
αiJ0(ui)
! ,
whereα= (α0, . . . , αk). Note thatJ has a unique minimum on this subset.
Remark: The conjugate gradient (inf overα∈Rk+1) is better than the optimal step gradi- ent (inf over (0, . . . ,0, αk)).
Remark: The successive gradientsJ0(ui) are mutually orthogonal.
Proof
The derivative is equal to zero at the optimum: hJ0(uk+1), J0(ui)i= 0,∀0≤i≤k.
Corollary 5.1 The gradients (J0(ui))0≤i≤k are linearly independent.
Corollary 5.2 The conjugate gradient algorithm converges in at mostniterations.
Definition 5.1 Two non-zero vectorspi andpj areconjugate with respect to A (A being a symmetric positive definite matrix) ifhpi, Apji= 0.
Proposition 5.1 Conjugate vectors are linearly independent.
Proof
Assume that there exist (λ1, . . . , λk) such that
k
X
i=0
λipi= 0. Then 0 =hA(Pk
i=0λipi), pji= λjhApj, pji. Aspj6= 0 andA is symmetric positive definite,λj= 0.
Letdk=uk+1−uk=
k
X
i=0
δkiJ0(ui) be the descent direction at iterationk.
Proposition 5.2 The descent directions (di)are mutually conjugate.
Proof
J0(uk+1) = J0(uk +dk) = A(uk +dk)−b = J0(uk) +Adk. Then for 0 ≤ i < j ≤ k, 0 =hJ0(uj+1), J0(ui)i=hJ0(uj), J0(ui)i+hAdj, J0(ui)i=hAdj, J0(ui)i.
ThenhAdj, dii=hAdj,Pi
l=0δilJ0(ul)i= 0,∀0≤i < j≤k.
Computation of the direction dk using Gram-Schmidt: the first direction is d0 = J0(u0). Then, as the direction lives in the subspace spanned by (J0(ui)), the second direction can be set to
d1=J0(u1) +β0d0.
We then use the conjugation constraint to findβ0: 0 =hd1, Ad0i=hJ0(u1), Ad0i+β0hd0, Ad0i and then
β0=−hJ0(u1), Ad0i hd0, Ad0i . Using a similar process, at iterationk,
dk+1=J0(uk+1) +
k
X
i=0
βidi
as the successive directions span the same vector subspace as the successive derivatives. Then, the conjugation constraint gives: 0 = hdk+1, Adki=hJ0(uk+1, Adki+βkhdk, Adki+ 0, and thenβk=−hJ0(uhdk+1),Adki
k,Adki .
Moreover, for i < k, 0 = hdk+1, Adii = hJ0(uk+1), Adii+βihdi, Adii. But J0(ui+1) = Aui+1−b=Aui−b+A(ui+1−ui) =J0(ui)−Aρidi.
Then hJ0(uk+1), Adii= ρ1
ihJ0(uk+1), J0(ui+1−J0(ui)i= 0 as the gradients are orthogonal.
Thenβi= 0 fori < k. Finally,
dk+1=J0(uk+1) +βkdk, with βk =−hJ0(uk+1), Adki hdk, Adki . Computation of the step size ρk:
The iterate is defined byuk+1=uk+ρkdk. As the derivative is equal to zero at the optimum (with respect toρ),hJ0(uk+1), dki= 0. Then, asJ0(uk+1) =J0(uk+ρkdk) =J0(uk) +Aρkdk, the step size is given by
ρk=−hJ0(uk), dki hAdk, dki . Algorithm:
• Choose any u0; setd0=J0(u0);
• uk+1=uk−ρkdk, withρk= hJ0(uk), dki hAdk, dki ;
• dk+1=J0(uk+1) +βkdk, with βk =−hJ0(uk+1), Adki hdk, Adki .
This is one of the best method for solving linear systems Ax = b, where A is symmetric positive definite. Note that this algorithm can be extended to non quadratic functionals.
5.2 Constrained optimization
5.2.1 Gradient algorithm with projection We consider the following optimization problem:
v∈Kinf J(v)
whereKis a closed convex subset ofV (a Hilbert space), andJis Gˆateaux-differentiable and α-convex. The minimumuofJ overKis characterized by Euler’s inequality: hJ0(u), v−ui ≥
0,∀v∈K. Then forρ >0,hu−u+ρJ0(u), v−ui ≥0 andh(u−ρJ0(u))−u, v−ui ≤0. As K is convex, then
u=P rojK(u−ρJ0(u)).
The gradient algorithm with projection is the following:
uk+1=P rojK(uk−ρJ0(uk)), ρ >0.
Note that ifK=V, this algorithm is exactly the fixed step gradient algorithm.
Theorem 5.3 If K is a closed non-empty convex subset of V, if J : V → R is Gˆateaux- differentiable and α-convex, ifJ0 is Lipschitz continuous onV (letM be the Lipschitz con- stant), and if
0< a≤ρ≤b < 2α M2, then the gradient algorithm with projection converges, and
kuk−uk ≤βkku0−uk with β <1.
Proof
The projection onK is Lipschitz continuous, with a Lipschitz constant of 1. Thenkuk+1− uk ≤ k(uk −u)−ρ(J0(uk)−J0(u))k. Then kuk+1−uk2 ≤ k(uk −u)k2+ρ2k(J0(uk)− J0(u))k2−2ρhuk−u, J0(uk)−J0(u)i. Using the Lipschitz continuity ofJ0andα-convexity ofJ, kuk+1−uk2≤ k(uk−u)k2+ρ2M2k(uk−u)k2−2ραk(uk−u)k2= (1−2ρα+ρ2M2)k(uk−u)k2. f(ρ) := (1−2ρα+ρ2M2) isα-convex, and reaches its maximum value 1 forρ= 0 andρ=M2α2. Then for 0< a≤ρ≤b < M2α2,f(ρ)≤β2<1.
5.2.2 Identification of a saddle-point: Uzawa’s algorithm
As the projection operator is not explicity known in general, the projection at each itera- tion can be very difficult. The idea is then to consider the Lagrangian associated to the constrainted minimization problem, and to identify the saddle-points of the Lagrangian.
We consider the convex minimization problem:
inf
F(v)≤0J(v),
where J :V →R is convex, and F : V →Rm is convex. We assume that the hypotheses of Kuhn-Tucker theorem are satisfied. Let L(v, q) =J(v) +hq, F(v)i. Then, by definition, (u, p) is a saddle-point if
∀q∈Rm+, L(u, q)≤ L(u, p)≤ L(v, p), ∀v∈V.
We deduce thathp−q, F(u)i ≥0 for allq∈Rm+. This is equivalent tohp−q, p−(p+ρF(u))i ≤ 0, withρ >0. Then
p=P rojRm
+(p+ρF(u)), ∀ρ >0.
Uzawa’s algorithm is then the following:
• one choosesp0∈Rm+;
• pn being known, computeun solution of the (unconstrained) optimization problem L(un, pn) = inf
v∈VL(v, pn), ∀v∈V;
• compute the next Lagrange multiplier:
pn+1=P rojRm
+(pn+ρnF(un)), ρn>0.
Note that the minimization problem is now unconstrained, and the Lagrange multiplier is given by a projection onRm+, which is straightforward.
Theorem 5.4 Under the previous hypotheses, and if 0 < a ≤ ρn ≤ b < M2α2, then the algorithm converges: un →uinV.
Note that the convergence of (pn) is not ensured.
Proof
un anduare characterized by the following inequalities (see proposition 2.15):
hJ0(un), v−uni+hpn, F(v)−F(un)i ≥0, ∀v∈V, hJ0(u), v−ui+hpn, F(v)−F(u)i ≥0, ∀v∈V.
If we denote byrn=pn−p, by choosing respectivelyv=uandv=un,
−hJ0(un)−J0(u), un−ui − hrn, F(un)−F(u)i ≥0.
hrn, F(vn)−F(u)i ≤ −hJ0(un)−J0(u), un−ui ≤ −αkun−uk2
and krn+1k ≤ krn+ρn(F(un)−F(u))k (by Lipschitz-continuity of the projection). Then krn+1k2≤ krnk2+ 2ρnhrn, F(un)−F(u)i+ρ2nkF(un)−F(u)k2≤ krnk2−2ρnαkun−uk2+ ρ2nM2kun−uk2.
As 0 < a ≤ ρn ≤ b < M2α2 ⇔ 2αρn−ρ2nM2 ≥ β > 0, krn+1k2 ≤ krnk2−βkun−uk2. Then the sequence (krnk) is decreasing, and bounded below, and then it converges. As
0≤βkun−uk2≤ krnk2− krn+1k2,kun−uk →0.
Remark: Uzawa’s algorithm has a dual interpretation: it is exactly the gradient algorithm with fixed step and projection applied to the dual problem.