Optimization in higher dimensions

(1)

Optimization in higher dimensions

Quasi-Newton Methods Conjugate Gradient Method

(2)

Optimization in higher dimensions

(3)

A bit of history

[Nocedal, Wright, Numerical Optimization 06], Chapters 6-7

? in the 50s W.C. Davidon used ”coodrdinate descent” method (GD on coordinates)

? the computer would always crash before the simulation was finished

? Davidon decided to find a way of accelerating the optimization process: he found one of the most creative ideas in nonlinear optimization

? Fletcher and Powell demonstrated that this algorithm was faster and more reliable than existing methods at the time

? paradoxically, Davidon’s paper was not accepted for publication. It remained a technical report for more than thirty yearsuntil it appeared in SIAM Journal on Optimizationin 1991!

(4)

Motivation

Recall the Variable Metric Method and replaceA⁻¹_i bySi: Algorithm 1 (Generic Variable Metric method)

Choose the starting point x0

Iteration i :

compute f(x_i),∇f(x_i)and eventually D²f(x_i)

choose a symmetric positive-definite matrix Si: compute the new direction di =−Si∇f(xi)

perform a line-search from xi in the direction di giving a new iterate xi+1 =xi+tidi =xi−tiSi∇f(xi).

? in the modified Newton methodSi is computed as follows: find the Hessian D²f(xi), modify it to make it ”well positive definite”, then invert it or solve Sidi=∇f(xi)

? in quasi-Newton method we try toskip all of thisand computeSi recursively with one objective: Si−(D²f(xi))⁻¹→0

(5)

Variable Metric method: quadratic case

? minimizef(x) =¹₂x^TAx−b^Tx with Steepest Descent line-search

? denoteE(x_i) =f(x_i)−minf: error in terms of objective function

? xi+1=xi−toptSi∇f(xi) is equivalent to a change of coordinatesξ=S_i^1/2x

? the stepi in the VM method is just a Steepest-Descent step for the matrix S_i^1/2AS_i^1/2. Therefore we have the estimate

E(xi+1)≤

Q−1 Q+ 1

2

E(xi) where Q is the condition number ofS_i^1/2AS_i^1/2

? ifSi is close toD²f(xi)⁻¹=A⁻¹thenS_i^1/2AS_i^1/2 is close to the identity matrix soQ is close to 1.

? Finally, ifQ converges to 1, we eventually get thatE(x_i+1)/E(x_i)→0, i.e.

super-linear convergence

(6)

Basic rules for updating S

_i

? Taylor expansion formula tells us that

∇f(xi+1)− ∇f(xi)≈D²f(xi)(xi+1−xi)

? Therefore, it is reasonable to request that

Si+1(∇f(xi+1)− ∇f(xi)) =xi+1−xi

called thesecant relation(make parallel with the 1D case)

? With the notationsgi =∇f(xi),pi=xi+1−xi,qi =gi+1−gi we have Si+1qi =pi,

called the quasi-Newton equation

? this leaves us with infinitely many possibilities... another goal is that Si+1−Si is as simple as possible!

? initialization? one may simply chooseS0= Id

(7)

Small rank updates

? idea: findSi+1=Si+Bi where Bi has low rank

? Rank 1 updates: Bi =αiviv_i^T - one may findBi such that the quasi-Newton relation holds

S_i+1=S_i+α_iz_iz_i^T

? the quasi-Newton relationp_i=S_i+1q_i implies z_i =ω_i(p_i−Sq_i)

? in the end we get

S_i+1=S_i+ 1 (pi−Siqi)^Tqi

[p_i−S_iq_i][p_i−S_iq_i]^T

? not possible to guarantee thatS_i+1 is positive definite ifS_i is

(8)

Rank 2 updates: DFP

? Davidon-Fletcher-Powell: historically, the first ”good” quasi-Newton method

? use rank 2 updates: guarantee the positive-definiteness ofS_i+1 under reasonable hypotheses

Proposition 1

Let S be a positive definite symmetric matrix and p and q be two vectors such that p^Tq>0. Then the matrix

S⁰=S+ 1

p^Tqpp^T− 1

q^TSqSqq^TS is symmetric positive definite and satisfies S⁰q=p.

? Proof: just computeS⁰q andxS⁰x and do a bit of linear algebra.

? How to get this idea? Just chooseSi+1=Si+αuu^T+βvv^T (rank 2 update)

? then chooseu=p_i andv =S_iq_i

(9)

DFP method

? DFP update:

Si+1=Si+ 1 p_i^Tqi

pip_i^T− 1 q^T_i Siqi

Siqiq_i^TSi

? the conditionq_i^Tpi>0 is equivalent to

(∇f(xi+1)− ∇f(xi))·(xi+1−xi)>0,

which is true iff is strictly convex: reasonable assumption near a minimum...

? when using Wolfe line-search we can guarantee thatq_i^Tpi >0.

? for the quadratic case DFP becomes theconjugate gradient method

? it turns out DFP is not the best method out there...

it does not ”self-correct” whenS_i gets far from the inverse Hessian

(10)

Duality: quasi-Newton relation

? any quasi-Newton update can generate another one:

Si+1=Si+Bi(Si,pi,qi) such thatSi+1qi =pi

then qi =S_i+1⁻¹pi whereS_i+1⁻¹ = (Si+B(Si,pi,qi))⁻¹

switching the roles ofp_i andq_i we get a different update, called thedual update

? how to get the dual of DFP: replaceSi withS_i⁻¹and interchange pi andqi

S_i+1⁻¹ =S_i⁻¹+ 1

q^T_i p_iqiq_i^T− 1

p_i^TS_i⁻¹p_iS_i⁻¹pip^T_i S_i⁻¹

? a direct computation orSherman-Morrison’s formula gives:

Si+1=Si−piq_i^TSi+Siqip_i^T p_i^Tq_i +

1 +q_i^TSiqi

p_i^Tq_i

pip_i^T p_i^Tq_i

(11)

The BFGS update

? BFGS:Broyden, Fletcher, Goldfarb, Shanno

Si+1=Si−p_iq_i^TS_i+S_iq_ip_i^T p_i^Tq_i +

1 +q_i^TS_iq_i p_i^Tq_i

p_ip_i^T p_i^Tq_i

? widely used in most of the codes implemented today

? since BFGS is the dual of DFP, anda matrix is positive-definite if and only if its inverse is positive-definite, the BFGS update maintains positive-definiteness if p_i^Tq_i >0 (same hypothesis as for DFP to work...)

[Nocedal, Wright, Numerical Optimization 06], Chapters 6-7

? Local super-linear convergence: If an algorithm using BFGS with Wolfe’s line-search converges tox^∗ wheref is strongly convex with Lipschitz Hessian then the convergence rate is super-linear

? BFGS can also be found by minimizing a certain distance between the inverse Hessian and the rank 2 updateSi+1 among matrices verifying the secant condition!

? BFGS has effectiveself-correcting properties

(12)

Extreme cases

Dimension 1:

? the quasi-Newton relation is justSi+1= pi

q_i and we get xi+1=xi− x_i−x_i−1

f⁰(xi)−f⁰(x_i−1)f⁰(xi) which is the false position (or secant) method

Large dimension:

? same disadvantage as Newton methods - an×nmatrix may be too large to store in memory

? it is possible to store only the update vectors and compute matrix - vector products by doing only scalar - products

(uv^T)x =u(v^Tx) = (v^Tx)u

? limited memory-BFGS (LBFGS): use only the lastmvectors p_i,q_i in order to compute S_i+1 - good behavior in practice despitebeing an approximation of BFGS

(13)

Computational cost per iteration

? after the function value, gradient and Hessian are computed (this is non-negligible in some applications)

GD:O(N)

Newton: O(N³) in worst case (solving a linear system) - it all depends on the structure of the Hessian

BFGS, DFT:O(N²) - matrix vector products

LBFGS:O(mN) wherem is the fixed number of gradients to remember

(14)

Practical example: the N-dimensional Rosenbrock

f(x) =

N−1

X

i=1

[100(xi+1−x_i²)²+ (1−xi)²] with global minimum atx^∗= (1,1, ...,1).

? ill conditioning: the optimization process wants to achievexi+1≈x_i²rather than minimizing (xi−1)² and go towards the global minimum!

(15)

Practical example: the N-dimensional Rosenbrock

f(x) =

N−1

X

i=1

(16)

Practical example: the N-dimensional Rosenbrock

f(x) =

N−1

X

i=1

(17)

Practical example: the N-dimensional Rosenbrock

f(x) =

N−1

X

i=1

(18)

Conclusion: quasi-Newton methods

equivalent of theSecant methodin higher dimensions achieve super-linear convergence without using the Hessian

for extremely largenBFGS may be costly from a memory point of view: if possible use L-BFGS instead

BFGS and LBFGS are often available in standard optimization libraries:

Examplescipy.optimize.minimize

(19)

Optimization in higher dimensions

(20)

Motivation

? ifAis symmetric, positive-definite then solving the systemAx=bis equivalent to minimizing the quadratic function

f :x 7→1

2x^TAx−b·x

? thegradientof this quadratic function is∇f(x) =Ax−b

? direct method: process details about the matrixA(factorization) and then solve the system: complexity is betweenO(n²) andO(n³).

? in contrast to this,iterative algorithmsproduce an approximation of the solution, which might be good enough for very largen

? for example: the gradient algorithm with Steepest-Descent will quickly converge to the optimum, but we can do better

(21)

Conjugate directions

? A given symmetric positive-definite matrixAdefines ascalar product hx,yi=x^TAy

? Two (non-zero) directionsd1andd2 are calledconjugate with respect to Aif they areorthogonal w.r.t. the above scalar product:

d1andd2are conjugate ⇐⇒d1Ad2= 0

? we may also call two directions which are conjugate w.r.t. Aas being A-orthogonal

? why is this useful? supposed1, ...,dk are mutually A-orthogonal and we have the decomposition

d =

k

X

j=1

α_jd_j

Then, using the orthogonality property, we can find the coefficientsαi explicitly:

d_i^TAd=αid_i^TAdi ⇒αi = d_i^TAd d_i^TAdi

= hd,dii hdi,dii

? Consequence: Ifd1, ...,dk are mutually orthogonal then they are linearly independent! (for a proof, use the above formula to see thatd= 0⇒αi = 0)

(22)

Why is this concept useful?

Proposition 2 (Solve a system using Conjugate Directions)

Let A be a symmetric positive-definite matrix and d1, ...,dna (complete) system of n non-zero A-orthogonal vectors. Then the solution x^∗ to the system Ax=b is given by the formula

x^∗=

n

X

j=1

b^Tdj

d_j^TAd_jdj

? An equivalent formulation:

x^∗=A⁻¹b=

n

X

j=1

b^Td_j d_j^TAdj

d_j =





n

X

j=1

1 d_j^TAdj

d_jd_j^T



b which gives us the explicit inverse ofA

A⁻¹=

n

X

j=1

1 d_j^TAd_jdjd_j^T

(23)

Why is this concept useful?

Proposition 2 (Solve a system using Conjugate Directions)

Let A be a symmetric positive-definite matrix and d1, ...,dna (complete) system of n non-zero A-orthogonal vectors. Then the solution x^∗ to the system Ax=b is given by the formula

x^∗=

n

X

j=1

b^Tdj

d_j^TAdj

dj

? An equivalent formulation:

x^∗=A⁻¹b=

n

X

j=1

b^Td_j d_j^TAdj

d_j =





n

X

j=1

1 d_j^TAdj

d_jd_j^T



b which gives us the explicit inverse ofA

A⁻¹=

n

X

j=1

1 d_j^TAd_jdjd_j^T

? All this is goodwhen we know a complete family ofA-orthogonal directions!

(24)

Conjugate Directions: quadratic case

Algorithm 2 (Conjugate Directions method)

Let A be a n×n symmetric positive-definite matrix, b a vector and f(x) = ¹₂x^TAx−b^Tx the quad. form associated to A and b.

Let d₀, ..,d_n−1 be a system of A-orthogonal vectors and x₀a starting point.

Then, with the notation g_i=∇f(x_i) =Ax_i−b, the iterative process x_i+1=x_i+γ_id_i, γ_i=− d_i^Tg_i

d_i^TAdi

, i= 1, ...,n converges to the unique minimizer x^∗ of f inn steps.

? The stepγ_i is optimal in the directiond_i: defineq(t) =f(x+td) then q⁰(t) =∇f(x+td)·d=d· ∇f(x) +td^TAd

? Proof: just look atx_nand see that it gives exactly the formula forx^∗.

? Important idea: d_kA(x_k−x₀) = 0 for anyk ≥0

(25)

Conjugate Directions: quadratic case

Algorithm 2 (Conjugate Directions method)

Let A be a n×n symmetric positive-definite matrix, b a vector and f(x) = ¹₂x^TAx−b^Tx the quad. form associated to A and b.

Let d₀, ..,d_n−1 be a system of A-orthogonal vectors and x₀a starting point.

Then, with the notation g_i=∇f(x_i) =Ax_i−b, the iterative process x_i+1=x_i+γ_id_i, γ_i=− d_i^Tg_i

d_i^TAd_i, i= 1, ...,n converges to the unique minimizer x^∗ of f inn steps.

? The stepγi is optimal in the directiondi: defineq(t) =f(x+td) then q⁰(t) =∇f(x+td)·d=d· ∇f(x) +td^TAd

? Proof: just look atx_nand see that it gives exactly the formula forx^∗.

? Important idea: d_kA(x_k−x₀) = 0 for anyk ≥0

? Again: all this is goodwhen we know a complete family ofA-orthogonal directions!

(26)

Properties of the Conjugate Directions Method

? define for eachi≥1 the linear space Bi−1= Span{d0, ...,di−1}

? if we define the affine subspacesMi =x0+Bi−1then {x0}=M0⊂M1⊂...⊂Mn=Rⁿ

? the Conjugate Directions method generate the minimizers off in each of the affine spaces M_i

Proposition 3

For every 1≤i≤n the vector x_i is the minimizer of f on the affine subspace Mi =x0+B_i−1. In particular, as shown previously, xi minimizes f on the line {x_i−1+td_i−1:t ∈R}.

Proof: ? Compute the gradientgi =∇f(xi) =Axi−band note thatgi is orthogonal tod0, ...,d_i−1.

? Then obtain thath∇f(xi),x−xii= 0 forx∈x0+B_i−1.

? f is strictly convex soEuler’s inequalitytells us thatxi is indeed the minimizer off inx0+Bi−1.

(27)

Build a basis of conjugated directions

? recall theGram-Schmidt procedure

? define theA-projectionofv onu:

proj_u(v) = hu,vi

hu,uiu= u^TAv u^TAuu Algorithm 3 (Gram-Schmidt)

0. Take a basis(vi)ofRⁿ: e.g. the canonical basis.

1. u1=v1

2. u2=v2−proj_u₁(v2)

3. u3=v3−proj_u₁(v3)−proj_u₂(v3) ...

n. u_n=v_n−proj_u

1(v_n)−...−proj_u_n−1(v_n) In the end normalize the vectors: di= √ ¹

u_i^TAui

ui

? in this form the process is notnumerically stable: due to rounding errors the vectors uk may not be exactly orthogonal...

(28)

Conjugate Gradient Method

? we can compute the family ofA-orthogonal directionsduring the optimization algorithm

Algorithm 4 (Conjugate Gradient)

Choose arbitrary initialization point x₀and set d₀=−g₀=−∇f(x₀) =b−Ax₀ Loop on: i= 0, ...,n−1

if∇f(xi) = 0thenstop.

x_i+1=x_i+γ_id_i withγ_i =− d_i^Tgi

d_i^TAdi

Compute new gradient gi+1=∇f(xi+1) =Axi+1−b

Compute new direction di+1=−gi+1+βidi withβi= g_i+1^T Adi

d_i^TAdi

? as beforeγ_i is the optimal step in the directiond_i

? the parameterβ_i is chosen such thatd_i+1^T Ad_i = 0

? the new directiond_i+1 is given by the projection of the anti-gradient direction

−g_i+1 on the previous direction

(29)

Main properties of CG

Proposition 4 (CG is a Conjugate Direction method) If the algorithm does not terminate at step i then:

the gradients g₀, ...,g_i−1 at x₀, ...,x_i−1are non-zero and Span{g0,g₁, ...,g_i−1}=Span{g0,Ag₀, ...,Aⁱ⁻¹g₀} The directions d0, ...,d_i−1are non-zero and Span{d0,d1, ...,d_i−1}=Span{g0,Ag0, ...,Aⁱ⁻¹g0} The directions d₀, ...,d_i−1are A orthogonal Alternative formulas forγi andβi:

γi= g_i^Tgi

d_i^TAd_i andβi= g_i+1^T gi+1

g_i^Tg_i .

? A sequence of the typeg0,Ag0,A²g0, ...is called aKrylov sequence

(30)

Consequences and convergence

? x_i is the minimizer off in the affine subspace

x₀+ Span{d₀, ...,d_i−1}=x₀+ Span{g₀,Ag₀, ...,Aⁱ⁻¹g₀}

?xi is the minimizer off in the affine subspace generated byx0andpolynomials ofAof degree at mosti−1timesg0(denote this polynomial space byP_i−1)

x0+{p(A)g0:p(z) =

i−1

X

i=0

pizⁱ}

?error in terms of the objective function: E(x) =f(x)−minf =¹₂(x−x^∗)^TA(x−x^∗) Proposition 5 (Error for CG)

E(x_i) = min

p∈Pi−1

1

2(x₀−x^∗)A(Id−Ap(A))²(x₀−x^∗)

? Proof: writexi=x0+p(A)g0and recall that∇f(xi) =A(xi−x^∗)

(31)

Error in terms of the spectrum of A

Corollary

Let Σ be the spectrum ofA. Then

E(xi)≤E(x0) min

p∈P_i^∗max

λ∈Σp²(λ),

where P_i^∗ is the set of polynomialspof degree at most i such thatp(0) = 1.

Another estimate is

E(x_i)≤ 1

2|x^∗−x₀|² min

p∈P_i^∗max

λ∈Σλp²(λ),

? Proof: use an orthonormal basis made of eigenvectors ofA

? denote byQ the condition number ofA. Then there exists a polynomial q∈ P_s^∗ such that

max

λ∈Σq_s(λ)²≤4 √

Q−1

√Q+ 1 ^2s

(32)

Error estimate in terms of the condition number

? for the Conjugate Gradient algorithm we have E(xN)≤4

√ Q−1

√Q+ 1 ^2N

E(x0), where Q is the condition number ofA.

? compare this with the error estimate for the Steepest-Descent E(x_N)≤

Q−1 Q+ 1

^2N E(x₀)

? in order to reduce the initial error by a factor ofεone needs to doO(Q) steps with Steepest Descent compared toO(√

Q) steps with CG.This is a big difference!

? CG is supposed to converge inniterations, however rounding errors may prevent the convergence!

? moreover, ifAhask ≤ndistinct eigenvalues then CG converges ink iterations!

? Often, fornlarge, the process is stopped before reachingniterations, when the error estimate is small enough

(33)

Example: Hilbert matrices

A= (1/(i+j−1))_1≤i,j≤n, ill conditioned

? below you can see a comparison between GD with optimal step and CG. The residual|Ax−b|is plotted at every iteration

? the residual decreases slowly for GD: the algorithm tends to go multiple times in the same direction! CG optimizes once and for all in the current direction.

? small residual does not mean thatx is close tox^∗: Ax−b=A(x−x^∗)!

(34)

Example: Hilbert matrices

(35)

Example: Hilbert matrices

(36)

Example: Hilbert matrices

(37)

Important application: approximate solution of PDEs

ConsiderLaplace’s equation

Findu∈H₀¹(D) such that

−∆u = f inD

u = 0 on∂D

wheref ∈L²(D) is a given source.

It is possible to associate to this avariational formulation:

Findu∈V such that∀v ∈V we havea(u,v) =`(v) where

The Hilbert spaceV is a Sobolev spaceH₀¹(D) a(·,·) is a bilinear form onV given bya(u,v) =R

D∇u· ∇vdx

`(·) is a linear form onV given by`(v) =R

Dfvdx

Lax-Milgram’stheorem assures us that such a problem has a solution onV.

(38)

Finite element method

Thefinite element methodproposes to search for an approximationuh in a finite dimension subspace Vh⊂V.

the variational formulation is replaced by:

Findu_h∈V_h such that ∀v_h∈V_hwe havea(u_h,v_h) =`(v_h) Advantage: Vh being of finite dimension, we can choose a basis B={ϕi}^N_i=1 and the variational formulation becomes alinear system Au¯=b with

A= (a(ϕi, ϕj)), b= (`(ϕi)) where ¯u are the coordinates ofuh in the basisB.

The choice of the basis is important: one objective is to have a system given by asparse matrix

(39)

Construct a finite element space

The domainD isdiscretized using ameshTh which consists of a partitions in triangles in 2D or tetrahedra in 3D.

The parameterhwhich indicates the convergence of the method is typically related to the size of the mesh elements.

(40)

Construct a finite element space (2)

A basis{ϕ1, ..., ϕN_h}offinite element functionsis introduced on the meshTh

Example

Nhis the number of verticesa1, ...,aN_h of the mesh

For each i= 1, ...,Nh,ϕi is affine on each triangleT ∈ Thand ϕi(aj) = 1 etϕi(aj) = 0 pouri6=j

(41)

Formulation of a matrix system

Decompose the solutionuhin the basis of finite elements uh=

N_h

X

i=1

ujϕi

and the variational problem becomes alinear systemof sizeN_h×N_h KU=f

where U =





 u1

... uN_h





is the vector of coefficients K is therigiditymatrix given by Kij=a(ϕi, ϕj) F is the vectorF = (`(ϕi))i=1,...,N_h.

? The matrixK will besymmetric and positive-definiteso we are in the good framework where CG works!

? whenN_h is large (a few tens of thousands of elements) direct methods will fail to work (computation time, memory limitations)

5

(42)

Some results

(43)

CG for general functions

Algorithm 5 (Fletcher-Reeves CG onRⁿ)

Choose a starting point x0. Set cycle counter k = 1.

Cyclek: Initialization of the cycle: Given x0compute g0=∇f(x0),d0=−g0

Inner Loop: for i = 0, ...,n−1

if gi= 0 terminate, otherwise set xi+1 as the minimizer of f(xi+tdi) compute g_i+1=∇f(x_i+1)

set di+1=−gi+1+βidi withβi= ^g

T i+1gi+1

g_i^Tg_i

When the loop is finishedreplace x0 with xn and restart.

? note that in the inner loop we have aSteepest Descent line-search: this is not applicable in general. A line-search procedure should be used instead!

? It can be proved that in the non-degenerate case the convergence isquadratic in the number of cycles i.e.

|x^k+1−x^∗| ≤C|x^k−x^∗|² where x^k is the sequence ofstarting points for cycles

(44)

Comparison with previous methods

? again on the Rosenbrock function forN= 100

? in general nonlinear-CG converges faster than GD but not necessarily faster than quasi-Newton methods

(45)

Conclusion on Conjugate Gradient method

when a complete system ofA-orthogonal directions is knowneverything is explicit

it can be made into an iterative algorithm with a convergence ratio way better than Steepest Descent

it converges inniterations (theoretically). In practice, for largen, we usually stop the process once the error estimate

E(xN)≤4 √

Q−1

√Q+ 1 ^2N

E(x0) is satisfying.

cost of a step in CG:

O(n) + cost of a matrix-vector multiplicationd →Ad.

This is particularly efficient whenAis sparse(has few non-zero elements) Disadvantage: sensitivity to thecondition number!

(46)

Conclusions: unconstrained optimization in ND

Gradient Descent algorithms: sensitive to conditioning!

Newton methods: fast convergence under right hypotheses. Major practical inconveniences:

compute Hessian matrix and (possibly) store it doesn’t necessarily decrease the function value solve a linear system at every iteration

variable metric methods: compute an approximation of the inverse Hessian BFGS: rank 2 updates, standard in available implementations

even better for largen: L-BFGS - limit memory by using only information from the previousmiterations

Conjugate Gradient methods: less sensitive to conditioning than Steepest Descent

Newton-Gauss: non-linear least squares Nedler-Mead: gradient free method

(47)

Practical discussion

? get used to the structure of algorithms which arealready implemented: in the practical session you will play with tools fromscipy.optimize

? keep in mindto minimize the number of function evaluationsin your codes:

not all functions to be optimized are computed in a cheap way

when the value of a function or its gradient are used multiple timesstore them in some variables

in some computations involving physical simulations the gradient can often be computedusing existing information from the solution given by the model: there is no point computing it multiple times