Master MVA Optimization Reminders Part II

(1)

Optimization Reminders Part II

Jean-Christophe Pesquet Center for Visual Computing

Inria, CentraleSup´elec Institut Universitaire de France

jean-christophe@pesquet.eu

(2)

Iterating projections

(3)

Feasibility problem

Problem

LetHbe a Hilbert space. Letm∈N\ {0,1}.

Let (C_i)1≤i≤m be closed convex subsets ofHsuch that

m

\

i=1

C_i6=∅. We want to

Find bx∈

m

\

i=1

C_i.

POCS (Projection Onto Convex Sets) algorithm

For everyn∈N\ {0}, leti_n−1 denote the remainder after division ofn−1 bym.

Setx₀∈ H Forn= 1, . . .

x_n+1=P_C_in(x_n).

(4)

Feasibility problem

Problem

LetHbe a Hilbert space. Letm∈N\ {0,1}.

Let (C_i)_1≤i_≤mbe closed convex subsets ofHsuch that

m

\

i=1

C_i 6=∅. We want to

Find xb∈

m

\

i=1

Ci.

POCS (Projection Onto Convex Sets) algorithm

Let (λn)n≥1be a sequence of [1,2−2] with (1, 2)∈ ]0,+∞[²such that 1+2<2.

For everyn∈N\ {0}, leti_n−1 denote the remainder after division ofn−1 bym.

Setx₀∈ H Forn= 1, . . .

x_n+1=x_n+λ_n(P_C_in(x_n)−x_n).

(5)

Convergence of POCS

Theorem

The sequence (x_n)_n∈_N generated by the POCS algorithm converges to a point in

m

\

i=1

C_i.

(6)

Exercise 1

Let Hbe a real Hilbert space.

1. Letc ∈ H and ρ∈ ]0,+∞[. What is the expression of the projection onto a closed ball B(c, ρ) with centerc and radiusρ ?

2. We consider three closed balls which are assumed to have a common point. Propose an algorithms to compute such a point.

(7)

Lagrange duality

(8)

Constrained optimization problem

Let Hbe a Hilbert space. Let f:H → ]−∞,+∞].

Let (m,q)∈N². For everyi ∈ {1, . . . ,m}, letg_i:H →Rand for every j ∈ {1, . . . ,q}, let hj:H →R.

Let

C ={x∈ H |(∀i ∈ {1, . . . ,m}) gi(x) = 0 (∀j ∈ {1, . . . ,q}) h_j(x)≤0}.

We want to:

Find xb∈Argmin

x∈C

f(x).

Remark: A vector x∈ H is said to be feasible if x ∈domf ∩C.

(9)

Definitions

The Lagrange function (or Lagrangian) associated with the previous problem is defined as

(∀x ∈ H)(∀µ= (µi)1≤i≤m∈R^m)(∀λ= (λj)1≤j≤q∈[0,+∞[^q) L(x, µ, λ) =f(x) +

m

X

i=1

µ_ig_i(x) +

q

X

j=1

λ_jh_j(x).

The vectorsµ andλare called Lagrange multipliers .

Remark: domL=domf ×R^m× [0,+∞[^q .

(10)

Saddle points

(xb,bµ,λ)b ∈ H ×R^m×[0,+∞[^q is saddle point of Lif

∀(x, µ, λ)∈ H×R^m×[0,+∞[^q

L(xb, µ, λ)≤ L(xb,µ,b λ)b ≤ L(x,µ,b λ).b

(11)

Saddle points

(x,b

µ,b bλ)∈ H ×R^m×[0,+∞[^q is saddle point ofLif

∀(x, µ, λ)∈ H×R^m×[0,+∞[^q

L(bx, µ, λ)≤ L(bx,bµ,bλ)≤ L(x,µ,b bλ).

Theorem

LetLandLbe defined as

(∀(µ, λ)∈R^m×[0,+∞[^q) L(µ, λ) = inf

x∈HL(x, µ, λ) (∀x∈ H) L(x) = sup

µ∈R^m,λ∈[0,+∞[^q

L(x, µ, λ).

(x,b

µ,b bλ)∈ H ×R^m×[0,+∞[^q is a saddle point ofLif and only if (∀x∈ H) L(xb)≤ L(x)

(∀(µ, λ)∈R^m×[0,+∞[^q) L(µ, λ)≤ L(µ,b bλ) L(µ,b bλ) =L(xb).

(12)

Saddle points

Theorem

LetLand Lbe defined as

(∀(µ, λ)∈R^m×[0,+∞[^q) L(µ, λ) = inf

x∈HL(x, µ, λ) (∀x ∈ H) L(x) = sup

µ∈R^m,λ∈[0,+∞[^q

L(x, µ, λ).

(xb,µ,b bλ)∈ H ×R^m×[0,+∞[^q is a saddle point ofL if and only if (∀x ∈ H) L(bx)≤ L(x)

(∀(µ, λ)∈R^m×[0,+∞[^q) L(µ, λ)≤ L(bµ,λ)b L(µ,b bλ) =L(xb).

Remark: L is called the primal Lagrange function and Lthe dual Lagrange function .

(13)

Sufficient condition for a constrained minimum

Assume that there exists a feasible point.

If (xb,µ,b λ)b ∈ H ×R^m×[0,+∞[^q is a saddle point of L, thenxbis minimizer off over C.

In addition, the complementary slackness condition holds:

(∀j ∈ {1, . . . ,q}) λb_jh_j(xb) = 0.

(14)

Convex case

Assume that f is a convex function, (gi)1≤i≤m are affine functions and (h_j)1≤j≤q are convex functions. Assume that the Slater condition holds, i.e. there existsx ∈domf such that

(∀i ∈ {1, . . . ,m}) g_i(x) = 0 (∀j ∈ {1, . . . ,q}) h_j(x)<0.

xbis a minimizer of f overC if and only if there exists µb ∈R^m and bλ∈ [0,+∞[^q such that (x,b bµ,bλ) is a saddle point of the Lagrangian.

(15)

Convex case

Assume that f is a convex function, (gi)1≤i≤m are affine functions and (h_j)1≤j≤q are convex functions. Assume that the Slater condition holds, i.e. there existsx ∈domf such that

(∀i ∈ {1, . . . ,m}) g_i(x) = 0 (∀j ∈ {1, . . . ,q}) h_j(x)<0.

xbis a minimizer of f overC if and only if there exists µb ∈R^m and bλ∈ [0,+∞[^q such that (x,b bµ,bλ) is a saddle point of the Lagrangian.

Remark: Under the assumptions of the above theorem, ifxbis a minimizer of f overC thenL(·,bµ,bλ) is a convex function which is minimum at bx.

This optimality condition is often used to calculate bx, in conjunction with the complementary slackness condition.

(16)

Differentiable case

Karush-Kuhn-Tucker (KKT) theorem

Assume thatf, (gi)_1≤i≤m, and (hj)_1≤j≤q are continuously differentiable on H=R^N.

Assume that xb is a local minimizer of f over C satisfying the following Mangasarian-Fromovitz constraint qualification conditions :

(i) {∇gi(x)b |i ∈ {1, . . . ,m}}is a family of linearly independent vectors;

(ii) there existsz∈R^N such that

(∀i ∈ {1, . . . ,m}) h∇gi(bx)|zi= 0 (∀j∈J(x))b h∇hj(bx)|zi<0 whereJ(x) =b {j∈ {1, . . . ,q} |hj(bx) = 0}is the set of

active inequality constraints atbx.

Then, there existsbµ∈R^N andλb∈[0,+∞[^q such thatxbis a critical point ofL(·,µ,b λ) and the complementary slackness condition holds.b

(17)

Differentiable case

Karush-Kuhn-Tucker (KKT) theorem

Assume thatf, (g_i)_1≤i≤m, and (h_j)_1≤j_≤qare continuously differentiable on H=R^N.

Assume that bx is a local minimizer of f over C satisfying the following Mangasarian-Fromovitz constraint qualification conditions :

(i) {∇g_i(xb)|i∈ {1, . . . ,m}}is a family of linearly independent vectors;

(ii) there existsz ∈R^N such that

(∀i∈ {1, . . . ,m}) h∇gi(bx)|zi= 0 (∀j∈J(bx)) h∇h_j(xb)|zi<0.

Then, there existsµb∈R^N andbλ∈[0,+∞[^qsuch that bx is a critical point ofL(·,bµ,bλ) and the complementary slackness condition holds.

Remark: A sufficient condition for Mangasarian-Fromovitz conditions to be satisfied is that{∇g_i(bx)|i∈ {1, . . . ,m}} ∪ {∇h_j(bx)|j ∈J(bx)}is a family of linearly independent vectors.

(18)

Exercise 2

Let f be defined as

∀x= (x⁽ⁱ⁾)1≤i≤N ∈R^N

f(x) =

N

X

i=1

exp(x⁽ⁱ⁾) with N>1. We want to find a minimizer off onR^N subject to the constraints

N

X

i=1

x⁽ⁱ⁾= 1

(∀i ∈ {1, . . . ,N}) x⁽ⁱ⁾≥0.

1. What can be said about the existence/uniqueness of a solution to this problem ?

2. Apply the Lagrange multiplier method.

(19)

Exercise 3

By using the Lagrange multipliers method, solve the following problem maximize

x=(x⁽ⁱ⁾)1≤i≤N∈B

(x^(N))³−1 2(x^(N))² whereB is the unit sphere, centered at 0, ofR^N.

(20)

A few algorithms

(21)

Problem

Let Hbe a Hilbert space.

Let f:H →R be differentiable.

Let C be a nonempty closed convex subset of H.

We want to:

Find xb∈Argmin

x∈C

f(x).

(22)

Problem

Let Hbe a Hilbert space.

Let f:H →R be differentiable.

Let C be a nonempty closed convex subset of H.

We want to:

Find xb∈Argmin

x∈C

f(x).

Objective: Build a sequence (x_n)n∈N converging to a minimizer.

(23)

Principle of first-order methods

I If f is differentiable, then, at iterationn, we have

(∀x ∈ H) f(x) =f(x_n) +h∇f(x_n)|x−x_ni+o(kx −x_nk).

So if kx_n+1−x_nk is small enough andx_n+1 is chosen such that h∇f(x_n)|x_n+1−x_ni<0

then f(x_n+1)<f(x_n).

I In particular, the steepest descent direction is given by x_n+1−x_n =−γ_n∇f(x_n), γ_n∈ ]0,+∞[.

I To secure that the solution belongs to C we can add a projection step.

I A relaxation parameter λ_n can also be added.

(24)

Principle of first-order methods

The projected gradient algorithm has the following form:

x_n+1 =x_n+λ_n P_C(x_n−γ_n∇f(x_n))−x_n) where γ_n ∈ ]0,+∞[ andλ_n∈]0,1].

(25)

Principle of first-order methods

Remark: x is a fixed point of the projected gradient iteration if and only if x ∈C and

(∀y ∈C) h∇f(x)|y −xi ≥0.

(26)

Principle of first-order methods

xn+1=xn+λn P_C(xn−γn∇f(xn))−xn) whereγn∈ ]0,+∞[ andλn∈]0,1].

Remark: xis a fixed point of the projected gradient iteration if and only if x∈Cand

(∀y∈C) h∇f(x)|y−xi ≥0.

Proof: Ifxis a fixed point, then

x=x+λn PC(x−γn∇f(x))−x

⇔ x=P_C(x−γn∇f(x)).

According to the characterization of the projection, for everyy∈C, hx−γn∇f(x)−x|y−xi ≤0

⇔ h∇f(x)|y−xi ≥0.

(27)

Principle of first-order methods

Remark:

I x is a fixed point of the projected gradient iteration if and only if x ∈C and

(∀y ∈C) h∇f(x)|y−xi ≥0.

I When f is convex,x is a fixed point of the projected gradient iteration if and only if x is a global minimizer of f over C.

(28)

Principle of first-order methods

Remark: IfC =Handλ_n = 1, we recover the standard gradient descent iteration:

x_n+1=x_n−γ_n∇f(x_n).

(29)

Convergence

Convergence theorem

Assume that f is convex and has a Lipschtzian gradient with constant β ∈ ]0,+∞[, i.e.

(∀(x,y)∈ H²) k∇f(x)− ∇f(y)k ≤βkx −yk.

Assume thatArgmin_x∈Cf(x)6=∅.

Assume that inf_n∈_Nγ_n > 0, sup_n∈_Nγ_n < 2/β, inf_n∈_Nλ_n > 0, and sup_n∈

Nλ_n ≤1.

Then the sequence (x_n)_n∈_N generated by the projected gradient algorithm converges to a minimizer of f over C.

(30)

Convergence

Convergence theorem

Assume that f is convex and has a Lipschtzian gradient with constant β ∈ ]0,+∞[.

Assume thatArgmin_x∈Cf(x)6=∅. Assume that inf_n∈_Nγ_n > 0, sup_n∈

Nγ_n < 2/β, inf_n∈_Nλ_n > 0, and sup_n∈

Nλ_n ≤1.

In addition, if f is strongly convex, the convergence is linear, i.e. there existsχ∈[0,1[ such that

(∀n∈N) kx_n−xbk ≤χⁿkx₀−xbk where bx is the unique minimizer of f over C.

(31)

Convergence

Convergence theorem

Assume that f is convex and has a Lipschtzian gradient with constant β ∈ ]0,+∞[.

Assume thatArgmin_x∈Cf(x)6=∅. Assume that inf_n∈_Nγ_n > 0, sup_n∈

Nγ_n < 2/β, inf_n∈_Nλ_n > 0, and sup_n∈

Nλ_n ≤1.

Remark: Iff is non convex with a β-Lipschtzian gradient, it can only be proved that (f(x_n))_n∈_N is a converging sequence provided thatγ_n ≤1/β.

(32)

Example: Uzawa algorithm

Problem

Let L:H ×[0,+∞[^q → R be differentiable with respect to its second argument. We want to find a saddle point of L

Solution

Set λ₀ ∈[0,+∞[^q For n= 1, . . .





Set γ_n ∈ ]0,+∞[,ρ_n ∈]0,1]

x_n ∈ArgminL(·, λ_n)

λ_n+1 =λ_n+ρ_n(P_[0,+∞[^q(λ_n+γ_n∇_λL(x_n, λ_n))−λ_n).

(33)

Principle of second-order methods

I If f is twice differentiable, then, at iterationn, we have (∀x ∈ H) f(x) =f(x_n) +h∇f(x_n)|x−x_ni

+1 2

(x−x_n)| ∇²f(x_n)(x−x_n)

+o(kx −x_nk²).

I If ∇²f(x_n) is positive definite, the minimizerx_n+1 of the quadratic term is given by Newton’s iteration

∇f(x_n) +∇²f(x_n)(x_n+1−x_n) = 0

⇔ x_n+1=x_n− ∇²f(x_n)−1

∇f(x_n).

(34)

Convergence

Convergence theorem

Let f ∈ H → ]−∞,+∞] be three times continuously differentiable in a neigborhood of a local minimizer xband assume that ∇²f(xb) is positive definite.

Then, there exists ∈ ]0,+∞[ such that, if kx₀ −bxk ≤ , then (x_n)_n∈_N converges toxb.

In addition, the convergence is quadratic, i.e. there existsκ∈ ]0,+∞[ such that

(∀n∈N) kx_n+1−bxk ≤κkx_n−xbk².

(35)

Numerical behaviour

I Although the convergence of Newton’s algorithm is faster than the gradient descent in terms of iteration number, the computational cost of each iteration is higher.

I To improve the convergence guarantees of Newton’s algorithm, we may practically modify it as follows:

(∀n∈N) x_n+1=x_n−γ_n ∇²f(x_n) +λ_nId−1

∇f(x_n), with (γ_n, λ_n)∈ ]0,+∞[².

I Quasi-Newton algorithms read

(∀n∈N) x_n+1=x_n−H_n⁻¹∇f(x_n),

whereH_n is a definite positive matrix providing some approximation to the Hessian.

(36)

Exercise 4

LetH andG be real Hilbert spaces and letL∈ B(H,G). Let y ∈ G and let α∈ ]0,+∞[.

We want to minimize the function defined as (∀x ∈ H) f(x) = 1

2kLx −yk²+ α 2kxk².

1. Give the form of the gradient descent algorithm allowing us to solve this problem.

2. How does Newton’s algorithm read for this function ? 3. Study the convergence of the gradient descent algorithm by

performing the eigendecomposition ofL^∗L.