Convex Optimization with Inexact Model of the Objective Function

(1)

Convex Optimization with Inexact Model of the Objective Function

Alexander Tyurin ¹, Pavel Dvurechensky², Alexander Gasnikov ³

1HSE (Russian Federation)

2WIAS (Germany)

3MIPT (Russian Federation)

(2)

General optimization problem

General convex optimization problem:

F(x)→min

x∈Q

We denotex^∗ – the solution of this problem.

1 F(x) is a convex function.

2 Q ⊆Rⁿ is a convex, closed set.

Examples:

1 Physics, Gas Network:

1 3

N

X

i=1

α_i|x_i|³ → min

Ax=d.

2 Machine Learning, Logistic regression:

N

X

i=1

log(1+exp(x_i^Tw))→ min

x∈Rⁿ.

2 / 26

(3)

Bregman divergence

Prox-functiond(x) : continuously differentiable on intQ and 1-strongly convex function, i.e.

d(x)≥d(y) +h∇d(y),x−yi+1

2kx−yk², x,y ∈int Q.

Bregman divergenceV(x,y)^def=d(x)−d(y)− h∇d(y),x−yi.

Examples:

V(x,y) = ¹₂kx−yk²₂ V(x,y) =Pn

i=1x_ilog^x_yⁱ

i (KL-divergence).

(4)

Definition of (δ, L)-model

Definition

Pair(Fδ(y), ψδ(x,y))is (δ,L)-model in a pointy of a function F with respect to the normkk if for allx ∈Q

0≤F(x)−F_δ(y)−ψ_δ(x,y)≤ L

2kx−yk²+δ, ψ_δ(x,x) =0, ∀x∈Q,

andψ_δ(x,y) is a convex function ofx for all y ∈Q.

Definition (Devolder–Glineur–Nesterov, 2013)

FunctionF has (δ,L)-oracle in a point y if there exists a pair (Fδ(y),∇F_δ(y))such that

0≤F(x)−F_δ(y)− h∇F_δ(y),x−yi ≤ L

2kx−yk²+δ, ∀x∈Q.

4 / 26

(5)

Definition of (δ, L)-inexact oracle; Devolder, PhD Thesis’13

(6)

Definition of inexact solution

Definition

Let us assume an optimization task ψ(x)→ min

x∈Q,

whereψ is a convex function. Then Arg min^e^δ_x∈Qψ(x) is a set ofxe such that

∃h∈∂ψ(ex), hh,x∗−xi ≥ −ee δ.

Arbitrary point from the Arg min^e^δ_x∈Qψ(x)we define as arg min^e^δ_x∈Qψ(x).

From the definition follows that

ψ(ex)−ψ(x∗)≤eδ.

In the opposite direction, it is not always true.

6 / 26

(7)

Gradient descent

Input:x0,N – number of steps, {δ_k}^N−1_k=0,{eδ_k}^N−1_k=0 – sequences, andL₀ >0.

0 - step: L1= L0

2 k +1 - step:αk+1 = 1

Lk+1

φk+1(x) =V(x,xk) +αk+1ψδk(x,xk) xk+1 =arg min

x∈Q

δekφk+1(x)

If F_δ_k(x_k+1)>F_δ_k(x_k) +ψ_δ_k(x_k+1,x_k)+

+L_k₊₁

2 kx_k+1−x_kk²+δ_k, then L_k+1 =2L_k+1 and repeat the current step.

Otherwise, Lk+2 = L_k₊₁

2 and go to the next step.

(8)

Fast gradient descent

Input:x₀,N – number of steps, {δ_k}^N−1_k=0,{eδ_k}^N−1_k=0 – sequences, andL0 >0.

0 - step: y0 =u0 =x0;L1= L0

2 ;α0 =0;A0 =α0

k +1 - step: Find α_k+1 :A_k+α_k+1 =L_k+1α²_k+1 Ak+1 =A_k+αk+1

y_k₊₁ = αk+1u_k +A_kx_k A_k+1

φ_k₊₁(x) =V(x,u_k) +α_k+1ψ_δ_k(x,y_k+1) u_k+1 =arg min

x∈Q

eδkφ_k+1(x), x_k+1= α_k+1u_k₊₁+A_kx_k A_k₊₁ If F_δ_k(x_k+1)>F_δ_k(y_k+1) +ψ_δ_k(x_k+1,y_k+1)+

+Lk+1

2 kx_k+1−yk+1k²+δk, then Lk+1 =2L_k+1 and repeat the current step.

Otherwise, Lk+2 = L_k+1

2 and go to the next step.

8 / 26

(9)

Theorem (Gradient descent)

LetV(x∗,x0)≤R² where x0 is a starting point,x∗ is the closest point tox0 in terms of a Bregman distanceV, and

¯ x_N = 1

N

N−1

X

k=0

x_k. For proposed algorithm we have

F(¯x_N)−F(x∗)≤ 2LR² N +2L

N

N−1

X

k=0

eδ_k + 2 N

N−1

X

k=0

δ_k.

Theorem (Fast gradient descent)

LetV(x∗,x₀)≤R² where x₀ is a starting point andx∗ is the closest point tox0 in terms of a Bregman distance V. For proposed algorithm we have

F(x_N)−F(x∗)≤ 8LR²

(N+1)² +8LPN−1 k=0 eδk

(N+1)² +2PN−1

k=0 δkAk+1

A_N .

(10)

Gradient descent vs Fast gradient descent

Leteδ_k =eδ andδ_k =δ then

F(¯xN)−F(x∗)≤ 2LR²

N +2Lδe+2δ, F(x_N)−F(x∗)≤ 8LR²

(N+1)² + 8Leδ

N+1 +2Nδ.

Fast gradient descent is more robust toeδ, which appears when we solve an intermediate optimization problem, but fast gradient descent accumulatesδ, which appears when we receive (δ,L)-model.

There are intermediate gradient methods with convergence rate equal to (in the case˜δ=0 – Devolder–Glineur–Nesterov, 2013)

O(1)LR²

N^p +O(1)N^1−pδe+O(1)N^p−1δ, wherep ∈[1,2](D. Kamzolov, 2017).

10 / 26

(11)

Linear model (Nesterov, 1983)

Assume thatF is a smoothconvex function and the gradient of F is Lipschitz-continuous with parameterL, then

0≤F(x)−F(y)− h∇F(y),x−yi ≤ L

2 kx−yk², ∀x,y ∈Q. We can takeψδk(x,y) =h∇F(y),x−yi,Fδk(y) =F(y), and δ_k =0 for allk ≥0. Let us assume that we can find exact solutions in intermediate optimization problems thuseδ_k =0 for all k ≥0 and

F(x_N)−F(x∗)≤ 8LR² (N+1)². The bound is optimal.

(12)

Universal method (Nesterov, 2013)

Let

k∇F(x)− ∇F(y)k_∗≤L_νkx−yk^ν, ∀x,y ∈Q. Therefore,

0≤F(x)−F(y)− h∇F(y),x−yi ≤ L(δ)

2 kx−yk²+δ, ∀x,y ∈Q, where

L(δ) =L_ν L_ν

2δ 1−ν 1+ν

^1−ν_1+ν andδ >0 is a parameter.

Consequently, we can takeψδk(x,y) =h∇F(y),x−yi,

F_δ_k(y) =F(y). Let us assume thatδe_k =0 for allk ≥0. Let us take δ_k = α_k+1

4A_k+1,∀k,

whereis the accuracy of the solution by function.

12 / 26

(13)

Universal method (Nesterov, 2013)

In order to have

F(x_N)−F(x∗)≤, it is enough to do

N ≤ inf

ν∈[0,1]

"

64^1+3ν^1+ν

2−2ν 1+ν

_1+3ν^1−ν

L_νR^1+ν

_1+3ν² #

steps.

The bound is optimal.

(14)

Universal conditional gradient (Frank–Wolfe) method

In fast gradient method we have

φ_k+1(x) =V(x,u_k) +α_k+1ψ_δ_k(x,y_k+1).

Instead of it let us take (idea goes back to A. Nemirovski, 2013) φek+1(x) =αk+1ψδ_k(x,yk+1).

Let us look at this substitution from the view of an erroreδ_k. We can show that it enough to takeeδk =2R_Q² for allk ≥0, where RQ

is a diameter of a setQ. Let us take δ_k = α_k+1

4A_k+1,∀k,

whereis the accuracy of the solution by function.

Finally,

N≤ inf

ν∈(0,1]



96^1+ν^2ν

2−2ν 1+ν

^1−ν_2ν L_νR_Q^1+ν

!¹_ν

.

14 / 26

(15)

Composite optimization (Nesterov, 2008)

F(x)^def=f(x) +h(x)→min

x∈Q,

wheref is a smooth convex function and the gradient of f is Lipschitz-continuous with parameterL. Function h is a convex function. We can show

0≤F(x)−F(y)− h∇f(y),x−yi −h(x) +h(y)≤

≤ L

2kx−yk², ∀x,y ∈Q.

We can takeψ_δ_k(x,y) =h∇f(y),x−yi+h(x)−h(y), Fδk(y) =F(y), and δk =0 for allk ≥0. On the one hand the methods do not change, but on the other hand auxiliary optimizations are more complicated.

(16)

Superposition of functions (Nemirovski–Nesterov, 1985)

F(x)^def=f(f1(x), . . . ,fm(x))→min

x∈Q,

wherefk is a smooth convex function and the gradient offk is Lipschitz-continuous with parameterL_k for allk ≥0. Function f is a convex function, Lipschitz-continuous with parameterM with respect toL¹-norm, and monotonic function. Therefore,

0≤F(x)−F(y)−f(f₁(y) +h∇f₁(y),x−yi, . . . ,f_m(y)+

+h∇f_m(y),x−yi) +F(y)≤M Pm

i=1L_i

2 kx−yk²,∀x,y ∈Q.

We can takeψ_δ_k(x,y) =

f(f₁(y) +h∇f₁(y),x−yi, . . . ,f_m(y) +h∇f_m(y),x−yi)−F(y), Fδk(y) =F(y), and δk =0 for allk ≥0.

16 / 26

(17)

Proximal method with inexact solution in an intermediate optimization problem; Devolder–Glineur–Nesterov, 2013

Let us consider

f(x)^def=min

y∈Q

φ(y) +L

2ky−xk²₂

| {z }

Λ(x,y)

→ min

x∈Rⁿ. (1)

Functionφis a convex function and maxy∈Q

Λ(x,y(x))−Λ(x,y) +L

2ky−y(x)k²₂

≤δ.

Then

φ(y(x)) + L

2ky(x)−xk²₂−δ, ψδ(z,x) =hL(x−y(x)),z −xi

is(δ,L)-model off(z) at the pointx w.r.t 2-norm. From this result one can obtainCatalyst technique (Lin–Marial–Harchaoui, 2015).

(18)

Saddle point problem; Devolder–Glineur–Nesterov, 2013

Let us consider

f(x)^def=max

y∈Q[hx,b−Ayi −φ(y)]→ min

x∈Rⁿ, (2)

whereφ(y) is aµ-strong convex function w.r.t.p-norm

(1≤p≤2). Then f is a smooth convex function and the gradient off is Lipschitz-continuous with parameter

L= 1 µ max

kyk_p≤1kAyk²₂. Ify_δ(x) is a δ-solution of the max problem, then

(hx,b−Ay_δ(x)i −φ(y_δ(x)), ψ_δ(z,x) =hb−Ay_δ(x),z−xi) is(δ,2L)-model off(z)at the point x w.r.t 2-norm.

18 / 26

(19)

Augmented Lagrangians; Devolder–Glineur–Nesterov, 2013

Let us consider

φ(y) +µ

2kAy−bk²₂ → min

Ay=b,y∈Q. and it’s dual problem

f(x)^def=max

y∈Q

hx,b−Ayi −φ(y)−µ

2kAy−bk²₂

| {z }

Λ(x,y)

→ min

x∈Rⁿ. Ify_δ(x) is a solution in the sense of an inequality

maxy∈Qh∇_yΛ(y_δ(x),y),y−y_δ(x)i ≤δ, then

hx,b−Ay_δ(x)i −φ(y_δ(x))−µ

2 kAy_δ(x)−bk²₂, ψ_δ(z,x) =hb−Ay_δ(x),z−xi

is(δ, µ⁻¹)-model off(z)at the point x w.r.t 2-norm.

(20)

First new example: Min-min problem

Let us consider

f(x)^def=min

y∈QF(y,x)→ min

x∈Rⁿ. Set Q is convex and bounded.

Function F is smooth and convex w.r.t. both variables.

k∇F(y⁰,x⁰)− ∇F(y,x)k₂ ≤Lk(y⁰,x⁰)−(y,x)k₂, ∀y,y⁰∈ Q,x,x⁰ ∈Rⁿ.

If we can find a pointye_δ(x)∈Q such that

h∇_yF(ye_δ(x),x),y−ye_δ(x)i ≥ −δ, ∀y ∈Q, then

F(yeδ(x),x)−f(x)≤δ,

∇f(x⁰)− ∇f(x) 2≤L

x⁰−x 2

and

(F(yeδ(x),x)−2δ, ψ_δ(z,x) =h∇_yF(yeδ(x),x),z−xi) is(6δ,2L)-model off(x) at the pointx w.r.t 2-norm.

20 / 26

(21)

Second new example: Nesterov’s clustering model, 2018

Let us consider

fµ(x= (z,p)) =g(z,p) +µ

n

X

k=1

z_klnz_k+µ

2kpk²₂→ min

z∈Sn(1),p≥0. Let’s introduce norm

kxk² =k(z,p)k² =kzk²₁+kpk²₂. Assume that

k∇g(x₂)− ∇g(x₁)k_∗ ≤Lkx₂−x₁k, L≤µ.

Then

f_µ(y),h∇g(y),x−yi+ (µ−L)

n

X

k=1

z_klnz_k+µ−L 2 kpk²₂

! . is(0,2L)-model off_µ(x = (z,p))at the point y.

(22)

Third new example: Proximal Sinkhorn algorithm

Let introduce smoothed Wasserstein distance (Peyre–Cuturi, 2018)

W_γ(p,q) = min

π∈Π(p,q)







n

X

i,j=1

C_ijπ_ij +γX

i,j

π_ijlogπij

π_ij⁰





 ,

where

Π(p,q) ={π ∈R^n×n+ ,

n

X

j=1

π_ij =p,

n

X

i=1

π_ij =q}.

It’s well known (Franklin–Lorentz, 1989; Dvurechensky et al., 2018) that the complexity (a.o.) of the Sinkhorn algorithm for this

problem is

O n²min (lnn

γ ,exp C˜ γ

!

ln ⁻¹ )!

.

22 / 26

(23)

Third new example: Proximal Sinkhorn algorithm

If we want to calculateW0(p,q) with the precisionone can calculateWγ(p,q) with γ =/4 lnn with the precision/2. So the complexity in this case will beO˜(n²/²) a.o. Can we do better?

The answer is Yes!.

If we useproximal model off(x) at point y of the form ψδ(x,y) =f(x)−f(y)

and choose arbitraryL>0. Then proximal gradient descent (Chen–Teboulle, 1993)

x^k⁺¹=arg min

x∈Q

n

ψδ(x,x^k) +LV(x,x^k) o will converges asO(LR²/k),R² =V(x∗,x⁰).

(24)

Third new example: Proximal Sinkhorn algorithm

So, let’s apply proximal set up toW0(p,q) calculation. Putx =π, Q= Π,L=γ, chooseV =KL. We obtain that proximal gradient method

π^k⁺¹= min

π∈Π(p,q)







n

X

i,j=1

C_ijπ_ij +γX

i,j

π_ijlogπij

π^k_ij





 requiresO(γlnn/) iterations to reach the precision . But, this assumes that we have to solve exactly auxiliary problem – i.e. we can calculate exactlyW_γ(p,q). But from above we know that ifγ is not small (Note:we can choose γ as we want!) we can solve this problem with very high precision with the complexityO(n˜ ²). So the total number of a.o. will beO˜(n²/)– that is better than for the simple SinkhornO(n˜ ²/²).

24 / 26

(25)

Auxiliary optimization problem

Let us takeφ_k+1(x) =V(x,u_k) +αk+1ψ_δ_k(x,yk+1). From h∇φ_k+1(x),x−x∗i ≤

we can get

φ_k+1(x)−φ_k+1(x∗)≤.

Let an intermediate optimization problem has Lipschitz-continuous with parameterM function. Let us assume x is a-solution then

h∇φ_k+1(x),x−x∗i ≤ k∇φ_k+1(x)k_∗kx−x∗k ≤M√ 2.

If we have linear convergence speed for an auxiliary optimization problem, then we should doc times more steps where c is a small constant.

(26)

Thank you!

26 / 26