• Aucun résultat trouvé

Convex Optimization with Inexact Model of the Objective Function

N/A
N/A
Protected

Academic year: 2022

Partager "Convex Optimization with Inexact Model of the Objective Function"

Copied!
26
0
0

Texte intégral

(1)

Convex Optimization with Inexact Model of the Objective Function

Alexander Tyurin 1, Pavel Dvurechensky2, Alexander Gasnikov 3

1HSE (Russian Federation)

2WIAS (Germany)

3MIPT (Russian Federation)

(2)

General optimization problem

General convex optimization problem:

F(x)→min

x∈Q

We denotex – the solution of this problem.

1 F(x) is a convex function.

2 Q ⊆Rn is a convex, closed set.

Examples:

1 Physics, Gas Network:

1 3

N

X

i=1

αi|xi|3 → min

Ax=d.

2 Machine Learning, Logistic regression:

N

X

i=1

log(1+exp(xiTw))→ min

x∈Rn.

2 / 26

(3)

Bregman divergence

Prox-functiond(x) : continuously differentiable on intQ and 1-strongly convex function, i.e.

d(x)≥d(y) +h∇d(y),x−yi+1

2kx−yk2, x,y ∈int Q.

Bregman divergenceV(x,y)def=d(x)−d(y)− h∇d(y),x−yi.

Examples:

V(x,y) = 12kx−yk22 V(x,y) =Pn

i=1xilogxyi

i (KL-divergence).

(4)

Definition of (δ, L)-model

Definition

Pair(Fδ(y), ψδ(x,y))is (δ,L)-model in a pointy of a function F with respect to the normkk if for allx ∈Q

0≤F(x)−Fδ(y)−ψδ(x,y)≤ L

2kx−yk2+δ, ψδ(x,x) =0, ∀x∈Q,

andψδ(x,y) is a convex function ofx for all y ∈Q.

Definition (Devolder–Glineur–Nesterov, 2013)

FunctionF has (δ,L)-oracle in a point y if there exists a pair (Fδ(y),∇Fδ(y))such that

0≤F(x)−Fδ(y)− h∇Fδ(y),x−yi ≤ L

2kx−yk2+δ, ∀x∈Q.

4 / 26

(5)

Definition of (δ, L)-inexact oracle; Devolder, PhD Thesis’13

(6)

Definition of inexact solution

Definition

Let us assume an optimization task ψ(x)→ min

x∈Q,

whereψ is a convex function. Then Arg mineδx∈Qψ(x) is a set ofxe such that

∃h∈∂ψ(ex), hh,x−xi ≥ −ee δ.

Arbitrary point from the Arg mineδx∈Qψ(x)we define as arg mineδx∈Qψ(x).

From the definition follows that

ψ(ex)−ψ(x)≤eδ.

In the opposite direction, it is not always true.

6 / 26

(7)

Gradient descent

Input:x0,N – number of steps, {δk}N−1k=0,{eδk}N−1k=0 – sequences, andL0 >0.

0 - step: L1= L0

2 k +1 - step:αk+1 = 1

Lk+1

φk+1(x) =V(x,xk) +αk+1ψδk(x,xk) xk+1 =arg min

x∈Q

δekφk+1(x)

If Fδk(xk+1)>Fδk(xk) +ψδk(xk+1,xk)+

+Lk+1

2 kxk+1−xkk2k, then Lk+1 =2Lk+1 and repeat the current step.

Otherwise, Lk+2 = Lk+1

2 and go to the next step.

(8)

Fast gradient descent

Input:x0,N – number of steps, {δk}N−1k=0,{eδk}N−1k=0 – sequences, andL0 >0.

0 - step: y0 =u0 =x0;L1= L0

2 ;α0 =0;A00

k +1 - step: Find αk+1 :Akk+1 =Lk+1α2k+1 Ak+1 =Akk+1

yk+1 = αk+1uk +Akxk Ak+1

φk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1) uk+1 =arg min

x∈Q

eδkφk+1(x), xk+1= αk+1uk+1+Akxk Ak+1 If Fδk(xk+1)>Fδk(yk+1) +ψδk(xk+1,yk+1)+

+Lk+1

2 kxk+1−yk+1k2k, then Lk+1 =2Lk+1 and repeat the current step.

Otherwise, Lk+2 = Lk+1

2 and go to the next step.

8 / 26

(9)

Theorem (Gradient descent)

LetV(x,x0)≤R2 where x0 is a starting point,x is the closest point tox0 in terms of a Bregman distanceV, and

¯ xN = 1

N

N−1

X

k=0

xk. For proposed algorithm we have

F(¯xN)−F(x)≤ 2LR2 N +2L

N

N−1

X

k=0

k + 2 N

N−1

X

k=0

δk.

Theorem (Fast gradient descent)

LetV(x,x0)≤R2 where x0 is a starting point andx is the closest point tox0 in terms of a Bregman distance V. For proposed algorithm we have

F(xN)−F(x)≤ 8LR2

(N+1)2 +8LPN−1 k=0k

(N+1)2 +2PN−1

k=0 δkAk+1

AN .

(10)

Gradient descent vs Fast gradient descent

Leteδk =eδ andδk =δ then

F(¯xN)−F(x)≤ 2LR2

N +2Lδe+2δ, F(xN)−F(x)≤ 8LR2

(N+1)2 + 8Leδ

N+1 +2Nδ.

Fast gradient descent is more robust toeδ, which appears when we solve an intermediate optimization problem, but fast gradient descent accumulatesδ, which appears when we receive (δ,L)-model.

There are intermediate gradient methods with convergence rate equal to (in the case˜δ=0 – Devolder–Glineur–Nesterov, 2013)

O(1)LR2

Np +O(1)N1−pδe+O(1)Np−1δ, wherep ∈[1,2](D. Kamzolov, 2017).

10 / 26

(11)

Linear model (Nesterov, 1983)

Assume thatF is a smoothconvex function and the gradient of F is Lipschitz-continuous with parameterL, then

0≤F(x)−F(y)− h∇F(y),x−yi ≤ L

2 kx−yk2, ∀x,y ∈Q. We can takeψδk(x,y) =h∇F(y),x−yi,Fδk(y) =F(y), and δk =0 for allk ≥0. Let us assume that we can find exact solutions in intermediate optimization problems thuseδk =0 for all k ≥0 and

F(xN)−F(x)≤ 8LR2 (N+1)2. The bound is optimal.

(12)

Universal method (Nesterov, 2013)

Let

k∇F(x)− ∇F(y)k≤Lνkx−ykν, ∀x,y ∈Q. Therefore,

0≤F(x)−F(y)− h∇F(y),x−yi ≤ L(δ)

2 kx−yk2+δ, ∀x,y ∈Q, where

L(δ) =Lν Lν

2δ 1−ν 1+ν

1−ν1+ν andδ >0 is a parameter.

Consequently, we can takeψδk(x,y) =h∇F(y),x−yi,

Fδk(y) =F(y). Let us assume thatδek =0 for allk ≥0. Let us take δk = αk+1

4Ak+1,∀k,

whereis the accuracy of the solution by function.

12 / 26

(13)

Universal method (Nesterov, 2013)

In order to have

F(xN)−F(x)≤, it is enough to do

N ≤ inf

ν∈[0,1]

"

641+3ν1+ν

2−2ν 1+ν

1+3ν1−ν

LνR1+ν

1+3ν2 #

steps.

The bound is optimal.

(14)

Universal conditional gradient (Frank–Wolfe) method

In fast gradient method we have

φk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1).

Instead of it let us take (idea goes back to A. Nemirovski, 2013) φek+1(x) =αk+1ψδk(x,yk+1).

Let us look at this substitution from the view of an erroreδk. We can show that it enough to takeeδk =2RQ2 for allk ≥0, where RQ

is a diameter of a setQ. Let us take δk = αk+1

4Ak+1,∀k,

whereis the accuracy of the solution by function.

Finally,

N≤ inf

ν∈(0,1]

961+ν

2−2ν 1+ν

1−ν LνRQ1+ν

!1ν

.

14 / 26

(15)

Composite optimization (Nesterov, 2008)

F(x)def=f(x) +h(x)→min

x∈Q,

wheref is a smooth convex function and the gradient of f is Lipschitz-continuous with parameterL. Function h is a convex function. We can show

0≤F(x)−F(y)− h∇f(y),x−yi −h(x) +h(y)≤

≤ L

2kx−yk2, ∀x,y ∈Q.

We can takeψδk(x,y) =h∇f(y),x−yi+h(x)−h(y), Fδk(y) =F(y), and δk =0 for allk ≥0. On the one hand the methods do not change, but on the other hand auxiliary optimizations are more complicated.

(16)

Superposition of functions (Nemirovski–Nesterov, 1985)

F(x)def=f(f1(x), . . . ,fm(x))→min

x∈Q,

wherefk is a smooth convex function and the gradient offk is Lipschitz-continuous with parameterLk for allk ≥0. Function f is a convex function, Lipschitz-continuous with parameterM with respect toL1-norm, and monotonic function. Therefore,

0≤F(x)−F(y)−f(f1(y) +h∇f1(y),x−yi, . . . ,fm(y)+

+h∇fm(y),x−yi) +F(y)≤M Pm

i=1Li

2 kx−yk2,∀x,y ∈Q.

We can takeψδk(x,y) =

f(f1(y) +h∇f1(y),x−yi, . . . ,fm(y) +h∇fm(y),x−yi)−F(y), Fδk(y) =F(y), and δk =0 for allk ≥0.

16 / 26

(17)

Proximal method with inexact solution in an intermediate optimization problem; Devolder–Glineur–Nesterov, 2013

Let us consider

f(x)def=min

y∈Q

φ(y) +L

2ky−xk22

| {z }

Λ(x,y)

→ min

x∈Rn. (1)

Functionφis a convex function and maxy∈Q

Λ(x,y(x))−Λ(x,y) +L

2ky−y(x)k22

≤δ.

Then

φ(y(x)) + L

2ky(x)−xk22−δ, ψδ(z,x) =hL(x−y(x)),z −xi

is(δ,L)-model off(z) at the pointx w.r.t 2-norm. From this result one can obtainCatalyst technique (Lin–Marial–Harchaoui, 2015).

(18)

Saddle point problem; Devolder–Glineur–Nesterov, 2013

Let us consider

f(x)def=max

y∈Q[hx,b−Ayi −φ(y)]→ min

x∈Rn, (2)

whereφ(y) is aµ-strong convex function w.r.t.p-norm

(1≤p≤2). Then f is a smooth convex function and the gradient off is Lipschitz-continuous with parameter

L= 1 µ max

kykp≤1kAyk22. Ifyδ(x) is a δ-solution of the max problem, then

(hx,b−Ayδ(x)i −φ(yδ(x)), ψδ(z,x) =hb−Ayδ(x),z−xi) is(δ,2L)-model off(z)at the point x w.r.t 2-norm.

18 / 26

(19)

Augmented Lagrangians; Devolder–Glineur–Nesterov, 2013

Let us consider

φ(y) +µ

2kAy−bk22 → min

Ay=b,y∈Q. and it’s dual problem

f(x)def=max

y∈Q

hx,b−Ayi −φ(y)−µ

2kAy−bk22

| {z }

Λ(x,y)

→ min

x∈Rn. Ifyδ(x) is a solution in the sense of an inequality

maxy∈Qh∇yΛ(yδ(x),y),y−yδ(x)i ≤δ, then

hx,b−Ayδ(x)i −φ(yδ(x))−µ

2 kAyδ(x)−bk22, ψδ(z,x) =hb−Ayδ(x),z−xi

is(δ, µ−1)-model off(z)at the point x w.r.t 2-norm.

(20)

First new example: Min-min problem

Let us consider

f(x)def=min

y∈QF(y,x)→ min

x∈Rn. Set Q is convex and bounded.

Function F is smooth and convex w.r.t. both variables.

k∇F(y0,x0)− ∇F(y,x)k2 ≤Lk(y0,x0)−(y,x)k2, ∀y,y0∈ Q,x,x0 ∈Rn.

If we can find a pointyeδ(x)∈Q such that

h∇yF(yeδ(x),x),y−yeδ(x)i ≥ −δ, ∀y ∈Q, then

F(yeδ(x),x)−f(x)≤δ,

∇f(x0)− ∇f(x) 2≤L

x0−x 2

and

(F(yeδ(x),x)−2δ, ψδ(z,x) =h∇yF(yeδ(x),x),z−xi) is(6δ,2L)-model off(x) at the pointx w.r.t 2-norm.

20 / 26

(21)

Second new example: Nesterov’s clustering model, 2018

Let us consider

fµ(x= (z,p)) =g(z,p) +µ

n

X

k=1

zklnzk

2kpk22→ min

z∈Sn(1),p≥0. Let’s introduce norm

kxk2 =k(z,p)k2 =kzk21+kpk22. Assume that

k∇g(x2)− ∇g(x1)k ≤Lkx2−x1k, L≤µ.

Then

fµ(y),h∇g(y),x−yi+ (µ−L)

n

X

k=1

zklnzk+µ−L 2 kpk22

! . is(0,2L)-model offµ(x = (z,p))at the point y.

(22)

Third new example: Proximal Sinkhorn algorithm

Let introduce smoothed Wasserstein distance (Peyre–Cuturi, 2018)

Wγ(p,q) = min

π∈Π(p,q)

n

X

i,j=1

Cijπij +γX

i,j

πijlogπij

πij0

 ,

where

Π(p,q) ={π ∈Rn×n+ ,

n

X

j=1

πij =p,

n

X

i=1

πij =q}.

It’s well known (Franklin–Lorentz, 1989; Dvurechensky et al., 2018) that the complexity (a.o.) of the Sinkhorn algorithm for this

problem is

O n2min (lnn

γ ,exp C˜ γ

!

ln −1 )!

.

22 / 26

(23)

Third new example: Proximal Sinkhorn algorithm

If we want to calculateW0(p,q) with the precisionone can calculateWγ(p,q) with γ =/4 lnn with the precision/2. So the complexity in this case will beO˜(n2/2) a.o. Can we do better?

The answer is Yes!.

If we useproximal model off(x) at point y of the form ψδ(x,y) =f(x)−f(y)

and choose arbitraryL>0. Then proximal gradient descent (Chen–Teboulle, 1993)

xk+1=arg min

x∈Q

n

ψδ(x,xk) +LV(x,xk) o will converges asO(LR2/k),R2 =V(x,x0).

(24)

Third new example: Proximal Sinkhorn algorithm

So, let’s apply proximal set up toW0(p,q) calculation. Putx =π, Q= Π,L=γ, chooseV =KL. We obtain that proximal gradient method

πk+1= min

π∈Π(p,q)

n

X

i,j=1

Cijπij +γX

i,j

πijlogπij

πkij

 requiresO(γlnn/) iterations to reach the precision . But, this assumes that we have to solve exactly auxiliary problem – i.e. we can calculate exactlyWγ(p,q). But from above we know that ifγ is not small (Note:we can choose γ as we want!) we can solve this problem with very high precision with the complexityO(n˜ 2). So the total number of a.o. will beO˜(n2/)– that is better than for the simple SinkhornO(n˜ 2/2).

24 / 26

(25)

Auxiliary optimization problem

Let us takeφk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1). From h∇φk+1(x),x−xi ≤

we can get

φk+1(x)−φk+1(x)≤.

Let an intermediate optimization problem has Lipschitz-continuous with parameterM function. Let us assume x is a-solution then

h∇φk+1(x),x−xi ≤ k∇φk+1(x)kkx−xk ≤M√ 2.

If we have linear convergence speed for an auxiliary optimization problem, then we should doc times more steps where c is a small constant.

(26)

Thank you!

26 / 26

Références

Documents relatifs

That is what happens in quasi- belief, according to Sperber: the representation which is embedded in a validating meta- representation ('Lacan says that...' or 'The teacher

The dynamic pace of development of automated testing requires the answer to the question “can we all automate?” What is the role of manual testers in modern software

In VADA [21], all of format transformation, source selection, matching, mapping and data repair are informed by evidence in the form of the data context [20], instance values that

We proposed a novel metric, based on a simple structure from factor analysis, for the measurement of the interpretability of factors delivered by matrix de- composition

Unless it is considered that the functioning of the State and of the social model engenders the lack of civism of the French (which cannot be determined from the data presented in

Two months later sitting on the street at night she was begging for money with her boyfriend I gave them five dollars. but they didn’t recognize me or thank me for the money which

7 It seems apparent to me that the physicians who do prescribe opioids are likely those with who have a chronic pain–focused practice and probably consult

In the case of a bimaterial, as far as localized waves as Love waves can be used, the evaluation of the boundary terms for detecting the crack is easier to estimate because such