Convex Optimization with Inexact Model of the Objective Function
Alexander Tyurin 1, Pavel Dvurechensky2, Alexander Gasnikov 3
1HSE (Russian Federation)
2WIAS (Germany)
3MIPT (Russian Federation)
General optimization problem
General convex optimization problem:
F(x)→min
x∈Q
We denotex∗ – the solution of this problem.
1 F(x) is a convex function.
2 Q ⊆Rn is a convex, closed set.
Examples:
1 Physics, Gas Network:
1 3
N
X
i=1
αi|xi|3 → min
Ax=d.
2 Machine Learning, Logistic regression:
N
X
i=1
log(1+exp(xiTw))→ min
x∈Rn.
2 / 26
Bregman divergence
Prox-functiond(x) : continuously differentiable on intQ and 1-strongly convex function, i.e.
d(x)≥d(y) +h∇d(y),x−yi+1
2kx−yk2, x,y ∈int Q.
Bregman divergenceV(x,y)def=d(x)−d(y)− h∇d(y),x−yi.
Examples:
V(x,y) = 12kx−yk22 V(x,y) =Pn
i=1xilogxyi
i (KL-divergence).
Definition of (δ, L)-model
Definition
Pair(Fδ(y), ψδ(x,y))is (δ,L)-model in a pointy of a function F with respect to the normkk if for allx ∈Q
0≤F(x)−Fδ(y)−ψδ(x,y)≤ L
2kx−yk2+δ, ψδ(x,x) =0, ∀x∈Q,
andψδ(x,y) is a convex function ofx for all y ∈Q.
Definition (Devolder–Glineur–Nesterov, 2013)
FunctionF has (δ,L)-oracle in a point y if there exists a pair (Fδ(y),∇Fδ(y))such that
0≤F(x)−Fδ(y)− h∇Fδ(y),x−yi ≤ L
2kx−yk2+δ, ∀x∈Q.
4 / 26
Definition of (δ, L)-inexact oracle; Devolder, PhD Thesis’13
Definition of inexact solution
Definition
Let us assume an optimization task ψ(x)→ min
x∈Q,
whereψ is a convex function. Then Arg mineδx∈Qψ(x) is a set ofxe such that
∃h∈∂ψ(ex), hh,x∗−xi ≥ −ee δ.
Arbitrary point from the Arg mineδx∈Qψ(x)we define as arg mineδx∈Qψ(x).
From the definition follows that
ψ(ex)−ψ(x∗)≤eδ.
In the opposite direction, it is not always true.
6 / 26
Gradient descent
Input:x0,N – number of steps, {δk}N−1k=0,{eδk}N−1k=0 – sequences, andL0 >0.
0 - step: L1= L0
2 k +1 - step:αk+1 = 1
Lk+1
φk+1(x) =V(x,xk) +αk+1ψδk(x,xk) xk+1 =arg min
x∈Q
δekφk+1(x)
If Fδk(xk+1)>Fδk(xk) +ψδk(xk+1,xk)+
+Lk+1
2 kxk+1−xkk2+δk, then Lk+1 =2Lk+1 and repeat the current step.
Otherwise, Lk+2 = Lk+1
2 and go to the next step.
Fast gradient descent
Input:x0,N – number of steps, {δk}N−1k=0,{eδk}N−1k=0 – sequences, andL0 >0.
0 - step: y0 =u0 =x0;L1= L0
2 ;α0 =0;A0 =α0
k +1 - step: Find αk+1 :Ak+αk+1 =Lk+1α2k+1 Ak+1 =Ak+αk+1
yk+1 = αk+1uk +Akxk Ak+1
φk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1) uk+1 =arg min
x∈Q
eδkφk+1(x), xk+1= αk+1uk+1+Akxk Ak+1 If Fδk(xk+1)>Fδk(yk+1) +ψδk(xk+1,yk+1)+
+Lk+1
2 kxk+1−yk+1k2+δk, then Lk+1 =2Lk+1 and repeat the current step.
Otherwise, Lk+2 = Lk+1
2 and go to the next step.
8 / 26
Theorem (Gradient descent)
LetV(x∗,x0)≤R2 where x0 is a starting point,x∗ is the closest point tox0 in terms of a Bregman distanceV, and
¯ xN = 1
N
N−1
X
k=0
xk. For proposed algorithm we have
F(¯xN)−F(x∗)≤ 2LR2 N +2L
N
N−1
X
k=0
eδk + 2 N
N−1
X
k=0
δk.
Theorem (Fast gradient descent)
LetV(x∗,x0)≤R2 where x0 is a starting point andx∗ is the closest point tox0 in terms of a Bregman distance V. For proposed algorithm we have
F(xN)−F(x∗)≤ 8LR2
(N+1)2 +8LPN−1 k=0 eδk
(N+1)2 +2PN−1
k=0 δkAk+1
AN .
Gradient descent vs Fast gradient descent
Leteδk =eδ andδk =δ then
F(¯xN)−F(x∗)≤ 2LR2
N +2Lδe+2δ, F(xN)−F(x∗)≤ 8LR2
(N+1)2 + 8Leδ
N+1 +2Nδ.
Fast gradient descent is more robust toeδ, which appears when we solve an intermediate optimization problem, but fast gradient descent accumulatesδ, which appears when we receive (δ,L)-model.
There are intermediate gradient methods with convergence rate equal to (in the case˜δ=0 – Devolder–Glineur–Nesterov, 2013)
O(1)LR2
Np +O(1)N1−pδe+O(1)Np−1δ, wherep ∈[1,2](D. Kamzolov, 2017).
10 / 26
Linear model (Nesterov, 1983)
Assume thatF is a smoothconvex function and the gradient of F is Lipschitz-continuous with parameterL, then
0≤F(x)−F(y)− h∇F(y),x−yi ≤ L
2 kx−yk2, ∀x,y ∈Q. We can takeψδk(x,y) =h∇F(y),x−yi,Fδk(y) =F(y), and δk =0 for allk ≥0. Let us assume that we can find exact solutions in intermediate optimization problems thuseδk =0 for all k ≥0 and
F(xN)−F(x∗)≤ 8LR2 (N+1)2. The bound is optimal.
Universal method (Nesterov, 2013)
Let
k∇F(x)− ∇F(y)k∗≤Lνkx−ykν, ∀x,y ∈Q. Therefore,
0≤F(x)−F(y)− h∇F(y),x−yi ≤ L(δ)
2 kx−yk2+δ, ∀x,y ∈Q, where
L(δ) =Lν Lν
2δ 1−ν 1+ν
1−ν1+ν andδ >0 is a parameter.
Consequently, we can takeψδk(x,y) =h∇F(y),x−yi,
Fδk(y) =F(y). Let us assume thatδek =0 for allk ≥0. Let us take δk = αk+1
4Ak+1,∀k,
whereis the accuracy of the solution by function.
12 / 26
Universal method (Nesterov, 2013)
In order to have
F(xN)−F(x∗)≤, it is enough to do
N ≤ inf
ν∈[0,1]
"
641+3ν1+ν
2−2ν 1+ν
1+3ν1−ν
LνR1+ν
1+3ν2 #
steps.
The bound is optimal.
Universal conditional gradient (Frank–Wolfe) method
In fast gradient method we have
φk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1).
Instead of it let us take (idea goes back to A. Nemirovski, 2013) φek+1(x) =αk+1ψδk(x,yk+1).
Let us look at this substitution from the view of an erroreδk. We can show that it enough to takeeδk =2RQ2 for allk ≥0, where RQ
is a diameter of a setQ. Let us take δk = αk+1
4Ak+1,∀k,
whereis the accuracy of the solution by function.
Finally,
N≤ inf
ν∈(0,1]
961+ν2ν
2−2ν 1+ν
1−ν2ν LνRQ1+ν
!1ν
.
14 / 26
Composite optimization (Nesterov, 2008)
F(x)def=f(x) +h(x)→min
x∈Q,
wheref is a smooth convex function and the gradient of f is Lipschitz-continuous with parameterL. Function h is a convex function. We can show
0≤F(x)−F(y)− h∇f(y),x−yi −h(x) +h(y)≤
≤ L
2kx−yk2, ∀x,y ∈Q.
We can takeψδk(x,y) =h∇f(y),x−yi+h(x)−h(y), Fδk(y) =F(y), and δk =0 for allk ≥0. On the one hand the methods do not change, but on the other hand auxiliary optimizations are more complicated.
Superposition of functions (Nemirovski–Nesterov, 1985)
F(x)def=f(f1(x), . . . ,fm(x))→min
x∈Q,
wherefk is a smooth convex function and the gradient offk is Lipschitz-continuous with parameterLk for allk ≥0. Function f is a convex function, Lipschitz-continuous with parameterM with respect toL1-norm, and monotonic function. Therefore,
0≤F(x)−F(y)−f(f1(y) +h∇f1(y),x−yi, . . . ,fm(y)+
+h∇fm(y),x−yi) +F(y)≤M Pm
i=1Li
2 kx−yk2,∀x,y ∈Q.
We can takeψδk(x,y) =
f(f1(y) +h∇f1(y),x−yi, . . . ,fm(y) +h∇fm(y),x−yi)−F(y), Fδk(y) =F(y), and δk =0 for allk ≥0.
16 / 26
Proximal method with inexact solution in an intermediate optimization problem; Devolder–Glineur–Nesterov, 2013
Let us consider
f(x)def=min
y∈Q
φ(y) +L
2ky−xk22
| {z }
Λ(x,y)
→ min
x∈Rn. (1)
Functionφis a convex function and maxy∈Q
Λ(x,y(x))−Λ(x,y) +L
2ky−y(x)k22
≤δ.
Then
φ(y(x)) + L
2ky(x)−xk22−δ, ψδ(z,x) =hL(x−y(x)),z −xi
is(δ,L)-model off(z) at the pointx w.r.t 2-norm. From this result one can obtainCatalyst technique (Lin–Marial–Harchaoui, 2015).
Saddle point problem; Devolder–Glineur–Nesterov, 2013
Let us consider
f(x)def=max
y∈Q[hx,b−Ayi −φ(y)]→ min
x∈Rn, (2)
whereφ(y) is aµ-strong convex function w.r.t.p-norm
(1≤p≤2). Then f is a smooth convex function and the gradient off is Lipschitz-continuous with parameter
L= 1 µ max
kykp≤1kAyk22. Ifyδ(x) is a δ-solution of the max problem, then
(hx,b−Ayδ(x)i −φ(yδ(x)), ψδ(z,x) =hb−Ayδ(x),z−xi) is(δ,2L)-model off(z)at the point x w.r.t 2-norm.
18 / 26
Augmented Lagrangians; Devolder–Glineur–Nesterov, 2013
Let us consider
φ(y) +µ
2kAy−bk22 → min
Ay=b,y∈Q. and it’s dual problem
f(x)def=max
y∈Q
hx,b−Ayi −φ(y)−µ
2kAy−bk22
| {z }
Λ(x,y)
→ min
x∈Rn. Ifyδ(x) is a solution in the sense of an inequality
maxy∈Qh∇yΛ(yδ(x),y),y−yδ(x)i ≤δ, then
hx,b−Ayδ(x)i −φ(yδ(x))−µ
2 kAyδ(x)−bk22, ψδ(z,x) =hb−Ayδ(x),z−xi
is(δ, µ−1)-model off(z)at the point x w.r.t 2-norm.
First new example: Min-min problem
Let us consider
f(x)def=min
y∈QF(y,x)→ min
x∈Rn. Set Q is convex and bounded.
Function F is smooth and convex w.r.t. both variables.
k∇F(y0,x0)− ∇F(y,x)k2 ≤Lk(y0,x0)−(y,x)k2, ∀y,y0∈ Q,x,x0 ∈Rn.
If we can find a pointyeδ(x)∈Q such that
h∇yF(yeδ(x),x),y−yeδ(x)i ≥ −δ, ∀y ∈Q, then
F(yeδ(x),x)−f(x)≤δ,
∇f(x0)− ∇f(x) 2≤L
x0−x 2
and
(F(yeδ(x),x)−2δ, ψδ(z,x) =h∇yF(yeδ(x),x),z−xi) is(6δ,2L)-model off(x) at the pointx w.r.t 2-norm.
20 / 26
Second new example: Nesterov’s clustering model, 2018
Let us consider
fµ(x= (z,p)) =g(z,p) +µ
n
X
k=1
zklnzk+µ
2kpk22→ min
z∈Sn(1),p≥0. Let’s introduce norm
kxk2 =k(z,p)k2 =kzk21+kpk22. Assume that
k∇g(x2)− ∇g(x1)k∗ ≤Lkx2−x1k, L≤µ.
Then
fµ(y),h∇g(y),x−yi+ (µ−L)
n
X
k=1
zklnzk+µ−L 2 kpk22
! . is(0,2L)-model offµ(x = (z,p))at the point y.
Third new example: Proximal Sinkhorn algorithm
Let introduce smoothed Wasserstein distance (Peyre–Cuturi, 2018)
Wγ(p,q) = min
π∈Π(p,q)
n
X
i,j=1
Cijπij +γX
i,j
πijlogπij
πij0
,
where
Π(p,q) ={π ∈Rn×n+ ,
n
X
j=1
πij =p,
n
X
i=1
πij =q}.
It’s well known (Franklin–Lorentz, 1989; Dvurechensky et al., 2018) that the complexity (a.o.) of the Sinkhorn algorithm for this
problem is
O n2min (lnn
γ ,exp C˜ γ
!
ln −1 )!
.
22 / 26
Third new example: Proximal Sinkhorn algorithm
If we want to calculateW0(p,q) with the precisionone can calculateWγ(p,q) with γ =/4 lnn with the precision/2. So the complexity in this case will beO˜(n2/2) a.o. Can we do better?
The answer is Yes!.
If we useproximal model off(x) at point y of the form ψδ(x,y) =f(x)−f(y)
and choose arbitraryL>0. Then proximal gradient descent (Chen–Teboulle, 1993)
xk+1=arg min
x∈Q
n
ψδ(x,xk) +LV(x,xk) o will converges asO(LR2/k),R2 =V(x∗,x0).
Third new example: Proximal Sinkhorn algorithm
So, let’s apply proximal set up toW0(p,q) calculation. Putx =π, Q= Π,L=γ, chooseV =KL. We obtain that proximal gradient method
πk+1= min
π∈Π(p,q)
n
X
i,j=1
Cijπij +γX
i,j
πijlogπij
πkij
requiresO(γlnn/) iterations to reach the precision . But, this assumes that we have to solve exactly auxiliary problem – i.e. we can calculate exactlyWγ(p,q). But from above we know that ifγ is not small (Note:we can choose γ as we want!) we can solve this problem with very high precision with the complexityO(n˜ 2). So the total number of a.o. will beO˜(n2/)– that is better than for the simple SinkhornO(n˜ 2/2).
24 / 26
Auxiliary optimization problem
Let us takeφk+1(x) =V(x,uk) +αk+1ψδk(x,yk+1). From h∇φk+1(x),x−x∗i ≤
we can get
φk+1(x)−φk+1(x∗)≤.
Let an intermediate optimization problem has Lipschitz-continuous with parameterM function. Let us assume x is a-solution then
h∇φk+1(x),x−x∗i ≤ k∇φk+1(x)k∗kx−x∗k ≤M√ 2.
If we have linear convergence speed for an auxiliary optimization problem, then we should doc times more steps where c is a small constant.
Thank you!
26 / 26