Homework 5: Proximal methods, monotone operators, incremental methods

(1)

10-801: Advanced Optimization and Randomized Methods

Homework 5: Proximal methods, monotone operators, incremental methods

(April 9, 2014)

Instructor: Suvrit Sra Due:April 21, 2014

1. Consider the convex optimization problem

min f(x) x∈ X,

whereXis closed and convex, whilef :X →Ris Lipschitz continuous and convex. Suppose further that we have a functionD_φ:X × X →R

Dφ(x, y) =φ(x)−φ(y)− h∇φ(y), x−yi,

whereφis strongly convex with parameterµand continuously differentiable onX (as a resultD_φ(x, y)≥

µ

2kx−yk²₂).

(a) Show thatDφ(x, y)is strongly convex as a function ofxby proving that Dφ(x, y)≥Dφ(z, y) +h∇zD(z, y), x−zi+µ

2kx−zk²₂. (b) Consider the general iterative algorithm

x^k+1= argmin

x∈X

{hx, g^ki+ 1 αk

Dφ(x, x^k)}, g^k ∈∂f(x^k). (5.1)

• Write down the optimality conditions for (5.1)

• Use these optimality conditions to write the above update explicitly in terms of∇φand∇φ^∗. Hint:You’ll need the fact:φis strongly convex onX, soφ^∗is finite everywhere and differentiable, with

∇φ^∗≡(∇φ+NX)⁻¹; then consideru∈ ∇φ(x) +NX(x), whereNX is the normal cone.

(c) Show why the projected subgradient iteration

x^k+1=P_X(x^k−αkg^k), k= 0,1, . . . ,

is actually a special case of iteration (5.1).

2. The Douglas-Rachford iteration for minimizingf(x) +g(x)is given by x^k= prox_g(z^k)

v^k= prox_f(2x^k−z^k) z^k+1=z^k+γk(v^k−x^k)

Show that forγ_k = 1, we can rewrite the above iteration usingaveraged reflectionsas

z^k+1= [1

2(RfRg+I)](z^k), where the reflection operators areR_f:= 2 prox_f−I, andR_g:= 2 prox_g−I. 3. Consider the following separable convex optimization problem

x∈minRⁿ

F(x) :=X^m

i=1fi(x),

where eachfi:Rⁿ→R∪ {+∞}(e.g.,fi(x) =δC(x)for some closed convex setC).

(a) Derive a Douglas-Rachford (DR) iteration to optimizeF(x)using the “product space trick”. Justify all your steps.

5-1

(2)

HOMEWORK5 Proximal methods, monotone operators, incremental methods April 9, 2014

(b) WriteF(x) =f1(x) +Pm

i=2fi(x). Introduce variablesx2, . . . , xm=x1. Now, obtain the (Lagrange) dual problem in terms of the conjugate functionsf_i^∗. Show how to solve this dual problem using DR.

(c) Compare (in words) the two formulations in (a) and (b) above. Are there situations where you would prefer one over the other?

4. LetA₁, . . . , A_T be matrices inR^m×n, and lety₁, . . . , y_T ∈R. Consider the trace-norm regularized optimization problem

min

X∈R^m×n

X^T

j=1(yj−tr(X^TAj))²+λkXktr, where thetrace normiskXk_tr:=P

iσ_i(X)(sum of singular values).

(a) Derive a closed-form solution for the proximity operator of the trace-norm 1

2kX−Yk²_F+λkXk_tr,

Hint 1:Ifr:Rⁿ →Ris symmetric convex andabsolute(r(x) =r(|x₁|,|x₂|, . . . ,|x_n|)), andσ:R^m×n →Rⁿ+is the singular value map, then the conjugate of the compositionr◦σ, i.e.,(r◦σ)^∗is (no surprise)r^∗◦σ. (b) Present pseudo-code for solving this problem via proximal-gradients. Comment on how to select the

step-size parameter.

5. LetXbe a closed and bounded convex set. Letf be strongly convex with parameterµ. Assume we run the stochastic gradient method

x^k+1=P_X(x^k−αkg^k),

whereg^k is astochastic subgradient, i.e.,E[g^k |ξ_[k−1]]∈∂f(x^k), that hasfinite variance, i.e.,E[kg^kk²]≤σ². In this exercise, we’ll study a small modification to the simple convergence analysis from Lecture 19. In particular, we’ll show that a weighted average of the iteratesx^k demonstratesO(1/k)convergence rate.

(a) Prove the following inequality (we essentially proved it in class already):

E[kx^k+1−x^∗k²]≤Ekx^k−x^∗k²+α²_kE[kgkk²]−2α_k[f(x^k)−f(x^∗) +^µ₂kx^k−x^∗k²].

(b) Show from this inequality it follows that

E[f(x^k)]−f(x^∗)≤ α_kσ²

2 +α⁻¹_k −µ

2 E[kx^k−x^∗k²]−_2α¹

kE[kx^k+1−x^∗k²]. (5.2) (c) Show that choosing stepsizeαk= _µ(k+1)² , implies that

Ef(¯x^k)−f^∗≤ 2σ² µ(k+ 1), wherex¯^k :=_k(k+1)² Pk

t=1tx^t.

(d) Show howx¯^kcan be efficiently updated from iterationk→k+ 1.

6. Supposef is a convex function on a set C. An alternative definition of strong convexity off onC with coefficientµ >0is

f(αx+ (1−α)y) +^µ₂α(1−α)kx−yk²₂≤αf(x) + (1−α)f(y).

Supposefis a continuously differentiable function on int(C). Show that the following two are equivalent:

(a) f is strongly convex with strong convexity coefficientµ (b) h∇f(x)− ∇f(y), x−yi ≥µkx−yk²₂, ∀x, y∈int(C).

Hint:For part (b), it might help to definez(α) =αx+ (1−α)yand invoke the integral representation

f(z(α)) =f(x) + Z 1

0

h∇f(x+t(z(α)−x)), z(α)−xidt

5-2

(3)

HOMEWORK5 Proximal methods, monotone operators, incremental methods April 9, 2014

7. Iff is not convex, we can still define a prox-operator, which is now a set-valued map:

prox^λ_f ≡y7→Argmin

x∈Rⁿ

1

2kx−yk²₂+λf(x), λ >0.

Obtain the prox-maps for the following functions (a) f(x) =kxk0, i.e., the`0-“norm”.

(b) f(x) =kxk_1/2:= (P

i|xi|^1/2)²

(c) Does there exist a nonconvexf for which the prox-map is a singleton (forn >1)?

8. Consider the convex optimization problem

min f(x) +h(Ax), (5.3)

wherefandhare closed convex functions, andAhas full column rank. Assume that∂(f+h◦A) =∂f+∂(h◦A) (assume a similar qualification on the dual if needed).

(a) Write the Fenchel dual of this problem

(b) Show why running Douglas-Rachford on the dual yields the Alternating Direction Method of Multipliers (ADMM) for solving (5.3). (Hint:You may need to use the full DR method, not just its averaged reflections incarnation).

5-3