Algorithms: Subgradient Descent - Entropic Regularization of the 0 Function

Entropic Regularization of the 0 Function

5.3 Algorithms: Subgradient Descent

The central algorithm we explore in this note is simple (sub)gradient descent on the dual problem (5.13):

Algorithm 5.1 (Subgradient descent). Given y⁰∈R^m, forν=0,1,2...generate the sequence{y^ν}^∞_ν=0via

y^ν+1=y^ν+λνd^ν, where d^ν ∈ −∂ϕ_ε,L^∗ ^A^T^y^ν−b^Ty^ν

and λν is an appropriate step length parameter.

Forε >0, ϕ_ε,L^∗ is continuously differentiable on its domain, and the algorithm amounts to the method of steepest descent.

5.3.1 Nonsmooth Case: ε = 0

In this section, we present and analyze a subgradient descent method with exact line search and variants thereof suitable for solving the dual problem above for the case ε=0, that is, we do not smooth the problem.

Using the notation of indicator functions, we have ϕ_0,L^∗ (x) =ιR_L(x)≡

0 for x∈RL

+∞ otherwise.

The specific instance of (5.13) that we address is

ymin∈R^mιR_L(A^Ty)−y^Tb. (5.14) Since the set RL is a rectangle, nonsmooth calculus yields the following simple expression for the subdifferential of the dual objective:

∂ιRL

A^Ty^ν

−b^Ty^ν

=ANRL

A^Ty^ν

−b. (5.15)

Here, we have used the fact that the subdifferential of the indicator function to the box RLat the point x, denoted∂ιR_L(x)is equivalent to the normal cone mapping of RLat the point x

∂ιR_L(x) =NR_L(x)≡

{w∈Rⁿ with (z−x)^Tw≤0 for all z∈RL} if x∈RL

/0 if x∈/RL.

Remark 5.2. It is important to note that we assume that we can perform exact arith-metic. This assumption is necessary due to the composition of the normal cone mapping of RLwith A^T: while we can determine the exact evaluation of the normal cone for a given A^Ty^ν, we cannot guarantee exact evaluation of the matrix–vector product and, since the normal cone mapping is not Lipschitz continuous on RL, this can lead to large computational errors.

Problem (5.14) is a linear programming problem. The algorithm we analyze be-low solves problem a parametric version of problem (5.14), where the parameter L changes dynamically at each iteration. To see how the parameter might be changed from one iteration to the next, we look to a trivial extension of the primal problem:

minimize

(x,L)∈Rⁿ×Rⁿ₊

n j=1

Lj|xj|

subject to Ax=b. (5.16)

It is clear that L=0 and any feasible x is an optimal solution to problem (5.16), and that the (global) optimal value is 0. However, this is not the only solution. Indeed, the sparsest solution x^∗ to Ax=b and the weight L^∗ satisfying L^∗_j =0 only for those elements j on the support of x^∗ is also a solution. The algorithm we study below finds a weight compatible with the sparsest element x^∗. A more satisfying reformulation would yield a weight that is in some sense optimal for the sparsest element x^∗, but this is beyond the scope of this work.

5.3.2 Dynamically Rescaled Descent with Exact Line Search

There are three unresolved issues in our discussion to this point, namely the choice of elements from the subdifferential, the choice of the step length and the adjustment of the weights Lj. Our strategy is given in Algorithm5.4below. In the description of the algorithm, we use some geometric notions that we introduce first. It will be convenient to define the set C by

C≡ {y∈R^m|A^Ty∈RL}.

This set is polyhedral as the domain of a linear mapping with box constraints.

Lemma 5.3 (Normal cone projection). Let A be full rank and denote the normal cone to C≡ {y∈R^m|A^Ty∈RL}at y∈C by NC(y). Then

PN_C(y)b=Aw (5.17)

for

w=argmin{Aw−b²|w∈NRL(A^Ty)}. (5.18) Proof. If A is full rank, then all points y∈C satisfy the constraint qualification that A is injective on NR_L(A^Ty), that is, the only vector w∈NR_L(A^Ty)for which Aw=0 is w=0. Then, by convex or nonsmooth analysis (see e.g., [14, Theorem 6.14]) the set C is regular and

NC(y) =ANR_L(A^Ty) =

u=Aw|w∈NR_L(A^Ty) . By the definition of the projection

P_N_C_(y)b≡argmin

u−b²|u∈NC(y) hence,

P_N_C_(y)b=argmin

u−b²|u∈ANRL(A^Ty)

=Aargmin

Aw−b²|w∈NRL(A^Ty) =Aw.

Algorithm 5.4 (Dynamically rescaled descent with exact line search).

Initialization: Setν=0,τ>0, L⁰= (a₁,a₂,...,an), where aj is the jth column of A, y⁰=0 and the direction d⁰=b.

Main iteration: Whiled^ν>τdo

• (Exact line search.) Calculate the step lengthλν>0 according to λν≡argmin

λ>0

ιR_L_ν

A^T(y^ν+λd^ν)

−b^T(y^ν+λd^ν) . (5.19)

Set y=y^ν+λνd^ν.

• (Subgradient selection and preliminary rescaling.) Define

J^ν+1={j| |a^Tjy|=L^ν_j}, (5.20) S(L,J,γ) = (s1(L,J,γ),s2(L,J,γ),... ,sn(L,J,γ)),

where sj(L,J,γ) =

γLj for all j∈J

Lj else, (5.21)

and

C(L,J,γ) ={y∈Rⁿ|A^Ty∈R_S(L,J,γ)}, where

R_S₍_L_,J,γ)≡[−s1(L,J,γ),s1(L,J,γ)]× ··· ×[−sn(L,J,γ),sn(L,J,γ)]

(5.22) Chooseγ≥0 small enough that P_N

C(Lν_,Jν+1,γ)(y)b∈ri(N_C(Lν,J^ν⁺¹,γ)(y)) for y=γy. Compute the direction

d^ν+1≡b−P_N

C(Lν_,Jν+1,γ)(y)b (5.23)

• (Rescaling.) Let J^ν+1₊ ≡

j|a^T_jd^ν+1>0 , J^ν+1₋ ≡

j|a^T_jd^ν+1<0 , and define

I^ν+1(γ)≡ argmin

j∈J^ν₊⁺¹∪J^ν₋⁺¹

L^ν_j−γa^T_jy a^T_jd^ν+1

j∈J^ν+1₊

, −L^ν_j−γa^T_jy

a^T_jd^ν+1

j∈J^ν+1₋

. (5.24)

Chooseγ^ν+1∈[0,γ]to satisfy

I^ν+1(γ^ν+1)⊂I^ν+1(0). (5.25) Set

L^ν+_j ¹=

γ^ν+1L^ν_j for j∈J^ν+1

L^ν_j else (5.26)

and y^ν+1=γ^ν+1y. Incrementν=ν+1.

End do.

We begin with some observations. The next proposition shows that the direc-tions chosen by (5.23) with P_N

C(Lν_,Jν+1,γ)(y)b∈ri(N_C(Lν,J^ν⁺¹,γ)(y))for y=γ^y are subgradient descent directions that are not only feasible, but orthogonal to the active constraints. We use orthogonality of the search directions to the active con-straints to guarantee finite termination of the algorithm.

Proposition 5.5 (Feasible directions). Let C≡ {y∈R^m|A^Ty∈RL}, ¯y∈C and define the direction ¯d≡b−P_N_C₍_y)_¯b. Then−d¯∈∂ιR_L(A^Ty¯)−b^Ty¯

and there exists aλ>0 such that y+λd^¯∈C for allλ ∈[0,λ].

Moreover, if PN_C(y¯)b∈ri(NC(y)), then the direction ¯¯ d is orthogonal to the jth column of A for all j such that a^T_jy=Lj.

Proof. The inclusion−d¯∈∂ιRL(A^Ty)¯ −b^T¯y

follows immediately from (5.15).

The feasibility of this direction follows from Lemma5.3and the polyhedrality of C since the polar to the normal cone to C at a point y∈C is therefore equivalent to the tangent cone, which consists only of feasible directions to C at y, defined as a direction d for which y+λ^d∈C for allλ >0 sufficiently small.

Indeed, let ajdenote the jth column of the matrix A and recall the definition of the contingent cone to C at y∈C:

KC(y)≡ {w∈Y| for allνy+λ^νw^ν∈C for some w^ν→w,λ^ν0}. Since C is convex the contingent cone and the tangent cone are equivalent [2, Corollary 6.3.7] and since C is polyhedral the tangent cone can be written as

TC(y)≡ {w∈Y| for allνy+λ^νw∈C for some λ^ν0},

that is, the tangent cone consists entirely of feasible directions. Now the tangent and normal cones to C are convex and polar to each other [14, Corollary 6.30], so, by Lemma5.3, what remains to be shown is that b−P_N_C₍_y₎b lies in the polar to the normal cone to C. This follows since NC(y)is nonempty, closed, and convex. Hence for all w∈NC(y)and for any b

w^T(b−P_N_C_(y)b)≤0, that is, b−P_N_C_(y)b is in the polar to the normal cone.

To see the final statement of the proposition, denote by ¯Jthe set {j=1,2,...,n|a^T_jy¯=Lj}.

If the projection lies on the relative interior to NC(y), then the projection onto N¯ C(y)¯ is equivalent to the projection onto the subspace containing NC(y):¯

P_N_C₍_y)_¯b=P_D(¯_y)b, where

D(y¯)≡

Aw|w∈Rⁿwith wj=0 for j∈/J .

Thus, a^T_j(b−P_N_C₍_y)_¯b) =a^T_j(b−P_N_D₍_y)_¯b) =0 for j∈Jas claimed.

Remark 5.6 (Detection of orthogonality of feasible directions). The interiority con-dition P_N_C_(¯_y)b∈ri(NC(¯y))guaranteeing orthogonality of the directions can easily be checked. Let ¯w≡argmin{Aw−b²|w∈NRL(A^Ty)}.By Lemma5.3P_N_C₍_y)_¯b=A ¯w.

Then A ¯w and hence P_N_C₍_y)_¯b lies in ri(NC(y))¯ if and only if ¯wj=0 for all j such that a^T_jy=Lj.

Calculation of the direction in (5.23) of Algorithm5.4is suggested by Lemma 5.3, where it is shown that the projection is the mapping of the solution to a least

squares problem over a cone. Also, by Proposition5.5, the direction is the negative of a subgradient of the objective in (5.14) with the box S(L^ν,J^ν+1,γ), that is

−d^ν+¹≡ −b+P_N

C(Lν_,Jν+1,γ)(y)b∈∂ιR_S₍_L_ν_,J_ν₊₁_,_γ₎(A^Ty)−b^Ty .

The description as a projection onto the normal cone of a polyhedron is perhaps less helpful than the explicit formulation of Lemma5.3for suggesting how this can be computed, but it provides greater geometrical insight. Moreover, the projection provides an elegant criterion for maintaining orthogonality of the search directions with the active constraints.

The exact line search step has an explicit formulation given in the next pro-position.

The exact line search step length ¯λgiven by (5.19) has the explicit representation λ¯ ≡min

Proof. Application of nonsmooth calculus provides a generalization to the fact from optimization of smooth objectives that the exact line search step extends to the tan-gent of a level set of the objective, from which we can extract (5.27). However, it is perhaps easiest to see the explicit formulation by direct inspection: the indicator functionιRL is zero at all points in RL, so the step length is the largestλ such that

We show in this section that for sufficiently sparse solutions x^∗to Ax=b, the steep-est subgradient descent algorithm with exact line search (Algorithm5.4) recovers x^∗exactly. Before we continue, however, we must specify precisely what is meant by “sufficiently sparse.”

Dans le document Fixed-Point Algorithms for Inverse Problems in Science and Engineering (Page 84-89)