Convergence for Methods with Cyclic Order

Optimization: A Survey

4.3 Convergence for Methods with Cyclic Order

In this section, we discuss convergence under the cyclic order. We consider a randomized order in the next section. We focus on the sequence {xk}rather than{zk}, which need not lie withinX in the case of iterations (4.24) and (4.25) when X = ⁿ. In summary, the idea is to show that the eﬀect of taking subgradients of fi or hi at points nearxk (e.g., atzk rather than at xk) is inconsequential, and diminishes as the stepsizeαkbecomes smaller, as long as some subgradients relevant to the algorithms are uniformly bounded in norm by some constant. This is similar to the convergence mechanism of incremental gradient methods described in Section 4.2. We use the following assumptions throughout the present section.

Assumption 4.1 (For iterations (4.23) and (4.24)). There is a constant c∈ such that for all k

max

∇˜fi_k(z_k),∇˜hi_k(z_k)

≤c. (4.26)

Furthermore, for allkthat mark the beginning of a cycle (i.e., allk >0with ik= 1), we have for all j= 1, . . . , m:

max

fj(xk)−fj(zk+j−1), hj(xk)−hj(zk+j−1)

≤cxk−zk+j−1. (4.27) Assumption 4.2 (For iteration (4.25)). There is a constant c ∈ such that for all k

max

∇˜fik(xk+1),∇˜hik(xk)

≤c. (4.28)

Furthermore, for all kthat mark the beginning of a cycle (i.e., allk >0with ik= 1), we have for all j= 1, . . . , m:

max

fj(x_k)−fj(x_k+j₋₁), hj(x_k)−hj(x_k+j₋₁)

≤cx_k−x_k+j₋₁, (4.29) fj(xk+j−1)−fj(xk+j)≤cxk+j−1−xk+j. (4.30) The condition (4.27) is satisﬁed if for eachiand k, there is a subgradient of fi at x_k and a subgradient of hi at x_k, whose norms are bounded by c.

Conditions that imply the preceding assumptions are:

(a) For algorithm (4.23):fi and hi are Lipschitz continuous over the setX.

(b) For algorithms (4.24) and (4.25):fi andhiare Lipschitz continuous over the entire space ⁿ.

4.3 Convergence for Methods with Cyclic Order 103

is a special case of (a) and (b)).

(d) The sequences {xk}and {z_k}are bounded, since then,f_i and h_i, being real-valued and convex, are Lipschitz continuous over any bounded set that contains{x_k}and {z_k} (see, e.g., Bertsekas (2009, Proposition 5.4.2))].

The following proposition provides a key estimate that reveals the conver-gence mechanism of our methods.

Proposition 4.2. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection. Then for all y ∈ X and all k that mark the beginning of a cycle (i.e., all k with ik= 1), we have

xk+m−y²≤ xk−y²−2αk

F(xk)−F(y)

+α²_kβm²c², (4.31) where β = _m¹ + 4 in the case of (4.23) and (4.24), and β = _m⁵ + 4 in the case of (4.25).

Proof. We ﬁrst prove the result for algorithms (4.23) and (4.24), and then indicate the modiﬁcations necessary for algorithm (4.25). Using Proposition 4.1(b), we have for ally∈X and k,

zk−y²≤ xk−y²−2αk

fik(zk)−fik(y)

. (4.32)

Also, using the nonexpansion property of the projection (i.e., P_X(u)− PX(v) ≤ u −v for all u, v ∈ ⁿ), the deﬁnition of subgradient, and (4.26), we obtain for all y∈X and k:

xk+1−y²=PX

zk−αk∇˜hik(zk)

−y²

≤ zk−αk∇˜hik(zk)−y²

≤ zk−y²−2αk∇˜hik(zk)(zk−y) +α²_k∇˜hik(zk)²

≤ zk−y²−2αk

hik(zk)−hik(y)

+α²_kc².

(4.33)

Combining (4.32) and (4.33), and using the deﬁnitionFj =fj+hj, we have xk+1−y² ≤ xk−y²−2αk

fik(zk) +hik(zk)−fik(y)−hik(y)

+α²_kc²

=x_k−y²−2α_k

Fi_k(z_k)−Fi_k(y)

+α²_kc².

(4.34) Now let k mark the beginning of a cycle (i.e., ik = 1). Then, at iteration k+j−1,j= 1, . . . , m, the selected components are {fj, hj}, in view of the assumed cyclic order. We may thus replicate the preceding inequality with

k replaced byk+ 1, . . . , k+m−1, and add to obtain xk+m−y² ≤ xk−y²−2αk

m j=1

Fj(zk+j−1)−Fj(y)

+mα²_kc² or, equivalently,

xk+m−y² ≤ xk−y²−2αk

F(xk)−F(y)

+mα²_kc² + 2αk

m j=1

Fj(xk)−Fj(zk+j−1)

. (4.35) The remainder of the proof deals with appropriately bounding the last term above.

From (4.27), we have for j= 1, . . . , mthat

Fj(xk)−Fj(zk+j−1)≤2cxk−zk+j−1. (4.36) We also have

xk−zk+j−1 ≤ xk−xk+1+· · ·+xk+j−2−xk+j−1+xk+j−1−zk+j−1, (4.37) and by the deﬁnition of algorithms (4.23) and (4.24), the nonexpansion property of the projection, and (4.26), each of the terms in the right-hand side above is bounded by 2αkc, except for the last, which is bounded byαkc.

Thus (4.37) yields x_k−zk+j−1 ≤αk(2j−1)cwhich, together with (4.36), shows that

Fj(xk)−Fj(zk+j−1)≤2αkc²(2j−1). (4.38) Combining (4.35) and (4.38), we have

xk+m−y² ≤ xk−y²−2αk

F(xk)−F(y)

+mα²_kc²+4α²_kc² m j=1

(2j−1), and ﬁnally

xk+m−y² ≤ xk−y²−2αk

F(xk)−F(y)

+mα²_kc²+ 4α_k²c²m², which is of the form (4.31) with β= _m¹ + 4.

For algorithm (4.25), a similar argument goes through using Assumption 4.2. In place of (4.32), using the nonexpansion property of the projection, the deﬁnition of subgradient, and (4.28), we obtain, for ally∈X andk≥0,

zk−y² ≤ xk−y²−2αk

hi_k(xk)−hi_k(y)

+α²_kc². (4.39)

4.3 Convergence for Methods with Cyclic Order 105

In place of (4.33), using Proposition 4.1(b), we have x_k+1−y²≤ z_k−y²−2α_k

f_i_k(x_k+1)−f_i_k(y)

. (4.40)

Combining these equations, in analogy to (4.34), we obtain xk+1−y² ≤ xk−y²−2αk

We now bound the last two terms in the preceding relation, using Assump-tion 4.2. From (4.29), we have

Fj(xk)−Fj(xk+j−1)≤2cxk−xk+j−1

≤2c

xk−xk+1+· · ·+xk+j−2−xk+j−1 , and since by (4.28) and the deﬁnition of the algorithm, each norm term in the right-hand side above is bounded by 2αkc:

Fj(xk)−Fj(xk+j−1)≤4αkc²(j−1).

Also, from (4.28) and (4.30) and the nonexpansion property of the projec-tion, we have

fj(xk+j−1)−fj(xk+j)≤cxk+j−1−xk+j ≤2αkc². Combining the preceding relations and adding, we obtain

2αk

Among other things, Proposition 4.2 guarantees that with a cyclic order,

given the iterate xk at the start of a cycle and any point y ∈ X having lower cost than xk (for example an optimal point), the algorithm yields a point xk+m at the end of the cycle that will be closer to y than xk, provided the stepsize αk is less than ²

F(x_k)−F(y)

βm²c² . In particular, for any >0 and assuming that there exists an optimal solution x^∗, either we are within ^α^k^βm₂ ²^c² + of the optimum,

F(xk)≤F(x^∗) +αkβm²c² 2 + ,

or the squared distance to the optimum will be strictly decreased by at least 2αk :

xk+m−x^∗²<xk−x^∗²−2αk .

Thus, using Proposition 4.2, we can provide various types of convergence results. As an example, for a constant stepsize (αk ≡ α), convergence can be established to a neighborhood of the that which shrinks to 0 as α→0, as stated in the following proposition. Its proof and all the proofs of propositions that follow are given in (Bertsekas, 2010).

Proposition 4.3. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection, and let the stepsize αk be ﬁxed at some positive constant α.

(a) If F^∗=−∞, then lim inf

k→∞ F(xk) =F^∗. (b) If F^∗>−∞, then

lim inf

k→∞ F(xk)≤F^∗+αβm²c²

2 ,

where c andβ are the constants of Proposition 4.2.

The next proposition gives an estimate of the number of iterations needed to guarantee a given level of optimality up to the threshold tolerance αβm²c²/2 of the preceding proposition.

Proposition 4.4. Assume that X^∗ is nonempty. Let {x_k} be a sequence generated as in Proposition 4.3. Then, for >0 we have

0≤mink≤NF(xk)≤F^∗+αβm²c²+

2 , (4.43)

4.3 Convergence for Methods with Cyclic Order 107

where N is given by N =m

dist(x0;X^∗)² α

. (4.44)

According to Proposition 4.4, to achieve a cost function value withinO( ) of the optimal, the termαβm²c²must also be of orderO( ), soαmust be of order O( /m²c²), and from (4.44), the number of necessary iterationsN is O(m³c²/ ²) and the number of necessary cycles isO

(mc)²/ ²)

. This is the same type of estimate as for the nonincremental subgradient method (i.e., O(1/ ²), counting a cycle as one iteration of the nonincremental method, and viewingmcas a Lipschitz constant for the entire cost functionF), and does not reveal any advantage for the incremental methods given here. However, in the next section, we demonstrate a much more favorable iteration com-plexity estimate for the incremental methods that use a randomized order of component selection.

Exact Convergence for a Diminishing Stepsize

We can also obtain an exact convergence result for the case where the stepsize αk diminishes to zero. The idea is that with a constant stepsize α we can get to within an O(α)-neighborhood of the optimum, as shown above, so with a diminishing stepsize αk, we should be able to reach an arbitrarily small neighborhood of the optimum. However, for this to happen, αkshould not be reduced too fast, and should satisfy_∞

k=0αk=∞(so that the method can “travel” inﬁnitely far if necessary).

Proposition 4.5. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection, and let the stepsize αk satisfy

lim

k→∞αk= 0,

∞ k=0

αk =∞. Then,

lim inf

k→∞ F(xk) =F^∗.

Furthermore, ifX^∗ is nonempty and _∞

k=0α²_k <∞,

then{x_k} converges to somex^∗∈X^∗.

Dans le document Optimization for Machine Learning (Page 117-123)