• Aucun résultat trouvé

Convergence for Methods with Cyclic Order

Dans le document Optimization for Machine Learning (Page 117-123)

Optimization: A Survey

4.3 Convergence for Methods with Cyclic Order

In this section, we discuss convergence under the cyclic order. We consider a randomized order in the next section. We focus on the sequence {xk}rather than{zk}, which need not lie withinX in the case of iterations (4.24) and (4.25) when X = n. In summary, the idea is to show that the effect of taking subgradients of fi or hi at points nearxk (e.g., atzk rather than at xk) is inconsequential, and diminishes as the stepsizeαkbecomes smaller, as long as some subgradients relevant to the algorithms are uniformly bounded in norm by some constant. This is similar to the convergence mechanism of incremental gradient methods described in Section 4.2. We use the following assumptions throughout the present section.

Assumption 4.1 (For iterations (4.23) and (4.24)). There is a constant c∈ such that for all k

max

˜fik(zk),∇˜hik(zk)

≤c. (4.26)

Furthermore, for allkthat mark the beginning of a cycle (i.e., allk >0with ik= 1), we have for all j= 1, . . . , m:

max

fj(xk)−fj(zk+j1), hj(xk)−hj(zk+j1)

≤cxk−zk+j1. (4.27) Assumption 4.2 (For iteration (4.25)). There is a constant c such that for all k

max

˜fik(xk+1),∇˜hik(xk)

≤c. (4.28)

Furthermore, for all kthat mark the beginning of a cycle (i.e., allk >0with ik= 1), we have for all j= 1, . . . , m:

max

fj(xk)−fj(xk+j1), hj(xk)−hj(xk+j1)

≤cxk−xk+j1, (4.29) fj(xk+j1)−fj(xk+j)≤cxk+j1−xk+j. (4.30) The condition (4.27) is satisfied if for eachiand k, there is a subgradient of fi at xk and a subgradient of hi at xk, whose norms are bounded by c.

Conditions that imply the preceding assumptions are:

(a) For algorithm (4.23):fi and hi are Lipschitz continuous over the setX.

(b) For algorithms (4.24) and (4.25):fi andhiare Lipschitz continuous over the entire space n.

(c) For algorithms (4.23), (4.24), and (4.25): fi and hi are polyhedra (this

4.3 Convergence for Methods with Cyclic Order 103

is a special case of (a) and (b)).

(d) The sequences {xk}and {zk}are bounded, since then,fi and hi, being real-valued and convex, are Lipschitz continuous over any bounded set that contains{xk}and {zk} (see, e.g., Bertsekas (2009, Proposition 5.4.2))].

The following proposition provides a key estimate that reveals the conver-gence mechanism of our methods.

Proposition 4.2. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection. Then for all y X and all k that mark the beginning of a cycle (i.e., all k with ik= 1), we have

xk+m−y2≤ xk−y2k

F(xk)−F(y)

+α2kβm2c2, (4.31) where β = m1 + 4 in the case of (4.23) and (4.24), and β = m5 + 4 in the case of (4.25).

Proof. We first prove the result for algorithms (4.23) and (4.24), and then indicate the modifications necessary for algorithm (4.25). Using Proposition 4.1(b), we have for ally∈X and k,

zk−y2≤ xk−y2k

fik(zk)−fik(y)

. (4.32)

Also, using the nonexpansion property of the projection (i.e., PX(u) PX(v) ≤ u −v for all u, v n), the definition of subgradient, and (4.26), we obtain for all y∈X and k:

xk+1−y2=PX

zk−αk˜hik(zk)

−y2

≤ zk−αk˜hik(zk)−y2

≤ zk−y2k˜hik(zk)(zk−y) +α2k˜hik(zk)2

≤ zk−y2k

hik(zk)−hik(y)

+α2kc2.

(4.33)

Combining (4.32) and (4.33), and using the definitionFj =fj+hj, we have xk+1−y2 ≤ xk−y2k

fik(zk) +hik(zk)−fik(y)−hik(y)

+α2kc2

=xk−y2k

Fik(zk)−Fik(y)

+α2kc2.

(4.34) Now let k mark the beginning of a cycle (i.e., ik = 1). Then, at iteration k+j−1,j= 1, . . . , m, the selected components are {fj, hj}, in view of the assumed cyclic order. We may thus replicate the preceding inequality with

k replaced byk+ 1, . . . , k+m−1, and add to obtain xk+m−y2 ≤ xk−y2k

m j=1

Fj(zk+j1)−Fj(y)

+2kc2 or, equivalently,

xk+m−y2 ≤ xk−y2k

F(xk)−F(y)

+2kc2 + 2αk

m j=1

Fj(xk)−Fj(zk+j1)

. (4.35) The remainder of the proof deals with appropriately bounding the last term above.

From (4.27), we have for j= 1, . . . , mthat

Fj(xk)−Fj(zk+j1)2cxk−zk+j1. (4.36) We also have

xk−zk+j1 ≤ xk−xk+1+· · ·+xk+j2−xk+j1+xk+j1−zk+j1, (4.37) and by the definition of algorithms (4.23) and (4.24), the nonexpansion property of the projection, and (4.26), each of the terms in the right-hand side above is bounded by 2αkc, except for the last, which is bounded byαkc.

Thus (4.37) yields xk−zk+j1 ≤αk(2j1)cwhich, together with (4.36), shows that

Fj(xk)−Fj(zk+j1)kc2(2j1). (4.38) Combining (4.35) and (4.38), we have

xk+m−y2 ≤ xk−y2k

F(xk)−F(y)

+mα2kc2+4α2kc2 m j=1

(2j1), and finally

xk+m−y2 ≤ xk−y2k

F(xk)−F(y)

+2kc2+ 4αk2c2m2, which is of the form (4.31) with β= m1 + 4.

For algorithm (4.25), a similar argument goes through using Assumption 4.2. In place of (4.32), using the nonexpansion property of the projection, the definition of subgradient, and (4.28), we obtain, for ally∈X andk≥0,

zk−y2 ≤ xk−y2k

hik(xk)−hik(y)

+α2kc2. (4.39)

4.3 Convergence for Methods with Cyclic Order 105

In place of (4.33), using Proposition 4.1(b), we have xk+1−y2≤ zk−y2k

fik(xk+1)−fik(y)

. (4.40)

Combining these equations, in analogy to (4.34), we obtain xk+1−y2 ≤ xk−y2k

We now bound the last two terms in the preceding relation, using Assump-tion 4.2. From (4.29), we have

Fj(xk)−Fj(xk+j1)2cxk−xk+j1

2c

xk−xk+1+· · ·+xk+j2−xk+j1 , and since by (4.28) and the definition of the algorithm, each norm term in the right-hand side above is bounded by 2αkc:

Fj(xk)−Fj(xk+j1)kc2(j1).

Also, from (4.28) and (4.30) and the nonexpansion property of the projec-tion, we have

fj(xk+j1)−fj(xk+j)≤cxk+j1−xk+jkc2. Combining the preceding relations and adding, we obtain

k

Among other things, Proposition 4.2 guarantees that with a cyclic order,

given the iterate xk at the start of a cycle and any point y X having lower cost than xk (for example an optimal point), the algorithm yields a point xk+m at the end of the cycle that will be closer to y than xk, provided the stepsize αk is less than 2

F(xk)F(y)

βm2c2 . In particular, for any >0 and assuming that there exists an optimal solution x, either we are within αkβm2 2c2 + of the optimum,

F(xk)≤F(x) +αkβm2c2 2 + ,

or the squared distance to the optimum will be strictly decreased by at least 2αk :

xk+m−x2<xk−x2k .

Thus, using Proposition 4.2, we can provide various types of convergence results. As an example, for a constant stepsize (αk α), convergence can be established to a neighborhood of the that which shrinks to 0 as α→0, as stated in the following proposition. Its proof and all the proofs of propositions that follow are given in (Bertsekas, 2010).

Proposition 4.3. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection, and let the stepsize αk be fixed at some positive constant α.

(a) If F=−∞, then lim inf

k→∞ F(xk) =F. (b) If F>−∞, then

lim inf

k→∞ F(xk)≤F+αβm2c2

2 ,

where c andβ are the constants of Proposition 4.2.

The next proposition gives an estimate of the number of iterations needed to guarantee a given level of optimality up to the threshold tolerance αβm2c2/2 of the preceding proposition.

Proposition 4.4. Assume that X is nonempty. Let {xk} be a sequence generated as in Proposition 4.3. Then, for >0 we have

0minkNF(xk)≤F+αβm2c2+

2 , (4.43)

4.3 Convergence for Methods with Cyclic Order 107

where N is given by N =m

dist(x0;X)2 α

. (4.44)

According to Proposition 4.4, to achieve a cost function value withinO( ) of the optimal, the termαβm2c2must also be of orderO( ), soαmust be of order O( /m2c2), and from (4.44), the number of necessary iterationsN is O(m3c2/ 2) and the number of necessary cycles isO

(mc)2/ 2)

. This is the same type of estimate as for the nonincremental subgradient method (i.e., O(1/ 2), counting a cycle as one iteration of the nonincremental method, and viewingmcas a Lipschitz constant for the entire cost functionF), and does not reveal any advantage for the incremental methods given here. However, in the next section, we demonstrate a much more favorable iteration com-plexity estimate for the incremental methods that use a randomized order of component selection.

Exact Convergence for a Diminishing Stepsize

We can also obtain an exact convergence result for the case where the stepsize αk diminishes to zero. The idea is that with a constant stepsize α we can get to within an O(α)-neighborhood of the optimum, as shown above, so with a diminishing stepsize αk, we should be able to reach an arbitrarily small neighborhood of the optimum. However, for this to happen, αkshould not be reduced too fast, and should satisfy

k=0αk=(so that the method can “travel” infinitely far if necessary).

Proposition 4.5. Let {xk} be the sequence generated by any one of the algorithms (4.23)-(4.25), with a cyclic order of component selection, and let the stepsize αk satisfy

lim

k→∞αk= 0,

k=0

αk =∞. Then,

lim inf

k→∞ F(xk) =F.

Furthermore, ifX is nonempty and

k=0α2k <∞,

then{xk} converges to somex∈X.

Dans le document Optimization for Machine Learning (Page 117-123)