• Aucun résultat trouvé

Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”

N/A
N/A
Protected

Academic year: 2022

Partager "Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”"

Copied!
51
0
0

Texte intégral

(1)

Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”

Leonardo Galli [email protected]

Department of Information Engineering University of Florence

Via Santa Marta 3, Firenze, Italia

Chih-Jen Lin [email protected]

Department of Computer Science National Taiwan University Taipei, Taiwan, 106

We report here some equations from the original paper that might be useful in the rest of the supplementary materials file. In Table I we collected the notations we used in this paper.

The linear classification problem:

minwww f(www) = 1

2wwwTwww+CXl

i=1ξ(yiwwwTxxxi), (i) where (yi, xxxi), i= 1, ..., l are the training data, yi =±1 is the label, xxxi ∈ Rn is a feature vector,wwwTwww/2 is the regularization term, C >0 is a regularization parameter andξ(yiwwwTxxxi) is a generic LC1 (continuously differentiable with locally Lipschitz continuous gradient) convex loss function. WithwwwTwww, f is a LC1 strongly convex function and the minimumwww of f exists and is unique.

The following two losses respectively correspond to logistic regression (C2, i.e. twice continuously differentiable) and l2-loss linear SVM (LC1)

ξLR(ywwwTxxx) = log 1 + exp −ywwwTxxx

ξL2(ywwwTxxx) = (max(0,1−ywwwTxxx))2. (ii) The gradientand theHessian of the objective functionf:

g

gg:=∇f(www) =www+CXl

i=1ξ0(yiwwwTxxxi)yixxxi, H:=∇2f(www) =I+CXTDX,

(iii) where I is the identity matrix, X = [xxx1, . . . , xxxl]T is the data matrix and D is a diagonal matrix withDii00(yiwwwTxxxi). The Newton equation:

2f(wwwk)sss=−∇f(wwwk). (iv) Generictruncation criterion:

ratio(sssjk)≤ηk, (v) Armijo condition:

f(wwwkksssk)≤f(wwwk) +γωkgggTksssk, with 0< γ <1. (vi)

(2)

Algorithm I: Conjugate Gradient

Input: sss1k = 000∈Rn,ddd1k=−∇Q1=−gggk

1 for j=1, 2, . . . do

2 αjk= k∇Qjk2

d d djkTHkdddjk 3 sssj+1k =sssjkjkdddjk

4 ∇Qj+1=∇QjjkHkdddjk

5 if (v) is satified then

6 sssk =sssj+1k

7 returnsssk

8 βj = k∇Qk∇Qj+1k2

jk2

9 dddj+1k =−∇Qj+1jdddjk Wolfe condition:

∇f(wwwkksssk)Tsssk≥γgggkTsssk, with 0< γ <1. (vii) Level set:

L1 :={www∈Rn:f(www)≤f(www1)}. (viii) This lemma can be obtained directly from the structure of the losses and it will be needed in Section I.

Lemma I The Hessian matrix (iii) is bounded, i.e. there exist two constants M1, M2 >0 such that

M1 ≥ kHkk ≥M2 ∀k. (ix)

Proof By the definition of the losses functions (ii) we have that 0≤Dii≤2 (see Appendix A for details). Thus from (iii) we have that ∀k Hkis bounded. Furthermore, with the matrix I in (iii),Hk is bounded above zero.

I. Global and Local Convergence of TNCG In this section we relax f to be anyf ∈ LC1.

I.1 Global Convergence by Treating TNCG as a Common-Directions Algorithm

In Algorithm I we detail the instructions of the conjugate gradient method for solving a generic quadratic strongly convex function of the shapeQ(sss) =gggTksss+ 12sssTHksss. We define

∇Qj :=∇Q(sssjk), where∇Q(sss) =gggk+Hksss.

Lemma II Letsssk be the direction returned by Algorithm I and assume that it terminates aftermkiterations. Thensssk=Pmk

j=1αjkdddjk, wherePk={ddd1k, . . . dddmkk}are conjugate directions

(3)

Algorithm II:Truncated Newton Conjugate Gradient Input: www1 ∈Rn starting point

1 for k= 1,2, . . . do

2 compute the directionsssk by approximately solving (iv) with a CG method, until (v) is satisfied

3 compute a step lengthωk

4 wwwk+1=wwwkksssk

with respect toHk and{α1k, . . . , αmkk} are scalars defined to minimize minαααk

g

ggTksssk+1

2sssTkHksssk s.t. sssk=Pkαααk,

(x)

where Pk:= [ddd1k|. . .|dddmk]∈Rn×m andgggk∈ {ddd1k, . . . , dddmk}.

Proof From instructions of Algorithm I we directly get that sssk=sssmkk+1 =

mk

X

j=1

αjkdddjk.

In addition, from Theorem 5.2 of Nocedal and Wright (1999) and the fact thatsss1k= 000 we have thatsssk is the minimizer of (x) on the span of the conjugate directions {ddd1k, . . . dddmkk}.

This concludes the proof.

We now point out that in the framework (Lee et al., 2017), the total number of common- directions m is assumed to be fixed through the whole optimization procedure. Since the termination criterion (v) is employed in Algorithm I, in the TNCG case this number mk depends on the iterationk. On the other hand, it is possible to see that in the proofs of Lee et al. (2017), m is never really required to be fixed. In fact, as long as mis bounded, the problem (x) is well defined and all the properties required for global convergence follow from the fact thatwwwk is its solution, together with the fact thatgggk ∈ {ddd1k, . . . dddmkk}. Note that problem (x) is highly dependent onk, and the dimensionality ofαααkcan be seen as a detail of the internal procedure for obtainingwwwk. For these reasons, we can relax the assumption that m is a fixed constant, and we instead require that it is bounded. In the case of Algorithm I, Theorem 5.1 from Nocedal and Wright (1999) ensures that the procedure terminates within niterations, so we have that mk is always bounded byn.

I.2 Global Convergence Following a More Classical Path

We first need to prove that the directionsssk obtained by Algorithm I is able to satisfy (xi) and (xii). In this section we relax the assumption that Hk employed inQ(sss) of Algorithm I is the Hessian (or the generalized Hessian) off. In fact, it is interesting to point out that for obtaining global convergence of{wwwk},Hk can be a generic positive definite matrix. Indeed in general, second-order information off are not needed to obtain global convergence to a stationary point.

(4)

Proposition III Let f ∈LC1. In addition, assume that Hk is positive definite and ∀k Hk is bounded as in (ix). Letsssk be generated by Algorithm I. Then there exist two constants a1>0 anda2 >0 such that

∇f(wwwk)Tsssk≤ −a1k∇f(wwwk)k2, (xi)

kssskk ≤a2k∇f(wwwk)k. (xii) Proof If we pre-multiply

∇Qj =∇Q1+

j−1

X

i=1

αiHkdddik

bydddjk we obtain

d d

djkT∇Qj =dddjkT∇Q1, (xiii) sincedddjk are all mutually conjugated. From Theorem 5.1 of Hestenes and Stiefel (1952) we have that

k∇Qjk2 =−∇QiTdddjk ∀1≤i≤j, (xiv) which together with (xiii) let us obtain that

αj = k∇Qjk2 dddjkTHkdddjk

=−∇QjTdddjk dd

djkTHkdddjk

=−∇Q1Tdddjk dddjkTHkdddjk

. (xv)

From (xv) we get that s s

sk=sssj+1k =

j

X

i=1

αidddik =−

j

X

i=1

∇Q1Tdddik d

d

dikTHkdddikdddik. (xvi) Thus,

∇f(wwwk)Tsssk =−

j

X

i=1

∇f(wwwk)Tdddik2

dddikTHkdddik

≤ −

∇f(wwwk)T∇f(wwwk)2

d

dd1kTHkddd1k ,

where the equality follows from the fact that∇Q1 =∇f(wwwk) and the inequality follows from the fact thatHk is positive definite and from the fact that in the algorithm initialization we haveddd1k=∇f(wwwk). Thus,sssk is a descent direction and we get

|∇f(wwwk)Tsssk| ≥ k∇f(wwwk)k4

k∇f(wwwk)k2kHkk ≥ 1

M1k∇f(wwwk)k2,

(5)

where the last inequality follows from upper-boundedness ofH with the constant M1 >0.

This proves that (xi) is satisfied with a1 = M1

1.Now to prove (xii) we first observe that Hk

is also lower bounded, i.e.

dddjkTHkdddjk≥M2dddjkTIdddjk ≥M2kdddjkk2, (xvii) where I is the identity matrix. Thus, from (xvi) and (xvii) we can obtain

kssskk=ksssj+1k k=

j

X

i=1

∇Q1Tdddik d

d

dikTHkdddik

kdddikk

j

X

i=1

kdddikk2k∇Q1k

|dddikTHkdddik|

=

j

X

i=1

kdddikk2

dddikTHkdddik

k∇f(wwwk)k

≤ j

M2k∇f(wwwk)k

≤ n M2

k∇f(wwwk)k,

where the last inequality follows from the fact that Algorithm I always terminates in a number of steps that is bounded by n (see Theorem 5.1 of Nocedal and Wright (1999)).

Thus, (xii) is proved to be satisfied witha2 = Mn

2.

From Proposition III we have thatsssk is a descent direction and this means that the Armijo line search procedure terminates finitely (see for instance Nocedal and Wright (1999) for details). Then we obtain global convergence by exploiting Armijo line search properties. In the following theorem we relax the assumption thatsssk is obtained by Algorithm I. In fact, the Theorem IV is still valid for any procedure that yields a directionsssk that satisfies (xi) and (xii).

Theorem IV Let f ∈LC1 and let L1 defined in (viii) be compact. Let {wwwk} be a sequence generated by Algorithm II, where at each stepk the direction in Step 3 is obtained by any procedure that yields asssk that satisfies (xi)and (xii). Then,

k→∞lim k∇f(wwwk)k= 0. (xviii)

Proof From the Armijo line search instruction we get that

f(wwwk+1)≤f(wwwk) +γωk∇f(wwwk)Tsssk, (xix)

(6)

and from this we obtain that

f(www1)−f(wwwk) =

k

X

i=1

f(wwwi)−f(wwwi+1)

≥ −γ

k

X

i=1

ωi∇f(wwwi)Tsssi

≥γ

k

X

i=1

ωia1k∇f(wwwi)k2,

(xx)

where the last inequality follows from (xi). By contradiction, there exists a subsequence K ⊂ {1,2. . .} for which

k∇f(wwwk)k ≥ >0, ∀k∈K. (xxi) From Armijo line search properties we get thatwwwk∈ L1 and thus the left part of inequality (xx) is bounded from above, which means that also the right part of (xx) is bounded from

above. Note that the right part of (xx) is also positive, thus,

k→∞lim γ

k

X

i=1

ωia1k∇f(wwwi)k<∞.

Since k∇f(wwwk)k is bounded from below because of (xxi), we get that

k→∞lim

k∈K k

X

i=1

ωi <∞, which, in turn, brings to

k→∞lim

k∈K

ωk= 0. (xxii)

Because of (xxii), it is not possible to have an infinite subsequence of K in which Armijo always terminates withωk = 1.Thus, from Armijo line search properties there exists a ˆkfor which ∀k >k, kˆ ∈K Armijo always terminates with a ωk such that

f(wwwkk

δ sssk)> f(wwwk) +γωk

δ ∇f(wwwk)Tsssk.

From the Mean Value Theorem, applied on the above inequality, we obtain that

∇f(zzzk)Tsssk> γ∇f(wwwk)Tsssk, (xxiii) wherezzzk =wwwkkωδksssk andθk∈[0,1]. Since {wwwk}K ∈ L1, fromK we can extract another infinite subsequence (redefinedK) for which

k→∞lim

k∈K

ww

wk= ˆwww. (xxiv)

(7)

Now, again from the fact that {wwwk}K ∈ L1, we have that there exists a constant M >0 such thatk∇f(wwwk)k< M ∀wwwk∈ L1 and thus, together with (xii) we get

kssskk ≤a2k∇f(wwwk)k ≤a2M. (xxv) Thus, from (xxiv), (xxii) and (xxv) we get that

k→∞lim

k∈K

zzzk= lim

k→∞k∈K

wwwkk

ωksssk

δ = ˆwww. (xxvi)

From (xxv) we have that {sssk}K is limited, so from K we can extract another infinite subsequence (redefined K) for which

k→∞lim

k∈K

s

ssk= ˆsss. (xxvii)

Moreover, from (xi) and (xxi) we have

∇f( ˆwww)Tsˆss≤ −a1k∇f( ˆwww)k2≤ −a12<0. (xxviii) Finally, from (xxiii), (xxvi) and (xxvii) we get that

γ∇f( ˆwww)Tsˆss= lim

k→∞k∈K

γ∇f(wwwk)Tsssk ≤ lim

k→∞k∈K

∇f(zzzk)Tsssk=∇f( ˆwww)Tˆsss, which is absurd, because from (xxviii) we have ∇f( ˆwww)Tsˆss <0 and γ <1.

I.3 Q-Superlinear Local Convergence

In this subsection we are going to show details on how to obtain Q-Superlinear local convergence for Algorithm II. We study local convergence by assuming that the ratio employed in the stopping rule (v) is (xl). Note that iff is the function defined in (i), the strong convexity implies BD-regularity, and, in turn, BD-regularity1 implies the following property (see Lemma 2.6 of Qi (1993)).

Lemma V If ∇f is BD-regular at a generic pointwww¯ and ∇f is locally Lipschitz, then there exist a constant δ > 0, a neighborhood N( ¯www, δ) := {www ∈ Rn : kwww−wwwk ≤¯ δ} ofwww¯ and a constant M >0 such that H is nonsingular and

kH−1k ≤M ∀H ∈∂B∇f(www) ∀www∈N( ¯www, δ). (xxix) From semismoothness we obtain the following (see Proposition 1 of Pang and Qi (1993)).

Lemma VI If ∇f is semismooth atwww∈Rn, then

slim

s s→0 H∈∂∇f(www+sss)

k∇f(www+sss)− ∇f(www)−Hsssk

ksssk = 0. (xxx)

1. Note that in Qi (1993) this property is called strong BD-regularity, but later on it was simply called BD-regularity, even by the author of Qi (1993), for instance in Mart´ınez and Qi (1995).

(8)

In addition, if ∇f is semismooth, also a second order Taylor expansion is available.

Proposition VII If ∇f is semismooth on Rn, then f(uuu) =f(www) +∇f(www)T(uuu−www) +1

2(uuu−www)TH(uuu−www), where H ∈∂f(zzz) for some zzz∈(uuu, www).

The following proposition is an extension of Lemma VI (see Proposition 2.4 from Facchinei (1995)).

Proposition VIII Assume ∇f to be semismooth on Rn and let {θk} be any sequence of numbers such thatθk∈(0,1] ∀k. Then, for every sequence {wwwk} converging to a pointwww and for every sequence{Hk} such that Hk∈∂∇f(wwwk(wwwk−www)), we have

∇f(www) =∇f(wwwk)−Hk(wwwk−www) +o(kwwwk−wwwk).

In the proof of Lemma below (called Lemma 1 in the main paper), we first provide the result for a genericforcing sequence to clarify that when (xliii) is employed andf ∈LC1 Q-L convergence is still an open question. If we instead directly assume ηk→0, the following proof would be more similar to the one given in Theorem 5.3 of Qi and Sun (2006).

Lemma IX Let ∇f be semismooth and BD-regular. Let {wwwk} be a generic sequence con- vergent to a critical pointwww. Further at eachwwwk we generate asssk by solving the Newton linear equation (iv)to satisfy the condition (v) where theratio employed is (xl). Then there exist δ >0, M >0 such that for any wwwk∈N(www, δ) we have

kwwwk+sssk−wwwk ≤o(kwwwk−wwwk) +M ηkkwwwk−wwwk. (xxxi) Further if ηk →0 in generating sssk, then we have

k→∞lim

kwwwk+sssk−wwwk

kwwwk−wwwk = 0. (xxxii) Proof Since∇f is semismooth, by Lemma V we obtain that there exist constantsM1, δ1 >0 such that ifkwww−wwwk ≤δ1 and H∈∂B∇f(www) thenH is nonsingular and

kH−1k ≤M1. (xxxiii)

From semismoothness of ∇f we also have local Lipschitz continuity of∇f. This, together with stationarity ofwww means that there exist two constants L,δ∈(0, δ1) such that

k∇f(www)k ≤Lkwww−wwwk ∀www∈N(www, δ). (xxxiv) By (xxxiv), (xxxiii), Lemma VI, (xl) and the stationarity ofwww, for anywwwk∈N(www, δ) we have:

kwwwk+sssk−wwwk=kwwwk−www−(Hk)−1∇f(wwwk) + (Hk−1)(Hksssk+∇f(wwwk))k

≤ k(Hk)−1k · kHksssk+∇f(wwwk)k+k∇f(wwwk)− ∇f(www)−Hk(wwwk−www)k

≤o(kwwwk−wwwk) +ηkM1k∇f(wwwk)k

≤o(kwwwk−wwwk) +ηkMkwwwk−wwwk,

which proves (xxxi) with M :=M1L. Now, ifηk→0, from (xxxi), we also get (xxxii).

The following is called Theorem 2 in the main paper.

(9)

Theorem X Let f be strongly convex and∇f be semismooth. Let L1 defined in (viii)be compact. Let {wwwk} be generated by Algorithm II with γ ∈(0,12) and ηk → 0. Then:

(1) there exists kˆ such that (vi) is satisfied withωk= 1∀k≥k.ˆ (2) {wwwk} is Q-SL convergent to a critical pointwww.

Proof Let {wwwk} be a sequence generated by Algorithm II in which γ ∈ (0,12), ηk → 0.

Since ∇f is semismooth we have that f ∈LC1, and since L1 is compact we also have that the (generalized) Hessian is bounded. Furthermore, since Hk employed in Algorithm II is the actual (generalized) Hessian of f and f is strongly convex, we have that Hk is positive definite and bounded for allwww ∈ L1. Thus, all the assumptions of Proposition III of the supplementary are satisfied and we obtain (xi) and (xii). Now, from Theorem IV of the supplementary we obtain limk→∞k∇f(wwwk)k= 0, which together with the strict convexity off implies that{wwwk}is convergent to a critical pointwww. This (together with the fact that strong convexity off implies BD-regularity of ∇f) means that we can apply Lemma IX on {wwwk}, since it is a sequence converging to a critical pointwww in whichsssk is generated by solving the Newton linear equation (iv) to satisfy the condition (v) where theratioemployed is (xl).

Since ηk→0, we can now apply (xxxii), on both kwwwk+sssk−wwwk

kwwwk−wwwk ≥ kwwwk−wwwk

kwwwk−wwwk− kssskk

kwwwk−wwwk ≥1− kssskk kwwwk−wwwk and

kwwwk+sssk−wwwk

kwwwk−wwwk ≥ kssskk

kwwwk−wwwk−kwwwk−wwwk

kwwwk−wwwk ≥ kssskk

kwwwk−wwwk−1 and obtain

k→∞lim

kssskk

kwwwk−wwwk = 1. (xxxv) From this and (xxxii) we can obtain

k→∞lim

kwwwk+sssk−wwwk

kssskk = lim

k→∞

kwwwk+sssk−wwwk

kwwwk−wwwk ·kwwwk−wwwk kssskk = 0, which leads to

kwwwk+sssk−wwwk=o(kssskk). (xxxvi) Since ∇f is semismooth, we get that H is locally upper-bounded in N(www, δ) for any δ.

That is, there exists a M >0 such that

M >kHk, ∀H ∈∂∇f(www) ∀www∈N(www, δ). (xxxvii) From (xxxii) and the fact that {wk} →w,there exist aδ >0 and ak(δ) such that

wk∈N(w, δ) andwk+sk∈N(w, δ)∀k≥k(δ). (xxxviii) By Proposition VII and since ∇f(www) = 000, we can write

f(wwwk+sssk) =f(www) +1

2(wwwk+sssk−www)THk(wwwk+sssk−www) f(wwwk) =f(www) +1

2(wwwk−www)TRk(wwwk−www),

(10)

whereHk∈∂∇f(zzzk) for somezzzk∈(www, wwwk+sssk) andRk∈∂∇f(uuuk) for someuuuk∈(www, wwwk).

Subtracting the two above equalities we have f(wwwk+sssk)−f(wwwk) =−1

2(wwwk−www)Rk(wwwk−www) +o(kssskk2),

where the last term follows from (xxxvi), (xxxvii) and (xxxviii). By subtracting 12∇f(wwwk)Tsssk, we get

f(wwwk+sssk)−f(wwwk)−1

2∇f(wwwk)Tsssk

=−1

2(wwwk−www)TRk(wwwk−www)−1

2∇f(wwwk)Tsssk+o(kssskk2)

=−1

2(wwwk+sssk−www)TRk(wwwk−www) + 1

2sssTkRk(wwwk−www)−1

2∇f(wwwk)Tsssk+o(kssskk2)

=−1

2sssTk (∇f(wwwk)−Rk(wwwk−www)) +o(kssskk2)

=−1

2sssTk (∇f(wwwk)− ∇f(www)−Rk(wwwk−www)) +o(kssskk2)

=o(kssskk2),

where the third equality follows from (xxxv), (xxxvi), and (xxxvii), while the last equality follows from Proposition VIII and again (xxxv). Thus, from the last equation we have

f(wwwk+sssk)−f(wwwk) = 1

2∇f(wwwk)Tsssk+o(kssskk2) +γ∇f(wwwk)Tsssk−γ∇f(wwwk)Tsssk

≤γ∇f(wwwk)Tsssk+o(kssskk2)− 1

2 −γ

a1k∇f(wwwk)k2

≤γ∇f(wwwk)Tsssk+o(kssskk2)− 1

2 −γ a1

a22kssskk2,

(xxxix)

where the first inequality follows from (xi), while the second inequality from (xii). The inequality (xxxix) proves that there exists a ˆkfor which (vi) is satisfied withωk = 1,∀ k >ˆk and γ ∈ (0,12). This means that ∀k > kˆ we can replacewwwk+sssk with wwwk+1 in the limit (xxxii) and obtain that{wwwk} Q-superlinearly converges towww.

II. Details on Truncation Criteria in TNCG We first recall all the ratio:

ˆ residual:

ratio= kgggk+Hksssjkk

kgggkk . (xl)

ˆ residuall1:

ratio= kgggk+Hksssjkk1

kgggkk1 . (xli)

(11)

Algorithm III:PCG for solving (xlvi). AssumeM has been factorized to EET.

1 Given ηk<1,∆k>0, let ˆs=0,rˆ= ˆd=−E−1gk, γ= ˆrT

2 while True do

3 ˆv←E−1(Hk(E−Td)),ˆ

4 α← kˆrk2/(ˆdTˆv)

5 ˆs←ˆs+αdˆ

6 ˆr←rˆ−αvˆ

7 if kˆrk< ηkkˆgkk then

8 returnsk=E−T

9 γnew←rˆT

10 β←γnew

11 ˆd←ˆr+βdˆ

12 γ←γnew

ˆ quadratic:

ratio= (Qj −Qj−1)

Qj/j , (xlii)

Now we recall all the forcing sequences:

ˆ constant:

ηk=c0 withc0 ∈(0,1). (xliii)

ˆ adaptive:

ηk= min{c1;c2kgggkkc3}, with c1∈(0,1), c2 >0, c3 ∈(0,1]. (xliv)

ˆ adaptivel1:

ηk= min{c1;c2kgggkkc13}, with c1∈(0,1), c2 >0, c3 ∈(0,1]. (xlv) III. Preconditioning

In this section we will show that Algorithm 2 from the original paper is equivalent to applying Algorithm III and then obtaining sssfromsss=E−1sˆss. For sake of simplicity in this section and in Algorithm III we will not use the iteration counterj and k will only be reported on Hk andgggk. Let us write the instructions of the original CG method (Algorithm I) applied to the system

E−1HkE−Tˆsss=−E−1gggk. (xlvi) At the internal generic CG step we have

ˆ

s= ˆs+αd,ˆ (xlvii)

where

α= kˆrk2TE−1HkE−T

(12)

with

ˆ

r=−E−1HkE−Tˆs−E−1gk =E−1 −HkE−Tˆs−gk

. (xlviii)

Pre-multiplying (xlvii) by E−1, using the transformations=E−1ˆsand the definition

d:=E−1dˆ (xlix)

we get

s=s+αd.

Again using the transformation s=E−1ˆs, (xlviii) can be re-written to get

rˆ=E−1(−Hks−gk) =E−1r, (l) which together with (xlix) brings to

α= rTM−1r

dTv , wherev=Hkd.

Moreover we have that

dˆ = ˆr+βd,ˆ (li)

where

β = rˆTnewˆrnew

ˆrTˆr = rTnewM−1rnew rTM−1r . Pre-multiplying (li) by E−1, from (l) and (xlix) we get

d=r+βd.

Finally, defining the vector

z :=M−1r and the scalar γ:=rTz, α and β can be re-written as

α= rTz

dTv and β= γnew γ . IV. Complete Numerical Results

For experiments we use two-class classification problems shown in Table II. Except for yahookr andyahoojp, others are available from LIBSVM Data Sets (2007). Table II shows the statistics of each data set used in our experiments. For a fair evaluation, all different settings are implemented based on the LIBLINEAR package2. To simulate the practical use we conduct a five-fold cross-validation to select the regularization parameterCBest that

2. Exceptions are for Scikit and ScikitArmijo of Figure 1 from the original paper, where Scikit-learn is used.

In particular, to train aLogisticRegressionmodel, the user should select thenewton-cgoption for the parametersolver. By default Scikit-learn solves a slightly different optimization problem with a bias term. To solve (i), the bias term is thus ignored by setting the optionfit intercepttoFalse.

(13)

achieves the best validation accuracy (see Table II for the resultingC). Then in experiments we consider C =CBest× {1,100}.

In figures below we show the total number of CG steps needed to solve the training problem (i). On the y axis we report the relative reduction of the function value, computed by

f(wwwk)−f(www) f(www) ,

wherewww is the optimal solution3 of (i). The four horizontal lines in figures below indicate places where the following stopping condition is met respectively with={10−1,10−2,10−3,10−4}

kgggkk ≤min{#pos,#neg}

l · kggg1k, (lii)

where #pos, #neg are the numbers of positive- and negative-labeled instances respectively, and >0 is the thresholding constant that controls the precision of the training procedure.

Note that (lii) with= 10−2 is the outer stopping condition employed in LIBLINEAR. Thus the behavior before 10−1 and after 10−4 is not crucial, since the training would be stopped too early or too late. For the Armijo line search, we follow Hsia et al. (2017) to haveδ = 0.5 for deciding the step sizeα∈ {1, δ, δ2, . . .}and γ = 0.01 in the sufficient decrease condition (vi).

IV.1 Selection of the Forcing Sequence in Truncation Criteria See complete results in Figures i and ii.

IV.2 Selection of the Constant Threshold in the Inner Stopping Condition In this section, results in Figures iii and iv are obtained by comparing

ˆ rescons: ratio= (xl),ηk= (xliii), c0= 0.1;

ˆ rescons05: ratio= (xl),ηk= (xliii), c0 = 0.5;

ˆ rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;

ˆ quadcons: ratio= (xlii),ηk= (xliii), c0= 0.1;

ˆ quadcons05: ratio= (xlii), ηk= (xliii), c0 = 0.5;

ˆ quadcons09: ratio= (xlii), ηk= (xliii), c0 = 0.9.

By using the same color (red) for all the settings with a residual ratio(xl) and one other (blue) for all those with a quadratic ratio(xlii), in Figures iii and iv we want to observe the

robustness of the tworatiosw.r.t. a modification of the constant thresholdc0:

ˆ When C=CBest, there is no evidence of one implementation that is more robust than the other.

3. In practicewwwis not available, so we run enough iterations to get a good approximate solution.

(14)

ˆ When C= 100CBest, the quadratic ratiois more sensible on covtype,rcv1 and also on url, while the residual one is more sensible on yahoojp, yahookr, kdda, kddb, kdd12and also on real-sim.

The value C = 100CBest leads to a more difficult optimization problem and it is thus reasonable that the modification of the constant threshold c0 has here a stronger effect on the convergence speed. From the observations above we can conclude that the quadratic ratiois more robust than the residual one, especially on difficult optimization problems.

IV.3 Comparison Between Adaptive and Constant Forcing Sequences In this section, results in Figures v and vi are obtained by comparing

ˆ rescons05: ratio= (xl),ηk= (xliii), c0 = 0.5;

ˆ resada: ratio= (xl),ηk= (xliv);

ˆ quadcons05: ratio= (xlii), ηk= (xliii), c0 = 0.5;

ˆ quadada: ratio= (xlii), ηk= (xliv).

From Figures v and vi we can observe that the constantforcing sequence and the adaptive one are (almost) always leading to an identical convergence speed. This means that the constant c1= 0.5 is (almost) always smaller thanc2k∇f(wwwk)kc3, withc2 = 1 andc3 = 0.5.

In fact, this result depends also on the choice of c1, c2 and c3, since, for instance, with a bigger initial threshold c1 = 0.9 this overlapping effect would not be so common. On the other side, the setting (c1 = 0.5, c2 = 1, c3 = 0.5) is the one used in Scikit-learn and suggested also in Nocedal and Wright (1999). Moreover, experiments with many other settings ofc1, c2 andc3 have been carried out and there is no evidence of configurations that work consistently better than the standard one.

IV.4 Comparison Between l2-Norm and l1-Norm and with TR

See complete results in Figures vii and viii. On the dataset covtype with C = 100CBest both the setting quadada and quadada l1 are facing slow convergence issues because of the threshold c1. Comparing with resada it seems that in early iterations the Newton equation needed to be solved more accurately. Since in Section IV.2 it is not easy to investigate the choice of the thresholdc1 (because of the single color blue) we decided to deepen the analysis on quadada in Section IV.9.

IV.5 Analysis via Conditions for Global Convergence

In the convergence proof from Section I.2 one of the key properties for proving global convergence is obtaining (xi) and (xii). The following combination of the two requirements is called angle condition

−gggTksssk≥ckgggkkkssskk. (liii) In fact, most of the line-search based optimization methods (see Section 3.2 from Nocedal and Wright (1999)) requires the directionsssk to satisfy (liii) to ensure global convergence4.

4. Note anyway that global convergence is always guaranteed for Algorithm II as proved in Section I.2, so the discussion in this section is only pointing out an undesired issue that might come out in practice.

(15)

This suggests us the idea of analyzing different inner stopping criteria on this value cos(−gggk, sssk) = −gggTksssk

kgggkkkssskk. (liv) The term (liv) is the cosine of the angle between the directionsssk and the anti-gradient−gggk. Inequality (liii) is indeed called angle condition because it requires the cosine of the above angle to be greater or equal to a positive c, withc <1. This means that the new direction sssk cannot be completely orthogonal to the anti-gradient. Otherwise, sssk is not a descent direction and we cannot use it to decrease the function value. We suspect that a good inner stopping condition should make (liii) more easily satisfied. Detailed experiments in Figures ix and x confirm this result.

The condition (liii) is also useful to investigate the relationship between the regularization parameterC and the difficulty of the optimization problem (i). Many past experiments have shown that whenC is large, the optimization problem is more difficult and slow convergence more frequently occurs Hsia et al. (2017, 2018). In particular, when the Hessian of the objective function is positive definite and we approximatesssk≈Hk−1gggk we have

g

ggTksssk≈ −gggTkHk−1gggk≤ − kgggkk2 λmax(Hk) kssskk ≈ kHk−1gggkk ≤ kgggkk

λmin(Hk).

(lv)

Now, from (lv) and (ix) we have

−gggTksssk≥ M2

M1kgggkkkssskk. (lvi) This means that when the regularization parameter C is large (e.g., C = 100CBest), (lvi) shows that the lower bound of the cosine value in (liv) we can derive is smaller. Thus the angle condition (liii) is harder to be satisfied. Note that equation (liii) is a property needed in the convergence proof, which requires the existence of a constant c. In fact, there is no control on the right-hand side of (liii), while what we hope is that the left-hand side of (liii) is large so that it can be easily satisfied. Unfortunately (lvi) shows that for large C,

the left-hand side of (liii) tends to be not that large. In conclusion, even if theoretically convergence is not an issue, the above discussion still gives us a hint on what might actually happen numerically when C is large.

We now show experiments in which we investigate the cosine between the anti-gradient

−gggk and the resulting directionsssk. In Figures ix and x we compare

ˆ quadada: ratio= (xlii), ηk= (xliv);

ˆ rescons09: ratio= (xlii),ηk= (xliii) c0= 0.9;

ˆ resada: ratio= (xl),ηk= (xliv).

Cosines have been calculated until the outer termination criterion (lii) is satisfied with = 10−4. In Figures ix and x, we present the cosine value at each Newton iteration. Further, two horizontal lines corresponding to cosine = 0.1 and cosine = 0.2 have been drawn. From the comparison between Figures ix-x and i-ii we can make the following observations:

(16)

ˆ Algorithms having low cosine values (for example below 0.1) are more likely to converge slower to the solution. In particular, resada has a cosine more frequently below 0.1 when C = 100CBest. For such configurations, Figures i and ii have shown that it behaves worse in terms of convergence speed.

Long continuous regions of low cosine in Figures ix-x might correspond to regions of slow convergence in Figures i-ii (see for example rescons09 onyahookr andkdd2010a when C= 100CBest, below the 0.2 threshold).

ˆ The cosine is often smaller when C is higher. This result confirms the theoretical analysis detailed in this section on the difficulty of solving (i) as C changes.

ˆ In Figures i-ii, quadada converges much faster than resada. From Figures ix-x we observe that quadada’s cosine is generally higher than resada’s. This indicates that in the early Newton iterations, the resulting directionsssk of quadada is closer to the anti-gradient. We have mentioned above in this section that this is a desired property.

Since early CG iterates are very close to the anti-gradient, the inner stopping condition in quadada effectively stops the CG procedure early to have this property.

We conclude that in early outer iterations a suitable criterion should avoid over-solving to obtain directions having a higher cosine with respect to the anti-gradient.

IV.6 Preconditioning: Effect of Preconditioning on Residual and Quadratic Ratio

See complete results in Figures xi and xii.

IV.7 Preconditioning: Comparison with Trust Region and Other Rules See complete results in Figures xiii and xiv.

IV.8 Additional Results in the Preconditioned Case: Comparison Between Adaptive and Constant Forcing Sequences

In this section, results in Figures xv and xvi are obtained by comparing

ˆ rescons05 p: ratio= (xl), ηk= (xliii), c0 = 0.5, preconditioned;

ˆ resada p: ratio= (xl),ηk= (xliv), preconditioned;

ˆ quadcons05 p: ratio= (xlii), ηk= (xliii), c0 = 0.5, preconditioned;

ˆ quadada p: ratio= (xlii), ηk= (xliv), preconditioned.

Also in this case, like in the non-preconditioned one of Section IV.3, the constant forcing sequence and the adaptive one are (almost) always leading to an identical convergence speed.

IV.9 Additional Results in the Preconditioned Case: Selection of the c1

Threshold in the Quadratic Adaptive Rule

In this section, results in Figures xvii and xviii are obtained by comparing

(17)

ˆ quadada01: ratio= (xlii),ηk= (xliv),c1 = 0.1, c2 = 1, c3= 0.5;

ˆ quadada01 p: ratio= (xlii),ηk= (xliv) c1= 0.1, c2= 1, c3 = 0.5, preconditioned;

ˆ quadada: ratio= (xlii), ηk= (xliv), c1= 0.5, c2 = 1, c3 = 0.5;

ˆ quadada p: ratio= (xlii), ηk= (xliv), c1= 0.5, c2 = 1, c3 = 0.5 preconditioned.

From Figures xvii and xviii we can make the following observations:

ˆ Using a smallerc1 threshold (c1= 0.1) is solving the issues encountered by quadada on covtypewith C= 100CBest in Figure viii.

ˆ Both the settings quadada01 and quadada01 p are generally slower than quadada and quadada p.

ˆ The slow convergence issue encountered by quadada oncovtype with C= 100CBest in Figure viii is also solved by the use of preconditioning (quadada p).

IV.10 Additional Results in the Preconditioned Case: Comparison Against a Residual Constant Rule with Higher Constant Threshold

In this section, results in Figures xix and xx are obtained by comparing

ˆ rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;

ˆ rescons09 p: ratio= (xl), (xliii), c0 = 0.9, preconditioned;

ˆ quadada: ratio= (xlii), ηk= (xliv);

ˆ quadada p: ratio= (xlii), ηk= (xliv), preconditioned.

From Figures xix and xx we can make the following observations:

ˆ The settings rescons09 p and quadada p are often obtaining a very similar speed of convergence.

ˆ There are differences on C = CBest in which rescons09 p is faster than quadada p (rcv1,real-sim) and quadada p is faster than rescons09 p (covtype and criteoon

the fourth line), but the detachment is less than 20 CG steps.

ˆ By looking at yahookr with C = CBest, we can see that rescons09 p is noticeably faster than quadada p (at the third line is around twice faster, by 150 CG steps). The opposite situation is instead happening on yahookrwith C= 100CBest, where being twice faster on the third line means saving around 750 CG steps.

(18)

IV.11 Additional Results in the Preconditioned Case: Comparison Between l2-Norm and l1-Norm

In this section, results in Figures xxi and xxii are obtained by comparing

ˆ resada l1: ratio= (xli), ηk= (xlv);

ˆ rescons l1 p: ratio= (xli), (xlv), preconditioned;

ˆ resada: ratio= (xl),ηk= (xliv);

ˆ resada p: ratio= (xl),ηk= (xliv), preconditioned.

From Figures xxi and xxii we can notice that rescons l1 p is often slower than resada p, so also in the preconditioned case there is no reason to employ L1-norm instead of L2-norm.

IV.12 Results for L2-loss SVM

In this section, results in Figures xxiii and xxiv are obtained by employing theξL2 loss as defined in (ii) and we compare:

ˆ tr rescons: ratio= (xl),ηk= (xliii),c0= 0.1, Trust Region instead of line search;

ˆ rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;

ˆ resada: ratio= (xl),ηk= (xliv);

ˆ quadada: ratio= (xlii), ηk= (xliv).

IV.13 Results for L2-loss SVM in the Preconditioned Case

In this section, results in Figures xxv and xxvi are obtained by employing the ξL2 loss as defined in (ii) and we compare:

ˆ tr rescons p: ratio= (xl), ηk= (xliii), c0 = 0.1, Trust Region instead of line search, preconditioned;

ˆ rescons09 p: ratio= (xl), ηk= (xliii), c0 = 0.9, preconditioned;

ˆ resada p: ratio= (xl),ηk= (xliv), preconditioned;

ˆ quadada p: ratio= (xlii), ηk= (xliv), preconditioned.

IV.14 Comparison with Other State-of-the-Art Software

In this section we are comparing the proposed TNCG implementation with the two state- of-the-art software SAG (Schmidt et al., 2017) and SAGA (Defazio et al., 2014). A handy implementation of the two methods is provided by Scikit-learn. All the settings are the same described in Footnote 1 of Section IV, except for the fact that for training a LogisticRegression model with SAG or SAGA the user should select the sag or saga option for the parametersolver. Few lines of python/cython code have been added in the Scikit implementation of SAG/SAGA to compute the function value and measure the CPU

(19)

time (with the function time.time()of the library time). This additional computation has not been considered in the total time employed. On the other side, in the case of the LIBLINEAR implementation the time has been measured with the function clock()of the librarytime.h. For both the implementations the starting time has been taken immediately before the main optimization cycle. Experiments are conducted on a machine with an AMD EPYC 7401 CPU and 128 GB of memory. Results in Figures xxvii and xxviii are obtained by comparing quadada p (ratio= (xlii),ηk= (xliv), preconditioned), SAG and SAGA.

Appendix A. Gradient and Hessian Details of the Losses from (ii) LetX = [xxx1, . . . , xxxl]T be the data matrix.

A.1 Logistic Loss

In this section we compute the gradient and the Hessian of f when the loss employed isξLR. Letσ(t) := (1 +e−t)−1.

The gradient of f is

∇f(www) =www+C

l

X

i=1

(σ(yiwwwTxxxi)−1)yixxxi

The Hessian off is

2f(www) =I+CXTDX

with Da diagonal matrix such that Dii(www) =σ(yiwwwTxxxi)(1−σ(yiwwwTxxxi)).

A.2 L2-loss SVM

In this section we compute the gradient and the Hessian of f when the loss employed is ξL2. Letyyy= [y1, . . . , yl]T. The gradient off is

∇f(www) =www+ 2CXI,:T(XI,:www−yyyI), where I=

i∈ {1, . . . l}|1−yiwwwTxxxi >0 . Now we define

φ(t) = max(0,1−t)2, φ0(t) =

(−2(1−t) if 1−t >0,

0 otherwise,

Bφ0(t) =





2 if 1−t >0, 0 if 1−t <0, {0,2} if 1−t= 0,

∂φ0(t) = conv{∂Bφ0(t)}=





2 if 1−t >0, 0 if 1−t <0, µwith µ∈[0,2] if 1−t= 0,

where conv{·} is the convex hull of a set. Therefore, the generalized Hessian off is

∂∇f =I+CXTDX, with Da diagonal matrix such that Dii(www) =∂φ0(yiwwwTxxxi).

(20)

An easy way to calculate∂∇f in practice is to replace the originalDii(www) with the following D¯ii(www) =

(2, if 1−yiwwwTxxxi>0, 0, otherwise.

A.3 Cost of Armijo, Wolfe and Exact Line Searches

Here we recall the cost of Armijo (Section 2.1 of Hsia et al. (2017)) and exact line search (Section 3.3 of Hsia et al. (2017)), and we show how to obtain the same save of operations

even for Wolfe line search.

ˆ Exact requires O(l) operations for each trialω,

ˆ Armijo requiresO(l) operations for each trial ω,

ˆ Wolfe requiresO(l) operations for each trial ω.

To show the trick in computing the Wolfe condition, we first recall the same trick for Armijo (vi). For computing f at each ω, we can store eachxxxTiwwwk,xxxTissskwwwTksssk,wwwkTwwwk andssskTsssk respectively in these scalars x¯iw,x¯is,ws,¯ ww¯ and ¯ss and computef at (wwwk+ωsssk) in the following way

f(wwwk+ωsssk) = 1

2(wwwk+ωsssk)T(wwwk+ωsssk) +C

l

X

i=1

ξ(yi(wwwk+ωsssk)Txxxi)

= 1

2( ¯ww+ 2ωws¯ +ω2ss) +¯ C

l

X

i=1

ξ(yi( ¯xiw+ωx¯is)).

Thus the cost of Armijo line search isO(l)×(line search steps). The cost of Wolfe condition (vii) is also O(l)×(line search steps), because we don’t have to compute ∇f(wwwk+ωsssk) at each ω, but we can directly compute∇f(wwwk+ωsssk)Tsssk. Thus, again by storing everything as above, we have:

∇f(wwwk+ωsssk)Tsssk =wwwTksssk+ωsssTksssk+C

l

X

i=1

ξ0(yi(wwwk+ωsssk)Txxxi)·yi·xxxTisssk

= ¯ww+ωss¯ +C

l

X

i=1

ξ0(yi( ¯xiw+ωx¯is))·yi·x¯is.

(21)

Symbol Meaning

n number of instances

l number of features

k outer iteration index (TN)

j inner step index (CG)

i index of the dataset instance w vector of weights (∈Rn)

f objective function

g or∇f gradient of f (∈Rn) H or ∇2f Hessian of f (∈R(n×n))

y label

x feature vector (∈Rl)

ξ loss function

X matrix vector (∈R(l×n))

s TN direction (∈Rn)

C regularization parameter Qk(·) quadratic approximation off

I identity matrix

η forcing sequence

ω step length

γ constant of Armijo line search

L level set

d d

d conjugate direction (∈Rn)

B a positive definite matrix (∈R(n×n)) M1, M2 bounding constants

r Newton equation residual (∈Rn)

B∇f B-subdifferential of ∇f

∂∇f generalized Hessian of f k · k l2-norm

k · k1 l1-norm

c0, c1, c2, c3 constants in the forcing sequences constant of the outer stopping condition M preconditioning matrix (∈R(n×n)) E a factor matrix of M =EET (∈R(n×n)) ˆ

s preconditioned TN direction (∈Rn) rˆ preconditioned residual (∈Rn) ˆ

g preconditioned gradient (∈Rn) Hˆ preconditioned Hessian (∈R(n×n)) k · kA norm induced by the matrix A λmin(·), λmax(·) min and max eigenvalue

CBest C achieving the best validation accuracy N(w, δ) neighbor of w with radius δ

Table I: Notation table.

Références

Documents relatifs

Such an alliance can be effective in disseminating and promoting a new role for surgery that targets wider sections of the population, as well as participating in attaining the

For the first time, the method of fungi ecology, PCR-DGGE was used as a new traceability analytical tool to identify the geographical origin of tropical timber from four

This note describes a one dimensional problem that is related to the study of the symmetry properties of minimisers for the best constant in the trace inequality in a ball..

Figure 5: Convergence speed for near-surface application test case: steepest descent (red), l-BFGS (green), Gauss-Newton (blue), Newton (purple). Figure 6: Near-surface test

Russell used acquaintance to that effect: ÒI say that I am acquainted with an object when I have a direct cognitive relation to that object, that is when I am directly aware of

Because of operations discussed in Section II.4.3 and II.4.4, we decide to record linear indices in each Z in,i corresponding to the selected elements, rather than linear indices in

Based on the obtained results we can conclude that considering complete document does not always increase the classification accuracy5. Instead, the accuracy depends on the nature

Abstract This article presents a family of semidefinite programming bounds, obtained by Lagrangian duality, for 0-1 quadratic optimization problems with linear or quadratic