Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”
Leonardo Galli [email protected]
Department of Information Engineering University of Florence
Via Santa Marta 3, Firenze, Italia
Chih-Jen Lin [email protected]
Department of Computer Science National Taiwan University Taipei, Taiwan, 106
We report here some equations from the original paper that might be useful in the rest of the supplementary materials file. In Table I we collected the notations we used in this paper.
The linear classification problem:
minwww f(www) = 1
2wwwTwww+CXl
i=1ξ(yiwwwTxxxi), (i) where (yi, xxxi), i= 1, ..., l are the training data, yi =±1 is the label, xxxi ∈ Rn is a feature vector,wwwTwww/2 is the regularization term, C >0 is a regularization parameter andξ(yiwwwTxxxi) is a generic LC1 (continuously differentiable with locally Lipschitz continuous gradient) convex loss function. WithwwwTwww, f is a LC1 strongly convex function and the minimumwww∗ of f exists and is unique.
The following two losses respectively correspond to logistic regression (C2, i.e. twice continuously differentiable) and l2-loss linear SVM (LC1)
ξLR(ywwwTxxx) = log 1 + exp −ywwwTxxx
ξL2(ywwwTxxx) = (max(0,1−ywwwTxxx))2. (ii) The gradientand theHessian of the objective functionf:
g
gg:=∇f(www) =www+CXl
i=1ξ0(yiwwwTxxxi)yixxxi, H:=∇2f(www) =I+CXTDX,
(iii) where I is the identity matrix, X = [xxx1, . . . , xxxl]T is the data matrix and D is a diagonal matrix withDii=ξ00(yiwwwTxxxi). The Newton equation:
∇2f(wwwk)sss=−∇f(wwwk). (iv) Generictruncation criterion:
ratio(sssjk)≤ηk, (v) Armijo condition:
f(wwwk+ωksssk)≤f(wwwk) +γωkgggTksssk, with 0< γ <1. (vi)
Algorithm I: Conjugate Gradient
Input: sss1k = 000∈Rn,ddd1k=−∇Q1=−gggk
1 for j=1, 2, . . . do
2 αjk= k∇Qjk2
d d djkTHkdddjk 3 sssj+1k =sssjk+αjkdddjk
4 ∇Qj+1=∇Qj +αjkHkdddjk
5 if (v) is satified then
6 sssk =sssj+1k
7 returnsssk
8 βj = k∇Qk∇Qj+1k2
jk2
9 dddj+1k =−∇Qj+1+βjdddjk Wolfe condition:
∇f(wwwk+ωksssk)Tsssk≥γgggkTsssk, with 0< γ <1. (vii) Level set:
L1 :={www∈Rn:f(www)≤f(www1)}. (viii) This lemma can be obtained directly from the structure of the losses and it will be needed in Section I.
Lemma I The Hessian matrix (iii) is bounded, i.e. there exist two constants M1, M2 >0 such that
M1 ≥ kHkk ≥M2 ∀k. (ix)
Proof By the definition of the losses functions (ii) we have that 0≤Dii≤2 (see Appendix A for details). Thus from (iii) we have that ∀k Hkis bounded. Furthermore, with the matrix I in (iii),Hk is bounded above zero.
I. Global and Local Convergence of TNCG In this section we relax f to be anyf ∈ LC1.
I.1 Global Convergence by Treating TNCG as a Common-Directions Algorithm
In Algorithm I we detail the instructions of the conjugate gradient method for solving a generic quadratic strongly convex function of the shapeQ(sss) =gggTksss+ 12sssTHksss. We define
∇Qj :=∇Q(sssjk), where∇Q(sss) =gggk+Hksss.
Lemma II Letsssk be the direction returned by Algorithm I and assume that it terminates aftermkiterations. Thensssk=Pmk
j=1αjkdddjk, wherePk={ddd1k, . . . dddmkk}are conjugate directions
Algorithm II:Truncated Newton Conjugate Gradient Input: www1 ∈Rn starting point
1 for k= 1,2, . . . do
2 compute the directionsssk by approximately solving (iv) with a CG method, until (v) is satisfied
3 compute a step lengthωk
4 wwwk+1=wwwk+ωksssk
with respect toHk and{α1k, . . . , αmkk} are scalars defined to minimize minαααk
g
ggTksssk+1
2sssTkHksssk s.t. sssk=Pkαααk,
(x)
where Pk:= [ddd1k|. . .|dddmk]∈Rn×m andgggk∈ {ddd1k, . . . , dddmk}.
Proof From instructions of Algorithm I we directly get that sssk=sssmkk+1 =
mk
X
j=1
αjkdddjk.
In addition, from Theorem 5.2 of Nocedal and Wright (1999) and the fact thatsss1k= 000 we have thatsssk is the minimizer of (x) on the span of the conjugate directions {ddd1k, . . . dddmkk}.
This concludes the proof.
We now point out that in the framework (Lee et al., 2017), the total number of common- directions m is assumed to be fixed through the whole optimization procedure. Since the termination criterion (v) is employed in Algorithm I, in the TNCG case this number mk depends on the iterationk. On the other hand, it is possible to see that in the proofs of Lee et al. (2017), m is never really required to be fixed. In fact, as long as mis bounded, the problem (x) is well defined and all the properties required for global convergence follow from the fact thatwwwk is its solution, together with the fact thatgggk ∈ {ddd1k, . . . dddmkk}. Note that problem (x) is highly dependent onk, and the dimensionality ofαααkcan be seen as a detail of the internal procedure for obtainingwwwk. For these reasons, we can relax the assumption that m is a fixed constant, and we instead require that it is bounded. In the case of Algorithm I, Theorem 5.1 from Nocedal and Wright (1999) ensures that the procedure terminates within niterations, so we have that mk is always bounded byn.
I.2 Global Convergence Following a More Classical Path
We first need to prove that the directionsssk obtained by Algorithm I is able to satisfy (xi) and (xii). In this section we relax the assumption that Hk employed inQ(sss) of Algorithm I is the Hessian (or the generalized Hessian) off. In fact, it is interesting to point out that for obtaining global convergence of{wwwk},Hk can be a generic positive definite matrix. Indeed in general, second-order information off are not needed to obtain global convergence to a stationary point.
Proposition III Let f ∈LC1. In addition, assume that Hk is positive definite and ∀k Hk is bounded as in (ix). Letsssk be generated by Algorithm I. Then there exist two constants a1>0 anda2 >0 such that
∇f(wwwk)Tsssk≤ −a1k∇f(wwwk)k2, (xi)
kssskk ≤a2k∇f(wwwk)k. (xii) Proof If we pre-multiply
∇Qj =∇Q1+
j−1
X
i=1
αiHkdddik
bydddjk we obtain
d d
djkT∇Qj =dddjkT∇Q1, (xiii) sincedddjk are all mutually conjugated. From Theorem 5.1 of Hestenes and Stiefel (1952) we have that
k∇Qjk2 =−∇QiTdddjk ∀1≤i≤j, (xiv) which together with (xiii) let us obtain that
αj = k∇Qjk2 dddjkTHkdddjk
=−∇QjTdddjk dd
djkTHkdddjk
=−∇Q1Tdddjk dddjkTHkdddjk
. (xv)
From (xv) we get that s s
sk=sssj+1k =
j
X
i=1
αidddik =−
j
X
i=1
∇Q1Tdddik d
d
dikTHkdddikdddik. (xvi) Thus,
∇f(wwwk)Tsssk =−
j
X
i=1
∇f(wwwk)Tdddik2
dddikTHkdddik
≤ −
∇f(wwwk)T∇f(wwwk)2
d
dd1kTHkddd1k ,
where the equality follows from the fact that∇Q1 =∇f(wwwk) and the inequality follows from the fact thatHk is positive definite and from the fact that in the algorithm initialization we haveddd1k=∇f(wwwk). Thus,sssk is a descent direction and we get
|∇f(wwwk)Tsssk| ≥ k∇f(wwwk)k4
k∇f(wwwk)k2kHkk ≥ 1
M1k∇f(wwwk)k2,
where the last inequality follows from upper-boundedness ofH with the constant M1 >0.
This proves that (xi) is satisfied with a1 = M1
1.Now to prove (xii) we first observe that Hk
is also lower bounded, i.e.
dddjkTHkdddjk≥M2dddjkTIdddjk ≥M2kdddjkk2, (xvii) where I is the identity matrix. Thus, from (xvi) and (xvii) we can obtain
kssskk=ksssj+1k k=
j
X
i=1
∇Q1Tdddik d
d
dikTHkdddik
kdddikk
≤
j
X
i=1
kdddikk2k∇Q1k
|dddikTHkdddik|
=
j
X
i=1
kdddikk2
dddikTHkdddik
k∇f(wwwk)k
≤ j
M2k∇f(wwwk)k
≤ n M2
k∇f(wwwk)k,
where the last inequality follows from the fact that Algorithm I always terminates in a number of steps that is bounded by n (see Theorem 5.1 of Nocedal and Wright (1999)).
Thus, (xii) is proved to be satisfied witha2 = Mn
2.
From Proposition III we have thatsssk is a descent direction and this means that the Armijo line search procedure terminates finitely (see for instance Nocedal and Wright (1999) for details). Then we obtain global convergence by exploiting Armijo line search properties. In the following theorem we relax the assumption thatsssk is obtained by Algorithm I. In fact, the Theorem IV is still valid for any procedure that yields a directionsssk that satisfies (xi) and (xii).
Theorem IV Let f ∈LC1 and let L1 defined in (viii) be compact. Let {wwwk} be a sequence generated by Algorithm II, where at each stepk the direction in Step 3 is obtained by any procedure that yields asssk that satisfies (xi)and (xii). Then,
k→∞lim k∇f(wwwk)k= 0. (xviii)
Proof From the Armijo line search instruction we get that
f(wwwk+1)≤f(wwwk) +γωk∇f(wwwk)Tsssk, (xix)
and from this we obtain that
f(www1)−f(wwwk) =
k
X
i=1
f(wwwi)−f(wwwi+1)
≥ −γ
k
X
i=1
ωi∇f(wwwi)Tsssi
≥γ
k
X
i=1
ωia1k∇f(wwwi)k2,
(xx)
where the last inequality follows from (xi). By contradiction, there exists a subsequence K ⊂ {1,2. . .} for which
k∇f(wwwk)k ≥ >0, ∀k∈K. (xxi) From Armijo line search properties we get thatwwwk∈ L1 and thus the left part of inequality (xx) is bounded from above, which means that also the right part of (xx) is bounded from
above. Note that the right part of (xx) is also positive, thus,
k→∞lim γ
k
X
i=1
ωia1k∇f(wwwi)k<∞.
Since k∇f(wwwk)k is bounded from below because of (xxi), we get that
k→∞lim
k∈K k
X
i=1
ωi <∞, which, in turn, brings to
k→∞lim
k∈K
ωk= 0. (xxii)
Because of (xxii), it is not possible to have an infinite subsequence of K in which Armijo always terminates withωk = 1.Thus, from Armijo line search properties there exists a ˆkfor which ∀k >k, kˆ ∈K Armijo always terminates with a ωk such that
f(wwwk+ωk
δ sssk)> f(wwwk) +γωk
δ ∇f(wwwk)Tsssk.
From the Mean Value Theorem, applied on the above inequality, we obtain that
∇f(zzzk)Tsssk> γ∇f(wwwk)Tsssk, (xxiii) wherezzzk =wwwk+θkωδksssk andθk∈[0,1]. Since {wwwk}K ∈ L1, fromK we can extract another infinite subsequence (redefinedK) for which
k→∞lim
k∈K
ww
wk= ˆwww. (xxiv)
Now, again from the fact that {wwwk}K ∈ L1, we have that there exists a constant M >0 such thatk∇f(wwwk)k< M ∀wwwk∈ L1 and thus, together with (xii) we get
kssskk ≤a2k∇f(wwwk)k ≤a2M. (xxv) Thus, from (xxiv), (xxii) and (xxv) we get that
k→∞lim
k∈K
zzzk= lim
k→∞k∈K
wwwk+θk
ωksssk
δ = ˆwww. (xxvi)
From (xxv) we have that {sssk}K is limited, so from K we can extract another infinite subsequence (redefined K) for which
k→∞lim
k∈K
s
ssk= ˆsss. (xxvii)
Moreover, from (xi) and (xxi) we have
∇f( ˆwww)Tsˆss≤ −a1k∇f( ˆwww)k2≤ −a12<0. (xxviii) Finally, from (xxiii), (xxvi) and (xxvii) we get that
γ∇f( ˆwww)Tsˆss= lim
k→∞k∈K
γ∇f(wwwk)Tsssk ≤ lim
k→∞k∈K
∇f(zzzk)Tsssk=∇f( ˆwww)Tˆsss, which is absurd, because from (xxviii) we have ∇f( ˆwww)Tsˆss <0 and γ <1.
I.3 Q-Superlinear Local Convergence
In this subsection we are going to show details on how to obtain Q-Superlinear local convergence for Algorithm II. We study local convergence by assuming that the ratio employed in the stopping rule (v) is (xl). Note that iff is the function defined in (i), the strong convexity implies BD-regularity, and, in turn, BD-regularity1 implies the following property (see Lemma 2.6 of Qi (1993)).
Lemma V If ∇f is BD-regular at a generic pointwww¯ and ∇f is locally Lipschitz, then there exist a constant δ > 0, a neighborhood N( ¯www, δ) := {www ∈ Rn : kwww−wwwk ≤¯ δ} ofwww¯ and a constant M >0 such that H is nonsingular and
kH−1k ≤M ∀H ∈∂B∇f(www) ∀www∈N( ¯www, δ). (xxix) From semismoothness we obtain the following (see Proposition 1 of Pang and Qi (1993)).
Lemma VI If ∇f is semismooth atwww∈Rn, then
slim
s s→0 H∈∂∇f(www+sss)
k∇f(www+sss)− ∇f(www)−Hsssk
ksssk = 0. (xxx)
1. Note that in Qi (1993) this property is called strong BD-regularity, but later on it was simply called BD-regularity, even by the author of Qi (1993), for instance in Mart´ınez and Qi (1995).
In addition, if ∇f is semismooth, also a second order Taylor expansion is available.
Proposition VII If ∇f is semismooth on Rn, then f(uuu) =f(www) +∇f(www)T(uuu−www) +1
2(uuu−www)TH(uuu−www), where H ∈∂f(zzz) for some zzz∈(uuu, www).
The following proposition is an extension of Lemma VI (see Proposition 2.4 from Facchinei (1995)).
Proposition VIII Assume ∇f to be semismooth on Rn and let {θk} be any sequence of numbers such thatθk∈(0,1] ∀k. Then, for every sequence {wwwk} converging to a pointwww∗ and for every sequence{Hk} such that Hk∈∂∇f(www∗+θk(wwwk−www∗)), we have
∇f(www∗) =∇f(wwwk)−Hk(wwwk−www∗) +o(kwwwk−www∗k).
In the proof of Lemma below (called Lemma 1 in the main paper), we first provide the result for a genericforcing sequence to clarify that when (xliii) is employed andf ∈LC1 Q-L convergence is still an open question. If we instead directly assume ηk→0, the following proof would be more similar to the one given in Theorem 5.3 of Qi and Sun (2006).
Lemma IX Let ∇f be semismooth and BD-regular. Let {wwwk} be a generic sequence con- vergent to a critical pointwww∗. Further at eachwwwk we generate asssk by solving the Newton linear equation (iv)to satisfy the condition (v) where theratio employed is (xl). Then there exist δ >0, M >0 such that for any wwwk∈N(www∗, δ) we have
kwwwk+sssk−www∗k ≤o(kwwwk−www∗k) +M ηkkwwwk−www∗k. (xxxi) Further if ηk →0 in generating sssk, then we have
k→∞lim
kwwwk+sssk−www∗k
kwwwk−www∗k = 0. (xxxii) Proof Since∇f is semismooth, by Lemma V we obtain that there exist constantsM1, δ1 >0 such that ifkwww−www∗k ≤δ1 and H∈∂B∇f(www) thenH is nonsingular and
kH−1k ≤M1. (xxxiii)
From semismoothness of ∇f we also have local Lipschitz continuity of∇f. This, together with stationarity ofwww∗ means that there exist two constants L,δ∈(0, δ1) such that
k∇f(www)k ≤Lkwww−www∗k ∀www∈N(www∗, δ). (xxxiv) By (xxxiv), (xxxiii), Lemma VI, (xl) and the stationarity ofwww∗, for anywwwk∈N(www∗, δ) we have:
kwwwk+sssk−www∗k=kwwwk−www∗−(Hk)−1∇f(wwwk) + (Hk−1)(Hksssk+∇f(wwwk))k
≤ k(Hk)−1k · kHksssk+∇f(wwwk)k+k∇f(wwwk)− ∇f(www∗)−Hk(wwwk−www∗)k
≤o(kwwwk−www∗k) +ηkM1k∇f(wwwk)k
≤o(kwwwk−www∗k) +ηkMkwwwk−www∗k,
which proves (xxxi) with M :=M1L. Now, ifηk→0, from (xxxi), we also get (xxxii).
The following is called Theorem 2 in the main paper.
Theorem X Let f be strongly convex and∇f be semismooth. Let L1 defined in (viii)be compact. Let {wwwk} be generated by Algorithm II with γ ∈(0,12) and ηk → 0. Then:
(1) there exists kˆ such that (vi) is satisfied withωk= 1∀k≥k.ˆ (2) {wwwk} is Q-SL convergent to a critical pointwww∗.
Proof Let {wwwk} be a sequence generated by Algorithm II in which γ ∈ (0,12), ηk → 0.
Since ∇f is semismooth we have that f ∈LC1, and since L1 is compact we also have that the (generalized) Hessian is bounded. Furthermore, since Hk employed in Algorithm II is the actual (generalized) Hessian of f and f is strongly convex, we have that Hk is positive definite and bounded for allwww ∈ L1. Thus, all the assumptions of Proposition III of the supplementary are satisfied and we obtain (xi) and (xii). Now, from Theorem IV of the supplementary we obtain limk→∞k∇f(wwwk)k= 0, which together with the strict convexity off implies that{wwwk}is convergent to a critical pointwww∗. This (together with the fact that strong convexity off implies BD-regularity of ∇f) means that we can apply Lemma IX on {wwwk}, since it is a sequence converging to a critical pointwww∗ in whichsssk is generated by solving the Newton linear equation (iv) to satisfy the condition (v) where theratioemployed is (xl).
Since ηk→0, we can now apply (xxxii), on both kwwwk+sssk−www∗k
kwwwk−www∗k ≥ kwwwk−www∗k
kwwwk−www∗k− kssskk
kwwwk−www∗k ≥1− kssskk kwwwk−www∗k and
kwwwk+sssk−www∗k
kwwwk−www∗k ≥ kssskk
kwwwk−www∗k−kwwwk−www∗k
kwwwk−www∗k ≥ kssskk
kwwwk−www∗k−1 and obtain
k→∞lim
kssskk
kwwwk−www∗k = 1. (xxxv) From this and (xxxii) we can obtain
k→∞lim
kwwwk+sssk−www∗k
kssskk = lim
k→∞
kwwwk+sssk−www∗k
kwwwk−www∗k ·kwwwk−www∗k kssskk = 0, which leads to
kwwwk+sssk−www∗k=o(kssskk). (xxxvi) Since ∇f is semismooth, we get that H is locally upper-bounded in N(www∗, δ) for any δ.
That is, there exists a M >0 such that
M >kHk, ∀H ∈∂∇f(www) ∀www∈N(www∗, δ). (xxxvii) From (xxxii) and the fact that {wk} →w∗,there exist aδ >0 and ak(δ) such that
wk∈N(w∗, δ) andwk+sk∈N(w∗, δ)∀k≥k(δ). (xxxviii) By Proposition VII and since ∇f(www∗) = 000, we can write
f(wwwk+sssk) =f(www∗) +1
2(wwwk+sssk−www∗)THk(wwwk+sssk−www∗) f(wwwk) =f(www∗) +1
2(wwwk−www∗)TRk(wwwk−www∗),
whereHk∈∂∇f(zzzk) for somezzzk∈(www∗, wwwk+sssk) andRk∈∂∇f(uuuk) for someuuuk∈(www∗, wwwk).
Subtracting the two above equalities we have f(wwwk+sssk)−f(wwwk) =−1
2(wwwk−www∗)Rk(wwwk−www∗) +o(kssskk2),
where the last term follows from (xxxvi), (xxxvii) and (xxxviii). By subtracting 12∇f(wwwk)Tsssk, we get
f(wwwk+sssk)−f(wwwk)−1
2∇f(wwwk)Tsssk
=−1
2(wwwk−www∗)TRk(wwwk−www∗)−1
2∇f(wwwk)Tsssk+o(kssskk2)
=−1
2(wwwk+sssk−www∗)TRk(wwwk−www∗) + 1
2sssTkRk(wwwk−www∗)−1
2∇f(wwwk)Tsssk+o(kssskk2)
=−1
2sssTk (∇f(wwwk)−Rk(wwwk−www∗)) +o(kssskk2)
=−1
2sssTk (∇f(wwwk)− ∇f(www∗)−Rk(wwwk−www∗)) +o(kssskk2)
=o(kssskk2),
where the third equality follows from (xxxv), (xxxvi), and (xxxvii), while the last equality follows from Proposition VIII and again (xxxv). Thus, from the last equation we have
f(wwwk+sssk)−f(wwwk) = 1
2∇f(wwwk)Tsssk+o(kssskk2) +γ∇f(wwwk)Tsssk−γ∇f(wwwk)Tsssk
≤γ∇f(wwwk)Tsssk+o(kssskk2)− 1
2 −γ
a1k∇f(wwwk)k2
≤γ∇f(wwwk)Tsssk+o(kssskk2)− 1
2 −γ a1
a22kssskk2,
(xxxix)
where the first inequality follows from (xi), while the second inequality from (xii). The inequality (xxxix) proves that there exists a ˆkfor which (vi) is satisfied withωk = 1,∀ k >ˆk and γ ∈ (0,12). This means that ∀k > kˆ we can replacewwwk+sssk with wwwk+1 in the limit (xxxii) and obtain that{wwwk} Q-superlinearly converges towww∗.
II. Details on Truncation Criteria in TNCG We first recall all the ratio:
residual:
ratio= kgggk+Hksssjkk
kgggkk . (xl)
residuall1:
ratio= kgggk+Hksssjkk1
kgggkk1 . (xli)
Algorithm III:PCG for solving (xlvi). AssumeM has been factorized to EET.
1 Given ηk<1,∆k>0, let ˆs=0,rˆ= ˆd=−E−1gk, γ= ˆrTrˆ
2 while True do
3 ˆv←E−1(Hk(E−Td)),ˆ
4 α← kˆrk2/(ˆdTˆv)
5 ˆs←ˆs+αdˆ
6 ˆr←rˆ−αvˆ
7 if kˆrk< ηkkˆgkk then
8 returnsk=E−Tsˆ
9 γnew←rˆTrˆ
10 β←γnew/γ
11 ˆd←ˆr+βdˆ
12 γ←γnew
quadratic:
ratio= (Qj −Qj−1)
Qj/j , (xlii)
Now we recall all the forcing sequences:
constant:
ηk=c0 withc0 ∈(0,1). (xliii)
adaptive:
ηk= min{c1;c2kgggkkc3}, with c1∈(0,1), c2 >0, c3 ∈(0,1]. (xliv)
adaptivel1:
ηk= min{c1;c2kgggkkc13}, with c1∈(0,1), c2 >0, c3 ∈(0,1]. (xlv) III. Preconditioning
In this section we will show that Algorithm 2 from the original paper is equivalent to applying Algorithm III and then obtaining sssfromsss=E−1sˆss. For sake of simplicity in this section and in Algorithm III we will not use the iteration counterj and k will only be reported on Hk andgggk. Let us write the instructions of the original CG method (Algorithm I) applied to the system
E−1HkE−Tˆsss=−E−1gggk. (xlvi) At the internal generic CG step we have
ˆ
s= ˆs+αd,ˆ (xlvii)
where
α= kˆrk2 dˆTE−1HkE−Tdˆ
with
ˆ
r=−E−1HkE−Tˆs−E−1gk =E−1 −HkE−Tˆs−gk
. (xlviii)
Pre-multiplying (xlvii) by E−1, using the transformations=E−1ˆsand the definition
d:=E−1dˆ (xlix)
we get
s=s+αd.
Again using the transformation s=E−1ˆs, (xlviii) can be re-written to get
rˆ=E−1(−Hks−gk) =E−1r, (l) which together with (xlix) brings to
α= rTM−1r
dTv , wherev=Hkd.
Moreover we have that
dˆ = ˆr+βd,ˆ (li)
where
β = rˆTnewˆrnew
ˆrTˆr = rTnewM−1rnew rTM−1r . Pre-multiplying (li) by E−1, from (l) and (xlix) we get
d=r+βd.
Finally, defining the vector
z :=M−1r and the scalar γ:=rTz, α and β can be re-written as
α= rTz
dTv and β= γnew γ . IV. Complete Numerical Results
For experiments we use two-class classification problems shown in Table II. Except for yahookr andyahoojp, others are available from LIBSVM Data Sets (2007). Table II shows the statistics of each data set used in our experiments. For a fair evaluation, all different settings are implemented based on the LIBLINEAR package2. To simulate the practical use we conduct a five-fold cross-validation to select the regularization parameterCBest that
2. Exceptions are for Scikit and ScikitArmijo of Figure 1 from the original paper, where Scikit-learn is used.
In particular, to train aLogisticRegressionmodel, the user should select thenewton-cgoption for the parametersolver. By default Scikit-learn solves a slightly different optimization problem with a bias term. To solve (i), the bias term is thus ignored by setting the optionfit intercepttoFalse.
achieves the best validation accuracy (see Table II for the resultingC). Then in experiments we consider C =CBest× {1,100}.
In figures below we show the total number of CG steps needed to solve the training problem (i). On the y axis we report the relative reduction of the function value, computed by
f(wwwk)−f(www∗) f(www∗) ,
wherewww∗ is the optimal solution3 of (i). The four horizontal lines in figures below indicate places where the following stopping condition is met respectively with={10−1,10−2,10−3,10−4}
kgggkk ≤min{#pos,#neg}
l · kggg1k, (lii)
where #pos, #neg are the numbers of positive- and negative-labeled instances respectively, and >0 is the thresholding constant that controls the precision of the training procedure.
Note that (lii) with= 10−2 is the outer stopping condition employed in LIBLINEAR. Thus the behavior before 10−1 and after 10−4 is not crucial, since the training would be stopped too early or too late. For the Armijo line search, we follow Hsia et al. (2017) to haveδ = 0.5 for deciding the step sizeα∈ {1, δ, δ2, . . .}and γ = 0.01 in the sufficient decrease condition (vi).
IV.1 Selection of the Forcing Sequence in Truncation Criteria See complete results in Figures i and ii.
IV.2 Selection of the Constant Threshold in the Inner Stopping Condition In this section, results in Figures iii and iv are obtained by comparing
rescons: ratio= (xl),ηk= (xliii), c0= 0.1;
rescons05: ratio= (xl),ηk= (xliii), c0 = 0.5;
rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;
quadcons: ratio= (xlii),ηk= (xliii), c0= 0.1;
quadcons05: ratio= (xlii), ηk= (xliii), c0 = 0.5;
quadcons09: ratio= (xlii), ηk= (xliii), c0 = 0.9.
By using the same color (red) for all the settings with a residual ratio(xl) and one other (blue) for all those with a quadratic ratio(xlii), in Figures iii and iv we want to observe the
robustness of the tworatiosw.r.t. a modification of the constant thresholdc0:
When C=CBest, there is no evidence of one implementation that is more robust than the other.
3. In practicewww∗is not available, so we run enough iterations to get a good approximate solution.
When C= 100CBest, the quadratic ratiois more sensible on covtype,rcv1 and also on url, while the residual one is more sensible on yahoojp, yahookr, kdda, kddb, kdd12and also on real-sim.
The value C = 100CBest leads to a more difficult optimization problem and it is thus reasonable that the modification of the constant threshold c0 has here a stronger effect on the convergence speed. From the observations above we can conclude that the quadratic ratiois more robust than the residual one, especially on difficult optimization problems.
IV.3 Comparison Between Adaptive and Constant Forcing Sequences In this section, results in Figures v and vi are obtained by comparing
rescons05: ratio= (xl),ηk= (xliii), c0 = 0.5;
resada: ratio= (xl),ηk= (xliv);
quadcons05: ratio= (xlii), ηk= (xliii), c0 = 0.5;
quadada: ratio= (xlii), ηk= (xliv).
From Figures v and vi we can observe that the constantforcing sequence and the adaptive one are (almost) always leading to an identical convergence speed. This means that the constant c1= 0.5 is (almost) always smaller thanc2k∇f(wwwk)kc3, withc2 = 1 andc3 = 0.5.
In fact, this result depends also on the choice of c1, c2 and c3, since, for instance, with a bigger initial threshold c1 = 0.9 this overlapping effect would not be so common. On the other side, the setting (c1 = 0.5, c2 = 1, c3 = 0.5) is the one used in Scikit-learn and suggested also in Nocedal and Wright (1999). Moreover, experiments with many other settings ofc1, c2 andc3 have been carried out and there is no evidence of configurations that work consistently better than the standard one.
IV.4 Comparison Between l2-Norm and l1-Norm and with TR
See complete results in Figures vii and viii. On the dataset covtype with C = 100CBest both the setting quadada and quadada l1 are facing slow convergence issues because of the threshold c1. Comparing with resada it seems that in early iterations the Newton equation needed to be solved more accurately. Since in Section IV.2 it is not easy to investigate the choice of the thresholdc1 (because of the single color blue) we decided to deepen the analysis on quadada in Section IV.9.
IV.5 Analysis via Conditions for Global Convergence
In the convergence proof from Section I.2 one of the key properties for proving global convergence is obtaining (xi) and (xii). The following combination of the two requirements is called angle condition
−gggTksssk≥ckgggkkkssskk. (liii) In fact, most of the line-search based optimization methods (see Section 3.2 from Nocedal and Wright (1999)) requires the directionsssk to satisfy (liii) to ensure global convergence4.
4. Note anyway that global convergence is always guaranteed for Algorithm II as proved in Section I.2, so the discussion in this section is only pointing out an undesired issue that might come out in practice.
This suggests us the idea of analyzing different inner stopping criteria on this value cos(−gggk, sssk) = −gggTksssk
kgggkkkssskk. (liv) The term (liv) is the cosine of the angle between the directionsssk and the anti-gradient−gggk. Inequality (liii) is indeed called angle condition because it requires the cosine of the above angle to be greater or equal to a positive c, withc <1. This means that the new direction sssk cannot be completely orthogonal to the anti-gradient. Otherwise, sssk is not a descent direction and we cannot use it to decrease the function value. We suspect that a good inner stopping condition should make (liii) more easily satisfied. Detailed experiments in Figures ix and x confirm this result.
The condition (liii) is also useful to investigate the relationship between the regularization parameterC and the difficulty of the optimization problem (i). Many past experiments have shown that whenC is large, the optimization problem is more difficult and slow convergence more frequently occurs Hsia et al. (2017, 2018). In particular, when the Hessian of the objective function is positive definite and we approximatesssk≈Hk−1gggk we have
g
ggTksssk≈ −gggTkHk−1gggk≤ − kgggkk2 λmax(Hk) kssskk ≈ kHk−1gggkk ≤ kgggkk
λmin(Hk).
(lv)
Now, from (lv) and (ix) we have
−gggTksssk≥ M2
M1kgggkkkssskk. (lvi) This means that when the regularization parameter C is large (e.g., C = 100CBest), (lvi) shows that the lower bound of the cosine value in (liv) we can derive is smaller. Thus the angle condition (liii) is harder to be satisfied. Note that equation (liii) is a property needed in the convergence proof, which requires the existence of a constant c. In fact, there is no control on the right-hand side of (liii), while what we hope is that the left-hand side of (liii) is large so that it can be easily satisfied. Unfortunately (lvi) shows that for large C,
the left-hand side of (liii) tends to be not that large. In conclusion, even if theoretically convergence is not an issue, the above discussion still gives us a hint on what might actually happen numerically when C is large.
We now show experiments in which we investigate the cosine between the anti-gradient
−gggk and the resulting directionsssk. In Figures ix and x we compare
quadada: ratio= (xlii), ηk= (xliv);
rescons09: ratio= (xlii),ηk= (xliii) c0= 0.9;
resada: ratio= (xl),ηk= (xliv).
Cosines have been calculated until the outer termination criterion (lii) is satisfied with = 10−4. In Figures ix and x, we present the cosine value at each Newton iteration. Further, two horizontal lines corresponding to cosine = 0.1 and cosine = 0.2 have been drawn. From the comparison between Figures ix-x and i-ii we can make the following observations:
Algorithms having low cosine values (for example below 0.1) are more likely to converge slower to the solution. In particular, resada has a cosine more frequently below 0.1 when C = 100CBest. For such configurations, Figures i and ii have shown that it behaves worse in terms of convergence speed.
Long continuous regions of low cosine in Figures ix-x might correspond to regions of slow convergence in Figures i-ii (see for example rescons09 onyahookr andkdd2010a when C= 100CBest, below the 0.2 threshold).
The cosine is often smaller when C is higher. This result confirms the theoretical analysis detailed in this section on the difficulty of solving (i) as C changes.
In Figures i-ii, quadada converges much faster than resada. From Figures ix-x we observe that quadada’s cosine is generally higher than resada’s. This indicates that in the early Newton iterations, the resulting directionsssk of quadada is closer to the anti-gradient. We have mentioned above in this section that this is a desired property.
Since early CG iterates are very close to the anti-gradient, the inner stopping condition in quadada effectively stops the CG procedure early to have this property.
We conclude that in early outer iterations a suitable criterion should avoid over-solving to obtain directions having a higher cosine with respect to the anti-gradient.
IV.6 Preconditioning: Effect of Preconditioning on Residual and Quadratic Ratio
See complete results in Figures xi and xii.
IV.7 Preconditioning: Comparison with Trust Region and Other Rules See complete results in Figures xiii and xiv.
IV.8 Additional Results in the Preconditioned Case: Comparison Between Adaptive and Constant Forcing Sequences
In this section, results in Figures xv and xvi are obtained by comparing
rescons05 p: ratio= (xl), ηk= (xliii), c0 = 0.5, preconditioned;
resada p: ratio= (xl),ηk= (xliv), preconditioned;
quadcons05 p: ratio= (xlii), ηk= (xliii), c0 = 0.5, preconditioned;
quadada p: ratio= (xlii), ηk= (xliv), preconditioned.
Also in this case, like in the non-preconditioned one of Section IV.3, the constant forcing sequence and the adaptive one are (almost) always leading to an identical convergence speed.
IV.9 Additional Results in the Preconditioned Case: Selection of the c1
Threshold in the Quadratic Adaptive Rule
In this section, results in Figures xvii and xviii are obtained by comparing
quadada01: ratio= (xlii),ηk= (xliv),c1 = 0.1, c2 = 1, c3= 0.5;
quadada01 p: ratio= (xlii),ηk= (xliv) c1= 0.1, c2= 1, c3 = 0.5, preconditioned;
quadada: ratio= (xlii), ηk= (xliv), c1= 0.5, c2 = 1, c3 = 0.5;
quadada p: ratio= (xlii), ηk= (xliv), c1= 0.5, c2 = 1, c3 = 0.5 preconditioned.
From Figures xvii and xviii we can make the following observations:
Using a smallerc1 threshold (c1= 0.1) is solving the issues encountered by quadada on covtypewith C= 100CBest in Figure viii.
Both the settings quadada01 and quadada01 p are generally slower than quadada and quadada p.
The slow convergence issue encountered by quadada oncovtype with C= 100CBest in Figure viii is also solved by the use of preconditioning (quadada p).
IV.10 Additional Results in the Preconditioned Case: Comparison Against a Residual Constant Rule with Higher Constant Threshold
In this section, results in Figures xix and xx are obtained by comparing
rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;
rescons09 p: ratio= (xl), (xliii), c0 = 0.9, preconditioned;
quadada: ratio= (xlii), ηk= (xliv);
quadada p: ratio= (xlii), ηk= (xliv), preconditioned.
From Figures xix and xx we can make the following observations:
The settings rescons09 p and quadada p are often obtaining a very similar speed of convergence.
There are differences on C = CBest in which rescons09 p is faster than quadada p (rcv1,real-sim) and quadada p is faster than rescons09 p (covtype and criteoon
the fourth line), but the detachment is less than 20 CG steps.
By looking at yahookr with C = CBest, we can see that rescons09 p is noticeably faster than quadada p (at the third line is around twice faster, by 150 CG steps). The opposite situation is instead happening on yahookrwith C= 100CBest, where being twice faster on the third line means saving around 750 CG steps.
IV.11 Additional Results in the Preconditioned Case: Comparison Between l2-Norm and l1-Norm
In this section, results in Figures xxi and xxii are obtained by comparing
resada l1: ratio= (xli), ηk= (xlv);
rescons l1 p: ratio= (xli), (xlv), preconditioned;
resada: ratio= (xl),ηk= (xliv);
resada p: ratio= (xl),ηk= (xliv), preconditioned.
From Figures xxi and xxii we can notice that rescons l1 p is often slower than resada p, so also in the preconditioned case there is no reason to employ L1-norm instead of L2-norm.
IV.12 Results for L2-loss SVM
In this section, results in Figures xxiii and xxiv are obtained by employing theξL2 loss as defined in (ii) and we compare:
tr rescons: ratio= (xl),ηk= (xliii),c0= 0.1, Trust Region instead of line search;
rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;
resada: ratio= (xl),ηk= (xliv);
quadada: ratio= (xlii), ηk= (xliv).
IV.13 Results for L2-loss SVM in the Preconditioned Case
In this section, results in Figures xxv and xxvi are obtained by employing the ξL2 loss as defined in (ii) and we compare:
tr rescons p: ratio= (xl), ηk= (xliii), c0 = 0.1, Trust Region instead of line search, preconditioned;
rescons09 p: ratio= (xl), ηk= (xliii), c0 = 0.9, preconditioned;
resada p: ratio= (xl),ηk= (xliv), preconditioned;
quadada p: ratio= (xlii), ηk= (xliv), preconditioned.
IV.14 Comparison with Other State-of-the-Art Software
In this section we are comparing the proposed TNCG implementation with the two state- of-the-art software SAG (Schmidt et al., 2017) and SAGA (Defazio et al., 2014). A handy implementation of the two methods is provided by Scikit-learn. All the settings are the same described in Footnote 1 of Section IV, except for the fact that for training a LogisticRegression model with SAG or SAGA the user should select the sag or saga option for the parametersolver. Few lines of python/cython code have been added in the Scikit implementation of SAG/SAGA to compute the function value and measure the CPU
time (with the function time.time()of the library time). This additional computation has not been considered in the total time employed. On the other side, in the case of the LIBLINEAR implementation the time has been measured with the function clock()of the librarytime.h. For both the implementations the starting time has been taken immediately before the main optimization cycle. Experiments are conducted on a machine with an AMD EPYC 7401 CPU and 128 GB of memory. Results in Figures xxvii and xxviii are obtained by comparing quadada p (ratio= (xlii),ηk= (xliv), preconditioned), SAG and SAGA.
Appendix A. Gradient and Hessian Details of the Losses from (ii) LetX = [xxx1, . . . , xxxl]T be the data matrix.
A.1 Logistic Loss
In this section we compute the gradient and the Hessian of f when the loss employed isξLR. Letσ(t) := (1 +e−t)−1.
The gradient of f is
∇f(www) =www+C
l
X
i=1
(σ(yiwwwTxxxi)−1)yixxxi
The Hessian off is
∇2f(www) =I+CXTDX
with Da diagonal matrix such that Dii(www) =σ(yiwwwTxxxi)(1−σ(yiwwwTxxxi)).
A.2 L2-loss SVM
In this section we compute the gradient and the Hessian of f when the loss employed is ξL2. Letyyy= [y1, . . . , yl]T. The gradient off is
∇f(www) =www+ 2CXI,:T(XI,:www−yyyI), where I=
i∈ {1, . . . l}|1−yiwwwTxxxi >0 . Now we define
φ(t) = max(0,1−t)2, φ0(t) =
(−2(1−t) if 1−t >0,
0 otherwise,
∂Bφ0(t) =
2 if 1−t >0, 0 if 1−t <0, {0,2} if 1−t= 0,
∂φ0(t) = conv{∂Bφ0(t)}=
2 if 1−t >0, 0 if 1−t <0, µwith µ∈[0,2] if 1−t= 0,
where conv{·} is the convex hull of a set. Therefore, the generalized Hessian off is
∂∇f =I+CXTDX, with Da diagonal matrix such that Dii(www) =∂φ0(yiwwwTxxxi).
An easy way to calculate∂∇f in practice is to replace the originalDii(www) with the following D¯ii(www) =
(2, if 1−yiwwwTxxxi>0, 0, otherwise.
A.3 Cost of Armijo, Wolfe and Exact Line Searches
Here we recall the cost of Armijo (Section 2.1 of Hsia et al. (2017)) and exact line search (Section 3.3 of Hsia et al. (2017)), and we show how to obtain the same save of operations
even for Wolfe line search.
Exact requires O(l) operations for each trialω,
Armijo requiresO(l) operations for each trial ω,
Wolfe requiresO(l) operations for each trial ω.
To show the trick in computing the Wolfe condition, we first recall the same trick for Armijo (vi). For computing f at each ω, we can store eachxxxTiwwwk,xxxTissskwwwTksssk,wwwkTwwwk andssskTsssk respectively in these scalars x¯iw,x¯is,ws,¯ ww¯ and ¯ss and computef at (wwwk+ωsssk) in the following way
f(wwwk+ωsssk) = 1
2(wwwk+ωsssk)T(wwwk+ωsssk) +C
l
X
i=1
ξ(yi(wwwk+ωsssk)Txxxi)
= 1
2( ¯ww+ 2ωws¯ +ω2ss) +¯ C
l
X
i=1
ξ(yi( ¯xiw+ωx¯is)).
Thus the cost of Armijo line search isO(l)×(line search steps). The cost of Wolfe condition (vii) is also O(l)×(line search steps), because we don’t have to compute ∇f(wwwk+ωsssk) at each ω, but we can directly compute∇f(wwwk+ωsssk)Tsssk. Thus, again by storing everything as above, we have:
∇f(wwwk+ωsssk)Tsssk =wwwTksssk+ωsssTksssk+C
l
X
i=1
ξ0(yi(wwwk+ωsssk)Txxxi)·yi·xxxTisssk
= ¯ww+ωss¯ +C
l
X
i=1
ξ0(yi( ¯xiw+ωx¯is))·yi·x¯is.
Symbol Meaning
n number of instances
l number of features
k outer iteration index (TN)
j inner step index (CG)
i index of the dataset instance w vector of weights (∈Rn)
f objective function
g or∇f gradient of f (∈Rn) H or ∇2f Hessian of f (∈R(n×n))
y label
x feature vector (∈Rl)
ξ loss function
X matrix vector (∈R(l×n))
s TN direction (∈Rn)
C regularization parameter Qk(·) quadratic approximation off
I identity matrix
η forcing sequence
ω step length
γ constant of Armijo line search
L level set
d d
d conjugate direction (∈Rn)
B a positive definite matrix (∈R(n×n)) M1, M2 bounding constants
r Newton equation residual (∈Rn)
∂B∇f B-subdifferential of ∇f
∂∇f generalized Hessian of f k · k l2-norm
k · k1 l1-norm
c0, c1, c2, c3 constants in the forcing sequences constant of the outer stopping condition M preconditioning matrix (∈R(n×n)) E a factor matrix of M =EET (∈R(n×n)) ˆ
s preconditioned TN direction (∈Rn) rˆ preconditioned residual (∈Rn) ˆ
g preconditioned gradient (∈Rn) Hˆ preconditioned Hessian (∈R(n×n)) k · kA norm induced by the matrix A λmin(·), λmax(·) min and max eigenvalue
CBest C achieving the best validation accuracy N(w, δ) neighbor of w with radius δ
Table I: Notation table.