Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”

(1)

Supplementary materials for “A Study on Truncated Newton Methods for Linear Classification”

Department of Information Engineering University of Florence

Via Santa Marta 3, Firenze, Italia

Department of Computer Science National Taiwan University Taipei, Taiwan, 106

We report here some equations from the original paper that might be useful in the rest of the supplementary materials file. In Table I we collected the notations we used in this paper.

The linear classification problem:

minwww f(www) = 1

2www^Twww+CXl

i=1ξ(y_iwww^Txxx_i), (i) where (y_i, xxx_i), i= 1, ..., l are the training data, y_i =±1 is the label, xxx_i ∈ Rⁿ is a feature vector,www^Twww/2 is the regularization term, C >0 is a regularization parameter andξ(yiwww^Txxxi) is a generic LC¹ (continuously differentiable with locally Lipschitz continuous gradient) convex loss function. Withwww^Twww, f is a LC¹ strongly convex function and the minimumwww^∗ of f exists and is unique.

The following two losses respectively correspond to logistic regression (C², i.e. twice continuously differentiable) and l2-loss linear SVM (LC¹)

ξLR(ywww^Txxx) = log 1 + exp −ywww^Txxx

ξL2(ywww^Txxx) = (max(0,1−ywww^Txxx))². (ii) The gradientand theHessian of the objective functionf:

g

gg:=∇f(www) =www+CXl

i=1ξ⁰(y_iwww^Txxx_i)y_ixxx_i, H:=∇²f(www) =I+CX^TDX,

(iii) where I is the identity matrix, X = [xxx₁, . . . , xxx_l]^T is the data matrix and D is a diagonal matrix withDii=ξ⁰⁰(yiwww^Txxxi). The Newton equation:

∇²f(www_k)sss=−∇f(www_k). (iv) Generictruncation criterion:

ratio(sss^j_k)≤η_k, (v) Armijo condition:

f(www_k+ω_ksss_k)≤f(www_k) +γω_kggg^T_ksss_k, with 0< γ <1. (vi)

(2)

Algorithm I: Conjugate Gradient

Input: sss¹_k = 000∈Rⁿ,ddd¹_k=−∇Q₁=−ggg_k

1 for j=1, 2, . . . do

2 α^j_k= ^k∇Q^j^k²

d d d^j_k^THkddd^j_k 3 sss^j+1_k =sss^j_k+α^j_kddd^j_k

4 ∇Q_j+1=∇Q_j +α^j_kH_kddd^j_k

5 if (v) is satified then

6 sssk =sss^j+1_k

7 returnsss_k

8 βj = ^k∇Q_k∇Q^j+1^k²

jk²

9 ddd^j+1_k =−∇Q_j+1+βjddd^j_k Wolfe condition:

∇f(www_k+ω_ksss_k)^Tsss_k≥γggg_k^Tsss_k, with 0< γ <1. (vii) Level set:

L₁ :={www∈Rⁿ:f(www)≤f(www₁)}. (viii) This lemma can be obtained directly from the structure of the losses and it will be needed in Section I.

Lemma I The Hessian matrix (iii) is bounded, i.e. there exist two constants M1, M2 >0 such that

M₁ ≥ kH_kk ≥M₂ ∀k. (ix)

Proof By the definition of the losses functions (ii) we have that 0≤D_ii≤2 (see Appendix A for details). Thus from (iii) we have that ∀k H_kis bounded. Furthermore, with the matrix I in (iii),H_k is bounded above zero.

I. Global and Local Convergence of TNCG In this section we relax f to be anyf ∈ LC¹.

I.1 Global Convergence by Treating TNCG as a Common-Directions Algorithm

In Algorithm I we detail the instructions of the conjugate gradient method for solving a generic quadratic strongly convex function of the shapeQ(sss) =ggg^T_ksss+ ¹₂sss^TH_ksss. We define

∇Q_j :=∇Q(sss^j_k), where∇Q(sss) =ggg_k+H_ksss.

Lemma II Letsssk be the direction returned by Algorithm I and assume that it terminates afterm_kiterations. Thensss_k=Pmk

j=1α^j_kddd^j_k, whereP_k={ddd¹_k, . . . ddd^m_k^k}are conjugate directions

(3)

Algorithm II:Truncated Newton Conjugate Gradient Input: www1 ∈Rⁿ starting point

1 for k= 1,2, . . . do

2 compute the directionsss_k by approximately solving (iv) with a CG method, until (v) is satisfied

3 compute a step lengthω_k

4 www_k+1=www_k+ω_ksss_k

with respect toH_k and{α¹_k, . . . , α^m_k^k} are scalars defined to minimize minαααk

g

gg^T_ksss_k+1

2sss^T_kH_ksss_k s.t. sss_k=P_kααα_k,

(x)

where Pk:= [ddd¹_k|. . .|ddd^m_k]∈R^n×m andgggk∈ {ddd¹_k, . . . , ddd^m_k}.

Proof From instructions of Algorithm I we directly get that sss_k=sss^m_k^k⁺¹ =

mk

X

j=1

α^j_kddd^j_k.

In addition, from Theorem 5.2 of Nocedal and Wright (1999) and the fact thatsss¹_k= 000 we have thatsssk is the minimizer of (x) on the span of the conjugate directions {ddd¹_k, . . . ddd^m_k^k}.

This concludes the proof.

We now point out that in the framework (Lee et al., 2017), the total number of common- directions m is assumed to be fixed through the whole optimization procedure. Since the termination criterion (v) is employed in Algorithm I, in the TNCG case this number m_k depends on the iterationk. On the other hand, it is possible to see that in the proofs of Lee et al. (2017), m is never really required to be fixed. In fact, as long as mis bounded, the problem (x) is well defined and all the properties required for global convergence follow from the fact thatwwwk is its solution, together with the fact thatgggk ∈ {ddd¹_k, . . . ddd^m_k^k}. Note that problem (x) is highly dependent onk, and the dimensionality ofααα_kcan be seen as a detail of the internal procedure for obtainingwww_k. For these reasons, we can relax the assumption that m is a fixed constant, and we instead require that it is bounded. In the case of Algorithm I, Theorem 5.1 from Nocedal and Wright (1999) ensures that the procedure terminates within niterations, so we have that m_k is always bounded byn.

I.2 Global Convergence Following a More Classical Path

We first need to prove that the directionsss_k obtained by Algorithm I is able to satisfy (xi) and (xii). In this section we relax the assumption that Hk employed inQ(sss) of Algorithm I is the Hessian (or the generalized Hessian) off. In fact, it is interesting to point out that for obtaining global convergence of{www_k},H_k can be a generic positive definite matrix. Indeed in general, second-order information off are not needed to obtain global convergence to a stationary point.

(4)

Proposition III Let f ∈LC¹. In addition, assume that H_k is positive definite and ∀k H_k is bounded as in (ix). Letsssk be generated by Algorithm I. Then there exist two constants a₁>0 anda₂ >0 such that

∇f(www_k)^Tsss_k≤ −a₁k∇f(www_k)k², (xi)

ksss_kk ≤a₂k∇f(www_k)k. (xii) Proof If we pre-multiply

∇Q_j =∇Q₁+

j−1

X

i=1

αiH_kdddⁱ_k

byddd^j_k we obtain

d d

d^j_k^T∇Q_j =ddd^j_k^T∇Q₁, (xiii) sinceddd^j_k are all mutually conjugated. From Theorem 5.1 of Hestenes and Stiefel (1952) we have that

k∇Q_jk² =−∇Q_i^Tddd^j_k ∀1≤i≤j, (xiv) which together with (xiii) let us obtain that

α^j = k∇Q_jk² ddd^j_k^TH_kddd^j_k

=−∇Q_j^Tddd^j_k dd

d^j_k^TH_kddd^j_k

=−∇Q₁^Tddd^j_k ddd^j_k^TH_kddd^j_k

. (xv)

From (xv) we get that s s

s_k=sss^j+1_k =

j

X

i=1

αⁱdddⁱ_k =−

j

X

i=1

∇Q₁^Tdddⁱ_k d

d

dⁱ_k^TH_kdddⁱ_kdddⁱ_k. (xvi) Thus,

∇f(www_k)^Tsss_k =−

j

X

i=1

∇f(www_k)^Tdddⁱ_k2

dddⁱ_k^TH_kdddⁱ_k

≤ −

∇f(www_k)^T∇f(www_k)2

d

dd¹_k^TH_kddd¹_k ,

where the equality follows from the fact that∇Q₁ =∇f(www_k) and the inequality follows from the fact thatHk is positive definite and from the fact that in the algorithm initialization we haveddd¹_k=∇f(www_k). Thus,sss_k is a descent direction and we get

|∇f(www_k)^Tsss_k| ≥ k∇f(wwwk)k⁴

k∇f(www_k)k²kH_kk ≥ 1

M₁k∇f(www_k)k²,

(5)

where the last inequality follows from upper-boundedness ofH with the constant M₁ >0.

This proves that (xi) is satisfied with a1 = _M¹

1.Now to prove (xii) we first observe that Hk

is also lower bounded, i.e.

ddd^j_k^TH_kddd^j_k≥M₂ddd^j_k^TIddd^j_k ≥M₂kddd^j_kk², (xvii) where I is the identity matrix. Thus, from (xvi) and (xvii) we can obtain

ksss_kk=ksss^j+1_k k=

j

X

i=1

∇Q₁^Tdddⁱ_k d

d

dⁱ_k^TH_kdddⁱ_k

kdddⁱ_kk

≤

j

X

i=1

kdddⁱ_kk²k∇Q₁k

|dddⁱ_k^THkdddⁱ_k|

=

j

X

i=1

kdddⁱ_kk²

dddⁱ_k^TH_kdddⁱ_k

k∇f(www_k)k

≤ j

M₂k∇f(www_k)k

≤ n M2

k∇f(wwwk)k,

where the last inequality follows from the fact that Algorithm I always terminates in a number of steps that is bounded by n (see Theorem 5.1 of Nocedal and Wright (1999)).

Thus, (xii) is proved to be satisfied witha₂ = _Mⁿ

2.

From Proposition III we have thatsss_k is a descent direction and this means that the Armijo line search procedure terminates finitely (see for instance Nocedal and Wright (1999) for details). Then we obtain global convergence by exploiting Armijo line search properties. In the following theorem we relax the assumption thatsss_k is obtained by Algorithm I. In fact, the Theorem IV is still valid for any procedure that yields a directionsss_k that satisfies (xi) and (xii).

Theorem IV Let f ∈LC¹ and let L₁ defined in (viii) be compact. Let {www_k} be a sequence generated by Algorithm II, where at each stepk the direction in Step 3 is obtained by any procedure that yields asss_k that satisfies (xi)and (xii). Then,

k→∞lim k∇f(www_k)k= 0. (xviii)

Proof From the Armijo line search instruction we get that

f(www_k+1)≤f(www_k) +γω_k∇f(www_k)^Tsss_k, (xix)

(6)

and from this we obtain that

f(www1)−f(www_k) =

k

X

i=1

f(wwwi)−f(wwwi+1)

≥ −γ

k

X

i=1

ω_i∇f(www_i)^Tsss_i

≥γ

k

X

i=1

ω_ia₁k∇f(www_i)k²,

(xx)

where the last inequality follows from (xi). By contradiction, there exists a subsequence K ⊂ {1,2. . .} for which

k∇f(www_k)k ≥ >0, ∀k∈K. (xxi) From Armijo line search properties we get thatwww_k∈ L₁ and thus the left part of inequality (xx) is bounded from above, which means that also the right part of (xx) is bounded from

above. Note that the right part of (xx) is also positive, thus,

k→∞lim γ

k

X

i=1

ω_ia₁k∇f(www_i)k<∞.

Since k∇f(wwwk)k is bounded from below because of (xxi), we get that

k→∞lim

k∈K k

X

i=1

ω_i <∞, which, in turn, brings to

k→∞lim

k∈K

ωk= 0. (xxii)

Because of (xxii), it is not possible to have an infinite subsequence of K in which Armijo always terminates withω_k = 1.Thus, from Armijo line search properties there exists a ˆkfor which ∀k >k, kˆ ∈K Armijo always terminates with a ωk such that

f(www_k+ω_k

δ sss_k)> f(www_k) +γω_k

δ ∇f(www_k)^Tsss_k.

From the Mean Value Theorem, applied on the above inequality, we obtain that

∇f(zzzk)^Tsssk> γ∇f(wwwk)^Tsssk, (xxiii) wherezzz_k =www_k+θ_k^ω_δ^ksss_k andθ_k∈[0,1]. Since {www_k}_K ∈ L₁, fromK we can extract another infinite subsequence (redefinedK) for which

k→∞lim

k∈K

ww

wk= ˆwww. (xxiv)

(7)

Now, again from the fact that {www_k}_K ∈ L₁, we have that there exists a constant M >0 such thatk∇f(wwwk)k< M ∀wwwk∈ L₁ and thus, together with (xii) we get

ksss_kk ≤a₂k∇f(www_k)k ≤a₂M. (xxv) Thus, from (xxiv), (xxii) and (xxv) we get that

k→∞lim

k∈K

zzzk= lim

k→∞k∈K

wwwk+θk

ωksssk

δ = ˆwww. (xxvi)

From (xxv) we have that {sss_k}_K is limited, so from K we can extract another infinite subsequence (redefined K) for which

k→∞lim

k∈K

s

ss_k= ˆsss. (xxvii)

Moreover, from (xi) and (xxi) we have

∇f( ˆwww)^Tsˆss≤ −a₁k∇f( ˆwww)k²≤ −a₁²<0. (xxviii) Finally, from (xxiii), (xxvi) and (xxvii) we get that

γ∇f( ˆwww)^Tsˆss= lim

k→∞k∈K

γ∇f(www_k)^Tsss_k ≤ lim

k→∞k∈K

∇f(zzz_k)^Tsss_k=∇f( ˆwww)^Tˆsss, which is absurd, because from (xxviii) we have ∇f( ˆwww)^Tsˆss <0 and γ <1.

I.3 Q-Superlinear Local Convergence

In this subsection we are going to show details on how to obtain Q-Superlinear local convergence for Algorithm II. We study local convergence by assuming that the ratio employed in the stopping rule (v) is (xl). Note that iff is the function defined in (i), the strong convexity implies BD-regularity, and, in turn, BD-regularity¹ implies the following property (see Lemma 2.6 of Qi (1993)).

Lemma V If ∇f is BD-regular at a generic pointwww¯ and ∇f is locally Lipschitz, then there exist a constant δ > 0, a neighborhood N( ¯www, δ) := {www ∈ Rⁿ : kwww−wwwk ≤¯ δ} ofwww¯ and a constant M >0 such that H is nonsingular and

kH⁻¹k ≤M ∀H ∈∂_B∇f(www) ∀www∈N( ¯www, δ). (xxix) From semismoothness we obtain the following (see Proposition 1 of Pang and Qi (1993)).

Lemma VI If ∇f is semismooth atwww∈Rⁿ, then

slim

s s→0 H∈∂∇f(www+sss)

k∇f(www+sss)− ∇f(www)−Hsssk

ksssk = 0. (xxx)

1. Note that in Qi (1993) this property is called strong BD-regularity, but later on it was simply called BD-regularity, even by the author of Qi (1993), for instance in Mart´ınez and Qi (1995).

(8)

In addition, if ∇f is semismooth, also a second order Taylor expansion is available.

Proposition VII If ∇f is semismooth on Rⁿ, then f(uuu) =f(www) +∇f(www)^T(uuu−www) +1

2(uuu−www)^TH(uuu−www), where H ∈∂f(zzz) for some zzz∈(uuu, www).

The following proposition is an extension of Lemma VI (see Proposition 2.4 from Facchinei (1995)).

Proposition VIII Assume ∇f to be semismooth on Rⁿ and let {θ_k} be any sequence of numbers such thatθk∈(0,1] ∀k. Then, for every sequence {wwwk} converging to a pointwww^∗ and for every sequence{H_k} such that H_k∈∂∇f(www^∗+θ_k(www_k−www^∗)), we have

∇f(www^∗) =∇f(www_k)−H_k(www_k−www^∗) +o(kwww_k−www^∗k).

In the proof of Lemma below (called Lemma 1 in the main paper), we first provide the result for a genericforcing sequence to clarify that when (xliii) is employed andf ∈LC¹ Q-L convergence is still an open question. If we instead directly assume ηk→0, the following proof would be more similar to the one given in Theorem 5.3 of Qi and Sun (2006).

Lemma IX Let ∇f be semismooth and BD-regular. Let {www_k} be a generic sequence convergent to a critical pointwww^∗. Further at eachwww_k we generate asss_k by solving the Newton linear equation (iv)to satisfy the condition (v) where theratio employed is (xl). Then there exist δ >0, M >0 such that for any www_k∈N(www^∗, δ) we have

kwww_k+sss_k−www^∗k ≤o(kwww_k−www^∗k) +M η_kkwww_k−www^∗k. (xxxi) Further if ηk →0 in generating sssk, then we have

k→∞lim

kwww_k+sss_k−www^∗k

kwwwk−www^∗k = 0. (xxxii) Proof Since∇f is semismooth, by Lemma V we obtain that there exist constantsM₁, δ₁ >0 such that ifkwww−www^∗k ≤δ₁ and H∈∂_B∇f(www) thenH is nonsingular and

kH⁻¹k ≤M₁. (xxxiii)

From semismoothness of ∇f we also have local Lipschitz continuity of∇f. This, together with stationarity ofwww^∗ means that there exist two constants L,δ∈(0, δ1) such that

k∇f(www)k ≤Lkwww−www^∗k ∀www∈N(www^∗, δ). (xxxiv) By (xxxiv), (xxxiii), Lemma VI, (xl) and the stationarity ofwww^∗, for anywww_k∈N(www^∗, δ) we have:

kwww_k+sss_k−www^∗k=kwww_k−www^∗−(H_k)⁻¹∇f(www_k) + (H_k⁻¹)(H_ksss_k+∇f(www_k))k

≤ k(H_k)⁻¹k · kH_ksss_k+∇f(www_k)k+k∇f(www_k)− ∇f(www^∗)−H_k(www_k−www^∗)k

≤o(kwww_k−www^∗k) +η_kM1k∇f(www_k)k

≤o(kwwwk−www^∗k) +ηkMkwwwk−www^∗k,

which proves (xxxi) with M :=M₁L. Now, ifη_k→0, from (xxxi), we also get (xxxii).

The following is called Theorem 2 in the main paper.

(9)

Theorem X Let f be strongly convex and∇f be semismooth. Let L₁ defined in (viii)be compact. Let {wwwk} be generated by Algorithm II with γ ∈(0,¹₂) and ηk → 0. Then:

(1) there exists kˆ such that (vi) is satisfied withωk= 1∀k≥k.ˆ (2) {www_k} is Q-SL convergent to a critical pointwww^∗.

Proof Let {www_k} be a sequence generated by Algorithm II in which γ ∈ (0,¹₂), η_k → 0.

Since ∇f is semismooth we have that f ∈LC¹, and since L₁ is compact we also have that the (generalized) Hessian is bounded. Furthermore, since Hk employed in Algorithm II is the actual (generalized) Hessian of f and f is strongly convex, we have that H_k is positive definite and bounded for allwww ∈ L₁. Thus, all the assumptions of Proposition III of the supplementary are satisfied and we obtain (xi) and (xii). Now, from Theorem IV of the supplementary we obtain limk→∞k∇f(www_k)k= 0, which together with the strict convexity off implies that{www_k}is convergent to a critical pointwww^∗. This (together with the fact that strong convexity off implies BD-regularity of ∇f) means that we can apply Lemma IX on {www_k}, since it is a sequence converging to a critical pointwww^∗ in whichsss_k is generated by solving the Newton linear equation (iv) to satisfy the condition (v) where theratioemployed is (xl).

Since η_k→0, we can now apply (xxxii), on both kwww_k+sss_k−www^∗k

kwwwk−www^∗k ≥ kwww_k−www^∗k

kwwwk−www^∗k− ksss_kk

kwwwk−www^∗k ≥1− ksss_kk kwwwk−www^∗k and

kwww_k+sss_k−www^∗k

kwww_k−www^∗k ≥ ksss_kk

kwww_k−www^∗k−kwww_k−www^∗k

kwww_k−www^∗k ≥ ksss_kk

kwww_k−www^∗k−1 and obtain

k→∞lim

ksss_kk

kwww_k−www^∗k = 1. (xxxv) From this and (xxxii) we can obtain

k→∞lim

kwwwk+sssk−www^∗k

ksss_kk = lim

k→∞

kwwwk+sssk−www^∗k

kwww_k−www^∗k ·kwwwk−www^∗k ksss_kk = 0, which leads to

kwww_k+sss_k−www^∗k=o(ksss_kk). (xxxvi) Since ∇f is semismooth, we get that H is locally upper-bounded in N(www^∗, δ) for any δ.

That is, there exists a M >0 such that

M >kHk, ∀H ∈∂∇f(www) ∀www∈N(www^∗, δ). (xxxvii) From (xxxii) and the fact that {w_k} →w^∗,there exist aδ >0 and ak(δ) such that

w_k∈N(w^∗, δ) andw_k+s_k∈N(w^∗, δ)∀k≥k(δ). (xxxviii) By Proposition VII and since ∇f(www^∗) = 000, we can write

f(www_k+sss_k) =f(www^∗) +1

2(www_k+sss_k−www^∗)^TH_k(www_k+sss_k−www^∗) f(www_k) =f(www^∗) +1

2(www_k−www^∗)^TR_k(www_k−www^∗),

(10)

whereH_k∈∂∇f(zzz_k) for somezzz_k∈(www^∗, www_k+sss_k) andR_k∈∂∇f(uuu_k) for someuuu_k∈(www^∗, www_k).

Subtracting the two above equalities we have f(www_k+sss_k)−f(www_k) =−1

2(www_k−www^∗)R_k(www_k−www^∗) +o(ksss_kk²),

where the last term follows from (xxxvi), (xxxvii) and (xxxviii). By subtracting ¹₂∇f(wwwk)^Tsssk, we get

f(www_k+sss_k)−f(www_k)−1

2∇f(www_k)^Tsss_k

=−1

2(www_k−www^∗)^TR_k(www_k−www^∗)−1

2∇f(www_k)^Tsss_k+o(ksss_kk²)

=−1

2(www_k+sss_k−www^∗)^TR_k(www_k−www^∗) + 1

2sss^T_kR_k(www_k−www^∗)−1

2∇f(www_k)^Tsss_k+o(ksss_kk²)

=−1

2sss^T_k (∇f(www_k)−R_k(www_k−www^∗)) +o(ksss_kk²)

=−1

2sss^T_k (∇f(wwwk)− ∇f(www^∗)−Rk(wwwk−www^∗)) +o(kssskk²)

=o(ksss_kk²),

where the third equality follows from (xxxv), (xxxvi), and (xxxvii), while the last equality follows from Proposition VIII and again (xxxv). Thus, from the last equation we have

f(www_k+sss_k)−f(www_k) = 1

2∇f(www_k)^Tsss_k+o(ksss_kk²) +γ∇f(www_k)^Tsss_k−γ∇f(www_k)^Tsss_k

≤γ∇f(www_k)^Tsss_k+o(ksss_kk²)− 1

2 −γ

a1k∇f(www_k)k²

≤γ∇f(www_k)^Tsss_k+o(ksss_kk²)− 1

2 −γ a₁

a₂²ksss_kk²,

(xxxix)

where the first inequality follows from (xi), while the second inequality from (xii). The inequality (xxxix) proves that there exists a ˆkfor which (vi) is satisfied withω_k = 1,∀ k >ˆk and γ ∈ (0,¹₂). This means that ∀k > kˆ we can replacewww_k+sss_k with www_k+1 in the limit (xxxii) and obtain that{www_k} Q-superlinearly converges towww^∗.

II. Details on Truncation Criteria in TNCG We first recall all the ratio:

residual:

ratio= kggg_k+H_ksss^j_kk

kggg_kk . (xl)

residual_l1:

ratio= kggg_k+H_ksss^j_kk₁

kggg_kk₁ . (xli)

(11)

Algorithm III:PCG for solving (xlvi). AssumeM has been factorized to EE^T.

1 Given ηk<1,∆k>0, let ˆs=0,rˆ= ˆd=−E⁻¹g_k, γ= ˆr^Trˆ

2 while True do

3 ˆv←E⁻¹(Hk(E^−Td)),ˆ

4 α← kˆrk²/(ˆd^Tˆv)

5 ˆs←ˆs+αdˆ

6 ˆr←rˆ−αvˆ

7 if kˆrk< η_kkˆg_kk then

8 returnsk=E^−Tsˆ

9 γ^new←rˆ^Trˆ

10 β←γ^new/γ

11 ˆd←ˆr+βdˆ

12 γ←γ^new

quadratic:

ratio= (Qj −Qj−1)

Q_j/j , (xlii)

Now we recall all the forcing sequences:

constant:

η_k=c₀ withc₀ ∈(0,1). (xliii)

adaptive:

η_k= min{c₁;c2kggg_kk^c³}, with c1∈(0,1), c2 >0, c3 ∈(0,1]. (xliv)

adaptivel1:

η_k= min{c₁;c₂kggg_kk^c₁³}, with c₁∈(0,1), c₂ >0, c₃ ∈(0,1]. (xlv) III. Preconditioning

In this section we will show that Algorithm 2 from the original paper is equivalent to applying Algorithm III and then obtaining sssfromsss=E⁻¹sˆss. For sake of simplicity in this section and in Algorithm III we will not use the iteration counterj and k will only be reported on H_k andggg_k. Let us write the instructions of the original CG method (Algorithm I) applied to the system

E⁻¹HkE^−Tˆsss=−E⁻¹gggk. (xlvi) At the internal generic CG step we have

ˆ

s= ˆs+αd,ˆ (xlvii)

where

α= kˆrk² dˆ^TE⁻¹H_kE^−Tdˆ

(12)

with

ˆ

r=−E⁻¹H_kE^−Tˆs−E⁻¹g_k =E⁻¹ −H_kE^−Tˆs−g_k

. (xlviii)

Pre-multiplying (xlvii) by E⁻¹, using the transformations=E⁻¹ˆsand the definition

d:=E⁻¹dˆ (xlix)

we get

s=s+αd.

Again using the transformation s=E⁻¹ˆs, (xlviii) can be re-written to get

rˆ=E⁻¹(−H_ks−g_k) =E⁻¹r, (l) which together with (xlix) brings to

α= r^TM⁻¹r

d^Tv , wherev=H_kd.

Moreover we have that

dˆ = ˆr+βd,ˆ (li)

where

β = rˆ^T_newˆr_new

ˆr^Tˆr = r^T_newM⁻¹r_new r^TM⁻¹r . Pre-multiplying (li) by E⁻¹, from (l) and (xlix) we get

d=r+βd.

Finally, defining the vector

z :=M⁻¹r and the scalar γ:=r^Tz, α and β can be re-written as

α= r^Tz

d^Tv and β= γ^new γ . IV. Complete Numerical Results

For experiments we use two-class classification problems shown in Table II. Except for yahookr andyahoojp, others are available from LIBSVM Data Sets (2007). Table II shows the statistics of each data set used in our experiments. For a fair evaluation, all different settings are implemented based on the LIBLINEAR package². To simulate the practical use we conduct a five-fold cross-validation to select the regularization parameterC_Best that

2. Exceptions are for Scikit and ScikitArmijo of Figure 1 from the original paper, where Scikit-learn is used.

In particular, to train aLogisticRegressionmodel, the user should select thenewton-cgoption for the parametersolver. By default Scikit-learn solves a slightly different optimization problem with a bias term. To solve (i), the bias term is thus ignored by setting the optionfit intercepttoFalse.

(13)

achieves the best validation accuracy (see Table II for the resultingC). Then in experiments we consider C =CBest× {1,100}.

In figures below we show the total number of CG steps needed to solve the training problem (i). On the y axis we report the relative reduction of the function value, computed by

f(www_k)−f(www^∗) f(www^∗) ,

wherewww^∗ is the optimal solution³ of (i). The four horizontal lines in figures below indicate places where the following stopping condition is met respectively with={10⁻¹,10⁻²,10⁻³,10⁻⁴}

kgggkk ≤min{#pos,#neg}

l · kggg1k, (lii)

where #pos, #neg are the numbers of positive- and negative-labeled instances respectively, and >0 is the thresholding constant that controls the precision of the training procedure.

Note that (lii) with= 10⁻² is the outer stopping condition employed in LIBLINEAR. Thus the behavior before 10⁻¹ and after 10⁻⁴ is not crucial, since the training would be stopped too early or too late. For the Armijo line search, we follow Hsia et al. (2017) to haveδ = 0.5 for deciding the step sizeα∈ {1, δ, δ², . . .}and γ = 0.01 in the sufficient decrease condition (vi).

IV.1 Selection of the Forcing Sequence in Truncation Criteria See complete results in Figures i and ii.

IV.2 Selection of the Constant Threshold in the Inner Stopping Condition In this section, results in Figures iii and iv are obtained by comparing

rescons: ratio= (xl),η_k= (xliii), c₀= 0.1;

rescons05: ratio= (xl),η_k= (xliii), c₀ = 0.5;

rescons09: ratio= (xl),ηk= (xliii), c0 = 0.9;

quadcons: ratio= (xlii),η_k= (xliii), c₀= 0.1;

quadcons05: ratio= (xlii), η_k= (xliii), c0 = 0.5;

quadcons09: ratio= (xlii), η_k= (xliii), c₀ = 0.9.

By using the same color (red) for all the settings with a residual ratio(xl) and one other (blue) for all those with a quadratic ratio(xlii), in Figures iii and iv we want to observe the

robustness of the tworatiosw.r.t. a modification of the constant thresholdc0:

When C=C_Best, there is no evidence of one implementation that is more robust than the other.

3. In practicewww^∗is not available, so we run enough iterations to get a good approximate solution.

(14)

When C= 100C_Best, the quadratic ratiois more sensible on covtype,rcv1 and also on url, while the residual one is more sensible on yahoojp, yahookr, kdda, kddb, kdd12and also on real-sim.

The value C = 100CBest leads to a more difficult optimization problem and it is thus reasonable that the modification of the constant threshold c0 has here a stronger effect on the convergence speed. From the observations above we can conclude that the quadratic ratiois more robust than the residual one, especially on difficult optimization problems.

IV.3 Comparison Between Adaptive and Constant Forcing Sequences In this section, results in Figures v and vi are obtained by comparing

rescons05: ratio= (xl),η_k= (xliii), c0 = 0.5;

resada: ratio= (xl),ηk= (xliv);

quadcons05: ratio= (xlii), ηk= (xliii), c0 = 0.5;

quadada: ratio= (xlii), η_k= (xliv).

From Figures v and vi we can observe that the constantforcing sequence and the adaptive one are (almost) always leading to an identical convergence speed. This means that the constant c1= 0.5 is (almost) always smaller thanc2k∇f(wwwk)k^c³, withc2 = 1 andc3 = 0.5.

In fact, this result depends also on the choice of c1, c2 and c3, since, for instance, with a bigger initial threshold c₁ = 0.9 this overlapping effect would not be so common. On the other side, the setting (c1 = 0.5, c2 = 1, c3 = 0.5) is the one used in Scikit-learn and suggested also in Nocedal and Wright (1999). Moreover, experiments with many other settings ofc₁, c₂ andc₃ have been carried out and there is no evidence of configurations that work consistently better than the standard one.

IV.4 Comparison Between l2-Norm and l1-Norm and with TR

See complete results in Figures vii and viii. On the dataset covtype with C = 100C_Best both the setting quadada and quadada l1 are facing slow convergence issues because of the threshold c₁. Comparing with resada it seems that in early iterations the Newton equation needed to be solved more accurately. Since in Section IV.2 it is not easy to investigate the choice of the thresholdc1 (because of the single color blue) we decided to deepen the analysis on quadada in Section IV.9.

IV.5 Analysis via Conditions for Global Convergence

In the convergence proof from Section I.2 one of the key properties for proving global convergence is obtaining (xi) and (xii). The following combination of the two requirements is called angle condition

−ggg^T_ksss_k≥ckggg_kkksss_kk. (liii) In fact, most of the line-search based optimization methods (see Section 3.2 from Nocedal and Wright (1999)) requires the directionsss_k to satisfy (liii) to ensure global convergence⁴.

4. Note anyway that global convergence is always guaranteed for Algorithm II as proved in Section I.2, so the discussion in this section is only pointing out an undesired issue that might come out in practice.

(15)

This suggests us the idea of analyzing different inner stopping criteria on this value cos(−ggg_k, sss_k) = −ggg^T_ksss_k

kgggkkkssskk. (liv) The term (liv) is the cosine of the angle between the directionsss_k and the anti-gradient−ggg_k. Inequality (liii) is indeed called angle condition because it requires the cosine of the above angle to be greater or equal to a positive c, withc <1. This means that the new direction sss_k cannot be completely orthogonal to the anti-gradient. Otherwise, sss_k is not a descent direction and we cannot use it to decrease the function value. We suspect that a good inner stopping condition should make (liii) more easily satisfied. Detailed experiments in Figures ix and x confirm this result.

The condition (liii) is also useful to investigate the relationship between the regularization parameterC and the difficulty of the optimization problem (i). Many past experiments have shown that whenC is large, the optimization problem is more difficult and slow convergence more frequently occurs Hsia et al. (2017, 2018). In particular, when the Hessian of the objective function is positive definite and we approximatesss_k≈H_k⁻¹ggg_k we have

g

gg^T_ksss_k≈ −ggg^T_kH_k⁻¹ggg_k≤ − kggg_kk² λmax(Hk) ksss_kk ≈ kH_k⁻¹ggg_kk ≤ kggg_kk

λ_min(H_k).

(lv)

Now, from (lv) and (ix) we have

−ggg^T_ksssk≥ M2

M₁kgggkkkssskk. (lvi) This means that when the regularization parameter C is large (e.g., C = 100C_Best), (lvi) shows that the lower bound of the cosine value in (liv) we can derive is smaller. Thus the angle condition (liii) is harder to be satisfied. Note that equation (liii) is a property needed in the convergence proof, which requires the existence of a constant c. In fact, there is no control on the right-hand side of (liii), while what we hope is that the left-hand side of (liii) is large so that it can be easily satisfied. Unfortunately (lvi) shows that for large C,

the left-hand side of (liii) tends to be not that large. In conclusion, even if theoretically convergence is not an issue, the above discussion still gives us a hint on what might actually happen numerically when C is large.

We now show experiments in which we investigate the cosine between the anti-gradient

−ggg_k and the resulting directionsss_k. In Figures ix and x we compare

quadada: ratio= (xlii), η_k= (xliv);

rescons09: ratio= (xlii),ηk= (xliii) c0= 0.9;

resada: ratio= (xl),η_k= (xliv).

Cosines have been calculated until the outer termination criterion (lii) is satisfied with = 10⁻⁴. In Figures ix and x, we present the cosine value at each Newton iteration. Further, two horizontal lines corresponding to cosine = 0.1 and cosine = 0.2 have been drawn. From the comparison between Figures ix-x and i-ii we can make the following observations:

(16)

Algorithms having low cosine values (for example below 0.1) are more likely to converge slower to the solution. In particular, resada has a cosine more frequently below 0.1 when C = 100C_Best. For such configurations, Figures i and ii have shown that it behaves worse in terms of convergence speed.

Long continuous regions of low cosine in Figures ix-x might correspond to regions of slow convergence in Figures i-ii (see for example rescons09 onyahookr andkdd2010a when C= 100CBest, below the 0.2 threshold).

The cosine is often smaller when C is higher. This result confirms the theoretical analysis detailed in this section on the difficulty of solving (i) as C changes.

In Figures i-ii, quadada converges much faster than resada. From Figures ix-x we observe that quadada’s cosine is generally higher than resada’s. This indicates that in the early Newton iterations, the resulting directionsssk of quadada is closer to the anti-gradient. We have mentioned above in this section that this is a desired property.

Since early CG iterates are very close to the anti-gradient, the inner stopping condition in quadada effectively stops the CG procedure early to have this property.

We conclude that in early outer iterations a suitable criterion should avoid over-solving to obtain directions having a higher cosine with respect to the anti-gradient.

IV.6 Preconditioning: Effect of Preconditioning on Residual and Quadratic Ratio

See complete results in Figures xi and xii.

IV.7 Preconditioning: Comparison with Trust Region and Other Rules See complete results in Figures xiii and xiv.

IV.8 Additional Results in the Preconditioned Case: Comparison Between Adaptive and Constant Forcing Sequences

In this section, results in Figures xv and xvi are obtained by comparing

rescons05 p: ratio= (xl), η_k= (xliii), c₀ = 0.5, preconditioned;

resada p: ratio= (xl),ηk= (xliv), preconditioned;

quadcons05 p: ratio= (xlii), η_k= (xliii), c₀ = 0.5, preconditioned;

quadada p: ratio= (xlii), η_k= (xliv), preconditioned.

Also in this case, like in the non-preconditioned one of Section IV.3, the constant forcing sequence and the adaptive one are (almost) always leading to an identical convergence speed.

IV.9 Additional Results in the Preconditioned Case: Selection of the c1

Threshold in the Quadratic Adaptive Rule

In this section, results in Figures xvii and xviii are obtained by comparing

(17)

quadada01: ratio= (xlii),η_k= (xliv),c₁ = 0.1, c₂ = 1, c₃= 0.5;

quadada01 p: ratio= (xlii),η_k= (xliv) c1= 0.1, c2= 1, c3 = 0.5, preconditioned;

quadada: ratio= (xlii), ηk= (xliv), c1= 0.5, c2 = 1, c3 = 0.5;

quadada p: ratio= (xlii), η_k= (xliv), c₁= 0.5, c₂ = 1, c₃ = 0.5 preconditioned.

From Figures xvii and xviii we can make the following observations:

Using a smallerc₁ threshold (c₁= 0.1) is solving the issues encountered by quadada on covtypewith C= 100CBest in Figure viii.

Both the settings quadada01 and quadada01 p are generally slower than quadada and quadada p.

The slow convergence issue encountered by quadada oncovtype with C= 100C_Best in Figure viii is also solved by the use of preconditioning (quadada p).

IV.10 Additional Results in the Preconditioned Case: Comparison Against a Residual Constant Rule with Higher Constant Threshold

In this section, results in Figures xix and xx are obtained by comparing

rescons09 p: ratio= (xl), (xliii), c0 = 0.9, preconditioned;

quadada: ratio= (xlii), η_k= (xliv);

quadada p: ratio= (xlii), ηk= (xliv), preconditioned.

From Figures xix and xx we can make the following observations:

The settings rescons09 p and quadada p are often obtaining a very similar speed of convergence.

There are differences on C = CBest in which rescons09 p is faster than quadada p (rcv1,real-sim) and quadada p is faster than rescons09 p (covtype and criteoon

the fourth line), but the detachment is less than 20 CG steps.

By looking at yahookr with C = C_Best, we can see that rescons09 p is noticeably faster than quadada p (at the third line is around twice faster, by 150 CG steps). The opposite situation is instead happening on yahookrwith C= 100CBest, where being twice faster on the third line means saving around 750 CG steps.

(18)

IV.11 Additional Results in the Preconditioned Case: Comparison Between l2-Norm and l1-Norm

In this section, results in Figures xxi and xxii are obtained by comparing

resada l1: ratio= (xli), η_k= (xlv);

rescons l1 p: ratio= (xli), (xlv), preconditioned;

resada: ratio= (xl),η_k= (xliv);

resada p: ratio= (xl),η_k= (xliv), preconditioned.

From Figures xxi and xxii we can notice that rescons l1 p is often slower than resada p, so also in the preconditioned case there is no reason to employ L1-norm instead of L2-norm.

IV.12 Results for L2-loss SVM

In this section, results in Figures xxiii and xxiv are obtained by employing theξL2 loss as defined in (ii) and we compare:

tr rescons: ratio= (xl),η_k= (xliii),c₀= 0.1, Trust Region instead of line search;

resada: ratio= (xl),ηk= (xliv);

quadada: ratio= (xlii), η_k= (xliv).

IV.13 Results for L2-loss SVM in the Preconditioned Case

In this section, results in Figures xxv and xxvi are obtained by employing the ξ_L2 loss as defined in (ii) and we compare:

tr rescons p: ratio= (xl), η_k= (xliii), c0 = 0.1, Trust Region instead of line search, preconditioned;

rescons09 p: ratio= (xl), η_k= (xliii), c0 = 0.9, preconditioned;

resada p: ratio= (xl),η_k= (xliv), preconditioned;

quadada p: ratio= (xlii), η_k= (xliv), preconditioned.

IV.14 Comparison with Other State-of-the-Art Software

In this section we are comparing the proposed TNCG implementation with the two state- of-the-art software SAG (Schmidt et al., 2017) and SAGA (Defazio et al., 2014). A handy implementation of the two methods is provided by Scikit-learn. All the settings are the same described in Footnote 1 of Section IV, except for the fact that for training a LogisticRegression model with SAG or SAGA the user should select the sag or saga option for the parametersolver. Few lines of python/cython code have been added in the Scikit implementation of SAG/SAGA to compute the function value and measure the CPU

(19)

time (with the function time.time()of the library time). This additional computation has not been considered in the total time employed. On the other side, in the case of the LIBLINEAR implementation the time has been measured with the function clock()of the librarytime.h. For both the implementations the starting time has been taken immediately before the main optimization cycle. Experiments are conducted on a machine with an AMD EPYC 7401 CPU and 128 GB of memory. Results in Figures xxvii and xxviii are obtained by comparing quadada p (ratio= (xlii),η_k= (xliv), preconditioned), SAG and SAGA.

Appendix A. Gradient and Hessian Details of the Losses from (ii) LetX = [xxx₁, . . . , xxx_l]^T be the data matrix.

A.1 Logistic Loss

In this section we compute the gradient and the Hessian of f when the loss employed isξ_LR. Letσ(t) := (1 +e^−t)⁻¹.

The gradient of f is

∇f(www) =www+C

l

X

i=1

(σ(yiwww^Txxxi)−1)yixxxi

The Hessian off is

∇²f(www) =I+CX^TDX

with Da diagonal matrix such that D_ii(www) =σ(y_iwww^Txxx_i)(1−σ(y_iwww^Txxx_i)).

A.2 L2-loss SVM

In this section we compute the gradient and the Hessian of f when the loss employed is ξL2. Letyyy= [y₁, . . . , y_l]^T. The gradient off is

∇f(www) =www+ 2CXI,:T(XI,:www−yyyI), where I=

i∈ {1, . . . l}|1−y_iwww^Txxx_i >0 . Now we define

φ(t) = max(0,1−t)², φ⁰(t) =

(−2(1−t) if 1−t >0,

0 otherwise,

∂Bφ⁰(t) =







2 if 1−t >0, 0 if 1−t <0, {0,2} if 1−t= 0,

∂φ⁰(t) = conv{∂_Bφ⁰(t)}=







2 if 1−t >0, 0 if 1−t <0, µwith µ∈[0,2] if 1−t= 0,

where conv{·} is the convex hull of a set. Therefore, the generalized Hessian off is

∂∇f =I+CX^TDX, with Da diagonal matrix such that Dii(www) =∂φ⁰(yiwww^Txxxi).

(20)

An easy way to calculate∂∇f in practice is to replace the originalD_ii(www) with the following D¯_ii(www) =

(2, if 1−yiwww^Txxxi>0, 0, otherwise.

A.3 Cost of Armijo, Wolfe and Exact Line Searches

Here we recall the cost of Armijo (Section 2.1 of Hsia et al. (2017)) and exact line search (Section 3.3 of Hsia et al. (2017)), and we show how to obtain the same save of operations

even for Wolfe line search.

Exact requires O(l) operations for each trialω,

Armijo requiresO(l) operations for each trial ω,

Wolfe requiresO(l) operations for each trial ω.

To show the trick in computing the Wolfe condition, we first recall the same trick for Armijo (vi). For computing f at each ω, we can store eachxxx^T_iwww_k,xxx^T_isss_kwww^T_ksss_k,www_k^Twww_k andsss_k^Tsss_k respectively in these scalars x¯_iw,x¯_is,ws,¯ ww¯ and ¯ss and computef at (www_k+ωsss_k) in the following way

f(www_k+ωsss_k) = 1

2(www_k+ωsss_k)^T(www_k+ωsss_k) +C

l

X

i=1

ξ(y_i(www_k+ωsss_k)^Txxx_i)

= 1

2( ¯ww+ 2ωws¯ +ω²ss) +¯ C

l

X

i=1

ξ(yi( ¯xiw+ωx¯is)).

Thus the cost of Armijo line search isO(l)×(line search steps). The cost of Wolfe condition (vii) is also O(l)×(line search steps), because we don’t have to compute ∇f(wwwk+ωsssk) at each ω, but we can directly compute∇f(www_k+ωsss_k)^Tsss_k. Thus, again by storing everything as above, we have:

∇f(wwwk+ωsssk)^Tsssk =www^T_ksssk+ωsss^T_ksssk+C

l

X

i=1

ξ⁰(yi(wwwk+ωsssk)^Txxxi)·yi·xxx^T_isssk

= ¯ww+ωss¯ +C

l

X

i=1

ξ⁰(y_i( ¯x_iw+ωx¯_is))·y_i·x¯_is.

(21)

Symbol Meaning

n number of instances

l number of features

k outer iteration index (TN)

j inner step index (CG)

i index of the dataset instance w vector of weights (∈Rⁿ)

f objective function

g or∇f gradient of f (∈Rⁿ) H or ∇²f Hessian of f (∈R^(n×n))

y label

x feature vector (∈R^l)

ξ loss function

X matrix vector (∈R^(l×n))

s TN direction (∈Rⁿ)

C regularization parameter Q_k(·) quadratic approximation off

I identity matrix

η forcing sequence

ω step length

γ constant of Armijo line search

L level set

d d

d conjugate direction (∈Rⁿ)

B a positive definite matrix (∈R^(n×n)) M₁, M₂ bounding constants

r Newton equation residual (∈Rⁿ)

∂_B∇f B-subdifferential of ∇f

∂∇f generalized Hessian of f k · k l2-norm

k · k₁ l1-norm

c₀, c₁, c₂, c₃ constants in the forcing sequences constant of the outer stopping condition M preconditioning matrix (∈R^(n×n)) E a factor matrix of M =EE^T (∈R^(n×n)) ˆ

s preconditioned TN direction (∈Rⁿ) rˆ preconditioned residual (∈Rⁿ) ˆ

g preconditioned gradient (∈Rⁿ) Hˆ preconditioned Hessian (∈R^(n×n)) k · k_A norm induced by the matrix A λ_min(·), λ_max(·) min and max eigenvalue

CBest C achieving the best validation accuracy N(w, δ) neighbor of w with radius δ

Table I: Notation table.