Index of /~cjlin/papers/inner_stopping

(1)

Supplementary Materials for

“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

For experiments we use two-class classification shown in Table I. Except for yahookr and yahoojp, others are available from LIBSVM Data Sets (2007). Table I shows the statistics of each data set used in our experiments.

Data sets #instances #features density log ₂ (C _best )

LR L2

HIGGS 11,000,000 28 92.1057% -6 -12

w8a 49,749 300 3.8834% 8 2

rcv1 20,242 47,236 0.1568% 6 0

real-sim 72,309 20,958 0.2448% 3 -1 news20 19,996 1,355,191 0.0336% 9 3 url 2,396,130 3,231,962 0.0036% -7 -10 yahoojp 176,203 832,026 0.0160% 3 -1 yahookr 460,554 3,052,939 0.0111% 6 1 webspam 350,000 16,609,143 0.0220% 2 -3 kdda 8,407,752 20,216,831 0.0001% -3 -5 kddb 19,264,097 29,890,095 0.0001% -1 -4 criteo 45,840,617 1,000,000 0.0039% -15 -12 kdd12 149,639,105 54,686,452 0.00002% -4 -11

Table I: Data statistics. The density is the average number of non-zeros per instance. C _best is the regularization parameter selected by cross validation.

I. Additional Experimental Results

In this section we present complete results for linear classification problems with either

• LR-loss or

• L2-loss.

References

LIBSVM Data Sets, 2007. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets.

c

.

(2)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(a ) HIGGS

0 2 4 6

CG iterations ^1e2

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(e) real-sim

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4 5

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure i: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-axis

shows the cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(3)

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻⁹

10

⁻⁷

10

⁻⁵

10

⁻³

10

⁻¹

(f − f * )/ f *

(a ) HIGGS

0.0 0.5 1.0 1.5

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(d ) rcv1

0 1 2 3

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(g) yahoojp

0 1 2 3

CG iterations ^1e4

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(h ) yahookr

0 2 4 6

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 2 4 6

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(k) kddb

0 1 2 3 4 5

CG iterations

^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure ii: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-

axis shows the cumulative number of CG iterations. Loss=LR and C = 100C best in each

sub-figure.

(4)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁷

10

⁶

10

⁵

10

⁴

10

³

10

²

(f f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(a ) HIGGS

0 1 2 3 4 5

CG iterations ^1e2

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(c) new20

0 2 4 6

CG iterations ^1e1

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(d ) rcv1

0 2 4 6

CG iterations ^1e1

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure iii: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(5)

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁹

10

⁻⁷

10

⁻⁵

10

⁻³

10

⁻¹

(f − f * )/ f *

(a ) HIGGS

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(c) new20

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(d ) rcv1

0 1 2 3

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(h ) yahookr

0 1 2 3 4 5

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(k) kddb

0 1 2 3 4 5 6

CG iterations

^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e3

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure iv: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = 100C best in each sub-figure.

(6)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁷

10

⁶

10

⁵

10

⁴

10

³

10

²

(f f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(a ) HIGGS

0 1 2 3 4 5 6

CG iterations ^1e2

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(k) kddb

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations

^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0 2 4 6

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure v: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C _best in each sub-figure.

(7)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10

⁻⁹

10

⁻⁷

10

⁻⁵

10

⁻³

10

⁻¹

(f − f * )/ f *

ResAdaL1 ResAdaL2 Q adAdaL2 Q adAdaL1 TR

(a ) HIGGS

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(c) new20

0 1 2 3

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(d ) rcv1

0 1 2 3 4

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e4

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(h ) yahookr

0 2 4 6

CG iterations ^1e2

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 2 4 6

CG iterations ^1e4

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(l ) criteo

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e4

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure vi: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C _best in each sub-figure.

(8)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(a ) HIGGS

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(b) w8a

0 2 4 6

CG iterations ^1e1

10

⁻¹

10

⁰

C osi ne

(c) new20

0 1 2 3 4

CG iterations ^1e1

10

⁻¹

10

⁰

C osi ne

(d ) rcv1

0 1 2 3 4 5

CG iterations ^1e1

10

⁻¹

10

⁰

C osi ne

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(g) yahoojp

0 2 4 6

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(h ) yahookr

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(i ) webspam

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10

⁻¹

10

⁰

C osi ne

(j ) kdda

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e3

10

⁻¹

10

⁰

C osi ne

(k) kddb

0 2 4 6

CG iterations

^1e1

10⁻¹ 10⁰

C osi ne

(l ) criteo

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(m ) kdd12

Figure vii: Cosines between the anti-gradient −∇f(w w w _k ) and the resulting direction s s s _k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C _best in each sub-figure.

(9)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(a ) HIGGS

0 1 2 3

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(b) w8a

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(d ) rcv1

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(e) real-sim

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10

⁻¹

10

⁰

C osi ne

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10

⁻¹

10

⁰

C osi ne

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10

⁻¹

10

⁰

C osi ne

(h ) yahookr

0 1 2 3

CG iterations ^1e2

10

⁻¹

10

⁰

C osi ne

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e4

10

⁻¹

10

⁰

C osi ne

(j ) kdda

0.0 0.5 1.0 1.5

CG iterations ^1e4

10

⁻¹

10

⁰

C osi ne

(k) kddb

0 1 2 3 4

CG iterations

^1e2

10⁻¹ 10⁰

C osi ne

(l ) criteo

0 1 2 3 4

CG iterations ^1e3

10

⁻¹

10

⁰

C osi ne

(m ) kdd12

Figure viii: Cosines between the anti-gradient −∇f (w w w _k ) and the resulting direction s s s _k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C _best in each sub-figure.

(10)

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(a ) HIGGS

0 1 2 3 4 5

CG iterations ^1e2

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁶

10

⁵

10

⁴

10

³

10

²

10

¹

10

⁰

10

¹

(f f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8

CG i era ions ^1e2

10

⁻⁷

10

⁻⁵

10

⁻³

10

⁻¹

10

¹

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0

CG i era ions ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(h ) yahookr

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10

⁶

10

⁵

10

⁴

10

³

10

²

10

¹

10

⁰

10

¹

(f f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4 5

CG iterations ^1e3

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

(f − f * )/ f *

(k) kddb

0 2 4 6

CG iterations

^1e2

10

¹

9 × 10⁰

(f − f * )/ f *

(l ) criteo

0 1 2 3 4

CG iterations ^1e2

10

⁻⁸

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(m ) kdd12

Figure ix: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative

number of CG iterations. Loss=L2 and C = C best in each sub-figure.

(11)

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

(f − f * )/ f *

(a ) HIGGS

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10

⁻⁸

10

⁻⁶

10

⁻⁴

10

⁻²

10

⁰

(f − f * )/ f *

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10

⁴

10

³

10

²

10

¹

10

⁰

10

¹

10

²

(f f * )/ f *

(c) new20

0.00 0.25 0.50 0.75 1.00 1.25

CG iterations ^1e3

10

⁴

10

³

10

²

10

¹

10

⁰

10

¹

10

²

(f f * )/ f *

(d ) rcv1

0 2 4 6

CG iterations ^1e2

10

⁻⁵

10

⁻³

10

⁻¹

10

¹

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10

⁶

10

⁵

10

⁴

10

³

10

²

10

¹

10

⁰

10

¹

(f f * )/ f *

(g) yahoojp

0 1 2 3 4 5

CG iterations ^1e3

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(h ) yahookr

0 1 2 3 4 5 6

CG iterations ^1e2

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

10

²

(f − f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e4

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e4

10

⁻³

10

⁻²

10

⁻¹

10

⁰

10

¹

(f − f * )/ f *

(k) kddb

0.0 0.5 1.0 1.5 2.0

CG iterations

^1e3

10

¹

(f − f * )/ f *

(l ) criteo

0.0 0.5 1.0 1.5

CG iterations ^1e3

10

⁻⁷

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

10

⁻¹

10

⁰

Index of /~cjlin/papers/inner_stopping

Supplementary Materials for

“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

For experiments we use two-class classification shown in Table I. Except for yahookr and yahoojp, others are available from LIBSVM Data Sets (2007). Table I shows the statistics of each data set used in our experiments.

Data sets #instances #features density log 2 (C best )

LR L2

HIGGS 11,000,000 28 92.1057% -6 -12

w8a 49,749 300 3.8834% 8 2

rcv1 20,242 47,236 0.1568% 6 0

Table I: Data statistics. The density is the average number of non-zeros per instance. C best is the regularization parameter selected by cross validation.

I. Additional Experimental Results

In this section we present complete results for linear classification problems with either

• LR-loss or

• L2-loss.

References

LIBSVM Data Sets, 2007. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets.

c

.

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

10

10

10

10

10

(f − f * )/ f *

(a ) HIGGS

0 2 4 6

CG iterations 1e2

10

10

10

10

10

10

10

10

(f − f * )/ f *

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

10

10

10

10

10

10

10

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

10

10

10

10

10

10

(f − f * )/ f *

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

10

10

10

(f − f * )/ f *

(e) real-sim

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

10

10

10

10

10

Data sets #instances #features density log ₂ (C _best )

Table I: Data statistics. The density is the average number of non-zeros per instance. C _best is the regularization parameter selected by cross validation.

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e2

CG iterations ^1e3

CG iterations ^1e2

CG iterations ^1e3

CG iterations ^1e3