Supplementary Materials for
“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”
For experiments we use two-class classification shown in Table I. Except for yahookr and yahoojp, others are available from LIBSVM Data Sets (2007). Table I shows the statistics of each data set used in our experiments.
Data sets #instances #features density log 2 (C best )
LR L2
HIGGS 11,000,000 28 92.1057% -6 -12
w8a 49,749 300 3.8834% 8 2
rcv1 20,242 47,236 0.1568% 6 0
real-sim 72,309 20,958 0.2448% 3 -1 news20 19,996 1,355,191 0.0336% 9 3 url 2,396,130 3,231,962 0.0036% -7 -10 yahoojp 176,203 832,026 0.0160% 3 -1 yahookr 460,554 3,052,939 0.0111% 6 1 webspam 350,000 16,609,143 0.0220% 2 -3 kdda 8,407,752 20,216,831 0.0001% -3 -5 kddb 19,264,097 29,890,095 0.0001% -1 -4 criteo 45,840,617 1,000,000 0.0039% -15 -12 kdd12 149,639,105 54,686,452 0.00002% -4 -11
Table I: Data statistics. The density is the average number of non-zeros per instance. C best is the regularization parameter selected by cross validation.
I. Additional Experimental Results
In this section we present complete results for linear classification problems with either
• LR-loss or
• L2-loss.
References
LIBSVM Data Sets, 2007. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets.
c
.
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
−710
−610
−510
−410
−310
−2(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(a ) HIGGS
0 2 4 6
CG iterations 1e2
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(b) w8a
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(c) new20
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(d ) rcv1
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−610
−410
−210
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(e) real-sim
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(f ) url
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(g) yahoojp
0.0 0.2 0.4 0.6 0.8
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(h ) yahookr
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(i ) webspam
0.0 0.5 1.0 1.5 2.0
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(j ) kdda
0 1 2 3 4 5
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(k) kddb
0.0 0.2 0.4 0.6 0.8
CG iterations
1e210
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(l ) criteo
0 1 2 3 4 5
CG iterations 1e2
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(m ) kdd12
Figure i: A comparison of different constant thresholds in the inner stopping condition.
We show the convergence of a truncated Newton method for logistic regression. The x-axis
shows the cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−910
−710
−510
−310
−1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(a ) HIGGS
0.0 0.5 1.0 1.5
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(b) w8a
0.0 0.5 1.0 1.5 2.0 2.5
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(c) new20
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(d ) rcv1
0 1 2 3
CG iterations 1e2
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(e) real-sim
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(f ) url
0.0 0.5 1.0 1.5
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(g) yahoojp
0 1 2 3
CG iterations 1e4
10
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(h ) yahookr
0 2 4 6
CG iterations 1e2
10
−410
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(i ) webspam
0 1 2 3 4
CG iterations 1e4
10
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(j ) kdda
0 2 4 6
CG iterations 1e4
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(k) kddb
0 1 2 3 4 5
CG iterations
1e210
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(l ) criteo
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e4
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst01 ResConst09 QuadConst01 QuadConst09
(m ) kdd12
Figure ii: A comparison of different constant thresholds in the inner stopping condition.
We show the convergence of a truncated Newton method for logistic regression. The x-
axis shows the cumulative number of CG iterations. Loss=LR and C = 100C best in each
sub-figure.
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
710
610
510
410
310
2(f f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(a ) HIGGS
0 1 2 3 4 5
CG iterations 1e2
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(b) w8a
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(c) new20
0 2 4 6
CG iterations 1e1
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(d ) rcv1
0 2 4 6
CG iterations 1e1
10
−610
−410
−210
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(e) real-sim
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(f ) url
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(g) yahoojp
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(h ) yahookr
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(i ) webspam
0.0 0.5 1.0 1.5
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(j ) kdda
0 1 2 3 4
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(k) kddb
0.0 0.2 0.4 0.6 0.8
CG iterations
1e210
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(l ) criteo
0 1 2 3 4 5
CG iterations 1e2
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(m ) kdd12
Figure iii: A comparison between adaptive and constant forcing sequences. We show the
convergence of a truncated Newton method for logistic regression. The x-axis shows the
cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−910
−710
−510
−310
−1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(a ) HIGGS
0.0 0.2 0.4 0.6 0.8
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(b) w8a
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(c) new20
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(d ) rcv1
0 1 2 3
CG iterations 1e2
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(e) real-sim
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(f ) url
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(g) yahoojp
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e4
10
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(h ) yahookr
0 1 2 3 4 5
CG iterations 1e2
10
−410
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(i ) webspam
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e4
10
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(j ) kdda
0 1 2 3 4
CG iterations 1e4
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(k) kddb
0 1 2 3 4 5 6
CG iterations
1e210
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(l ) criteo
0 1 2 3 4 5
CG iterations 1e3
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09 ResAdaL2 QuadConst09 QuadAdaL2
(m ) kdd12
Figure iv: A comparison between adaptive and constant forcing sequences. We show the
convergence of a truncated Newton method for logistic regression. The x-axis shows the
cumulative number of CG iterations. Loss=LR and C = 100C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
710
610
510
410
310
2(f f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(a ) HIGGS
0 1 2 3 4 5 6
CG iterations 1e2
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(b) w8a
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(c) new20
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(d ) rcv1
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−610
−410
−210
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(e) real-sim
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(f ) url
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(g) yahoojp
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(h ) yahookr
0.0 0.5 1.0 1.5
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(i ) webspam
0.0 0.5 1.0 1.5
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(j ) kdda
0 1 2 3 4
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(k) kddb
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations
1e210
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(l ) criteo
0 2 4 6
CG iterations 1e2
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(m ) kdd12
Figure v: An investigation on the robustness of adaptive rules and a comparison with
the trust region approach. We show the convergence of a truncated Newton method for
logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR
and C = C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e2
10
−910
−710
−510
−310
−1(f − f * )/ f *
ResAdaL1 ResAdaL2 Q adAdaL2 Q adAdaL1 TR
(a ) HIGGS
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(b) w8a
0.0 0.5 1.0 1.5 2.0 2.5
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(c) new20
0 1 2 3
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(d ) rcv1
0 1 2 3 4
CG iterations 1e2
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(e) real-sim
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e3
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(f ) url
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(g) yahoojp
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e4
10
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(h ) yahookr
0 2 4 6
CG iterations 1e2
10
−410
−310
−210
−110
010
110
2(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(i ) webspam
0.0 0.5 1.0 1.5 2.0
CG iterations 1e4
10
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(j ) kdda
0 2 4 6
CG iterations 1e4
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(k) kddb
0.0 0.2 0.4 0.6 0.8
CG iterations
1e310
−610
−510
−410
−310
−210
−1(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(l ) criteo
0.0 0.2 0.4 0.6 0.8
CG iterations 1e4
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR
(m ) kdd12
Figure vi: An investigation on the robustness of adaptive rules and a comparison with
the trust region approach. We show the convergence of a truncated Newton method for
logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR
and C = 100C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(a ) HIGGS
0.0 0.5 1.0 1.5 2.0 2.5
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(b) w8a
0 2 4 6
CG iterations 1e1
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(c) new20
0 1 2 3 4
CG iterations 1e1
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(d ) rcv1
0 1 2 3 4 5
CG iterations 1e1
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(e) real-sim
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(f ) url
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(g) yahoojp
0 2 4 6
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(h ) yahookr
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(i ) webspam
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e3
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(j ) kdda
0.0 0.5 1.0 1.5 2.0 2.5 3.0
CG iterations 1e3
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(k) kddb
0 2 4 6
CG iterations
1e110−1 100
C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(l ) criteo
0.0 0.5 1.0 1.5 2.0 2.5 3.0
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(m ) kdd12
Figure vii: Cosines between the anti-gradient −∇f(w w w k ) and the resulting direction s s s k ,
using different truncation rules. Iterations in which the cosine is below the 0.1 threshold
are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR
and C = C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(a ) HIGGS
0 1 2 3
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(b) w8a
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(c) new20
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(d ) rcv1
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(e) real-sim
0.0 0.2 0.4 0.6 0.8
CG iterations 1e3
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(f ) url
0.0 0.2 0.4 0.6 0.8 1.0 1.2
CG iterations 1e3
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(g) yahoojp
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e4
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(h ) yahookr
0 1 2 3
CG iterations 1e2
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(i ) webspam
0.00 0.25 0.50 0.75 1.00 1.25 1.50
CG iterations 1e4
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(j ) kdda
0.0 0.5 1.0 1.5
CG iterations 1e4
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(k) kddb
0 1 2 3 4
CG iterations
1e210−1 100
C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(l ) criteo
0 1 2 3 4
CG iterations 1e3
10
−110
0C osi ne
QuadAdaL2 ResConst09 ResAdaL2
(m ) kdd12
Figure viii: Cosines between the anti-gradient −∇f (w w w k ) and the resulting direction s s s k ,
using different truncation rules. Iterations in which the cosine is below the 0.1 threshold
are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR
and C = 100C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2
(a ) HIGGS
0 1 2 3 4 5
CG iterations 1e2
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(b) w8a
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
610
510
410
310
210
110
010
1(f f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(c) new20
0.0 0.2 0.4 0.6 0.8
CG i era ions 1e2
10
−710
−510
−310
−110
1(f − f * )/ f *
ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2
(d ) rcv1
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−610
−410
−210
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(e) real-sim
0.0 0.5 1.0 1.5 2.0 2.5 3.0
CG iterations 1e2
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(f ) url
0.0 0.2 0.4 0.6 0.8 1.0
CG i era ions 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2
(g) yahoojp
0.0 0.2 0.4 0.6 0.8
CG iterations 1e3
10
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2
(h ) yahookr
0.0 0.5 1.0 1.5 2.0
CG iterations 1e2
10
610
510
410
310
210
110
010
1(f f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(i ) webspam
0 1 2 3 4
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(j ) kdda
0 1 2 3 4 5
CG iterations 1e3
10
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(k) kddb
0 2 4 6
CG iterations
1e210
19 × 100
(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(l ) criteo
0 1 2 3 4
CG iterations 1e2
10
−810
−610
−410
−210
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(m ) kdd12
Figure ix: A comparison between adaptive and constant forcing sequences. We show the
convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative
number of CG iterations. Loss=L2 and C = C best in each sub-figure.
0.0 0.2 0.4 0.6 0.8
CG iterations 1e2
10
−510
−410
−310
−210
−1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2
(a ) HIGGS
0.0 0.5 1.0 1.5 2.0
CG iterations 1e3
10
−810
−610
−410
−210
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2
(b) w8a
0.0 0.2 0.4 0.6 0.8 1.0
CG iterations 1e4
10
410
310
210
110
010
110
2(f f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(c) new20
0.00 0.25 0.50 0.75 1.00 1.25
CG iterations 1e3
10
410
310
210
110
010
110
2(f f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(d ) rcv1
0 2 4 6
CG iterations 1e2
10
−510
−310
−110
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(e) real-sim
0.0 0.5 1.0 1.5 2.0
CG iterations 1e3
10
−510
−410
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(f ) url
0.0 0.5 1.0 1.5 2.0
CG iterations 1e3
10
610
510
410
310
210
110
010
1(f f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(g) yahoojp
0 1 2 3 4 5
CG iterations 1e3
10
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(h ) yahookr
0 1 2 3 4 5 6
CG iterations 1e2
10
−310
−210
−110
010
110
2(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(i ) webspam
0 1 2 3 4
CG iterations 1e4
10
−510
−410
−310
−210
−110
010
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(j ) kdda
0 1 2 3 4
CG iterations 1e4
10
−310
−210
−110
010
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(k) kddb
0.0 0.5 1.0 1.5 2.0
CG iterations
1e310
1(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2
(l ) criteo
0.0 0.5 1.0 1.5
CG iterations 1e3
10
−710
−610
−510
−410
−310
−210
−110
0(f − f * )/ f *
ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2