• Aucun résultat trouvé

Index of /~cjlin/papers/inner_stopping

N/A
N/A
Protected

Academic year: 2022

Partager "Index of /~cjlin/papers/inner_stopping"

Copied!
11
0
0

Texte intégral

(1)

Supplementary Materials for

“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

For experiments we use two-class classification shown in Table I. Except for yahookr and yahoojp, others are available from LIBSVM Data Sets (2007). Table I shows the statistics of each data set used in our experiments.

Data sets #instances #features density log 2 (C best )

LR L2

HIGGS 11,000,000 28 92.1057% -6 -12

w8a 49,749 300 3.8834% 8 2

rcv1 20,242 47,236 0.1568% 6 0

real-sim 72,309 20,958 0.2448% 3 -1 news20 19,996 1,355,191 0.0336% 9 3 url 2,396,130 3,231,962 0.0036% -7 -10 yahoojp 176,203 832,026 0.0160% 3 -1 yahookr 460,554 3,052,939 0.0111% 6 1 webspam 350,000 16,609,143 0.0220% 2 -3 kdda 8,407,752 20,216,831 0.0001% -3 -5 kddb 19,264,097 29,890,095 0.0001% -1 -4 criteo 45,840,617 1,000,000 0.0039% -15 -12 kdd12 149,639,105 54,686,452 0.00002% -4 -11

Table I: Data statistics. The density is the average number of non-zeros per instance. C best is the regularization parameter selected by cross validation.

I. Additional Experimental Results

In this section we present complete results for linear classification problems with either

• LR-loss or

• L2-loss.

References

LIBSVM Data Sets, 2007. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets.

c

.

(2)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(a ) HIGGS

0 2 4 6

CG iterations 1e2

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(e) real-sim

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(j ) kdda

0 1 2 3 4 5

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(l ) criteo

0 1 2 3 4 5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(m ) kdd12

Figure i: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-axis

shows the cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(3)

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−9

10

−7

10

−5

10

−3

10

−1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(a ) HIGGS

0.0 0.5 1.0 1.5

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(d ) rcv1

0 1 2 3

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(f ) url

0.0 0.5 1.0 1.5

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(g) yahoojp

0 1 2 3

CG iterations 1e4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(h ) yahookr

0 2 4 6

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(i ) webspam

0 1 2 3 4

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(j ) kdda

0 2 4 6

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(k) kddb

0 1 2 3 4 5

CG iterations

1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(l ) criteo

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e4

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(m ) kdd12

Figure ii: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-

axis shows the cumulative number of CG iterations. Loss=LR and C = 100C best in each

sub-figure.

(4)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

7

10

6

10

5

10

4

10

3

10

2

(f f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(a ) HIGGS

0 1 2 3 4 5

CG iterations 1e2

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(c) new20

0 2 4 6

CG iterations 1e1

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(d ) rcv1

0 2 4 6

CG iterations 1e1

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(j ) kdda

0 1 2 3 4

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(l ) criteo

0 1 2 3 4 5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(m ) kdd12

Figure iii: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(5)

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−9

10

−7

10

−5

10

−3

10

−1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(a ) HIGGS

0.0 0.2 0.4 0.6 0.8

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(c) new20

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(d ) rcv1

0 1 2 3

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(h ) yahookr

0 1 2 3 4 5

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(j ) kdda

0 1 2 3 4

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(k) kddb

0 1 2 3 4 5 6

CG iterations

1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(l ) criteo

0 1 2 3 4 5

CG iterations 1e3

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(m ) kdd12

Figure iv: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = 100C best in each sub-figure.

(6)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

7

10

6

10

5

10

4

10

3

10

2

(f f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(a ) HIGGS

0 1 2 3 4 5 6

CG iterations 1e2

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(h ) yahookr

0.0 0.5 1.0 1.5

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(j ) kdda

0 1 2 3 4

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(k) kddb

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations

1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(l ) criteo

0 2 4 6

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(m ) kdd12

Figure v: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C best in each sub-figure.

(7)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e2

10

−9

10

−7

10

−5

10

−3

10

−1

(f − f * )/ f *

ResAdaL1 ResAdaL2 Q adAdaL2 Q adAdaL1 TR

(a ) HIGGS

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(c) new20

0 1 2 3

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(d ) rcv1

0 1 2 3 4

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e3

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(h ) yahookr

0 2 4 6

CG iterations 1e2

10

−4

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(j ) kdda

0 2 4 6

CG iterations 1e4

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(k) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(l ) criteo

0.0 0.2 0.4 0.6 0.8

CG iterations 1e4

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(m ) kdd12

Figure vi: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C best in each sub-figure.

(8)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(a ) HIGGS

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(b) w8a

0 2 4 6

CG iterations 1e1

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(c) new20

0 1 2 3 4

CG iterations 1e1

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(d ) rcv1

0 1 2 3 4 5

CG iterations 1e1

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(g) yahoojp

0 2 4 6

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(h ) yahookr

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(i ) webspam

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e3

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(j ) kdda

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations 1e3

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(k) kddb

0 2 4 6

CG iterations

1e1

10−1 100

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(l ) criteo

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(m ) kdd12

Figure vii: Cosines between the anti-gradient −∇f(w w w k ) and the resulting direction s s s k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C best in each sub-figure.

(9)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(a ) HIGGS

0 1 2 3

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(b) w8a

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(d ) rcv1

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(e) real-sim

0.0 0.2 0.4 0.6 0.8

CG iterations 1e3

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations 1e3

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e4

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(h ) yahookr

0 1 2 3

CG iterations 1e2

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations 1e4

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(j ) kdda

0.0 0.5 1.0 1.5

CG iterations 1e4

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(k) kddb

0 1 2 3 4

CG iterations

1e2

10−1 100

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(l ) criteo

0 1 2 3 4

CG iterations 1e3

10

−1

10

0

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(m ) kdd12

Figure viii: Cosines between the anti-gradient −∇f (w w w k ) and the resulting direction s s s k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C best in each sub-figure.

(10)

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(a ) HIGGS

0 1 2 3 4 5

CG iterations 1e2

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

6

10

5

10

4

10

3

10

2

10

1

10

0

10

1

(f f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(c) new20

0.0 0.2 0.4 0.6 0.8

CG i era ions 1e2

10

−7

10

−5

10

−3

10

−1

10

1

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(e) real-sim

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0

CG i era ions 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations 1e3

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(h ) yahookr

0.0 0.5 1.0 1.5 2.0

CG iterations 1e2

10

6

10

5

10

4

10

3

10

2

10

1

10

0

10

1

(f f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(i ) webspam

0 1 2 3 4

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(j ) kdda

0 1 2 3 4 5

CG iterations 1e3

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(k) kddb

0 2 4 6

CG iterations

1e2

10

1

9 × 100

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(l ) criteo

0 1 2 3 4

CG iterations 1e2

10

−8

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(m ) kdd12

Figure ix: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative

number of CG iterations. Loss=L2 and C = C best in each sub-figure.

(11)

0.0 0.2 0.4 0.6 0.8

CG iterations 1e2

10

−5

10

−4

10

−3

10

−2

10

−1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(a ) HIGGS

0.0 0.5 1.0 1.5 2.0

CG iterations 1e3

10

−8

10

−6

10

−4

10

−2

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations 1e4

10

4

10

3

10

2

10

1

10

0

10

1

10

2

(f f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(c) new20

0.00 0.25 0.50 0.75 1.00 1.25

CG iterations 1e3

10

4

10

3

10

2

10

1

10

0

10

1

10

2

(f f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(d ) rcv1

0 2 4 6

CG iterations 1e2

10

−5

10

−3

10

−1

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(e) real-sim

0.0 0.5 1.0 1.5 2.0

CG iterations 1e3

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations 1e3

10

6

10

5

10

4

10

3

10

2

10

1

10

0

10

1

(f f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(g) yahoojp

0 1 2 3 4 5

CG iterations 1e3

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(h ) yahookr

0 1 2 3 4 5 6

CG iterations 1e2

10

−3

10

−2

10

−1

10

0

10

1

10

2

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(i ) webspam

0 1 2 3 4

CG iterations 1e4

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(j ) kdda

0 1 2 3 4

CG iterations 1e4

10

−3

10

−2

10

−1

10

0

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(k) kddb

0.0 0.5 1.0 1.5 2.0

CG iterations

1e3

10

1

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(l ) criteo

0.0 0.5 1.0 1.5

CG iterations 1e3

10

−7

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

0

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(m ) kdd12

Figure x: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative

number of CG iterations. Loss=L2 and C = 100C best in each sub-figure.

Références

Documents relatifs

To apply the linear convergence result in Tseng and Yun (2009), we show that L1-regularized logistic regression satisfies the conditions in their Theorem 2(b) if the loss term L(w)

In binary classification, squared hinge loss (L2 loss) is a common alter- native to L1 loss, but surprisingly we have not found any paper studying details of Crammer and Singer’s

Under random or cyclic selections, [2] explained that CD algorithms converge much faster if the dual problem does not contain a linear constraint (i.e., the bias term is

The x-axis is the cumulative number of O(n) operations, while the y-axis is the relative difference to the optimal function value defined in (4.19)... The x-axis is the

However, we find that (1) may be too tight in practice, so the effectiveness of diagonal preconditioning is still unclear if the training program stops earlier.. To this end, we

In Section 2, we introduce Newton methods for large- scale linear classification, and detailedly discuss two strategies (line search and trust region) for deciding the step size..

With our proposed optimization techniques for training large problems, the Full approach by treating all missing entries as negative becomes practically viable.. One-class MF is a

In the topic of multi-label classification for medical code prediction, one influential paper conducted a proper parameter selection on a set, but when moving to a sub- set