• Aucun résultat trouvé

Index of /~cjlin/papers/logistic

N/A
N/A
Protected

Academic year: 2022

Partager "Index of /~cjlin/papers/logistic"

Copied!
2
0
0

Texte intégral

(1)

A Note on Diagonal Preconditioner in Large-scale Logistic Regression

Wei-Sheng Chin, Bo-Wen Yuan, Meng-Yuan Yang, and Chih-Jen Lin

Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan Preconditioning is a common technique to accelerate conjugate gradient methods. Although [2] has conducted some comparisons, they only show the total number of CG iterations needed to reach the following stopping condition.

k∇f(w)k≤0.001. (1)

Their conclusion is that diagonal preconditioner may not be always useful. However, we find that (1) may be too tight in practice, so the effectiveness of diagonal preconditioning is still unclear if the training program stops earlier. To this end, we follow the settings in [2, Section 5] to redo their experiments with more details provided.

We consider the same six data sets and the same C = 16 used in [2]. In addition to the trust-region Newton method proposed in [2], we consider another setting that replaces trust-region technique with line search. For each data set, we check the relation between the accumulated CG iterations and the relative difference to the optimal function value

f −f f ,

where f is the function value at a specific number of accumulated CG iterations and f is the optimal solution. For each data set, two horizontal lines are drawn to respectively indicate the stopping condition (1) and the default stopping condition (below) in LIBLINEAR [1] are satisfied.

k∇f(w)k2≤ min(l+, l)

100l k∇f(winit)k2, (2)

wherel+is the number of position instances,lis the number of negative instances,lis the number of total instances, and winit stands for the initial solution. Note that following [2], we add a bias term in the model and the regularization term. That is,

xTi ←[xTi ,1] wT ←[wT, b].

Our results are shown in Figure 1. We have the following observations.

• The stopping condition (1) used in [2] is much tighter than the one that is more often used in practice.

• In the early stage, PCG helps to more quickly or at least comparably reduce the function value, but it may not be better in a later stage.

In summary, our results are consistent with the conclusion in [2], which is based on results of using a strict stopping condition. However, in practice the Newton method stops much earlier.

In such a situation our experiments show that PCG is useful or at least not harmful for training logistic regression. Programs used for this experiments are available at

www.csie.ntu.edu.tw/~cjlin/papers/logistic.

1

(2)

0 100 200 300 400 500 600 700 800 CG iterations

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(a)a9a

0 20 40 60 80 100 120 140 160 180 200

CG iterations 10-14

10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(b)news20

0 50 100 150 200 250 300 350 400 450

CG iterations 10-16

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(c)rcv1

0 20 40 60 80 100 120 140 160

CG iterations 10-14

10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(d)real-sim

0 100 200 300 400 500 600

CG iterations 10-16

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(e)yahoo-japan

0 200 400 600 800 1000 1200

CG iterations 10-16

10-14 10-12 10-10 10-8 10-6 10-4 10-2 100 102

Difference to Opt. Fun. Value

LS LS-P TR TR-P

(f)yahoo-korea

Figure 1: Effects of diagonal preconditioning for truncated Newton methods. TR: trust-region technique is used. LS: line search is used. Note that “P” indicates the cases with preconditioning.

Dashed horizontal line: (2) is satisfied. Solid horizontal line: (1) is satisfied.

References

[1] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[2] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008.

2

Références

Documents relatifs

For C = 1, 000, L-CommDir-BFGS is the fastest on most data sets, and the only exception is KDD2010-b, on which L-CommDir-BFGS is slightly slower than L-CommDir- Step, but faster

To apply the linear convergence result in Tseng and Yun (2009), we show that L1-regularized logistic regression satisfies the conditions in their Theorem 2(b) if the loss term L(w)

In binary classification, squared hinge loss (L2 loss) is a common alter- native to L1 loss, but surprisingly we have not found any paper studying details of Crammer and Singer’s

Under random or cyclic selections, [2] explained that CD algorithms converge much faster if the dual problem does not contain a linear constraint (i.e., the bias term is

The x-axis is the cumulative number of O(n) operations, while the y-axis is the relative difference to the optimal function value defined in (4.19)... The x-axis is the

In Section 2, we introduce Newton methods for large- scale linear classification, and detailedly discuss two strategies (line search and trust region) for deciding the step size..

In the topic of multi-label classification for medical code prediction, one influential paper conducted a proper parameter selection on a set, but when moving to a sub- set

In Proceedings of the Annual Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL), pages 1101–1111, 2018. Medical code prediction