for Regularized Empirical Risk Minimization”
Po-Wei Wang b97058@csie.ntu.edu.tw
Department of Computer Science National Taiwan University Taipei 106, Taiwan
Ching-pei Lee ching-pei@cs.wisc.edu
Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706-1613, USA
Chih-Jen Lin cjlin@csie.ntu.edu.tw
Department of Computer Science National Taiwan University Taipei 106, Taiwan
Editor:
1. More Experimental Results
In this supplementary document, we provide more experimental results. We first give an additional experiment on adding a bias term in the logistic regression problem, and then show comparison between our method and some representative variance reduction stochastic methods.
1.1 Logistic Regression with a Bias Term
In the main paper, we conducted experiments on logistic regression without a bias term. In this section, we provide the results with a bias term in Figures 1-12. In general, the same trend as that in Section 6 is observed.
1.2 Comparison with Variance Reduction Stochastic Methods
We compare the proposed method with two representative variance reduction stochastic
methods in this section. We implemented the stochastic L-BFGS (SLBFGS) method by
Moritz et al. (2016), whose special setting of using zero historical vectors reduces to the
so-called SVRG method by Johnson and Zhang (2013). One concern is that most of the
data sets we used are relatively sparse, and thus without a careful consideration, a plain
implementation will conduct dense vector addition at every iteration, resulting in a much
higher cost per data pass than other methods compared in this work. We notice that
implementation tricks like those for the plain stochastic gradient on regularized problems
Data Training size (l) Features (n) Density (#nnz/(ln))
w8a 49,749 300 3.8834 %
real-sim 72,309 20,958 0.2448 %
gisette 6,000 5,000 99.1000%
Table 1: Statistics of the additional data sets.
are applicable to SVRG for dealing with this issue, see, for example, the appendix of Reddi et al. (2015). However, Note that this sort of tricks does not apply to Moritz et al. (2016);
The L-BFGS update always costs at least O(n) because it either involves multiplying the gradient by an n × n dense matrix, or takes a two-loop procedure that requires weighted summation of several dense vectors according to their inner products with the current gradient. In summary, the SLBFGS method is practical only when the data set is dense, while SVRG can be made efficient for sparse data if we added some implementation tricks, but a preliminary comparison conducted on relatively dense data sets already suffices for showing their deficiency as we will see below, so the implementation trick is not added.
Another disadvantage of the stochastic methods is that these approaches usually require step size tuning, and for different regularization parameters we might need different step sizes to ensure convergence or fast convergence. Therefore, even if the performance of these methods are good when the parameters are right, it takes a long time to find the right ones, and one should not ignore the parameter tuning time. Therefore, in terms of the total running time, including these parameter tuning parts, these batch methods that need not tune parameters are more preferable in practice.
In regard of the reasons above, we compare with the SLBFGS method by Moritz et al.
(2016) on the rather dense data sets epsilon, covtype, a9a, and the additional small data sets w8a, real-sim, and gisette with the default parameters suggested in either their paper or their experiment code at https://github.com/pcmoritz/slbfgs/. For the latter three data sets, their statistics are shown in Table 1.
Note that for a fair comparison, we reimplemented their algorithm in C++, following strictly the algorithm described in the paper that possesses a convergence guarantee.
1We also compared with the special case of this algorithm when it reduces to SVRG, though we acknowledge that the running time for SVRG can be shortened if we use the implementation trick for sparse data mentioned above.
The parameters we used for SLBFGS are: step size: 10
−4, memory size of the L- BFGS matrix: 10, number of samples for minibatch stochastic gradient: 20, number of samples for Hessian estimation: 200, number of stochastic steps before updating the L- BFGS matrix: 10, number of stochastic steps before recomputing the full gradient: l/20, where l is the number of data points. Regarding the parameters we used for SVRG, we follow the suggestion of Johnson and Zhang (2013) to use minibatch size being 1, and the number of stochastic steps before recomputing a full gradient being 2l. For the step size, there was no suggestion in the paper, so we followed the same setting of SLBFGS to use
1. Their experiment code is written in Julia, and to exclude the performance gap resulted from different
SLBFGS SVRG SLBFGS SVRG SLBFGS SVRG epsilon: logistic regression 10
−410
−410
−410
−510
−410
−8covtype: logistic regression 10
−1010
−1010
−1310
−1310
−1610
−16a9a: logistic regression 10
−410
−410
−410
−510
−410
−8w8a: Logistic regression 10
−410
−410
−410
−410
−610
−7real-sim: Logistic regression 10
−410
−410
−410
−410
−410
−6gisette: Logistic regression 10
−410
−410
−410
−610
−810
−8epsilon: L2-loss SVM 10
−410
−410
−510
−610
−910
−9covtype: L2-loss SVM 10
−1010
−1110
−1310
−1410
−1610
−17a9a: L2-loss SVM 10
−410
−410
−510
−610
−910
−9w8a: L2-loss SVM 10
−410
−410
−410
−710
−810
−10real-sim: L2-loss SVM 10
−410
−410
−410
−510
−610
−8gisette: L2-loss SVM 10
−410
−410
−810
−810
−1110
−11Table 2: Step sizes of SLBFGS and SVRG for different C and different data sets.
10
−4. When the current step size provides at the tenth iteration an objective value higher than that of the initial objective, we divide the step size by 10 and retain all other settings to rerun the algorithm. This step size tuning procedure is adopted for both SVRG and SLBFGS. The results are shown in Figures 13-17. Note that for the sparse data sets, the running time of SVRG can be improved if a more efficient implementation is used. However, the result of data passes can still provide enough information.
The step sizes we eventually used are shown in Table 2. It is obvious that it takes quite some time to tune the step sizes for these stochastic methods, for example we needed to decrease the step size at least four times on gisette with C = 10
3and more than ten times on covtype with C = 10
3. From the figures, we see that although when the problem is rather easy (when C is small), SVRG often outperforms our method, for more difficult problems like covtype, w8a, real-sim, and gisette, especially when C is large, SVRG failed to converge after reaching some low accuracy level. This might be solved partially by further tuning the step size, but this also means that more time needs to be spent. On the other hand, SLBFGS is often the slowest, even in terms of data pass, and the same convergence problem appeared on SVRG is also observed for SLBFGS.
Although it is possible to conduct acceleration for SVRG on problems with larger con- dition number to partially solve this problem, a difficulty here is the estimation of the Lipschitz constant is required in those accelerated methods.
References
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–
323, 2013.
Philipp Moritz, Robert Nishihara, and Michael I. Jordan. A linearly-convergent stochas-
tic L-BFGS algorithm. In Proceedings of the Nineteenth International Conference on
Artificial Intelligence and Statistics, 2016.
Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J. Smola.
On variance reduction in stochastic gradient descent and its asynchronous variants. In
Advances in Neural Information Processing Systems, pages 2647–2655, 2015.
10−1.810−1.610−1.410−1.210−110−0.810−0.6 10−14
10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
100 101
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
10−0.7 10−0.6 10−0.5 10−0.4
10−11 10−10 10−9 10−8 10−7
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
101 102
10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
10−0.4 10−0.2 100 100.2 100.4
10−14 10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
100.4 100.6 100.8 101 101.2 10−14
10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
101.4 101.6 101.8 102 102.2 102.4 10−14
10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
10−0.2 100 100.2 100.4 100.6 100.8 10−14
10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
100.8 101 101.2 101.4 101.6 10−14
10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(j) KDD2010-b
101 102 103 104 105
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 1: Comparison between single inner iteration and multiple inner iterations variants of
the common-directions method. We present training time (in log scale) of logistic regression
with a bias term and C = 10
−3.
10−2 10−1 100 10−10
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(b) covtype
100 101
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(c) news20
10−0.5 100 100.5
10−13 10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
101 102 103 104
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
100 101
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
101 102
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
102 103
10−9 10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
100 101
10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
101 102
10−12 10−9 10−6 10−3 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(j) KDD2010-b
101 102 103 104
10−0.3 10−0.2 10−0.1 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 2: Comparison between single inner iteration and multiple inner iterations variants of
the common-directions method. We present training time (in log scale) of logistic regression
with a bias term and C = 10
1.
10−2 10−1 100 101 10−7
10−6 10−5 10−4 10−3 10−2 10−1
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
100 101
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
100 101
10−9 10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
101 102 103 104
10−2 10−1 100 101 102 103
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
100 101 102 103
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
100 101 102 103 104
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
102 103 104
10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
100 101 102 103 104
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
101 102 103 104
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 3: Comparison between single inner iteration and multiple inner iterations variants of
the common-directions method. We present training time (in log scale) of logistic regression
with a bias term and C = 10
3.
(a) a9a
100.2 100.4 100.6 100.8 101 101.2 101.4 10−14
10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
100.5 101 101.5
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(c) news20
100.3 100.4 100.5 100.6
10−11 10−10 10−9 10−8 10−7
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
100.5 101 101.5
10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
100.4 100.6 100.8 101
10−14 10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
100.4 100.6 100.8 101 101.2 10−14
10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
100.2 100.4 100.6 100.8 101 101.2 101.4 10−14
10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
100.4 100.6 100.8 101 101.2 10−14
10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
100.4 100.6 100.8 101 101.2 10−14
10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(j) KDD2010-b
101 102
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 4: Comparison between single inner iteration and multiple inner iterations variants
of the common-directions method. We present data passes(in log scale) of logistic regression
with a bias term and C = 10
−3.
101 102 10−10
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
101 102
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
100.5 101 101.5
10−13 10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
101 102
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
101 102
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
101 102
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
101 102
10−9 10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
101 102
10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
101 102
10−12 10−9 10−6 10−3 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(j) KDD2010-b
101 102
10−0.3 10−0.2 10−0.1 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 5: Comparison between single inner iteration and multiple inner iterations variants
of the common-directions method. We present data passes (in log scale) of logistic regression
with a bias term and C = 10
1.
(a) a9a
101 102
10−7 10−6 10−5 10−4 10−3 10−2 10−1
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
101 102
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(c) news20
101 102
10−9 10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(d) url
101 102
10−2 10−1 100 101 102 103
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(e) yahoo-japan
101 102
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(f) yahoo-korea
101 102 103
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(g) webspam
101 102
10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(h) rcv1t
101 102 103
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
(i) epsilon
101 102 103
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
Single Inner Iteration Multiple Inner Iterations
Figure 6: Comparison between single inner iteration and multiple inner iterations variants
of the common-directions method. We present data passes (in log scale) of logistic regression
with a bias term and C = 10
3.
10−2 10−1 100 10−14
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
10−1 100 101 102 103
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
100 101
10−15 10−12 10−9 10−6 10−3
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
101 102 103 104
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
100 101
10−14 10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
101 102 103
10−15 10−12 10−9 10−6 10−3 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
102 103 104
10−13 10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
100 101 102
10−16 10−12 10−8 10−4 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
101 102 103
10−14 10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(j) KDD2010-b
101 102 103 104 105
10−14 10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 7: Training time of logistic regression with a bias term and C = 10
−3.
(a)
10−2 10−1 100 101
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(b) covtype
10−1 100 101 102 103
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(c) news20
10−1 100 101 102
10−14 10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
101 102 103 104
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
100 101 102 103
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
100 101 102 103 104
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
102 103 104 105
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
100 101 102 103
10−14 10−10 10−6 10−2 102
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
101 102 103 104
10−16 10−12 10−8 10−4 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(j) KDD2010-b
101 102 103 104 105
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 8: Training time of logistic regression with a bias term and C = 1.
10−2 10−1 100 101 10−9
10−7 10−5 10−3 10−1
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
10−1 100 101 102 103
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
10−1 100 101 102 103
10−11 10−8 10−5 10−2 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
101 102 103 104
10−3 10−2 10−1 100 101 102 103
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
100 101 102 103
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
100 101 102 103 104
10−7 10−5 10−3 10−1 101
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
102 103 104 105
10−6 10−4 10−2 100 102
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
100 101 102 103 104
10−9 10−6 10−3 100
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
101 102 103 104
10−13 10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 9: Training time of logistic regression with a bias term and C = 10
3.
101 102 10−14
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(b) covtype
100 101 102 103
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(c) news20
101 102
10−15 10−12 10−9 10−6 10−3
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
100 101 102 103
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
101 102
10−14 10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
101 102
10−15 10−12 10−9 10−6 10−3 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
101 102
10−13 10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
101 102 103
10−16 10−12 10−8 10−4 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
101 102 103
10−14 10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(j) KDD2010-b
100 101 102 103
10−14 10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 10: Number of data passes of logistic regression with a bias term and C = 10
−3.
100 101 102 103 10−11
10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
100 101 102 103
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
100 101 102 103
10−14 10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
100 101 102 103
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
100 101 102 103
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
100 101 102 103
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
100 101 102 103
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
100 101 102 103
10−14 10−10 10−6 10−2 102
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
100 101 102 103
10−16 10−12 10−8 10−4 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(j) KDD2010-b
100 101 102 103 104
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 11: Number of data passes of logistic regression with a bias term and C = 1.
(a) a9a
100 101 102 103
10−9 10−7 10−5 10−3 10−1
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
100 101 102 103
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(c) news20
100 101 102 103
10−11 10−8 10−5 10−2 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(d) url
100 101 102 103 104
10−3 10−2 10−1 100 101 102 103
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(e) yahoo-japan
100 101 102 103
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(f) yahoo-korea
100 101 102 103 104
10−7 10−5 10−3 10−1 101
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(g) webspam
100 101 102 103
10−6 10−4 10−2 100 102
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(h) rcv1t
100 101 102 103
10−9 10−6 10−3 100
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
(i) epsilon
100 101 102 103
10−13 10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
CommDir L-BFGS30
BFGS NEWTON
AG
Figure 12: Number of data passes of logistic regression with a bias term and C = 10
3.
Logistic regression
101 102
10−14 10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102
10−15 10−11 10−7 10−3 101
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103 104
10−12 10−9 10−6 10−3 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
100.5 101 101.5
10−14 10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102
10−15 10−11 10−7 10−3 101
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
100 101 102 103
10−12 10−9 10−6 10−3 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
L2-loss SVM
101 102
10−14 10−11 10−8 10−5 10−2
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103
10−13 10−10 10−7 10−4 10−1
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103 104 105
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
100.5 101 101.5
10−14 10−11 10−8 10−5 10−2
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102
10−13 10−10 10−7 10−4 10−1
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
100 101 102 103
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
Figure 13: Comparison with the stochastic methods on epsilon.
C = 10 C = 1 C = 10 Logistic regression
10−1 100 101 102
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
10−1 100 101 102
10−6 10−5 10−4 10−3 10−2 10−1 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
10−1 100 101 102
10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
101 102
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103
10−6 10−5 10−4 10−3 10−2 10−1 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103
10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
L2-loss SVM
10−1 100 101 102
10−7 10−5 10−3 10−1
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
10−1 100 101 102
10−10 10−8 10−6 10−4 10−2 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
10−1 100 101 102
10−7 10−6 10−5 10−4 10−3 10−2 10−1 100
Training Time (s)
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS
101 102 103
10−7 10−5 10−3 10−1
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103
10−10 10−8 10−6 10−4 10−2 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON SVRG SLBFGS
101 102 103
10−7 10−6 10−5 10−4 10−3 10−2 10−1 100
Data Pass
RelativeFunctionDifference
CommDir NEWTON
SVRG SLBFGS