• Aucun résultat trouvé

Index of /~cjlin/papers/similarity_learning

N/A
N/A
Protected

Academic year: 2022

Partager "Index of /~cjlin/papers/similarity_learning"

Copied!
1
0
0

Texte intégral

(1)

Supplementary Materials for

Efficient Optimization Methods for Extreme Similarity Learning with Nonlinear Embeddings

Bowen Yuan

National Taiwan University [email protected]

Yu-Sheng Li

National Taiwan University [email protected]

Pengrui Quan

University of California, Los Angeles [email protected]

Chih-Jen Lin

National Taiwan University [email protected]

1 MORE IMPLEMENTATION DETAILS

In evaluating𝐿(𝜽), the required𝑷,𝑸,𝑷ˆ𝑐,𝑸ˆ𝑐,𝑷𝑐,𝑸𝑐, and Frobenius inner products can be directly computed by the forward and other built-in functions. However, when we use GPUs for large data sets, we face two difficulties. First, as the memory consumption of a forward process is proportional to the number of data, calculating the whole𝑷 and𝑸at once may be infeasible for GPUs. Second, for later computations, we need to cache𝑷and𝑸inO ( (𝑚+𝑛)𝑘) space, which may also be infeasible.

For the first difficulty, fortunately, as the computation of each row of𝑷and𝑸is independent of others, we follow [2] to have a mini-batch setting. Specifically, by splittingUandVinto multiple subsets, we sequentially calculate rows of𝑷or𝑸corresponding to indices in each subset. For ˜𝑷𝑐,𝑸˜𝑐,𝑷ˆ𝑐,𝑸ˆ𝑐,𝑷𝑐, and𝑸𝑐, each of which is the sum of𝑚or𝑛matrices of size𝑘×𝑘, we can calculate the partial results on each subset and accumulate them for the final output.

To overcome the second difficulty, we apply a heterogeneous computing scheme by storing𝑷and𝑸in CPU, which has a larger memory capacity than GPU. After finishing the above computation in GPU on each subset, we move the resulting partial𝑷 and𝑸 to the CPU memory. Some subsequent sparse operations such as 𝐿+(𝜽)in (10) are conducted in CPU. As the advantage of GPU over CPU is more significant on dense operations, our setting of having sparse operations on CPU is appropriate. On the contrary, for ˜𝑷𝑐,𝑸˜𝑐,𝑷ˆ𝑐,𝑸ˆ𝑐,𝑷𝑐, and𝑸𝑐, we store these𝑘×𝑘matrices in the GPU memory. Then we compute𝐿(𝜽)involving dense operations of Frobenius inner products on GPU.

To compute∇𝐿(𝜽)in (27), similar to𝐿+(𝜽),𝑿 𝑸and𝑿𝑷are computed on CPUs. Then by applying the mini-batch setting, for each subset, we feed the corresponding part of𝑿 𝑸,𝑿𝑷,𝑨,𝑩,𝑷, 𝑸, ˜𝑷and ˜𝑸into GPUs. With the cached ˆ𝑷𝑐,𝑸ˆ𝑐,𝑷𝑐, and𝑸𝑐, we first conduct dense operations to compute𝑿 𝑸+𝜔(𝑨𝑷 𝑸𝑐−𝑨𝑷˜𝑸ˆ𝑐)and 𝑿𝑷+𝜔(𝑩𝑸 𝑷𝑐−𝑩𝑸˜𝑷ˆ𝑐)on GPU. Then we call the automatic dif- ferentiation function (e.g.,tf.gradientsin TensorFlow) to finish the transposed Jacobian-vector products.

To compute𝑮𝒅in (40), as we mentioned, the Jacobian-vector products can be implemented with the forward-mode automatic differentiation. However, for compatibility with older versions of libraries where the forward-mode automatic differentiation is not supported, we apply the trick provided in [1] to compute𝑾and 𝑯with the reverse-mode automatic differentiation. Similar to𝑷

and𝑸, we apply the mini-batch setting and cache𝑾and𝑯in the CPU memory. Then with the cached𝑾,𝑯,𝑷, and𝑸, we compute 𝒁 𝑸and𝒁𝑷on CPUs. Next, similar to∇𝐿(𝜽), by the mini-batch setting, for each subset, we feed the required partial matrices to compute𝑨𝑾 𝑸𝑐+𝑨𝑷 𝑯𝑐and𝑩𝑯 𝑷𝑐+𝑩𝑸𝑾𝑐in GPU. Finally, we call the automatic differentiation function to finish the remaining transposed Jacobian-vector products.

REFERENCES

[1] Jamie Townsend. A new trick for calculating Jacobian vector products, 2017.

[2] Chien-Chih Wang, Kent-Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S. Sathiya Keerthi, Dhruv Mahajan, Sellamanickam Sundararajan, and Chih-Jen Lin. Dis- tributed Newton methods for deep learning.Neural Comput., 30:1673–1724, 2018.

Références

Documents relatifs

For C = 1, 000, L-CommDir-BFGS is the fastest on most data sets, and the only exception is KDD2010-b, on which L-CommDir-BFGS is slightly slower than L-CommDir- Step, but faster

To apply the linear convergence result in Tseng and Yun (2009), we show that L1-regularized logistic regression satisfies the conditions in their Theorem 2(b) if the loss term L(w)

In binary classification, squared hinge loss (L2 loss) is a common alter- native to L1 loss, but surprisingly we have not found any paper studying details of Crammer and Singer’s

Under random or cyclic selections, [2] explained that CD algorithms converge much faster if the dual problem does not contain a linear constraint (i.e., the bias term is

The x-axis is the cumulative number of O(n) operations, while the y-axis is the relative difference to the optimal function value defined in (4.19)... The x-axis is the

However, we find that (1) may be too tight in practice, so the effectiveness of diagonal preconditioning is still unclear if the training program stops earlier.. To this end, we

In Section 2, we introduce Newton methods for large- scale linear classification, and detailedly discuss two strategies (line search and trust region) for deciding the step size..

In the topic of multi-label classification for medical code prediction, one influential paper conducted a proper parameter selection on a set, but when moving to a sub- set