Index of /~cjlin/papers/dnn

(1)

Distributed Newton Methods for Deep Neural Networks

Chien-Chih Wang,

¹

Kent Loong Tan,

¹

Chun-Ting Chen,

¹

Yu-Hsiang Lin,

²

S. Sathiya Keerthi,

³

Dhruv Mahajan,

⁴

S. Sundararajan,

³

Chih-Jen Lin

¹

1Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan

2Department of Physics, National Taiwan University, Taipei 10617, Taiwan

3Microsoft

4Facebook Research

Abstract

Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton

(2)

matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks.

Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. In compared with stochastic gradient methods, it is more robust and may give better test accuracy.

Keywords:Deep Neural Networks, Distributed Newton methods, Large-scale classification, Subsampled Hessian.

1 Introduction

Recently deep learning has emerged as a useful technique for data classification as well as finding feature representations. We consider the scenario of multi-class classification. A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multi-layer structure. Between two adjacent layers a weight matrix maps the inputs (values in the previous layer) to the outputs (values in the current layer). Assume the training set includes (yⁱ,xⁱ), i = 1, . . . , l, where xⁱ ∈ <ⁿ⁰ is the feature vector andyⁱ ∈ <^K is the label vector. Ifxⁱ is associated with labelk, then

yⁱ = [0, . . . ,0

| {z }

k−1

,1,0, . . . ,0]^T ∈ <^K,

whereK is the number of classes and{1, . . . , K}are possible labels. After collecting all weights and biases as the model vector θ and having a loss function ξ(θ;x,y), a

(3)

neural-network problem can be written as the following optimization problem.

minθ f(θ), (1)

where

f(θ) = 1

2Cθ^Tθ+1 l

l

X

i=1

ξ(θ;xⁱ,yⁱ). (2)

The regularization termθ^Tθ/2avoids overfitting the training data, while the parameter Cbalances the regularization term and the loss term. The functionf(θ)is non-convex because of the connection between weights in different layers. This non-convexity and the large number of weights have caused tremendous difficulties in training large-scale deep neural networks. To apply an optimization algorithm for solving (2), the calculation of function, gradient, and Hessian can be expensive. Currently, stochastic gradient (SG) methods are the most commonly used way to train deep neural networks (e.g., Bot- tou, 1991; LeCun et al., 1998b; Bottou, 2010; Zinkevich et al., 2010; Dean et al., 2012;

Moritz et al., 2015). In particular, some expensive operations can be efficiently conducted in GPU environments (e.g., Ciresan et al., 2010; Krizhevsky et al., 2012; Hinton et al., 2012). Besides stochastic gradient methods, some works such as Martens (2010);

Kiros (2013); He et al. (2016) have considered a Newton method of using Hessian information. Other optimization methods such as ADMM have also been considered (Taylor et al., 2016).

When the model or the data set is large, distributed training is needed. Following the design of the objective function in (2), we note it is easy to achieve data parallelism: if data instances are stored in different computing nodes, then each machine can

(4)

calculate the local sum of training losses independently.1 However, achieving model parallelism is more difficult because of the complicated structure of deep neural networks. In this work, by considering that the model is distributedly stored we propose a novel distributed Newton method for deep learning. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks.

To be focused, among the various types of neural networks, we consider the standard feedforward networks in this work. We do not consider other types such as the convolution networks that are popular in computer vision.

This work is organized as follows. Section 2 introduces existing Hessian-free New- ton methods for deep learning. In Section 3, we propose a distributed Newton method for training neural networks. We then develop novel techniques in Section 4 to reduce running time and memory consumption. In Section 5 we analyze the cost of the pro- 1Training deep neural networks with data parallelism has been considered in SG, Newton and other optimization methods. For example, He et al. (2015) implement a parallel Newton method by letting each node store a subset of instances.

(5)

posed algorithm. Additional implementation techniques are given in Section 6. Then Section 7 reviews some existing optimization methods, while experiments in Section 8 demonstrate the effectiveness of the proposed method. Programs used for experiments in this paper are available at

http://csie.ntu.edu.tw/˜cjlin/papers/dnn.

Supplementary materials including a list of symbols and additional experiments can be found at the same web address.

2 Hessian-free Newton Method for Deep Learning

In this section, we begin with introducing feedforward neural networks and then review existing Hessian-free Newton methods to solve the optimization problem.

2.1 Feedforward Networks

A multi-layer neural network maps each feature vector to a class vector via the connection of nodes. There is a weight vector between two adjacent layers to map the input vector (the previous layer) to the output vector (the current layer). The network in Figure 1 is an example. Letn_m denote the number of nodes at the mth layer. We usen₀(input)-n₁-. . .-n_L(output)to represent the structure of the network.3 The weight 2This figure is modified from the example at http://www.texample.net/tikz/

examples/neural-network.

3_{Note that}_n₀is the number of features andnL=Kis the number of classes.

(6)

A ₀

B ₀

C ₀

A ₁

B ₁

A ₂

B ₂

C ₂

Figure 1: An example of feedforward neural networks.2

matrixW^mand the bias vectorb^mat themth layer are

W^m =







w₁₁^m w₁₂^m · · · w_1n^m_m w₂₁^m w₂₂^m · · · w_2n^m_m

... ... ... ... w_n^m_m−1₁ w_n^m_m−1₂ · · · w_n^m_m−1_n_m







nm−1×nm

and b^m =





 b^m₁ b^m₂ ... b^m_n_m







nm×1

.

Let

s^0,i=z^0,i =xⁱ

be the feature vector for theith instance, and s^m,i and z^m,i denote vectors of the ith instance at themth layer, respectively. We can use

s^m,i = (W^m)^Tz^m−1,i+b^m, m= 1, . . . , L, i= 1, . . . , l

z_j^m,i =σ(s^m,i_j ), j= 1, . . . , n_m, m= 1, . . . , L, i= 1, . . . , l (3)

to derive the value of the next layer, whereσ(·)is the activation function.

(7)

IfW^m’s columns are concatenated to the following vector w^m =

w^m₁₁ . . . w_n^m_m−1₁ w₁₂^m . . . w^m_n_m−1₂ . . . w_1n^m_m . . . w^m_n_m−1_n_m T

, then we can define

θ =





 w¹

b¹ ... w^L

b^L







as the weight vector of a whole deep neural network. The total number of parameters is n =

L

X

m=1

(n_m−1×n_m+n_m).

Becausez^L,iis the output vector of theith data, by a loss function to compare it with the label vectoryⁱ, a neural network solves the following regularized optimization problem

minθ f(θ), where

f(θ) = 1

2Cθ^Tθ+ 1 l

l

X

i=1

ξ(z^L,i;yⁱ), (4)

C > 0is a regularization parameter, andξ(z^L,i;yⁱ)is a convex function ofz^L,i. Note that we rewrite the loss functionξ(θ;xⁱ,yⁱ)in (2) asξ(z^L,i;yⁱ)becausez^L,i is decided byθandxⁱ. In this work, we consider the following loss function

ξ(z^L,i;yⁱ) = ||z^L,i−yⁱ||². (5) The gradient off(θ)is

∇f(θ) = 1 Cθ+ 1

l

X

i=1

(Jⁱ)^T∇_z^L,iξ(z^L,i;yⁱ), (6)

(8)

where

Jⁱ =







∂z^L,i₁

∂θ1 · · · ^∂z_∂θ^L,i¹

n

... ... ...

∂z^L,i_nL

∂θ1 · · · ^∂z

L,i nL

∂θn







nL×n

, i= 1, . . . , l, (7)

is the Jacobian ofz^L,i, which is a function ofθ. The Hessian matrix off(θ)is

∇²f(θ) =1 CI+ 1

l

X

i=1

(Jⁱ)^TBⁱJⁱ

+1 l

l

X

i=1 nL

X

j=1

∂ξ(z^L,i;yⁱ)

∂z^L,i_j







∂²z_j^L,i

∂θ1∂θ1 · · · ^∂²^z

L,i j

∂θ1∂θn

... . .. ...

∂²z_j^L,i

∂θn∂θ1 · · · ^∂

2z^L,i_j

∂θn∂θn







, (8)

whereIis the identity matrix and

B_tsⁱ = ∂²ξ(z^L,i;yⁱ)

∂z_t^L,i∂zs^L,i

, t= 1, . . . , nL, s= 1, . . . , nL. (9)

From now on for simplicity we let

ξ_i ≡ξ_i(z^L,i;yⁱ).

2.2 Hessian-free Newton Method

For the standard Newton methods, at thekth iteration, we find a directiond^kminimizing the following second-order approximation of the function value:

min

d

1

2d^TH^kd+∇f(θ^k)^Td, (10)

(9)

whereH^k =∇²f(θ^k)is the Hessian matrix off(θ^k). To solve (10), first we calculate the gradient vector by a backward process based on (3) through the following equations:

∂ξ_i

∂s^m,i_j = ∂ξ_i

∂z_j^m,iσ⁰(s^m,i_j ), i = 1, . . . , l, m= 1, . . . , L, j = 1, . . . , n_m (11)

∂ξi

∂z^m−1,i_t =

nm

X

j=1

∂ξi

∂s^m,i_j w_tj^m, i= 1, . . . , l, m= 1, . . . , L, t= 1, . . . , nm−1 (12)

∂f

∂w^m_tj = 1

Cw^m_tj + 1 l

l

X

i=1

∂ξ_i

∂s^m,i_j z_t^m−1,i, m= 1, . . . , L, j = 1, . . . , n_m, t= 1, . . . , nm−1

(13)

∂f

∂b^m_j = 1

Cb^m_j + 1 l

l

X

i=1

∂ξ_i

∂s^m,i_j , m= 1, . . . , L, j = 1, . . . , n_m. (14) Note that formally the summation in (13) should be

l

X

i=1 l

X

i⁰=1

∂ξ_i

∂s^m,i_j ⁰z_t^m−1,i⁰, but it is simplified becauseξ_iis associated with onlys^m,i_j .

If H^k is positive definite, then (10) is equivalent to solving the following linear system:

H^kd=−∇f(θ^k). (15)

Unfortunately, for the optimization problem (10), it is well known that the objective function may be non-convex and thereforeH^kis not guaranteed to be positive definite.

Following Schraudolph (2002), we can use the Gauss-Newton matrix as an approximation of the Hessian. That is, we remove the last term in (8) and obtain the following positive-definite matrix.

G= 1 CI+ 1

l

X

i=1

(Jⁱ)^TBⁱJⁱ. (16)

(10)

Note that from (9), eachBⁱ, i = 1, . . . , l is positive semi-definite if we require that ξ(z^L,i;yⁱ)is a convex function ofz^L,i. Therefore, instead of using (15), we solve the following linear system to find ad^kfor deep neural networks.

(G^k+λ_kI)d=−∇f(θ^k), (17)

whereG^k is the Gauss-Newton matrix at thekth iteration and we add a termλkI because of considering the Levenberg-Marquardt method (see details in Section 4.5).

For deep neural networks, because the total number of weights may be very large, it is hard to store the Gauss-Newton matrix. Therefore, Hessian-free algorithms have been applied to solve (17). Examples include Martens (2010); Ngiam et al. (2011). Specif- ically, conjugate gradient (CG) methods are often used so that a sequence of Gauss- Newton matrix vector products are conducted. Martens (2010); Wang et al. (2015) use R-operator (Pearlmutter, 1994) to implement the product without storing the Gauss-

Newton matrix.

Because the use ofRoperators for the Newton method is not the focus of this work, we leave some detailed discussion in Sections II-III in supplementary materials.

3 Distributed Training by Variable Partition

The main computational bottleneck in a Hessian-free Newton method is the sequence of matrix-vector products in the CG procedure. To reduce the running time, parallel matrix-vector multiplications should be conducted. However, theRoperator discussed in Section 2 and Section II in supplementary materials is inherently sequential. In a forward process results in the current layer must be finished before the next. In this

(11)

section, we propose an effective distributed algorithm for training deep neural networks.

3.1 Variable Partition

Instead of using theRoperator to calculate the matrix-vector product, we consider the whole Jacobian matrix and directly use the Gauss-Newton matrix in (16) for the matrix- vector products in the CG procedure. This setting is possible because of the following reasons.

1. A distributed environment is used.

2. With some techniques we do not need to explicitly store every element of the Jaco- bian matrix.

Details will be described in the rest of this paper. To begin we split each Jⁱ to P partitions

Jⁱ =

J₁ⁱ · · · J_Pⁱ

.

Because the number of columns in Jⁱ is the same as the number of variables in the optimization problem, essentially we partition the variables toP subsets. Specifically, we split neurons in each layer to several groups. Then weights connecting one group of the current layer to one group of the next layer form a subset of our variable partition.

For example, assume we have a150-200-30neural network in Figure 2. By splitting the three layers to3,2,3groups, we have a total number of partitionsP = 12. The partition (A₀, A₁)in Figure 2 is responsible for a50×100 sub-matrix ofW¹. In addition, we distribute the variableb^mto partitions corresponding to the first neuron sub-group of the

(12)

A ₀

B ₀

C ₀

A

₀

,A

₁

A

₀

,B

₁

B

₀

,A

₁

B

₀

,B

₁

C

₀

,A

₁

C

₀

,B

₁

A ₁

B ₁

A

₁

,A

₂

A

₁

,B

₂

A

₁

,C

₂

B

₁

,A

₂

B

₁

,B

₂

B

₁

,C

₂

A ₂

B ₂

C ₂

Figure 2: An example of splitting variables in Figure 1 to12partitions by a split structure of 3-2-3. Each circle corresponds to a neuron sub-group in a layer, while each square is a partition corresponding to weights connecting one neuron sub-group in a layer to one neuron sub-group in the next layer.

mth layer. For example, the200variables ofb¹ is split to100in the partition(A0, A1) and100in the partition(A₀, B₁).

By the variable partition, we achieve model parallelism. Further, becausez^0,i =xⁱ from (2.1), our data points are split in a feature-wise way to nodes corresponding to partitions between layers 0 and 1. Therefore, we have data parallelism.

With the variable partition, the second term in the Gauss-Newton matrix (16) for the

(13)

ith instance can be represented as

(Jⁱ)^TBⁱJⁱ =







(J₁ⁱ)^TBⁱJ₁ⁱ · · · (J₁ⁱ)^TBⁱJ_Pⁱ . ..

(J_Pⁱ)^TBⁱJ₁ⁱ · · · (J_Pⁱ)^TBⁱJ_Pⁱ





 .

In the CG procedure to solve (17), the product between the Gauss-Newton matrix and a vectorvis

Gv =







1 l

Pl

i=1(J₁ⁱ)^TBⁱ(PP

p=1J_pⁱv_p) + _C¹v₁ ...

1 l

Pl

i=1(J_Pⁱ)^TBⁱ(PP

p=1J_pⁱv_p) + _C¹v_P







,wherev =





 v₁

... v_P







(18)

is partitioned according to our variable split. From (9) and the loss function defined in (5),

B_tsⁱ =

∂² PnL

j=1(z_j^L,i−y_jⁱ)²

∂z_t^L,i∂zs^L,i

=

∂

2(z_t^L,i−yⁱ_t)

∂zs^L,i

=











2 ift =s, 0 otherwise.

However, after the variable partition, eachJⁱ may still be a huge matrix. The total space for storingJ_pⁱ, ∀iis roughly

n_L× n P ×l.

Ifl, the number of data instances, is so large such that l×n_L

P > n,

than storingJ_pⁱ, ∀irequires more space than then×nGauss-Newton matrix. To reduce the memory consumption, we will propose effective techniques in Sections 3.3, 4.3, and 6.1.

(14)

With the variable partition, function, gradient, and Jacobian calculations become complicated. We discuss details in Sections 3.2 and 3.3.

3.2 Distributed Function Evaluation

From (3) we know how to evaluate the function value in a single machine, but the implementation in a distributed environment is not trivial. Here we check the details from the perspective of an individual partition. Consider a partition that involves neurons in setsTm−1 andT_mfrom layersm−1andm, respectively. Thus

T_m−1 ⊂ {1, . . . , n_m−1}andT_m ⊂ {1, . . . , n_m}.

Because (3) is a forward process, we assume that

s^m−1,i_t , i= 1, . . . , l, ∀t∈Tm−1

are available at the current partition. The goal is to generate

s^m,i_j , i= 1, . . . , l, ∀j ∈Tm

and pass them to partitions between layersmandm+ 1. To begin, we calculate

z_t^m−1,i=σ(s^m−1,i_t ), i = 1, . . . , landt∈Tm−1. (19)

Then, from (3), the following local values can be calculated fori= 1, . . . , l, j∈T_m









 P

t∈Tm−1w^m_tjz_t^m−1,i+b^m_j ifTm−1is the first neuron sub-group of layerm−1, P

t∈Tm−1w^m_tjz_t^m−1,i otherwise.

(20)

(15)

After the local sum in (20) is obtained, we must sum up values in partitions between layersm−1andm.

s^m,i_j = X

Tm−1∈Pm−1

local sum in (20)

, (21)

wherei= 1, . . . , l,j ∈T_m, and

Pm−1 ={Tm−1|Tm−1 is any sub-group of neurons at layerm−1}.

The resulting s^m,i_j values should be broadcasted to partitions between layers m and m+ 1 that correspond to the neuron subset T_m. We explain details of (21) and the broadcastoperation in Section 3.2.1.

3.2.1 Allreduce and Broadcast Operations

The goal of (21) is to generate and broadcasts^m,i_j values to some partitions between layersm and m+ 1, so a reduce operation seems to be sufficient. However, we will explain in Section 3.3 that for the Jacobian evaluation and then the product between Gauss-Newton matrix and a vector, the partitions between layersm−1andm corresponding toT_m also needs^m,i_j for calculating

z_j^m,i =σ(s^m,i_j ), i= 1, . . . , l, j ∈T_m. (22) To this end, we consider anallreduceoperation so that not only are values reduced from some partitions between layersm−1andm, but also the result is broadcasted to them.

After this is done, we make the same results^m,i_j available in partitions between layers mandm+ 1by choosing the partition corresponding to the first neuron sub-group of layerm−1to conduct abroadcastoperation. Note that for partitions between layers L−1andL(i.e., the last layer), a broadcast operation is not needed.

(16)

Consider the example in Figure 2. For partitions(A₁, A₂), (A₁, B₂), and(A₁, C₂), all of them must gets^1,i_j , j ∈A₁calculated via (21):

s^1,i_j = X

t∈A₀

w_tj¹z_t^0,i+b¹_j

| {z }

(A0,A1)

+ X

t∈B₀

w_tj¹z_t^0,i

| {z }

(B0,A1)

+ X

t∈C₀

w¹_tjz_t^0,i

| {z }

(C0,A1)

. (23)

The three local sums are available at partitions (A₀, A₁), (B₀, A₁) and (C₀, A₁) respectively. We first conduct anallreduce operation so that s^1,i_j , j ∈ A₁ are available at partitions(A₀, A₁), (B₀, A₁), and (C₀, A₁). Then we choose (A₀, A₁)to broadcast values to(A1, A2),(A1, B2), and(A1, C2).

Depending on the system configurations, suitable ways can be considered for implementing theallreduceand thebroadcastoperations (Thakur et al., 2005). In Section IV of supplementary materials we give details of our implementation.

To derive the loss value, we need one finalreduce operation. For the example in Figure 2, in the end we havez_j^2,i, j ∈A₂, B₂, C₂ respectively available in partitions

(A₁, A₂), (A₁, B₂), and(A₁, C₂).

We then need the followingreduceoperation

||z^2,i−yⁱ||² = X

j∈A2

(z_j^2,i−y_jⁱ)²+ X

j∈B2

(z^2,i_j −y_jⁱ)²+ X

j∈C2

(z^2,i_j −yⁱ_j)² (24)

and let(A₁, A₂)have the loss term in the objective value.

We have discussed the calculation of the loss term in the objective value, but we also need to obtain the regularization termθ^Tθ/2. One possible setting is that before the loss-term calculation we run areduce operation to sum up all local regularization terms. For example, in one partition corresponding to neuron subgroupsTm−1 at layer

(17)

m−1andT_mat layerm, the local value is X

t∈Tm−1

X

j∈Tm

(w^m_tj)². (25)

On the other hand, we can embed the calculation into the forward process for obtaining the loss term. The idea is that we append the local regularization term in (25) to the vector in (20) for anallreduceoperation in (21). The cost is negligible because we only increase the length of each vector by one. After theallreduceoperation, we broadcast the resulting vector to partitions between layersm and m + 1 that corresponding to the neuron subgroupT_m. We cannot let each partition collect the broadcasted value for subsequentallreduceoperations because regularization terms in previous layers would be calculated several times. To this end, we allow only the partition corresponding to T_m in layer m and the first neuron subgroup in layer m+ 1 to collect the value and include it with the local regularization term for the subsequentallreduceoperation. By continuing the forward process, in the end we get the whole regularization term.

We use Figure 2 to give an illustration. The allreduceoperation in (23) now also calculates

X

t∈A₀

X

j∈A₁

(w_tj¹)²+ X

j∈A₁

(b¹_j)²

| {z }

(A0,A1)

+X

t∈B₀

X

j∈A₁

(w_tj¹)²

| {z }

(B0,A1)

+X

t∈C₀

X

j∈A₁

(w¹_tj)²

| {z }

(C0,A1)

. (26)

The resulting value is broadcasted to

(A₁, A₂), (A₁, B₂), and(A₁, C₂).

Then only(A₁, A₂)collects the value and generate the following local sum:

(26)+X

t∈A1

X

j∈A2

(w²_tj)²+ X

j∈A2

(b²_j)². In the end we have

(18)

1. (A₁, A₂)contains regularization terms from

(A₀, A₁), (B₀, A₁), (C₀, A₁), (A₁, A₂), (A₀, B₁), (B₀, B₁), (C₀, B₁), (B₁, A₂).

2. (A₁, B₂)contains regularization terms from

(A₁, B₂), (B₁, B₂).

3. (A₁, C₂)contains regularization terms from

(A₁, C₂), (B₁, C₂).

We can then extend the reduce operation in (24) to generate the final value of the regularization term.

3.3 Distributed Jacobian Calculation

From (7) and similar to the way of calculating the gradient in (11)-(14), the Jacobian matrix satisfies the following properties.

∂z_u^L,i

∂w^m_tj = ∂z^L,i_u

∂s^m,i_j

∂w_tj^m, (28)

∂z_u^L,i

∂b^m_j = ∂z^L,i_u

∂s^m,i_j

∂b^m_j , (29)

wherei= 1, . . . , l, u= 1, . . . , n_L, m= 1, . . . , L, j = 1, . . . , n_m, andt = 1, . . . , nm−1. However, these formulations do not reveal how they are calculated in a distributed setting. Similar to Section 3.2, we check details from the perspective of any variable partition. Assume the current partition involves neurons in setsTm−1 andT_m from layers m−1andm, respectively. Then we aim to obtain the following Jacobian components.

∂z^L,i_u

∂w_tj^m and ∂z_u^L,i

∂b^m_j , ∀t∈Tm−1, ∀j ∈T_m, u= 1, . . . , n_L, i= 1, . . . , l.

(19)

Before showing how to calculate them, we first get from (3) that

∂z_u^L,i

∂s^m,i_j = ∂z_u^L,i

∂z_j^m,i

∂s^m,i_j = ∂z^L,i_u

∂z_j^m,iσ⁰(s^m,i_j ), (30)

∂s^m,i_j

∂w^m_tj =z_t^m−1,iand ∂s^m,i_j

∂b^m_j = 1, (31)

∂z_u^L,i

∂z_j^L,i =











1 ifj =u, 0 otherwise.

(32)

From (28)-(32), the elements for the local Jacobian matrix can be derived by

∂z_u^L,i

∂w^m_tj = ∂z_u^L,i

∂z^m,i_j

∂z_j^m,i

∂s^m,i_j

∂w^m_tj =











∂zu^L,i

∂z_j^m,iσ⁰(s^m,i_j )z_t^m−1,i ifm < L,

σ⁰(s^L,i_u )z_t^L−1,i ifm =L, j =u,

0 ifm =L, j 6=u,

(33)

and

∂z_u^L,i

∂b^m_j = ∂z_u^L,i

∂z^m,i_j

∂z_j^m,i

∂s^m,i_j

∂b^m_j =











∂zu^L,i

∂z_j^m,iσ⁰(s^m,i_j ) ifm < L,

σ⁰(s^L,i_u ) ifm =L, j =u, 0 ifm =L, j 6=u,

(34)

whereu= 1, . . . , n_L,i= 1, . . . , l,t∈Tm−1, andj ∈T_m.

We discuss how to have values in the right-hand side of (33) and (34) available at the current computing node. From (19), we have

z_t^m−1,i, ∀i= 1, . . . , l, ∀t∈Tm−1

available in the forward process of calculating the function value. Further, in (21)-(22) to obtainz_j^m,ifor layersmandm+1, we use anallreduceoperation rather than areduce operation so that

s^m,i_j , ∀i= 1, . . . , l, ∀j ∈T_m

(20)

are available at the current partition between layersm−1andm. Therefore,σ⁰(s^m,i_j ) in (33)-(34) can be obtained. The remaining issue is to generate∂z_u^L,i/∂z_j^m,i. We will show that they can be obtained by a backward process. Because the discussion assumes that currently we are at a partition between layers m −1and m, we show details of generating∂z_u^L,i/∂z_t^m−1,iand dispatching them to partitions betweenm−2andm−1.

From (3) and (30),∂z_u^L,i/z_t^m−1,ican be calculated by

∂z^L,i_u

∂z_t^m−1,i =

nm

X

j=1

∂z^L,i_u

∂s^m,i_j

∂z^m−1,i_t =

nm

X

j=1

∂z_u^L,i

∂z_j^m,iσ⁰(s^m,i_j )w_tj^m. (35) Therefore, we consider a backward process of using∂z_u^L,i/∂z_j^m,ito generate∂z_u^L,i/∂z^m−1,i_t . In a distributed system, from (32) and (35),

∂z_u^L,i

∂z^m−1,i_t =









 P

Tm∈Pm

P

j∈Tm

∂z^L,iu

∂z^m,i_j σ⁰(s^m,i_j )w^m_tj ifm < L, P

Tm∈Pmσ⁰(s^L,i_u )w_tu^L ifm=L,

(36)

wherei= 1, . . . , l, u= 1, . . . , n_L, t∈Tm−1, and

P_m ={T_m|T_m is any sub-group of neurons at layerm}. (37) Clearly, each partition calculates the local sum overj ∈ Tm. Then areduce operation is needed to sum up values in all corresponding partitions between layersm−1 and m. Subsequently, we discuss details of how to transfer data to partitions between layers m−2andm−1.

Consider the example in Figure 2. The partition(A₀, A₁)must get

∂z_u^L,i

∂z^1,i_t , t∈A₁, u= 1, . . . , n_L, i= 1, . . . , l.

From (36),

∂z_u^L,i

∂z_t^1,i = X

j∈A2

∂z_u^L,i

∂z^2,i_j σ⁰(s^2,i_j )w²_tj

| {z }

(A1,A2)

+X

j∈B2

∂z^L,i_u

∂z_j^2,iσ⁰(s^2,i_j )w²_tj

| {z }

(A1,B2)

+X

j∈C2

∂z_u^L,i

∂z_j^2,iσ⁰(s^2,i_j )w_tj²

| {z }

(A1,C2)

. (38)

(21)

Note that these three sums are available at partitions(A₁, A₂), (A₁, B₂), and(A₁, C₂), respectively. Therefore, (38) is areduceoperation. Further, values obtained in (38) are needed in partitions not only(A₀, A₁) but also (B₀, A₁) and(C₀, A₁). Therefore, we need abroadcastoperation so values can be available in the corresponding partitions.

For details of implementing reduce and broadcast operations, see Section IV of supplementary materials. Algorithm 2 summarizes the backward process to calculate

∂z_u^L,i/∂z_j^m,i.

3.3.1 Memory Requirement

We have mentioned in Section 3.1 that storing all elements in the Jacobian matrix may not be viable. In the distributing setting, if we store all Jacobian elements corresponding to the current partition, then

|Tm−1| × |T_m| ×n_L×l (39) space is needed. We propose a technique to save space by noting that (28) can be written as the product of two terms. From (30)-(31), the first term is related to onlyT_m, while the second is related to onlyTm−1:

∂z^L,i_u

∂w^m_tj = [∂z_u^L,i

∂s^m,i_j ][∂s^m,i_j

∂w^m_tj] = [∂z_u^L,i

∂z_j^m,iσ⁰(s^m,i_j )][z_t^m−1,i]. (40) They are available in our earlier calculation. Specifically, we allocate space to receive

∂z_u^L,i/∂z_j^m,ifrom previous layers. After obtaining the values, we replace them with

∂z_u^L,i

∂z^m,i_j σ⁰(s^m,i_j ) (41)

for the future use. Therefore, the Jacobian matrix is not explicitly stored. Instead, we use the two terms in (40) for the product between the Gauss-Newton matrix and a vector

(22)

in the CG procedure. See details in Section 4.2. Note that we also need to calculate and store the local sum before the reduce operation in (36) for getting∂z_u^L,i/∂z_t^m−1,i, ∀t∈ Tm−1, ∀u, ∀i. Therefore, the memory consumption is proportional to

l×n_L×(|Tm−1|+|T_m|).

This setting significantly reduces the memory consumption of directly storing the Jaco- bian matrix in (39).

3.3.2 Sigmoid Activation Function

In the discussion so far, we consider a general differentiable activation functionσ(s^m,i_j ).

In the implementation in this paper, we consider the sigmoid function except the output layer:

z_j^m,i =σ(s^m,i_j ) =











1 1+e^−s

m,i j

ifm < L, s^m,i_j ifm =L.

(42)

Then,

σ⁰(s^m,i_j ) =











e^−s

m,i j

1+e^−s

m,i j

2 =z_j^m,i(1−z_j^m,i) ifm < L,

1 ifm=L.

and (33)-(34) become

∂z^L,i_u

∂w^m_tj =











∂zu^L,i

∂z_j^m,iz^m,i_j (1−z_j^m,i)z^m−1,i_t , z_t^L−1,i,

0,

,∂z_u^L,i

∂b^m_j =











∂z^L,iu

∂z^m,i_j z_j^m,i(1−z_j^m,i) ifm < L,

1 ifm=L, j =u,

0 ifm=L, j 6=u,

whereu= 1, . . . , n_L,i= 1, . . . , l,t∈T_m−1, andj ∈T_m.

(23)

3.4 Distributed Gradient Calculation

For the gradient calculation, from (4),

∂f

∂w^m_tj = 1

Cw^m_tj + 1 l

l

X

i=1

∂ξ_i

∂w^m_tj = 1

Cw^m_tj +1 l

l

X

i=1 nL

X

u=1

∂ξ_i

∂zu^L,i

∂z_u^L,i

∂w^m_tj, (43) where∂z_u^L,i/∂w^m_tj, ∀t, ∀j are components of the Jacobian matrix; see also the matrix form in (6). From (33), we have known how to calculate ∂z^L,i_u /∂w_tj^m. Therefore, if

∂ξ_i/∂z^L,i_u is passed to the current partition, we can easily obtain the gradient vector via (43). This can be finished in the same backward process of calculating the Jacobian matrix.

On the other hand, in the technique that will be introduced in Section 4.3, we only consider a subset of instances to construct the Jacobian matrix as well as the Gauss- Newton matrix. That is, by selecting a subset S ⊂ {1, . . . , l}, then onlyJⁱ,∀i∈ S are considered. Thus we do not have all the needed ∂z_u^L,i/∂w^m_tj for (43). In this situation, we can separately consider a backward process to calculate the gradient vector. From a derivation similar to (33),

∂ξ_i

∂w_tj^m = ∂ξ_i

∂z_j^m,iσ⁰(s^m,i_j )z_t^m−1,i, m= 1, . . . , L. (44) By considering∂ξ_i/∂z^m,i_j to be like∂z_u^L,i/∂z_j^m,i in (36), we can apply the same backward process so that each partition between layers m −2 and m− 1 must wait for

∂ξ_i/∂z^m−1,i_j from partitions between layersm−1andm:

∂ξ_i

∂z_t^m−1,i = X

Tm∈Pm

X

j∈Tm

∂ξ_i

∂z_j^m,iσ⁰(s^m,i_j )w_tj^m, (45) wherei= 1, . . . , l,t ∈T_m−1, andP_m is defined in (37). For the initial∂ξ_i/∂z_j^L,i in the

(24)

backward process, from the loss function defined in (5),

∂ξi

∂z_j^L,i = 2×

z_j^L,i−y_jⁱ

.

From (43), a difference from the Jacobian calculation is that here we obtain a sum over all instances i. Earlier we separately maintain terms related to Tm−1 and T_m to avoid storing all Jacobian elements. With the summation overi, we can afford to store

∂f /∂w^m_tj and∂f /∂b^m_j ,∀t ∈Tm−1, ∀j ∈T_m.

4 Techniques to Reduce Computational, Communica- tion, and Synchronization Cost

In this section we propose some novel techniques to make the distributed Newton method a practical approach for deep neural networks.

4.1 Diagonal Gauss-Newton Matrix Approximation

In (18) for the Gauss-Newton matrix-vector products in the CG procedure, we notice that the communication occurs for reducingP vectors

J₁ⁱv₁, . . . , J_Pⁱv_P,

each with size O(n_L), and then broadcasting the sum to all nodes. To avoid the high communication cost in some distributed systems, we may consider the diagonal blocks

(25)

of the Gauss-Newton matrix as its approximation:

Gˆ = 1 CI+







1 l

Pl

i=1(J₁ⁱ)^TBⁱJ₁ⁱ . ..

1 l

Pl

i=1(J_Pⁱ)^TBⁱJ_Pⁱ







. (49)

Then (17) becomesP independent linear systems (1

l

X

i=1

(J₁ⁱ)^TBⁱJ₁ⁱ+ 1

CI+λ_kI)d^k₁ =−g^k₁,

... (50)

(1 l

l

X

i=1

(J_Pⁱ)^TBⁱJ_Pⁱ + 1

CI +λ_kI)d^k_P =−g^k_P, whereg^k₁, . . . ,g^k_P are local components of the gradient:

∇f(θ^k) =





 g^k₁

... g^k_P





 .

The matrix-vector product becomes

Gv ≈Gvˆ =







1 l

Pl

i=1(J₁ⁱ)^TBⁱJ₁ⁱv₁+_C¹v₁ ...

1 l

Pl

i=1(J_Pⁱ)^TBⁱJ_PⁱvP + _C¹vP







, (51)

in which each (Gv)_p can be calculated using only local information because we have independent linear systems. For the CG procedure at any partition, it is terminated if the following relative stopping condition holds

||1 l

l

X

i=1

(J_pⁱ)^TBⁱJ_pⁱv_p + (1

C +λ_k)v_p+g^k_p|| ≤σ||g^k_p|| (52) or the number of CG iterations reaches a pre-specified limit. Hereσ is a pre-specified tolerance. Unfortunately, partitions may finish their CG procedures at different time, a

(26)

situation that results in significant waiting time. To address this synchronization cost, we propose some novel techniques in Section 4.4.

Some past works have considered using diagonal blocks as the approximation of the Hessian. For logistic regression, Bian et al. (2013) consider diagonal elements of the Hessian to solve several one-variable sub-problems in parallel. Mahajan et al. (2017) study a more general setting in which using diagonal blocks is a special case.

4.2 Product Between Gauss-Newton Matrix and a Vector

In the CG procedure the main computational task is the matrix-vector product. We present techniques for the efficient calculation. From (51), for the pth partition, the product between the local diagonal block of the Gauss-Newton matrix and a vectorv_p takes the following form.

(J_pⁱ)^TBⁱJ_pⁱv_p.

Assume thepth partition involves neuron sub-groupsTm−1andT_mrespectively in layers m−1andm, and this partition is not responsible to handle the bias termb^m_j , ∀j ∈T_m. Then

J_pⁱ ∈ Rⁿ^L^×(|T^m−1^|×|T^m^|)andv_p ∈ R^(|T^m−1^|×|T^m^|)×1.

Let mat(v_p) ∈ R^|T^m−1^|×|T^m^| be the matrix representation of v_p. From (40), the uth component of(J_pⁱv_p)_uis

X

t∈Tm−1

X

j∈Tm

∂z^L,i_u

∂w^m_tj(mat(vp))tj = X

t∈Tm−1

X

j∈Tm

∂z_u^L,i

∂s^m,i_j z_t^m−1,i(mat(vp))tj. (53)

Index of /~cjlin/papers/dnn

Distributed Newton Methods for Deep Neural Networks

Chien-Chih Wang,

Kent Loong Tan,

Chun-Ting Chen,

Yu-Hsiang Lin,

S. Sathiya Keerthi,

Dhruv Mahajan,

S. Sundararajan,

Chih-Jen Lin

1 Introduction

2 Hessian-free Newton Method for Deep Learning

2.1 Feedforward Networks

A 0

B 0

C 0

A 1

B 1

A 2

B 2

C 2

2.2 Hessian-free Newton Method

3 Distributed Training by Variable Partition

3.1 Variable Partition

A 0

B 0

C 0

A

,A

A

,B

B

,A

B

,B

C

,A

C

,B

A 1

B 1

A

,A

A

,B

A

,C

B

,A

B

,B

B

,C

A 2

B 2

C 2

3.2 Distributed Function Evaluation

3.3 Distributed Jacobian Calculation

3.4 Distributed Gradient Calculation

4 Techniques to Reduce Computational, Communica- tion, and Synchronization Cost

4.1 Diagonal Gauss-Newton Matrix Approximation

4.2 Product Between Gauss-Newton Matrix and a Vector

A ₀

B ₀

C ₀

A ₁

B ₁

A ₂

B ₂

C ₂

A ₀

B ₀

C ₀

A ₁

B ₁

A ₂

B ₂

C ₂