• Aucun résultat trouvé

Letybe an output vector of a DNN, corresponding to an observation vectoro. As before (see Formula (2.2)),[W,b]denotes a set of DNN parameters to be estimated from training samplesT={(om,ym): 0≤m≤M}, whereomis them-th observation vector, andymis the corresponding target vector. The process of[W,b]parameter estimation (or DNN training) is characterized by atraining criterionand alearning algorithm.

2.3.1 Training criteria

In order to estimate parameters of a DNN, a suitable training criterion (or a loss func-tion) should be specified. Two aspects should be taken into account when choosing a loss functionF:

2.3 Training 11 1. Simplicity of evaluation.

2. Correlation with the final goal of the task.

Several popular training criteria are used for DNN training:

• Mean square error (MSE):

This criterion is used for regression tasks.

• Cross-entropy (CE): whereyi=Ptarget(i|o)is the observed in the training set empirical probability that the observation vectorobelongs to class i, andhLi =PDNN(i|o)is the same probability estimated from the DNN. This criterion is used for the classification tasks, wherey is a probability distribution. Minimizing CE criterion is equivalent to minimizing the Kullback-Leibler divergence (KLD) between the empirical probability distribution (of targets) and the probability distribution estimated from the DNN. In many tasks, hard class labels are used as targets:

yi= wherecois the class label of the observation vectoroin the training set. In this case the CE criterion (2.15) becomes thenegative log-likelihood criterion(NLL):

FNLL(W,b;o,y) =−loghLco. (2.17)

2.3.2 Learning algorithm

Given the training criterion, the parameters of a DNN[W,b]can be discriminatively trained (DT) with the errorbackpropagation algorithm(BP) [Rumelhart et al.,1986] by propagating derivatives of a loss function. In the simplest version, the DNN parameters can be improved using the first-order gradient information as follows [Yu and Deng,2014]:

Wlt+1←Wlt−ε∆Wlt (2.18) are the average weight matrix gradient and the average bias vector gradient correspondingly at iterationt computed onMbsamples and∇xFis the gradient ofFwith respect tox.

The parameter updates in Formulas (2.18) and (2.19) are estimated on a batch of training samples. The choice of the batch size influences the final result of the training, as well as the convergence speed. In the simplest approach, which is referred to thebatch training, the batch is the whole training set. An alternative approach is to use thestochastic gradient descent(SGD) algorithm [Bishop,2006;Bottou,1998], where gradients are estimated from a single sample. However, the most common technique is to compute the derivatives on small, randomly chosenmini-batches of the training samples. In this thesis we will use the last approach, which also referred to as themini-batch SGDalgorithm in the literature [Li et al., 2014c].

There are some practical considerations that have to be taken into account for efficient DNN training [Ruder,2016;Yu and Deng,2014]. One important question is the choice of the learning rateε. If a learning rate is too small the learning algorithm has too slow convergence, and on the contrary, if it is too large, it can prevent learning from convergence to optimal solution. Learning rate schedules [Darken and Moody,1991] aim to regulate the learning rate during the training according to a fixed learning rate schedule, depending on the changes in the objective function, or component-wise, depending on the geometry of the observed data and

2.3 Training 13 the parameter sparseness. Popular gradient descent optimization algorithms include [Ruder, 2016]:momentum,Nesterov accelerated gradient(NAG) [Nesterov,1983],Adagrad[Duchi et al.,2011],Adadelta[Zeiler,2012],adaptive moment estimation (Adam)[Kingma and Ba, 2014], natural gradient descent [Amari,1998], andRMSprop[Tieleman and Hinton,2012].

Batch normalizationtechnique, proposed in [Ioffe and Szegedy,2015], allows significant acceleration of the DNN training. It consists in a normalization step that fixes the means and variances of layer inputs.

Below we consider some other important issues related to DNN training procedure, such asmomentum,pre-training, andregularization techniques.

2.3.3 Momentum

In order to speed up the training and to smooth the parameter updates, the information about the previous updates is included into the gradient update in the form ofmomentum[Qian, 1999;Rumelhart et al.,1985]. When the momentum is applied, Formulas (2.18) and (2.19) are replaced with:

Wlt+1←Wlt−ε∆Wtl+α∆t−1(Wl) (2.22) and

bt+1l ←blt−ε∆blt+α∆t−1(bl), (2.23) where

t(Wl) =Wlt+1−Wlt (2.24)

and

t(bl) =blt+1−blt, (2.25) and 0<α <1 is amomentum coefficient (factor).

2.3.4 Regularization

Overfitting can be a serious problem in DNN training due to the large number of estimated parameters. There are different ways to control and prevent over-fitting in DNN training.

One solution to reduce over-fitting is to apply someregularization termsfor the DNN pa-rameter updates. Regularization aims to incorporate some prior information into the training criteria, to prevent the model from learning undesirable configurations. The common way is to add a complexity term (penalty) to the loss function to penalizes certain configurations.

The most commonly used regularization terms include:

1. L1-penalty: the vector, obtained by concatenation all the columns in the matrixWl. The regularization terms are often referred to as weight decay in the literature. When the regularization is applied, the training criterion is changed to:

Freg(W,b;T) =F(W,b;T) +λRp(W), (2.28) where p∈ {1,2}, depending on the chosen penalty type; andF(W,b;T)is a CE or MSE loss function, that optimizes the empirical loss on the training setT.

2.3.5 Dropout

Dropout is an alternative powerful regularization technique [Dahl et al.,2013;Hinton et al., 2012b;Srivastava et al.,2014] which consists in removing a specified percentage of randomly selected hidden units or inputs during training or pre-training to improve generalization.

When a hidden neuron is dropped out in training, its activation is set to 0. During the training with dropout, a random subset of units should be repeatedly sampled. This slows down the training process. A fast dropout training algorithm was proposed in [Wang and Manning,2013] and is based on using Gaussian approximation instead of doing Monte Carlo optimization.