Neural network architectures - Uncertainty handling 31

Part II Uncertainty handling 31

8.1 Neural network architectures

State of the art

This chapter presents the state of the art of training for neural networks. Several neural network architectures are described in Section 8.1. Sections 8.2.2, 8.2.3, and 8.2.4 present three state-of-the-art approaches to train neural networks.

8.1 Neural network architectures

Perceptrons

Perceptrons were developed in the 1950s and 1960s [Rosenblatt, 1958]. Figure 8.1 depicts a model of a perceptron. A perceptron takes several binary inputsz₁, z₂, ...and produces a single binary output x. The neuron’s output can be set to 0 or 1 and it is determined by whether the weighted sum P

jw_jz_j is less than or greater than some threshold value thre. A basic mathematical model of one neuron is given by

( 0 if P

jwjzj ≤thre 1 if P

jwjzj > thre (8.1)

where the weightsw_j are real numbers expressing the importance of the respective inputs in the output.

z₁

z₂

w₁ w2

w₃

Figure 8.1: The model of a perceptron.

z⁽¹⁾ x⁽²⁾ z⁽²⁾

x⁽³⁾ z⁽³⁾

x⁽⁴⁾ z⁽⁴⁾

Figure 8.2: Model of a multilayer perceptron with four layers: input, two hidden layers and output layer.

Multilayer perceptron

A multilayer perception (MLP) [Rosenblatt, 1958] is a feedforward neural network consisting of N layers of fully connected perceptrons as shown in Figure 8.2. Let kn be the number of elements (neurons) in then-th layer, p the data index. P is number of data samples. Here we define z⁽ⁿ⁾_jp as the input to the j-th element. Let w_ij⁽ⁿ⁾ be the weight from the j-th element to thei-th element and u⁽ⁿ⁾_i be thei-th bias term between the n-th and the (n+ 1)-th layer. The neural network can be defined as

x⁽ⁿ⁺¹⁾_ip =

j=1

w⁽ⁿ⁾_ij z_jp⁽ⁿ⁾+u⁽ⁿ⁾_i (8.2)

z_ip⁽ⁿ⁺¹⁾=f(x⁽ⁿ⁺¹⁾_ip ) (8.3)

wheref represents a nonlinear activation function. Possible activation functions include sigmoid

f(x) = 1

1 + exp(−x) (8.4)

tangent hyperbolic

f(x) = exp(x)−exp(−x)

exp(x) + exp(−x) (8.5)

or rectified linear unit [Zeiler et al., 2013]

f(x) = max(0, x). (8.6)

Softmax output layer

When the outputs of a network are interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one.

8.1. Neural network architectures

v h

Figure 8.3: Model of an RBM.

A softmax output layer is then used as the output layer in order to convert a K-dimensional pre-activation vectorxinto an output vector z in the range (0,1):

z_i= exp(x_i) PK

j=1exp(x_j). (8.7)

Maxout network

The maxout model [Goodfellow et al., 2013] is a feed-forward achitecture, such as a multilayer perceptron or convolutional neural network, that uses an activation function called the maxout unit. The maxout unit is given by

fi(x) = max

j∈[1,k]xij (8.8)

where

xij =w^T_ijz+uij (8.9)

wherezis the input vector,w^T_ij are a set of trainable weight vectors, andu_ij are a set of trainable biases.

Deep belief net

Deep belief nets (DBN) were first proposed by Hinton [Hinton et al., 2006]. A DBN is a gener-ative type of deep neural network, where each layer is constructed from a restricted Boltzmann machine (RBM). A RBM as shown in Figure 8.3 is a generative stochastic artificial neural network that can learn a probability distribution over its inputs. It can also be viewed as an undirected graphical model with one visible layer and one hidden layer with connections between the visible units and the hidden units but no connection between the visible units or the hidden units themselves.

Given training data, training is achieved in two steps. In the first step, by using the so-called contrastive divergence criterion, the RBM parameters are adjusted such that the probability distribution represented by the RBM fits the training data as well as possible. Because this

training process does not require labels it is a form of unsupervised training. This is also called

”pre-training”. This pre-training is then repeated greedily for all layers from the first hidden layer (after input) to the last hidden layer (before output).

Pretraining in deep neural networks refer to unsupervised training with RBMs. The joint distribution of the hidden layerh and the visible layerv can be written as

p(v,h) = 1

Zexp(−E(v,h)) (8.10)

where Z is a normalization constant and E(v,h) is an energy function. For Bernoulli RBMs, the energy function is:

wherewij denotes the weight of the undirected edge connecting visible nodevi and hidden node h_j, and aand b are the bias terms for the hidden and visible units, respectively. For Gaussian RBMs, assuming that the visible units have zero mean and unit variance, the energy function is:

An RBM is pre-trained generatively to maximize the data log-likelihood logP

hp(v,h) by using so-called contrastive divergence.

When pre-training is finished for one hidden layer, one forward pass of this RBM is executed to get the value of the input to the next hidden layer. When all layer have been pretrained, an ouput layer is added, and given target labels, the network is fined-tuned in a supervised fashion using the back-propagation algorithm [Rumelhart et al., 1986]. This is usually used in DBN or MLP with sigmoid units in order to obtain a better initialization.

Convolutional neural network

Convolutional neural nets (CNN) or Convnets were first proposed by Lecun [LeCun et al., 1998].

A CNN consists of several layers. These layers can be a convolutional layer, a pooling layer, or a fully-connected layer.

In the convolutional layer, given a big matrix of inputs m×n, a small rectangle d×h rectangle of units called a local receptive field connects to the units of the next layer by d×h trainable weights plus a bias passed though a nonlinear activation function (d << m, h << n).

Intuitively, it can be understood as a small MLP. This is done to extract information from a group of neighboring units from the previous layer. This comes from the fact that the local units have high correlation. The rectangle of d×h trainable weights called ’window’ moves around the entire input, with no overlap between moving window position. Each position produces one unit at the next layer. This means that all neighborhood units in the previous layer share the

8.2. Training algorithms

Dans le document Traitement de l’incertitude pour la reconnaissance de la parole robuste au bruit ~ Association Francophone de la Communication Parlée (Page 97-101)