• Aucun résultat trouvé

1.3 Artificial neural network architectures

1.3.2 Convolutional neural networks (CNNs)

In this subsection we will discuss the Convolutional Neural Network (CNN), which is a specialized neural network architecture initially proposed for image processing.

Like MLPs / DNNs, CNNs also contain neurons connected by learnable weights and biases and the operations performed by the neurons (units) are like in MLPs:

each neuron will receive some inputs, then perform a dot product and then apply a nonlinearity to the result. The entire system is also differentiable (like MLPs), uses a loss function and the typical optimization procedures (stochastic gradient, backpropagation) apply, as will be discussed in Section 1.5. Although mostly used for image processing, in recent years, the use of different CNN variants has been extended to many more domains, e.g. speech recognition [205] or natural language processing [82]. Our presentation is mostly inspired by the excellent introductions in [10] and [12]. We will assume that the inputs to the CNN are images, to simplify the presentation; extensions to CNNs processing text or other types of inputs are straightforward and many of the ideas and motivations carry over naturally to those cases.

The first significant difference between CNNs and MLPs is the pattern of con-nectivity of each neuron. While in MLPs neurons in a layer are fully-connected to all the neurons in a consecutive layer, in CNNs the pattern of connectivity is sparse and local. This is inspired by work in neuroscience, where Hubel and Wiesel [113]

have discovered that neurons in the visual cortex of cats act as local filters over their input space. The small sub-region of the input each cell is sensitive to is called a receptive field, a term also used to denote the connectivity pattern of CNN neurons.

Another conceptual difference between CNNs and MLPs is that in CNNs the neurons / units are replicated across 2D arrays. A 2D array of replicated units shares the same trainable parameters and is called a feature map. This means that the weights and biases of multiple neurons have the same values. The shared trainable parameters are also denoted as afilter. As is also the case for RNNs and discussed in more details in Section 1.5, the gradient with regard to a shared trainable parameter is the sum of the gradients with regard to all the parameters being shared.

The replicated units and replicated filters allow for features to be detected regardless of the position in the input image where they appear. This property is called equivariance and roughly means that the resulting feature map changes similarly to how the input image would. If the image were translated one pixel to the left, so would the resulting feature map after applying the filter.

Similarly to MLPs, CNNs are deep neural networks, with multiple layers. A CNN is obtained by stacking repeatedly blocks of convolutional and subsampling layers, which will be discussed below. For classification, the task of most interest for this thesis, the repeated blocks of convolutional and subsampling layers are most commonly followed by fully connected layers and a softmax layer.

For clarity, we’ll discuss how CNNs work when a single example (image) is provided as input. Modern CNNs are usually trained on mini-batches containing multiple images. The discussion here can easily be extended to the mini-batch case, by adding an extra dimension (corresponding to the mini-batch) to every tensor.

1.3.2.1 Convolutional layer

For simplicity, we will only discuss stride 1 convolution here, which is the most relevant for the work in this thesis. We will also not discuss zero padding settings. A discussion on strides and paddings for convolutional layers is provided in [12].

The feature maps discussed in the previous subsection will be obtained in a CNN in the following manner. The input image is convoluted with a linear filter, a bias term is added and then a nonlinear function f is applied. The convolution can be interpreted as repeatedly applying the same function across sub-regions of the entire image. To allow for a richer representation of the data, each convolutional or subsampling hidden layer contains multiple feature maps hk, k ∈0, ..., K. We will denote thek-th feature map byhk and the corresponding filter, composed of trainable weights and biases, by Wk and bk and the input image by x. A nonlinear function is then applied element-wise to every pixel in every feature map. The corresponding equation for pixel [i, j] in feature map hk is:

hk[i, j] =f((Wk~x)[i, j] +bk) (1.14) We have introduced convolution in Section 1.2.1.2. f can be any nonlinearity in Section 1.2.2, with ReLU being a popular modern choice.

The discussion above would apply exactly for a CNN with a single convolutional layer. Similarly to the theoretical results obtained for deep MLPs [158], deep CNNs

are more expressive than shallow ones [149]. We will now extend the discussion of the convolution operation to two consecutive hidden layers,m−1 and m. We will denote feature map k in layer mby hmk. The input image could be interpreted as layer 0.

For an RGB image, it would contain three feature maps (one for each of the red, green and blue channels), while a grayscale image would contain a single feature map.

The layers hm can be convolutional (including the input image) or subsampling (the subsampling layer is presented in the next subsection). By aggregating the weights Wk for every feature map hk, we obtain a 4D tensor which contains elements for every destination feature map hmk, every source feature map hm−1l , source vertical position i and source horizontal positionj.Wkl[i, j] denotes the weights connecting each pixel of thek-th feature map at layer m with the pixel coordinates [i, j] of the l-th feature map at layer m−1. The biases b can be represented as a vector indexed by the destination feature maps; bmk denotes the bias corresponding to destination feature map hmk.

1.3.2.2 Subsampling layer

Many different proposals for subsampling layers exist, but we will discuss here the max-pooling layer, which is the most successful and most widely-used subsampling layer. Other notable uses of subsampling are the average pooling layer [53] and using a stride larger than 1 in the preceding convolutional layer [177]. What is common to these layers is that, when applied to a group of pixels of a feature map, they first optionally perform a transformation of the pixels, then select one of them as the result. What is specific to max-pooling is that it selects the pixel with the maximum value (of the group).

The most common choice is to divide the pixels into non-overlapping groups, with the most common choice being groups of 2 pixels width and 2 pixels height. When used with max-pooling, this is called 2 ×2 max-pooling and is the most widely used setting for a pooling layer.

A more detailed discussion of various types of pooling layers, with different modes to group pixels (with potential overlapping and various subsampling options, as well as the sizes of the corresponding resulting feature maps) is provided in [12].

Given input feature map i, the output feature map o is given by:

o[m, n] = max(i[2∗m,2∗n], i[2∗m+1,2∗n], i[2∗m,2∗n+1], i[2∗m+1,2∗n+1]) (1.15)

The main motivation behind the max-pooling layer is to provide some translation invariance and robustness to small image distortions. The resulting reduction in feature map size (both the width and the height are divided by 2 in the case of 2 × 2 max-pooling) also helps to reduce the computational requirements of the CNN.

An interesting feature of the max-pooling layer is that it doesn’t contain any trainable parameters.

1.3.2.3 Output layer for classification

When the CNN is used for classification, all the pixels in the feature maps resulting from the last layer (convolutional, nonlinear or subsampling) are reshaped to a 1D array. Fully-connected layers and a softmax layer can then be added to perform classification.

Other possibilities exist in the literature to obtain classification scores from 2D feature maps (e.g. [177]), but we won’t discuss them here.

1.3.2.4 CNN design patterns

In this section we will very briefly mention some of the most influential CNN architectures in deep learning. The most influential task on which CNNs are trained is object recognition on the ImageNet dataset [23].

Our simplified description of a CNN above is most in line with the VGG CNN architecture [171], since it is the one closest to the CNNs we have used in our work.

This architecture is very homogeneous, containing 3 × 3 convolutional layers and 2× 2 max-pooling layers. It was introduced in the context of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2014, where it was the runner-up.

One of the early influential CNN architectures trained with backpropagation was LeNet-5, introduced by Yann LeCun for the task of digit recognition [130], which we discuss in more detail in Section 2.1.1.

AlexNet [126] is the architecture which popularized the use of CNNs in Computer Vision, after winning the ILSRVC challenge in 2012. It also popularized the use of GPUs for training neural networks and led to deep learning becoming the dominant paradigm in Computer Vision.

GoogleNet [185] won the ILSRVC 2014 challenge and introduced Inception Mod-ules, which significantly reduced the number of trainable parameters (compared to e.g. AlexNet). The GoogleNet architecture has gone through several iterations, the most recent one being v4 [184].

Residual CNNs (ResNets) [104] won ILSVRC 2015 and introduced skip connec-tions to ease the training of very deep networks, by reducing the problem of vanishing gradients (see Subsection 1.5.2.5. As of 2018, ResNet variants are often the default choice for using CNNs in practice.

We have only briefly mentioned some of the most famous CNN architectures. A more detailed discussion regarding the architectures mentioned above is provided in [12].

We will discuss in more detail some CNNs which have been influential in the context of handwriting recognition (including LeNet) in Chapter 2.