Convolutional Neural Networks - Proposed approach and network design

Proposed approach and network design

3.2.1 Convolutional Neural Networks

3.2.1.1 Biological similarity and basic principle

CNNs are biologically inspired from our visual cortex. The latter has limited localities of cells that are responsive to particular regions of the visual percep-tive field. In fact this biologic equivalence was lightened thanks to a charming experiment by Hubel and Wiesel in 1962 [41] in which it was proven that there are some particular neurons in the human being’s brain that do not respond unless there were horizontal, vertical or edges of particular orientation in the visualized scene.

Besides, the experiment realizers could find out that those individual neuronal cells were arranged in a columnar structure and, together, they could insure the cognitive tasks and the human visual perception. This principle of dele-gating specific task to different component of a composite system is perfectly relevant to the machines capabilities and this made it the basis behind the Convolutional Neural Networks.

The biologic inspiration related to the design of the network and to its work-ing process are summarized in table 3.1. The CNNs make proof of the DL hierarchical basis through the descriptors learning fact. More precisely, while

going deeper in the network, the layers among the CNN are responsible for learning features with different levels of abstraction.

Neuronal Cells Artificial Neurons

Individual neurons in the human brain react only with the presence of specific edges with particular direc-tions (horizontal, vertical, etc.) in the visible visual field

Artificial neurons are only active when there is particular low level features in the input image.

These neurons, organized in colum-nar architecture, are capable of pro-ducing the human perception

Artificial neurons are organized in layers, and together, they insure an understanding of the input image content.

Table 3.1: Biologic Neuron vs Artificial Neuron

Generally features that we’re able to extract from a CNN’s set of hidden layers belong to three types:

Figure 3.1: Low, Mid and High level features Low level features in figure 3.1 (1)

features that belong to this first type are a kind of very basic shapes like horizontal, vertical or diagonal edges, curves, colors , .. etc.

Mid level features in figure 3.1 (2)

Mid level features are a bit more complex than the previous type since we can get them through concatenating several basic level feature together. We can take the example of circles , triangles or other components of images.

High level features in figure 3.1 (3)

Those descriptors have the highest level of abstraction and they are the last

3.2 Preliminaries

evaluated parts to decide about the image’s nature and about the objects that the latter does contain. Those features might lead us for example to decide whether the input sample contains some handwritten text or a particular object animal or person.

3.2.1.2 Architectures and common layers

A simple Convolutional Neural Networks is a succession of layers, each of them, as we explained in section 1.2.3.3 converts one activations’ volume to another through a differentiable function. In fact here are three types of crucial layers that we must find in a CNN: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (figure 3.2) .

Figure 3.2: Typical CNN architecture

Those types of layers are essentially existing in a CNN but what varies from one network to another is their number which informs us on the network’s Depth. Some other layers like the dropout layer might appear for optimiza-tion reasons. In order to master this convoluoptimiza-tional architecture well we will be explaining briefly, in the rest of this paragraph, how does those layer extract the features and perform images classification.

First Convolutional Layer:

As its name indicates, a Convolutional network’s most basic layer is the con-volutional one. It is not only the introducing fist layer that we come across in a CNN, but is also generally introduced more than one time among the archi-tecture in order to get a truly deep network. We will first focus on the first convolutional layer apart and explain it separately since it has its exceptions compared to the mid and final level ones.

In order to simulate this layer’s entire process, we assume having an input image of n x m pixels and a color depth noted l (3 in cas of RGB channel).

Consequently, our input will be a volume of n x m x l size (see figure3.3). The principle of the convolution remains in a Filter with a fixed size that convolves the input image through regions called the receptive field.

First, the filter gets assigned to a first receptive field. We do then multiply the latter’s original pixel values with the filter values, sum them up and store them in the first position of the output layer related to the filter we’re using.

Next, we do shift the filter with one step and so we will be assigning it to the second receptive field. This process still repeats until getting all the image swept with the considered filter.

Figure 3.3: Inputs and outputs of a first convolution layer

The filters used to convolve the image are simply an array of numbers that we call weights in artificial neuronal terms (figure3.5 (a) ). Regarding the convo-lution terminology, the filters are the components responsible for detecting the particular features that we have introduced while explaining the basis behind CNNs. In fact, through one particular network’s layer, we usually introduce about hundreds of filters each of them considered as a particular shape or form identifier (figure 3.5 (b)). This multiple use of them does basically aim to extract as much features as possible.

Figure 3.4: The output neuron of the convolution of one filter with one recep-tive field

As we mentioned before in section 2.2.1.1 the first network’s layers are con-ceived for and extract of features with a low abstraction level finalities. This obviously means that the filters of the input convolution layer that we are considering actually, will be detectors of basic descriptors.

3.2 Preliminaries

The question that still comes to mind is how can we look for a shape or a form in the image using a filer which is nothing more than a set of numerical values. The following example illustrates briefly and simply how does that stuff work.

Figure 3.5: Basic convolution filter representation

We will take the example shown in figure3.5will assume that our input image is a mouse and that the region inside the yellow box is the actual receptive field then we will try to find out whether the feature in the previous figure3.4 exists in this mouse image through simulating the convolution layer working process.

Figure 3.6: Convolution when receptive field and filter are similar If we proceed with computing the sum and multiplication as we already ex-plained it, we can get here 6600 as result that we store in the appropriate location in the output hidden layer corresponding to the used filter. This high value that we got is back to similarity that we can notice while comparing the filter shape and the images region that is appearing in the receptive field we

are using.

We should hence keep in mind that a convolution that results in a high value means that the feature referenced by the filter we’re using has a strong proba-bility to appear in the locality of the image referenced with the used position of the output layer. Next, since we said before the filter has to sweep the totality of image, we will see then what happens if we move to a different receptive field.

Figure 3.7: Convolution when receptive field and filter are different The convolution of the same filter with a different region of the image mod-eledin figure 3.6 that does clearly not contain the shape we’re looking for returns zero. Thus contrarily to the previous case, we can admit that the con-volution of a filter with a region of the image that returns a low or null value means that that feature does not appear in that region of the input sample.

Those two simple rules summarize how do the filters and the convolution theory work for features detection yet we should disclaim that in this exam-ple, the image and the filter used are pretty simplified and in realistic cases where we use CNNs we can not find them this way. Actually, a set of low level feature as they are extracted from natural input image is given in the figure3.8 Until now, all the process long we were saying that the convolution’s results are stored in the output hidden layer. The latter is called Feature map or activation map and it is one the most important concepts in Convolutional Neural Networks.

A feature map is an array of numerical values that corresponds to a specific filter and thus to a particular feature. Each location (or value) of this volume corresponds to a particular and limited region of the input image. Besides,

3.2 Preliminaries

the appearances of the extracted descriptor is coded through the values that appear in that layer, high values signifies that the features exists there and low or null values means that it does not. The fact of being reference to the the descriptors localities is what made it called Features map.

The last detail that we should not surrender is that in a real application of CNNs we do not use a single filter per layer. We remind also that each activation map corresponds to one and only filter.

Figure 3.8: Low level features

Consequently, we admit that we won’t have a single feature map as output of our convolutional layer and its number will depend on the filters defined and used for the features detection and it also defines what we call the depth of the output hidden layer. We can find the example in the figure3.9 of using 6 convolving filters to get 6 output feature maps.

Figure 3.9: Six convolution filters output

For compatibility reasons and in order to define well the parameters of the networks it is important to track the size through its output layers. The size of a feature map is given by the following formula:

S = ((W −K+ 2P) + 1)/S (3.1) S is the features map height / width, W the input image height / width, K the filter Size , P is the Padding and S is the stride. The padding and the stride are called hyper parameters and we will explain them further in one of later sections.

Second, third and Nth Convolutional Layers:

We have already saw that the first convolution layer takes as input the original image and detects low level features like multi directional edges and curves.

The reason behind having more than one convolution layer is to enable the network to detect mid level and high level features added to the low level ones and thus ensure the hierarchical basis on which is build the feature learning models.

Figure 3.10: Inputs and outputs of a mid placed convolution layer Convolutional layers that come next do not take images as input as it was the case for the first studied network’s layer. This time, the input is identically the output hidden layer or what we already introduced as feature map as shown in figure 3.10. This fact changes a bit the notions that we explained before.

Actually , those layers placed deeper in the network have, similarly to the first layer, feature maps as output. Also, the process of convolving the input image with a filter will not change. We just have to substitute the input image with the output feature map of the previous layer. However, the most significant and important difference that we should take into consideration is that the output layer won’t be a map for the filter we’re using in the convolution. It will be instead a map for the descriptor given by the combination of the fea-ture corresponding to the filter we are using in the actual convolution and the

3.2 Preliminaries

feature corresponding to the filter of the input feature map.

That is exactly how we build features with higher level of abstraction. Fi-nally we will achieve middle placed convolution layers that are detecting mid level features (figure3.11(a)) and last convolution layers enabling the network extract high level features like faces or objects. (figure 3.11 (b))

Figure 3.11: (a) Mid level features. (b) High level features

Pooling Layer:

In a CNN architecture it is common to regularly introduce a Pooling layer amid each two successive convolutional layers. Its role is to increasingly di-minish the dimensions of the representation’s space and thus decrease the amount of weights to learn and the computational cost of their learning. The Pooling Layer acts separately on each depth slice of the volume taken as input and reduces its spatial size using what we call a Max pooling operating.

Its most recognized method is a pooling using 2×2 filters combined with a stride (step) in the order to two. Actually, it does down sample each slice of the input volume by two along the height and the width and hence it eliminates 75 % of the activation values while the depth dimensions stands unaffected. More precisely, the max pooling method consists in acting on all the 2×2size - regions of the input and keeps only the max rate over the four existing numerical values.

Added to the max pooling operation, other types of sub sampling the volumes

slices exist such as L2-norm pooling and and average pooling. The latter was frequently used historically yet it has lately fallen out in profit of the max pooling function, which were proved to perform better practically.

The example at the left of the figure 3.12 shows an input volume of 224× 224×64 size that is down sampled wit a filter of 2×2 size. This operation results into an output volume of112×112×64size with a depth preservation.

The working process of the max pooling function that consists in keeping only the max values in regions of 4 pixels is illustrated in the right example of the figure 3.12

Figure 3.12: Pooling operation model

Many people detest the pooling function and believe that we’re able to go ahead without introducing it. Instead, in order to decease the representa-tions’ size, they choose to introduce convolutional layers with higher stride values. Giving up on pooling layers has also been considered to be essential in learning performing generative model, just like generative adversarial net-works (GANs) [42] or variational auto-encoders (VAEs) [43].

Fully Connected layer :

After a series of convolutions and pooling operation we can consider now that we are supposed to reach the high level feature extraction with success.

Now we have to add the layers responsible of interpreting those descriptors and make them a classification tool. The fully connected layers are attached by the end of a CNN, as shown in the figure3.13in order to perform this task.

FC layers take evidently as input a volume of neurons which is the output of whether the preceding pooling or convolutional layer. It returns as output a vector of N dimensions where N refers to the amount of the final classes we want to recognize. The values in this vector, each of them, informs us about

3.2 Preliminaries

the probability that the input sample have to belong to a given output division.

Actually, the FC layer observes the previous layers output feature maps and decides which features corresponds for a specific class and attributes its pa-rameters or weights. Thus, when we apply dot product between those weights and the previous feature maps we obtain the right probability of the distinct classes.

Figure 3.13: Fully connected set of layers attached to the convolution block

SoftMax layer:

Another way to show the output classification result is to use the SoftMax lay-ers family. In fact, classification tasks may profit from the asset of its classes being mutually exclusive in the neural network architecture.

In fact an ideal representation of the prediction would be obtained if the out-put layer function assigns 1 for a unique outout-put node (the appropriate class) and zero for the rest. The most relevant architecture that enables us to inte-grate such mechanism within the network architecture is the max-layer, which assigns a 1 probability value to the maximum value in the output pf the pre-vious layer and turns off the rest of nodes to zero. Yet this max function is not differentiable and this fact would be a problem lately during the training process.

Willingly , if we employ Softmax layer as output, it will nearly perform to a Max function and it is free of the differentiation problem since it is a dif-ferentiable function. The figure 3.14 shows a three outputs softmax layer structure.

Figure 3.14: Softmax layer architecture 3.2.1.3 Over-fitting problem

As we detailed in the first chapter and in section 2.2.1.2 Deep Neural networks such as CNN are conceived in a structures that involve several non-linear hid-den layers and that’s what makes them potential and expressive in terms of learning good representations of highly complicated connections that link the input to the output data. However when we use them in a problematic where we do not own enough training data many of those connections will result from sampling noise and they will be particular to the training data and not appropriate for generalization to other cases.

Such problem is popular and serious in networks training and it is called over-fitting. The figure 3.15shows the difference between good training result and an over fitted network.

Figure 3.15: Well tuned vs Over-fitted model

The pooling layer, with its sub-sampling task, reduces the over-fitting risk, yet an other type of layers, called dropout and shown in the figure 3.16, that may be attached to the CNNs are also able to deal with it. They are appreciated in solving the fact of the weights being so tuned to the training samples and hence not able to perform well when new test examples are introduced.

3.2 Preliminaries

The basis behind this layer is to just drop out randomly a set of neurons and turn them off to zero. This simple idea pushes the model to be redundant and thus to give the right prediction even if some weights are given up.

Figure 3.16: Dropout layer structure

Dans le document Présenté en vue de l’obtention du (Page 52-64)