Computational primitives - Contributions to handwriting recognition using deep neural networks

In their most general form, neural networks can be interpreted as computational graphs composed of primitive operations. The computational graphs allow for a richer set of primitive operations than those we described here, but we will restrict ourselves to the most commonly used and most successful operations.

1.2.1 Matrix vector multiplication

Matrix vector multiplication is probably the most widely-used deep learning compu-tational primitive. It is a type of linear operation (no nonlinear effect) and is used as a component in all the most successful neural network architectures, including those used in this thesis: multilayer perceptrons, convolutional neural networks and recurrent neural networks..

In the most common setting, the vector x∈Rⁿ represents information previously processed by the neural network and / or unprocessed (input) information. The matrix W ∈ R^m∗n is dense and all of its entries are modifiable (trainable) - see Subsection 1.5. The result of this operation is the vector y∈R^m:

y=W ∗x (1.1)

Usually, a vector of biases b∈R^m is added to the matrix vector multiplication result, so that the previous equation becomes:

y=W ∗x+b (1.2)

We can obtain the same result by appending the value 1 to the end of the column vectorx and the column b to the matrix W, so that we can keep Eq. 1.1.

In the next subsections we will introduce various primitives which can be inter-preted as matrix vector multiplications, where the matrices are factorized in different manners.

1.2.1.1 Element-wise multiplication

Element-wise multiplication can be interpreted as matrix vector multiplication where the matrix is diagonal. Here again, the vectorx∈Rⁿrepresents information previously processed by the neural network and / or unprocessed (input) information, while the diagonal matrix D∈R^n∗n contains the trainable parameters. By reshaping the diagonal matrix to the vector d∈Rⁿ:

y=D∗x=d•x (1.3)

Biases can be added analogously to how this was performed in the previous subsection.

Element-wise multiplication is notably used in state of the art Long Short-Term Memory (LSTM - see Section 1.3.3.2) recurrent neural networks. The convolution operation, a computational primitive we describe in next, can also be factorized as Discrete Fourier Transforms and element-wise multiplications, but we won’t go into more details here (see [9]).

1.2.1.2 Convolution

The convolution operation is widely used in state of the art neural networks architec-tures, particularly Convolutional Neural Networks (CNNs). We will only describe discrete convolution applied to functions with finite support (vectors in the 1D case and images in the 2D case), as these are the most relevant for neural networks.

Discrete convolution can be reformulated as matrix vector multiplication [8].

We will consider two vectors f and g with indices in {0..., N −1}. The result of their 1D convolution can be written as:

o[n] =f[n]~g[n] =

u=N−1

u=0

f[n−u]∗g[u] (1.4)

where ~ denotes the convolution operator.

This can be extended to 2D arrays as follows:

o[m, n] =f[m, n]~g[m, n] =

u=M−1

u=0

v=N−1

v=0

f[m−u, n−v]∗g[u, v] (1.5)

We will describe the intuition behind the convolution operation when we discuss Convolutional Neural Networks in Section 1.3.2.

1.2.1.3 Factorized matrix vector multiplication

The previously introduced matrix vector multiplication is the most popular linear operation in neural networks, but it can be quite expensive in computational time and memory. For a vector of size N and a matrix of size N×N, the computational and memory complexity are O(N²). In this subsection we briefly present several methods which factorize the matrix in the matrix vector multiplication, in order to reduce the computational and / or memory complexity. We introduce a similar approach, which factorizes the matrix (and uses quantum computation) in Chapter 6. For this reason, we focus here on some of the methods most similar and relevant to our own.

[147] replaced the W matrix in fully-connected layers with the matrix product A∗C ∗D∗C, with Aand D diagonal matrices,C the discrete cosine transform and C⁻¹ the inverse discrete cosine transform, reducing the computational complexity of a to O(N ∗log(N)) and the number of trainable parameters to O(N), while maintaining comparable statistical performance for the task of object recognition on the ImageNet dataset.

[38] proposed a similar factorization, withO(N∗log(N)) computational complexity and O(N) trainable parameters hidden-to-hidden transform of a recurrent neural network (see Section 1.3.3.1). The resulting transform is the product of multiple unitary matrices, some of which represent the Discrete Fourier Transform and the Inverse Discrete Fourier Transform. The RNN parameterization obtained state of the art results at the time of the proposal on several long-term dependency tasks.

Our proposal in chapter 6 also decomposes the matrix W into a product of multiple unitary matrices, some of which represent Fourier transforms, but potentially reduces the computational and memory complexities even further, due to the use of quantum computation.

[121] introduced a Kronecker parameterization of the matrix implementing the hidden-to-hidden transform of a recurrent neural network (see Section 1.3.3.1), showing that, at least for certain tasks, the number of parameters in the

hidden-to-hidden part of the RNN can be drastically reduced, fromO(N²) toO(N∗log(N)) and the computational complexity is reduced. In chapter 6, we propose a neural network architecture which makes use of quantum computation and can also dramatically reduce the number of trainable parameters, but also the computational complexity (to O(log(N)²)), while maintaining comparable performance, using a similar Kronecker matrix factorization.

1.2.2 Nonlinear activation functions

All previously described computational primitives are linear. Nonlinear operations are also necessary, otherwise a machine learning system containing only linear operations would not be expressive enough, no matter how many linear operations were composed.

Intuitively, no matter how many linear operations are composed, the entire system is no more powerful than a simple linear regression. On the other hand, even the composition with a single nonlinear operation makes neural networks universal approximators of continuous functions [73].

In neural networks, nonlinearity is introduced using the concept of an activation function, which is applied element-wise to the input.

Historically, the most popular activation function used to be the sigmoid function:

σ(x) = 1

1 +e^−x (1.6)

Another activation function with a long history is the tanh function:

tanh(x) = e^x−e^−x

e^x+e^−x (1.7)

Currently, one of the most successful activation functions is the Rectified linear unit (ReLU) [86]:

ReLU(x) =







x, if x >0.

0, otherwise.

(1.8) The interest of ReLU activation functions is that they provide less vanishing gradients because they saturate less (as compared to the logistic function)

Leaky rectified linear units (LReLUs) [140] have been found to either match or surpass ReLUs in performance by some authors [140] [200]:

LReLU(x) =







x, if x >0.

ax, otherwise.

(1.9) a is the scaling factor and is fixed. Parametrized Rectified Linear Units (PReLU) [105] is another rectified activation function, with the same equation as for LReLU.

The difference, though, is that a is a trainable parameter (through gradient-based optimization). A different a can be chosen for each neuron or the a values can be

’tied’ so that several neurons (see the next section for an introduction to the concept of artificial neuron) share a same value. This helps reduce the number of trainable parameters and, thus, can prevent overfitting.

The softmax nonlinearity is commonly used in the setting of supervised learning with discrete labels (classes). In this setting, it is most often combined with the cross-entropy loss function (see Subsection 1.5.1.1). GivenK distinct classes, the softmax takes as input a vectorx of K real values (this vector can be obtained from a vector of inputs or previously processed values through e.g. matrix vector multiplication).

The softmax then processes theseK values to provide the probabilities corresponding to the K classes. For each classk ∈1, .., K, its corresponding probability is:

p(k) = e^x^k PK

j=1e^x^j (1.10)

Dans le document Contributions to handwriting recognition using deep neural networks and quantum computation (Page 24-28)