Gaussian Information Bottleneck Method

3.5 The Generalized Eigenvalue Problems in Machine Learning

3.5.4 Gaussian Information Bottleneck Method - GIBM

The Information Bottleneck Method (IBM) is a general and powerful model of learning, that has its foundation in information theory. The IBM has been originally proposed as a way to extract the relevant information from a random variable X, with respect to another random variableY, while compressingX as much as possible. With such a broad formulation, the IBM can be applied to both supervised and unsupervised problems (as the Y can be just thought as being a hidden variable for the unsupervised case). The compressed representation that we extract fromX can be thought as of another random variable, that we can call T. It can be shown that T is an approximation of a minimal sufficient statistic ofX w.r.t. Y.

Definition 34 (Sufficient Statistics [61, 77]). . Let Y ∈ Y be an unknown parameter andX ∈ X be a random variable with conditional probability distribution functionp(x|y).

Given a function f :X → S, the random variableS =f(X)is called a sufficient statistic forY if:

∀x∈ X, y∈ Y :

P(X =x|Y =y, S=s) =P(X =x|S=s). (3.24) The meaning of definition 34 is that S captures all the information about Y that is available inX. As a consequence, we can state the following theorem:

Theorem 35. Let S be a probabilistic function ofX. Then S is a sufficient statistic for Y if and only if:

I(S;Y) =I(X;Y)

Here, I(X;Y) denotes the mutual information between the random variable X and the random variableY. It is simple to note that, according to this definition, the identity function makes the random variableXa sufficient statistic forY. To make something useful out of this definition, we should introduce a way of compressing information contained in X. For this, comes to our aid the definition of minimal sufficient statistic.

Definition 36. (Minimal Sufficient Statistic) A sufficient statisticS is said to be minimal if it is a function of all other sufficient statistics, i.e.:

∀T; Tis sufficient statistic ⇒ ∃g;S=g(T). (3.25) The relation between minimal sufficient statistics and information theory is stated in the following theorem:

Theorem 37 ([77]). Let X be a sample drawn from a distribution function that is deter-mined by the random variable Y. The statistic S is an MSS for Y if and only if it is a solution of the optimization process

T:is sufficient statisticmin I(X;T). (3.26) Using theorem 35 we can restate the previous optimization problem as

min

T:I(T;Y)=I(X;Y)

I(X;T). (3.27)

The goal of the IBM is to find a compressed representation ofXthat preserves as much information as possible about the random variableY. In this sense,T is a minimal sufficient statistic for random variable X and Y. The trade-off is obtained using a Lagrangian multiplier which we callβ. Note that we are interested in the stochastic maps fromX to T andT toY. Using theorem 35, we can restate the previous theorem as:

Definition 38 (Information Bottleneck).

min

p(˜x|x)I( ˜X;X)−βI( ˜X;Y) (3.28) The first term is meant to measure the compression, and the second term is meant to measure the relevance of the sufficient statisticT with respect toY. In the IBM,T is meant to be a representation ofX in a “meaningful semantic space”. The Lagrangian multiplier β ∈[0,∞) decides the trade-off representation and compression. Surprisingly enough, the parameterβ allows one to train models that are explicit with respect to the bias/variance trade-off. This optimization problem is closely related to the Rate-Distortion theorem of Shannon, which described the trade-off between compression and resilience to errors in a channel. For further information, we refer to [146,134]. Far from being just a theoretical tool to analyze learning, the IBM has been used to solve a stack of problems in machine learning [84,135]. According to the problem under consideration, the IB can be computed via a plethora of different algorithms [134]. There is one case where the solution of the IB can be computed analytically: under the assumption that the random variablesX, Y are jointly multivariate Gaussian distributions. Let the cross-covariance matrices be defined as usual, i.e. ΣXY := X^TY and ΣY X := Y^TX, where we assume that the matrices X and Y are scaled to have features with zero means and unit variance. We also define the conditional covariance, or canonical correlation matrix as: Σ_X|Y = ΣX−ΣXYΣ⁻¹_Y ΣY X. In the Gaussian Information Bottleneck, we need to access the left singular vectors of:

A= Σ_X|YΣ⁻¹_X

The number of eigenvectors to extract is a function specified by a trade-off parameterβ. It is straightforward to see that GIB and CCA are two ways of looking at the same problem. Recall the GEP problem of CCA in Equation 3.22 is:

ΣXYΣ⁻¹_{Y Y}ΣY Xwx=λ²ΣXXwx (3.29) The matrix A is defined asI−ΣXYΣ⁻¹_Y ΣY XΣ⁻¹_XX. Adding or removing the identity matrix from another matrix will only shift its spectrum, so we can consider the left singular vectors of ΣXYΣ⁻¹_Y ΣY XΣ⁻¹_XX. Taking the transpose of this matrix, we see that this is exactly the same matrix of the GEP in Equation 3.29.

Part I

Quantum Algorithms for Machine Learning

Chapter 4

Quantum slow feature analysis and classification

4.1 Introduction to Slow Feature Analysis

Slow Feature Analysis (SFA) is a dimensionality reduction technique proposed in the con-text of computational neurosciences as a way to model part of the visual cortex of humans.

In the last decades, it has been applied in various areas of machine learning. In this chapter we propose a quantum algorithm for slow feature analysis, and detail its application for performing dimensionality reduction on a real dataset. We also simulate the random error that the quantum algorithms might incur. We show that, despite the error caused by the algorithm, the estimate of the model that we obtain is good enough to reach high accuracy on a standard dataset widely used as benchmark in machine learning. Before providing more details on this result, we give a brief description of dimensionality reduction and introduce the model of slow feature analysis in this context.

SFA has been shown to model a kind of neuron (called complex cell) situated in the cortical layer in the primary visual cortex (called V1) [22]. SFA can be used in machine learning as a DR algorithm, and it has been successfully applied to enhance the perfor-mance of classifiers [156, 23]. SFA was originally proposed as an online, nonlinear, and unsupervised algorithm [152]. Its task was to learn slowly varying features from generic input signals that vary rapidly over time [23, 152]. SFA has been motivated by the tem-poral slowness principle, that postulates that while the primary sensory receptors (like the retinal receptors in an animal’s eye) are sensitive to very small changes in the environment and thus vary on a very fast time scale, the internal representation of the environment in the brain varies on a much slower time scale. The slowness principle is a hypothesis for the functional organization of the visual cortex and possibly other sensory areas of the brain [153] and it has been introduced as a way to model the transformation invariance in natural image sequences [156]. SFA is an algorithm that formalizes the slowness principle as a nonlinear optimization problem. In [28,138], SFA has been used to do nonlinear blind source separation. Although SFA has been developed in the context of computational neurosciences, there have been many applications of the algorithm to solve ML related tasks. A prominent advantage of SFA compared to other algorithms is that it is almost hyperparameter-free. The only parameters to chose are in the preprocessing of the data, e.g. the initial PCA dimension and the nonlinear expansion that consists of a choice of a polynomial of (usually low) degree p. Another advantage is that it is guaranteed to find the optimal solution within the considered function space [55]. For a detailed description of the algorithm, we suggest [137]. With appropriate preprocessing, SFA can be used in conjunction to a supervised algorithm to acquire classification capabilities. For instance

it has been used for pattern recognition to classify images of digits in the famous MNIST database [23]. SFA can be adapted to be used to solve complex tasks in supervised learning, like face and human action recognition [75,156,142].

We can use SFA for classification in the following way. One can think of the training set a set of vectorsxi ∈R^d, i∈n. Eachxi belongs to one ofK different classes. A class T_k has |Tk| vectors in it. The goal is to learn K−1 functions g_j(x_i), j ∈ [K−1] such that the outputy_i= [g₁(x_i),· · · , g_K−1(x_i)] is very similar for the training samples of the same class and largely different for samples of different classes. Once these functions are learned, they are used to map the training set in a low dimensional vector space. When a new data point arrive, it is mapped to the same vector space, where classification can be done with higher accuracy. SFA projects the data points onto the subspace spanned by the eigenvectors associated to theksmallest eigenvalues of the derivative covariance matrix of the data, which we define in the next Section.

4.2 The computational problem and the SFA

Dans le document Quantum algorithms for machine learning (Page 45-50)