Composite models - Learning to compress and search visual data in large-scale systems

less restrictive alternative to positive definite matrices), and avoid trivial solutions like repeated rows. They then proposed a conjugate gradient-based algorithm and showed that it outperforms image denoising using K-SVD while being much faster.

This line of work is then extended, e.g., in [19, 20] by proposing a closed-form solution (after slightly changing the objective function), or in [21] by extending the concept to

over-complete transforms.

2.2.3 Synthesis versus analysis priors

Now that we have reviewed the two basic prior models, a natural question arises as to how do these models compare. While they have been shown to be related, and even equivalent under some (restricting) conditions, there is no clear answer to this question in general terms.

A useful work shedding light to the relation between analysis and synthesis priors is [22], which leaves the general answer as open.

So finally which of these two basic prior models do we choose? The choice is very subjective and should be decided depending on the setup and the application at hand.

In this thesis, we build on top of both of these models and adapt them to our general objective of having compressed and discrete representations. While we do not give a decisive answer to this question, we point out that the analysis model is more compatible with neural network architectures for which excellent practical knowledge has been built during the last several years. Therefore, when it comes to building composite models out of basic ones, we tend to be more in favor of analysis models, since they can benefit from back-propagation.⁷

2.3 Composite models

Not all efforts to address our problems of interest follow the structures we have seen in the previous section. In fact, some of the state-of-the-art results reported during the last several years deviate from the category of basic synthesis or analysis models. These models can be considered as composite structures made up of basic ones, and for which the inference procedure is more involved.

As discussed earlier in section 2.1.2.1, it is expected that more complex priors lift the limitations of basic ones by providing a richer space of parameters. On the other hand, their required sample complexity is not minimal, moreover, their analysis and interpretation is not straightforward.

In section 2.3.1, we make an effort to understand these models by decomposing them into simpler parts. Then in section 2.3.2, without making the effort to provide a structured

7On the other hand, one can argue that synthesis models are more compatible with Expectation-Maximization framework and can benefit from EM-like solutions.

review, we recount some highlights from the literature, and in particular, the application of complex models to solve inverse problems.

2.3.1 Decomposition of priors

Priors need not be limited to the basic forms we saw in the previous section. For example, the image distribution may be different from the Boltzmann-like distribution of Eq. 2.8, or may be composed in a different way, perhaps from some simpler elements.

This can consist of different levels of interaction with the data and conditioning on the results of the previous stages or conditioning on different parameter sets. While it is not straightforward to model the structure of such composite priors in general, let us next see a very abstract way to explain this.

Denote the set of parameters of the model as θ∈Θ. For example, the analysis prior of Eq.

2.8 was parametrized by the projection A, or the synthesis model of Eq. 2.3 was parametrized by the dictionary C. So, for both models, we can decompose the prior symbolically asp(f;θ).⁸ One can think of going beyond this rigid prior and think of composite models. For example, we can consider an L-stage decomposition of the prior, where each stage is conditioned on previous stages of prior modeling as:

p(f;θ) =p(f^[0];θ^[1])p(f^[1]|f^[0], θ^[1];θ^[2])· · ·p(f^[L−1]|f^[L−2],· · ·,f^[0], θ^[L−1],· · ·, θ^[1];θ^[L]), where f^[0]=f is the given data, and f^[1],· · ·,f^[L−1] are the inputs to the second to the last stages of processing. These layers are parameterized each by the parameter setsθ^[1],· · · , θ^[L], respectively.

This is a very general and intricate decomposition for the priors that may lead to over-complicated structures. Simplifications should be imposed by relaxing the conditioning within the layers.

One possible simplification can be considered by assuming a sort of Markovian assumption.

This leads to the decomposition of the prior as:

p(f;θ) =p(f^[0];θ^[1])p(f^[1]|f^[0];θ^[2])· · ·p(f^[L−1]|f^[L−2];θ^[L]), (2.13) which can be realized in various ways as we see next.

8When learning these parameters, these models are referred to as parametric models in the statistical machine learning literature since they have a clear parameter set. This is in contrast with non-parametric models whose parameter set can adapt to the data. However, we prefer not to use this terminology here and do not make an explicit distinction between them, as they happen to be vague for some cases.

2.3 Composite models 31

Feed-forward neural networks

A prominent instance of the prior decomposition of Eq. 2.13 is the family of feed-forward neural networks. These networks are characterized as:

f^[1] =σ^[1]A^[1]f^[0]+b^[1], ...

f^[L]=σ^[L]A^[L]f^[L−1]+b^[L],

(2.14)

where A^[1],· · · ,A^[L] are projection matrices for the first to the last layers, b^[1],· · ·,b^[L] are bias vectors, and σ^[1](·),· · ·σ^[L](·) are non-linear but differentiable functions applied to the projections, respectively.

These networks are trained by first forming a cost function composed of f^[L]and, perhaps a set of labels in supervised scenarios, or f^[0] in unsupervised cases and in autoencoder networks, along with optional regularization consisting of norms on the projection matrices. The cost function is then minimized usually using variants of gradient descent by differentiating w.r.t.

the parameters of the network, i.e., A^[1],· · ·,A^[L] and b^[1],· · · ,b^[L]. The differentiation is performed using the back-propagation technique which is perhaps the most important element behind the success of neural networks. This is essentially the chain rule in multivariate calculus which is applicable, thanks to the structure of the prior in Eq. 2.13.

The configuration of the network is very flexible. Depending on the choice of the projection matrices and the non-linearities, the feed-forward structure can take different forms. For example, A^[l], can be an unstructured matrix, which leads to the so-called perceptron layer, or cyclic convolution matrix, which leads to a convolutional layer.

The non-linearities also play crucial roles in the network. Famous choices forσ(·) are the sigmoid, tanh and Re-Lu functions.

Residual networks

As the number of layers increases, training of feed-forward networks becomes increasingly involved. This is due to a set of factors leading to the vanishing (or exploding) gradient phenomenon. A remedy was proposed in [23] by introducing the so-called skip-connections that, along with the output of layerl, redirects the output of layer l−1 to the input of layer l+ 1 of the network. This is equivalent to relaxing the Markovian simplification of Eq. 2.14 to conditioning on the outputs of other layers, apart from the previous layer.

This evolution has been very successful and made the training of networks with many layers feasible.

The literature of neural networks and deep learning is filled with different recipes and prac-tical insights on how to better train these complex learning machines. This is because of their

very complicated structure for which there does not exist enough theoretical understanding and explanation.

This field has received an unprecedented amount of attention from academia and industry.

So the research and practice in this field has explosively expanded during the last decade.

We refer the reader to [24] that reviews the key achievements in this field.

2.3.2 Literature review

In section 2.2, we have seen the basic formulation of MAP estimation and how the basic prior models can be added to the objective function, e.g., as in Eq. 2.4 or Eq. 2.9.

For the composite models, on the other hand, this can be done in many different ways.

Since these models are very flexible, and also the fact that they can benefit more from the availability of data than the basic models, one can consider many different scenarios.

Denote the equivalent network architecture of Eq. 2.14 as f^[L]=Nθ(f^[0]), whereNθ(·) is a network with paramaters symbolized as θ. One can generate a large number of degraded-clean pairs of images and learn to map the degraded images to degraded-clean ones by learning a networkf =N^θ(q) on the training pairs. This, however, needs to learn one network for any degradation level. Attempts to learn one network for all degradations requires learning very complex networks. One such effort is in [25], where they train a 30-layer network to learn to denoise several contamination levels, simultaneously. Another effort is in [26], where they train a very complex model consisting of 80 layers with memory units.

Another possibility is to train ˆq=N^θ1(q) on the set of degraded images and then train ˆf =N^θ2(f) on the set of clean images. A third network can then be learned to map the set of parameter pairsθ₁ andθ₂, i.e.,θ₂=Nθ3(θ₁). An example of such effort is in [27] and for the task of single-image super-resolution.

One other possibility is to learn the network parameters θ on a large set of clean images, e.g., using an autoencoder network, i.e., a network trained with reconstruction distortion as cost function, and use the trained ˆf =N^θ(f) as the prior termp(f;θ) in Eq. 2.2. Examples of these efforts can be found in [28, 29].

Another very interesting line of work tries to benefit from the learning capabilities of composite models in enriching the solution of the basic models. The idea is to take the solution structure of basic models and implement it using a neural network. The parameters of the model are trained using input-output pairs provided from the basic model. This is done using the idea of “unfolding”, which expands the iterative solutions in several iteration steps. Notable examples of this line of work are in [30] that unfolds the solution of ISTA of Eq. 2.6b into several time-steps, and in [31] that unfolds the IHT of Eq. 2.6a and solves it with a neural network, lifting some limitations of IHT w.r.t. dictionary coherence.

The recent work of [32] reveals a very interesting fact about neural structures used in image processing. They show that, contrary to the common understanding, the success of

2.4 This thesis: the general picture 33

Dans le document Learning to compress and search visual data in large-scale systems (Page 46-50)