Nonlinear transform learning: model, applications and algorithms

(1)

Thesis

Reference

Nonlinear transform learning: model, applications and algorithms

KOSTADINOV, Dimche

Abstract

Les principes de la modélisation de non-linéarités sont essentiels pour maints problèmes de la vie réelle. Leur traitement joue un rôle central et influence non seulement la qualité de la solution, mais aussi la complexité computationnelle et les gains dans les compromis possiblement impliqués, qui sont tous hautement demandés dans une variété d'applications, comme la prise du contenu des empreintes digitales active, la reconstitution des images, l'apprentissage supervisé et non-supervisé des représentations discriminatives pour des tâches de reconnaissance d'image et les méthodes de regroupement. Dans la thèse présente un modèle de transformation non-linéaire généralisé novateur est proposé et étudié. Notre intérêt principal et élément de base est la transformation non linéaire exprimée par une double opération qui consiste en une modélisation linéaire suivi d'une non-linéarité par éléments. Pour ce faire, selon l'application considérée, des interprétations probabilistes sont développées et des généralisations et des cas particuliers sont proposées et considérées.

Une [...]

KOSTADINOV, Dimche. Nonlinear transform learning: model, applications and algorithms. Thèse de doctorat : Univ. Genève, 2018, no. Sc. 5335

URN : urn:nbn:ch:unige-1185338

DOI : 10.13097/archive-ouverte/unige:118533

Available at:

http://archive-ouverte.unige.ch/unige:118533

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES

Département d’Informatique Professeur S. Voloshynovskiy

Nonlinear Transform Learning:

Model, Applications and Algorithms

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Dimche Kostadinov

de

Strumica (Macedonia)

Thèse no 5335

GENÈVE

Repro-Mail - Université de Genève

2018

(3)

(4)

NONLINEAR TRANSFORM LEARNING:

MODEL, APPLICATIONS AND ALGORITHMS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF UNIVERSITY OF GENEVA

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Dimche Kostadinov May 2019

(5)

c Copyright by Dimche Kostadinov 2019 All Rights Reserved

ii

(6)

In memory of my father To all that I care about with love and eternal appreciation

(7)

(8)

Acknowledgements

I would like to thank my supervisor Prof. Sviatoslav Voloshynovskiy for providing me op- portunity to work on this PhD Thesis, all his encouragement, considerations and involvement at all times, the discussions, the insights and all the rest of his support. I would like to thank my jury members Prof. Karen Egiazarian, Prof. Teddy Furon, Prof. Sylvain Sardy and Prof.

Stéphane Marchand-Maillet for their careful reading, valuable suggestions and comments.

I would like to thank Taras Holotyak for his time spend in the discussions and reading of my draft concepts as well as providing valuable comments. I would like to thank Sohrab Ferdowsi for taking the time and participating in discussions regarding my presentations, elaborations and outlines of many of my ideas on the board in our office. I’m thankful to Maurits Diephuis for been involved in providing comments on the English writing over many papers. In addition, I would like to thank Behrooz Razeghi for his interests and enthusiasm in some of the concepts over several works and getting involved in providing comments towards certain details and clarifications. I want to mention all of the rest of my colleagues from our Stohastic Information Processing Group, who have directly or indirectly provided a support in terms of discussions, comments and suggestions related to the work in this Thesis.

I would like to thank our head of the Computer Vision and Multimedia Laboratory Prof.

Thierry Pun for outlying his supportive attitude and enabling a great working environment, as well as Prof. Stéphane Marchand-Maillet and Prof. Alexandros Kalousis for providing me with insights and suggestions related to some machine learning aspects.

I would like to thank Fokko Beekhof and Farzad Farhadzadeh for their initial help during my relocation to Geneva. I would like to thank Boris Petrov Lambrev for providing me his help with the translation of the abstract. I would like to thank also the rest of the colleges of the Computer Vision and Multimedia Laboratory with whom I have spend a wonderful time, made professional as well as personal bonds. I’m thankful to many of the colleges with whom I had many coffee breaks with interesting and re-energizing conversations. I would like to thank Edgar Francisco Roman, Sohrab Ferdowsi, Ke Sun, Majid Yazdani and Michal Muszynski for the delightful accompany and the great hang outs.

Finally, I would like to thank my mother, my two sisters and their families for their never-ending support and I would like to thank to my wife.

(9)

(10)

Abstract

Modeling of nonlinearities is essential for many real-world problems, where its treatment plays a central role and impacts not only the quality of the solution but also the computational complexity. Its high prevalence impacts on a variety of applications, including active content fingerprinting, image restoration, supervised and unsupervised discriminative representation learning for image recognition tasks and clustering.

In this thesis, we introduce and study a novel generalized nonlinear transform model. In particular, our main focus and core element is on the nonlinear transform that is expressible by a two-step operation consisting of linear mapping, which is followed by element-wise nonlinearity. To that end, depending on the considered application, we unfold probabilistic interpretations, propose generalizations, extensions and take into account special cases.

An approximation to the empirical likelihood of our nonlinear transform model provides a learning objective, where we not only identify and analyze the corresponding trade-offs, but we give information-theoretic as well as empirical risk connections considering the addressed objectives in the respective problem formulations. We introduce a generalization that extends an integrated maximum marginal principle over the approximation to the empirical likelihood, which allows us to address the optimal parameter estimation. In this scope, depending on the modeled assumptions w.r.t. an application objective, the implementation of the maximum marginal principle enables us to efficiently estimate the model parameters where we propose an approximate and exact closed form solutions as well as present iterative algorithms with convergence guarantees.

Numerical experiments empirically validate the nonlinear transform model, the learning principle, and the algorithms for active content fingerprinting, image denoising, estimation of robust and discriminative nonlinear transform representation for image recognition tasks and our clustering method that is preformed in the nonlinear transform domain. At the moment of thesis preparation our numerical results demonstrate advantages in comparison to the state-of-the-art methods of the corresponding category, regarding the learning time, the run time and the quality of the solution.

(11)

(12)

Résumé

Les principes de la modélisation de non-linéarités sont essentiels pour maints problèmes de la vie réelle. Leur traitement joue un rôle central et influence non seulement la qualité de la solution, mais aussi la complexité computationnelle et les gains dans les compromis possiblement impliqués, qui sont tous hautement demandés dans une variété d’applications, comme la prise du contenu des empreintes digitales active, la reconstitution des images, l’apprentissage supervisé et non-supervisé des représentations discriminatives pour des tâches de reconnaissance d’image et les méthodes de regroupement.

Dans la thèse présente un modèle de transformation non-linéaire généralisé novateur est proposé et étudié. Notre intérêt principal et élément de base est la transformation non linéaire exprimée par une double opération qui consiste en une modélisation linéaire suivi d’une non-linéarité par éléments. Pour ce faire, selon l’application considérée, des interprétations probabilistes sont développées et des généralisations et des cas particuliers sont proposées et considérées.

Une approximation à la probabilité empirique de la transformation non-linéaire assure l’objectif d’apprentissage où non seulement les compromis correspondants sont identifiés et analysés, mais les connexions à risque d’un point de vue informative-théorique, ainsi qu’empirique sont proposé en considérant les objectifs adressés dans les formulations respec- tives du problème. L’introduction d’une généralisation qui étend un principe maximal intégré marginal sur l’approximation de la probabilité empirique permet d’adresser l’estimation optimale du paramètre. Dans cet esprit, selon les hypothèses modelées par rapport à un objectif d’application la réalisation du principe marginal maximal, permet d’estimer de manière efficace les paramètres du modèle où des solutions analytiques approximatives et exactes sont proposées, ainsi que des algorithmes itératifs avec des garanties convergentes.

Des expériences numériques confirment la validité de notre modèle NT, le principe d’apprentissage, les algorithmes pour la prise du contenu des empreintes digitales active, l’enlèvement du bruit des images, l’estimation d’une représentation de transformation non- linéaire robuste et discriminative pour des tâches de reconnaissance d’image et la méthode de regroupement exécuté dans le domaine de transformation non-linéaire. Lors de la préparation

(13)

x

de la thèse nos résultats numériques montrent des avantages, comparés aux méthodes de pointe correspondants, concernant le temps d’apprentissage, la durée de fonctionnement et la qualité de la solution.

(14)

List of figures

1.1 The problems addressed in this thesis are based on the unified concept of nonlinear transform modeling and learning. . . 3 3.1 Local ACFP framework. . . 22 3.2 Local ACFP framework with approximation to the linear map. . . 24 3.3 A general scheme for joint ACFP modulation and linear feature map learning. 26 3.4 A general scheme for ACFP-LR using a latent representation, extractor and

reconstructor functions. . . 31 4.1 The evolution of a) the transform error∥AX−Y∥²F, b) the−Tr{AXY^T}and

its lower bound approximation−Tr{AG}c) the conditioning number and d) the expected mutual coherenceµ(A)while learning the transform matrix Aon overlapping 8×8 noisy image blocks (equivalentlyN=64) from the Cameraman image, whereMwas set to 80 and the sparsity level was set to 36. 58 4.2 The evolution of the normalized transform error ^∥^AX⁻^Y^∥

2 F

L , whereLis the total number of samplesx_l∈ {1, ...,L}under a) sparsity levelss∈ {4,10,16,22,28, 24,40,46,52,58,64,70} and b) amounts of data expressed in percentage from the total amount of data while learning the transform matrixAon overlapping 8×8 noisy image blocks (equivalentlyN=64) from the Cameraman image, whereM was set to 80 and the sparsity level was set to 36. . . 59 5.1 The illustration of the clustering over local blocks. . . 68 5.2 An illustration of the code s_j,k construction for subject k at image patch

location j. . . 72 5.3 An illustration of the recognition based on aggregation over local bag-of-

word decisions, that use the local NT representations. . . 74 5.4 Recognition results under basic fusion with ^ℓ_ℓ¹

2 constrained projection, soft thresholding and hard thresholding and weighted fusion with ^ℓ_ℓ¹

2 constrained projection, soft thresholding and hard thresholding. . . 77

(19)

xvi List of figures 5.5 Recognition results under varying number of training samples and varying

number of codebook codes. . . 78 5.6 Comparative recognition results using Extended Yale B and AR. . . 78 5.7 Comparative recognition results using PUT and FARET data sets. . . 79 5.8 Comparative recognition results under random corruption and continuous

occlusion. . . 79 5.9 An illustration of the idea about our NT transform, where we used different

colors to denote the spaces of the data samples from different classes in the original and transform domain. The goal of our NT is to achieve discrimination by taking into account a minimum information loss on the linear map and discrimination prior with a discrimination measure defined on the support intersection for the NT representations. . . 85 5.10 The evolution of the approximationsC₁=R^P_ℓ

1(X)andC₂=D^P_ℓ

1(X), their ra- tioC₁/C₂and the discrimination power log(C₁/C₂) =I^t during the learning of the nonlinear transform with transform dimensionM=9100. . . 97 5.11 The conditioning numberκn(A) =C_n(A) and the expected mutual coher-

ence µ(A)for the learned linear map A in the NT at different transform dimensionalityM∈Q. . . 97 5.12 The approximationC₂=R^P_ℓ

1(X)and the discrimination powerI^ton a subset of the transform data using learned NT at different transform dimensionality M∈Q. . . 99 5.13 The recognition results and the discrimination power on the Extended Yale B

and MNIST databases, respectively, using a NT with different dimensionality Mand linear SVM classifier on top of the transform representation. . . 101 5.14 The expected lossE[∥^z_M^c,k∥²2] =E[^∥^Ax^c,k⁻^y^c,k^∥

2 2

M ]and the discrimination power on the Extended Yale B and MNIST databases, respectively. The transform representationYis obtained by using a nonlinear transformTP with different dimensionality. . . 101 5.15 The expected mutual coherenceµ(A)and the conditioning numberκn(A) =

λmax

λmin for the learned transform matrixAat dimensionalityM∈QU. . . 102 5.16 The evolution of the discrimination powerI^t for 100 algorithm iterations

on a subset of the transform data using UNT at transform dimensionM=5884.102 5.17 An illustration of the idea about our NT transform with self-collaboration

relations that takes discrimination specific objective into account. . . 110 6.1 An illustration of the cluster assignment based on a similarity measured(., .)

betweenx_iand the clustersd_j,j={1, ..,8},i.e., ˆj=7=arg min_jm(x_i,d_j). 127

(20)

List of figures xvii 6.2 An illustration of the proposed simultaneous cluster and NT representation

assignment. q_i=Ax_iis the linear transform representation,y|_{c₁,c₂}is NT representation, {τττ_c₁,ννν_c₂} are element-wise nonlinearity parameters with discrimination role. There are in total of 4 NT representations, determined by all pairs {c₁,c₂} ∈ {1,2} × {1,2}. Simultaneously, the data point x_i is assigned to cluster indexc=2(c₁−1) +c₂= (2−1)2+2=4 and the NT representation is estimated asy_i=y|_{2,2}based on the discriminating min-max similarity/dissimilarity score. . . 128 6.3 The evolution of a) the objective related to the problem of simultaneous

cluster and NT representation assignment, b) the expected NT error and c) the expected discrimination min-max functional score per iteration for the proposed algorithm on the ORL [132], COIL [105], E-YALE-B [47] and AR [96] database. . . 136

(21)

(22)

List of tables

2.1 The nonlinear transform model and the applications considered in this Thesis. 17 3.1 The p_eusing PCFP under varying AWGN noise, varying JPEG compression

levels and projective transformation with QF level of 5. . . 38 3.2 DWR and p_eusing varying ACFP modulation under under varying AWGN

noise, varying JPEG compression levels and affine transformation with QF level of 5. . . 39 3.3 DW Rand p_eunder PCFP using varying AWGN noise, JPEQ quality factor

and affine transformation with QF=5 for the feature mapsF,F^†,F^r andF_I . 40 3.4 DW Rand p_eusing varying ACFP modulation under varying AWGN noise,

JPEQ quality factor and Projective transformation with QF=5 for the feature mapsFandF_I . . . 41 3.5 DW Rand p_eusing varying ACFP modulation under varying AWGN noise,

JPEQ quality factor and affine transformation with QF=5 for the feature mapsF,F^†andF^r . . . 42 3.6 DW Rand p_eusing varying ACFP modulation under varying AWGN noise,

JPEQ quality factor and affine transformation with QF=5. . . 43 3.7 DW Rand p_e of under different Additive White Gaussian Noise (AWGN) level. 44 3.8 DW Rand p_eunder different JPEQ Quality Factor (QF). . . 44 3.9 DW Rand p_eunder JPEQ quality factor QF=5 and affine transformation. . . 45 3.10 DW R and p_e for ACFP-LR under affine transform and extremely low QF

and high AWGN levels. . . 45 4.1 Denosing performance in PSNR, whereσ is the noise standard deviation. . 60 4.2 The execution time in minutes and the percentage of the used image data. . 60 4.3 The PSNR for theεCAT algorithm learned on percentage of the available

noisy image data with noise levelσ=10. . . 61 5.1 Computational complexity in bigO(.)notation . . . 75

(23)

xx List of tables 5.2 Memory usage . . . 75 5.3 a) The conditioning numberκ_n(A)and the expected mutual coherence µ(A)

for the learned linear map Ain the NT, and the execution time t_e[min] in minutes of the proposed algorithm for 28 iterations using NT with dimen- sionalityM=19000 . b) The discrimination power in the original domain I^O, after a transform with random linear map I^RT, after using learned sparsifying transformI^ST and after using learned NTI^NT. . . 98 5.4 a) The conditioning numberκn(A)and the expected mutual coherence µ(A)

for the learned linear map Ain the NT, and the execution time t_e[min] in minutes of the proposed algorithm for 28 iterations using NT with dimen- sionalityM=19000 . b) The discrimination power in the original domain I^O, after a transform with random linear map I^RT, after using learned sparsifying transformI^ST and after using learned NTI^NT. . . 99 5.5 a) The discrimination powerI for the methodsDLSI[118],FDDL [162],

COPAR [148], LRSDL [149] and the proposed NT, and the recognition results using a nonlinear transform with different dimensionality M and linear SVM classifier on top of the transform representation for the Extended Yale B and MNIST database. . . 100 5.6 The recognition results and the learning time in hours on all database, respec-

tively. We show the k-NN accuracy on the original data (OD) representation and the k-NN accuracy on the NT representation where the UNT is learned in the unsupervised case and has dimensionM=5884. . . 103 5.7 a) The discrimination power for the sparse representations of the methods

DLSI [118], FDDL [162], COPAR [148] and LRSDL [149] and the proposed methodU NT, b), c) The recognition results on the Extended Yale B and MNIST for the methods DLSI [118], FDDL [162],COPAR [148]

andLRSDL[149] compared to the kNN results on the UNT representations learned in the unsupervised setup. . . 103 5.8 Recognition accuracy comparison between state-of-the-art methods and 1)

K Nearest Neighbor kNN search and 2) linear SVM [61] (l-svm) that use the Sparsifying Nonlinear Transform (sNT) representations from our model on extracted HOG [32] image features. We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 7300 for the respective training and test sets. The training sNT representations are used to estimate the SVM parameters and the recognition is performed using the learned SVM on the test sNT representations. . . . 104

(24)

List of tables xxi 5.9 The cumulative expected mutual coherence _L¹∑lµ(A_l)and the cumulative

conditioning number ¹_L∑lκ_n(A_l)for the linear mapsA_l,l∈ {1, ...,6}with dimensions 6570×N, whereNis the dimensionality of the input data . . . 117 5.10 The learning time in hours on the databases AR, YALE B, COIL20 and

NORB using our model with dimensionLM=6570, number of self-collaboration componentsL=9, and dimension per self-collaboration componentM=730 . . . 118 5.11 The discrimination power in the original domain, after random transform,

after learned sparsifying transform and after learned self-collaborating target specific nonlinear transform with dimensionM=6570. . . 119 5.12 The recognition results on the databases AR, YALE B, COIL20 and NORB,

using k-NN on the raw image data (raw) and the sparse representations from our model (p) with dimensionM=6570. . . 119 5.13 The discrimination power and the recognition results on the Extended Yale

B and MNIST databases for the methodsDLSI[118],FDDL[162],COPAR [148], LRSDL [149], the proposed model on raw image data p and the proposed model on extracted HOG [32] image features HOG-p. . . 120 5.14 Recognition accuracy comparison between sota and 1) k Nearest Neighbor

(k-nn) search and 2) linear SVM [61] (l-svm) that use the Sparsifying NT (sNT) representations from our model on extracted HOG [32] image features.

We use our algorithm to learn the model on the HOG features. Then we get the sNT representations with dimensionality 9800 for the respective training and test sets. Considering the obtained result for database SVHN, we note that the unlabeled training data from the respective database was not used during the learning of the corresponding model. . . 121 6.1 The computational efficiency per iterationt[sec]for the proposed algorithm,

the conditioning numberκ_n(A)and the expected mutual coherenceµ for the liner mapA. . . 137 6.2 The clustering performance over the databases COIL, ORL, E-YALE-B and

AR evaluated using the Cluster Accuracy (CA) and the Normalized Mutual Information (NMI) metrics. . . 138 6.3 A comparative results between state-of-the-art [89], [168], [164], [54] and

[73], and the proposed method(∗). . . 138 6.4 The k-NN accuracy results using assigned NT representations and original

data (OD) representation. . . 138

(25)

(26)

Chapter 1 Introduction

"A perspective from a transform trough projection, one may see it as a formal notion, others, as the essence of life..."

D. Kostadinov Nowadays, in many areas, including signal processing, machine learning, artificial intelligence, computer vision, etc., due to the inevitable imperfections in the data acquisition process, a commonly encountered issue is the presence of data uncertainty in a form of noise or data variability. To that end, essential are convenient data representations, mappings and transforms that allow to qualitatively and efficiently process, analyze, modulate, recognize, classify and cluster the data.

In general, we can distinguish two main types of mappings and transforms. The first type characterizes transforms that when applied to the data introduce small changes with respect to a defined constraints. This approach covers a range of applications in the case when content modulation is appropriate, prior to the content distribution/reproduction such as content authentication, identification and recognition. The second type describes transforms that do not introduce changes to the original data, but when applied to the data result in transform representations that satisfy task specific objectives. Commonly, the latter type of transforms is widely used in applications like image denoising, recognition, classification and clustering.

Nonetheless, in both cases, one aims to efficiently express the original data representation with an other convenient data representation, which can result from:

- A carefully and appropriately chosen analytic and predefined transform or - Data adaptive learned transform.

(27)

2 Introduction The advantage of the latter compared to the former is an ability to adapt to the given data, since for the use of the former a more strict statistical data properties have to be known in advance, which might be restrictive for the practical usability.

Nonetheless, in order to accordingly model and allow efficient estimation of a task- relevant, useful and information preserving transform representation that satisfy certain properties, usually, a prior knowledge in the form of more "loose" assumption has to be taken into account. One of the fundamental concepts that was widely exploited in the past decade, addressing data adaptiveprocessing and data analysis, is thesparse data representation.

That is, given a data samplex∈ℜ^N and a set of vectorsD= [d₁, ...,d_M]∈ℜ^N^×^M (formally known as a frame¹), a sparse representationy∈ℜ^M forxoverDis one that uses a sparse (small) number of vectorsd_i∈ℜ^N fromDto representx.

Although sparsity is crucial for the modeling and solving of many inverse problems that are encountered across different signal processing, machine learning and artificial intelligence tasks, the sparsity assumption alone is not enough to encompass the full extend of requirements in applications like active content fingerprinting (ACFP) and recognition.

On the other hand, even if we relay only on the mostly used synthesis [2] sparse model, a disadvantage might be the computational complexity, since the synthesis sparse model can have high computational complexity when the input data dimension or the sparse representation dimension is high.

To address the aforementioned challenges, extensions and alternative models have to be taken into account, and additional priors and assumptions on the representation properties have to be considered, modeled and explored in order to fulfill task specific demands like:

- Low complexity estimate - Optimal trade-offs - Robustness

- Low estimate variation w.r.t. a task specific objective - Discrimination.

All which are very important for the active content fingerprinting (ACFP), image denoising, estimation of discriminative representations in image recognition/classification tasks and the clustering methods. In this thesis, to address the above open issues:

- We introduce a novel generalized nonlinear transform (NT) model

- We demonstrate the usefulness of the NT model across several applications.

1A set ofMorthonormal vectors with vector dimensionalityNequal toMis said to form a basis set for that vector space. A frame of an inner product space is a generalization of a basis of a vector space to sets that may be linearly dependent.

(28)

1.1 Scope of the Thesis 3

Fig. 1.1 The problems addressed in this thesis are based on the unified concept of nonlinear transform modeling and learning.

Our model allows not only to address different task specific constraints on the transform representations, but also offers a probabilistic interpretation and provides information-theoretic connections, as well as enables the considered approximation of the model log likelihood to be related to the empirical risk in the corresponding learning objective.

Our parametric model will varay depending on the different assumptions and prior constraints, which are application driven. At the basic level, a common component is the low complexity nonlinear transform, which is expressible by a linear mapping that is followed by an element-wise nonlinearity. Regarding the estimation of the NT representation, the key difference of our NT model compared to the commonly used synthesis model with constraints is that we do not explicitly address the reconstruction of the data by a sparse linear combination. Rather, we address a constrained projection problem and estimate the NT representation as its solution. Our approach has number of advantages that will be presented and explained in this Thesis.

1.1 Scope of the Thesis

In the scope of this Thesis, using special cases and extensions of our nonlinear transform model, we address:

- The active content fingerprinting (ACFP) problem

- The image denoising, as one particular representative of the restoration problems

(29)

4 Introduction - The estimation of sparse and discriminative NT representations useful for image

recognition/classification tasks

- Nonlinear transform domain clustering.

The addressed problems in this Thesis are summarized in Figure 1.1.

In the ACFP, we use a model that represents a special case of our generalized nonlinear transform model. Since the focus is on a special type of ACFP modulation, the NT representation appears in the linear system, which has to be solved in order to estimate the optimal distortion component that has to be added to the original data.

In image denoising, the sparsifying transform model is also another special case of our generalized nonlinear transform model. To study the optimal solution for the the sparsifying transform model with a non-structured overcomplete transform matrix, we focus on a problem formulation that addresses a trade-off between (a) the alignment of the gradients in the approximative objective and the original objective and (b) lower bound tightness to the original objective. The usage of the aforementioned trade-off offers an acceleration in the local convergence of the solution next to leading to a satisfactory solution under a small amount of training data.

Sparsity alone does not guarantee that the resulting representation will be discriminative.

Up to the best of our knowledge, we provide the first work that extends the sparsifying model for learning sparse and discriminative representations while offering a high degree of freedom in modeling²and imposing constraints other then the sparsity constraint on the representation.

Considering an estimate with low variability w.r.t. a discrimination specific objective that is found trough the use of self-collaboration, we extend our generalized nonlinear transform model and explore a discrimination centered, collaboration structured and sparse modeling.

In the final part of this Thesis, we jointly model and learn multiple NT transforms with explicit consideration of discrimination specific parameters. We propose a novel clustering principle, where we focus on measures that reflect a notion of joint similarity and dissimilarity score between a data point and a set of data points. Finally, we develop a concept that allows unsupervised discrimination and clustering not in the original data domain, but instead in a nonlinear transform domain, where a nonlinear transform model is used.

1.2 Thesis Outline

In Chapter 2, we present the commonly used synthesis model, give its probabilistic interpretation and provide the related inverse problem for the estimation of the synthesis representation.

2Many nonlinearities, i.e., ReLu,p-norms, elastic net-like,^ℓ_ℓ¹

2-norm ratio, binary encoding, ternary encoding, etc., can be modeled as a generalized nonlinear transform representation.

(30)

1.2 Thesis Outline 5 Afterwards, we introduce our generalized nonlinear transform model, which is a unified base for all our NT modeling across the considered applications in this thesis. In addition, we introduce the related direct problem,i.e., the constrained projection problem, which has a central role in estimation of the NT representation.

In Chapter 3, we introduce and describe the active content fingerprinting concept and explain the differences compared to passive content fingerprinting (PCFP). Then, by taking an approximation of the negative logarithm of a special case of our NT model, we introduce the generalized problem formulation as a form of min-max problem. Under a linear modulation and predefined linear feature map, we show a reduction to a constrained projection problem and provide the optimal solution. We also address an approximation of the predefined linear feature map in order to find appropriate trade-offs between the modulation distortion and feature robustness. Afterwards, we address a problem formulation, where we jointly learn the linear map and estimate the modulation distortion in order to attain a low modulation distortion and high feature robustness. In our numerical evaluation, our efficient solution demonstrates significant improvements compared to PCFP. Finally, we extend the basic concept of ACFP by focusing on a redundant content representation and include extractor and reconstructor functions, which we name as ACFP-LR. In the latter, we present the problem formulation and show a reduction to a constrained projection problem, which has an efficient solution. Our numerical evaluation shows that ACFP-LR has superior performance compared to the rest of the analyzed schemes.

In Chapter 4, we address the learning problem for the data adaptive transform that provides sparse representation in a space with dimensions larger than or equal to the dimensions of the original space. We show that the sparsifying transform model represents one reduced form of the generalized NT model. We present an iterative, alternating algorithm that has two steps: (i) transform update and (ii) sparse coding. In the transform update step, we focus on a novel problem formulation based on a lower bound of the objective function that addresses a trade-off between (a) how much the gradients are aligned of the approximative objective and the original objective and (b) how close is the lower bound to the original objective. This allows us not only to propose an approximate closed form solution, but also gives a possibility to find an update that can lead to an accelerated local convergence and enables us to estimate an update that achieves satisfactory solution under a small amount of data. Since in the transform update, the approximate closed form solution preserves the gradient and in the sparse coding step, we use the exact closed form solution and we show that the resulting algorithm is convergent. On the practical side, we evaluate our algorithm in an image denoising application. We demonstrate promising performance together with

(31)

6 Introduction advantages in training data requirements, accelerated local convergence and computational complexity.

Chapter 5 consists of three major sections. In the first section of Chapter 5, we consider the face recognition problem from both machine learning and information coding perspectives, adopting an alternative way of visual information encoding and decoding thought estimation of a robust NT representation. Our model for recognition is based on multilevel vector quantization (MVQ) that is conceptually equivalent to a bag-of-word method and bears similarity to a convolutional neural network CNN. We introduce an alternative aggregation method over local bag-of-words decisions from locally estimated robust NT representations w.r.t. a learned centroids over local blocks. Moreover, we relate the local NT representation with a corresponding likelihood vector. We present a generalization of a sparse likelihood approximation, give connections to Maximum a Posterior (MAP) estimate, as well as showing connections to common techniques such as hard and soft thresholding. We evaluate our approach by extensive numerical simulation on face image databases, where we show improvements and competitive result w.r.t. the state-of-the art methods, while having low computational complexity.

In the second section of Chapter 5, we describe and explain our NT model, where we take into consideration the modeling and learning of a nonlinear transform that is parameterized by a linear map and generalized element-wise nonlinearity. In our modeling of the NT, we introduce the minimum information loss and discriminative priors for the respective linear map and sparse representations. During training, we estimate the model parameters by minimizing an approximation for the negative logarithm of the model. We propose an efficient iterative algorithm with convergence guarantee that alternates between two steps, which have approximate and exact closed form solutions. Given a test data sample, we estimate a sparse representation using the learned model parameters, which represents a solution to a low complexity constrained projection problem. The efficiency of the proposed approach, together with the potential usefulness of the NT representations is validated by numerical experiments in a supervised and unsupervised image recognition setup. The evaluation demonstrates advantages in comparison to the state-of-the-art methods of the same category, regarding learning time, the discriminative quality and the recognition accuracy.

In the third section of Chapter 5, we present an extension of our base NT model to another NT model for learning collaboration structured, discriminative and sparse representations.

The idea is to model a collaboration corrective functionality between multiple nonlinear transforms in order to reduce the uncertainty in the estimate. The focus is on joint estimation of a data-adaptive NTs that take into account a collaboration component w.r.t. a discrimination target. The joint model includes the minimum information loss, collaboration corrective

(32)

1.3 Main Contributions 7 and discriminative priors. The model parameters are learned by minimizing the negative logarithm of the learning model, where we propose an efficient solution by an iterative, coordinate descend algorithm. Numerical experiments validate the potential of the proposed learning principle. The preliminary results show advantages in comparison to the state of-the-art methods.

In Chapter 6, we present a novel clustering concept based on (i) jointly learned NTs with minimum information loss and discriminative priors and (ii) min-max assignment over NT representations. In the common clustering algorithms a data point in the original data space is assigned to clusters based on the similarity correspondence. In contrast, we propose a simultaneous cluster and NT representation assignment principle based on evaluating a min-max score that approximates discriminative log likelihood in the transform domain.

Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods regarding the learning time and the used clustering performance measures.

In Chapter 7, the conclusions summarize this Thesis.

1.3 Main Contributions

The main contributions of this thesis are summarized as follows:

- We introduce a generalized nonlinear transform model that contrary to the synthesis model, which is based on data reconstruction, relies on a constrained data projection - We generalize the integrated maximum marginal (IMM) principle by taking into

consideration the negative logarithm of the learning model for estimation of the NT model parameters, which enables efficient solutions for a number of applications - We propose several novel active content fingerprinting (ACFP) schemes under linear

modulation and linear feature maps, where we show optimal closed form solutions and efficient algorithms with convergence guarantees by utilizing a special case of our NT model

- We study the sparsifying transform model with an overcomplete transform matrix for an image denoising application. The considered model represents a reduction and a special case of our NT model. We propose not only an alternating algorithm with an approximate and exact closed form solution and convergence guarantees, but also introduce a novel problem formulation that addresses a trade-off between accelerated local convergence and a satisfactory solution under small amount of data

(33)

8 Introduction - We propose novel strategies for learning discriminative and robust NT representations that are useful for image recognition tasks in supervised and unsupervised setups.

In addition, we consider task-centric self-collaboration. The NT model parameters estimation is based on our generalized IMM principle, which allow as efficient solution with convergence guarantee to be implemented by iterative alternating algorithm - We present a novel clustering concept based on (i) jointly learned NTs with mini-

mum information loss and discriminative priors and (ii) min-max assignment over NT representations, where we introduce the simultaneous cluster and NT representation assignment principle, which is based on evaluating a score that approximates discriminative log likelihood in the transform domain.

(34)

Chapter 2 Modeling and Estimation of Nonlinear Transform

In this chapter, we outline the well known synthesis model and its corresponding inverse problem. Then, we introduce our nonlinear transform model and its corresponding direct problem. Along this way, we also highlight the differences in the modeling approaches and the corresponding problems.

2.1 Sparse Synthesis Model vs Nonlinear Transform Model

2.1.1 Sparse Synthesis Model

As the name suggests, in many areas, the main idea behind this model, is to synthesize a data vector from a set of defined vectors that represent some dictionary.

Deterministic FormulationIn the most general case, according to the synthesis model, a data samplex_i∈ℜ^N of dimensionalityNis approximated by a linear combinationy_i∈ℜ^M (referred to as a sparse data representation) of a few words (frame vectors)∥y_i∥0<<M, from a dictionary (frame¹)D∈ℜ^N^×^M, as:

x_i=Dy_i+v_i, (2.1)

wherev_i∈ℜ^N denotes the approximation error, which is usually assumed to be Gaussian.

1A matrixD∈ℜ^N×M is said to be overcomplete ifM>N. Equivalently, if the numberM of columns d_m∈ℜ^NinDis bigger than the dimensionalityNofd_m,i.e.,M>N, one might also say that the set of vectors {d₁,d₂, ...,d_M}is linearly dependent and that this set forms a frame.

(35)

10 Modeling and Estimation of Nonlinear Transform Probabilistic FormulationIn a probabilistic sense, we consider thatx_i,y_iandDare random vectors and random matrix, respectively. A conditional probability distribution ofx_igiven the dictionaryDcan be expressed as:

p(x_i|D) = Z

yi∈ℜ^M

p(x_i,y_i|D)dy_i= Z

yi∈ℜ^M

p(x_i|y_i,D)p(y_i|D)dy_i, (2.2) where p(x_i|y_i,D)models the relation (2.1),i.e.:

p(x_i|y_i,D)∝exp

− 1

β₀∥x_i−Dy_i∥²2

, (2.3)

where β₀ is a scaling parameter. In the prior term p(y_i|D), it is usually assumed that y_i is independent to D, i.e., p(y_i) = p(y_i|D). Moreover, assuming that the entries in the representationy_iare i.i.d. and follow Laplace distribution, then we have that:

p(y_i)∝exp

−1 β₁∥y_i∥1

, (2.4)

where∥.∥¹denotes theℓ₁-norm andβ₁is a scaling parameter.

Learning The Model Parameters Given CK data samples X= [x₁, ...,x_CK], we model conditional probability p(X|D)that under the independence assumption between the data samplesx_idecomposes as:

p(X|D) =

CK

∏

i=1

p(x_i|D). (2.5)

Moreover, instead of just working with p(X|D), we can use the Bayes’ rule and consider an approximative posterior:

p(D|X)∝p(X|D)p(D), (2.6)

were we disregard the prior p(X), while the priorp(D)on the dictionaryDis defined as:

p(D)∝exp(−Ω_S(D)), (2.7)

whereΩ_S(.)is the prior measure that defines the properties of the dictionaryD.

Under the above considerations, the Maximum a Posterior (MAP) estimations ofDand Y= [y₁, ...,y_CK]can be expressed as:

{Y,ˆ Dˆ}=arg max

Y,D p(D|X)≃arg max

Y,D p(X|D)p(D), (2.8)

(36)

2.1 Sparse Synthesis Model vs Nonlinear Transform Model 11 or equivalently taking the negative logarithm of p(X|D)p(D), the problem reduces to:

{Y,ˆ Dˆ} ≃arg min

Y,D[−logp(X|D)−logp(D)] = arg min

Y,D

"

−

CK i=1

∑

logp(x_i|D)−logp(D)

#

=

arg min

Y,D

"

−

CK i=1

∑

log Z

yi∈ℜ^M

p(x_i|y_i,D)p(y_i)dy_i−logp(D)

# .

(2.9)

The estimation of ˆYand ˆDis still difficult to compute due to the integration overy_i. If we replace p(y_i,x_i|D)with its extreme value, then we end up with the following problem:

{Y,ˆ Dˆ}=arg min

D CK

∑

i=1

miny_i

1

2∥x_i−Dy_i∥²2+λ1∥y_i∥¹

+Ω(D), (2.10) where we assumed that{_β¹₀, ¹

β1}={¹₂,λ₁}.

−Sparse Representation EstimationAssuming that the dictionaryDis given, then (2.10) per individual sparse representationy_i, reduces to:

y_i=arg min

yi

1

2∥x_i−Dy_i∥²2+λ₁∥y_i∥1, (2.11) which represents aninverse problemw.r.t. y_ithat also is known as a constrained regression problem.

2.1.2 Nonlinear Transform Model

In this thesis, we focus on a model that describes a generalized nonlinear transform represen- tationy_ifor the data samplex_i.

Deterministic FormulationWe express our nonlinear transform model as:

Ax_i=y_i+z_i, (2.12)

whereA∈ℜ^M^×^N is the linear map of the nonlinear transform,y_iis the nonlinear transform representation andz_i∈ℜ^M is the nonlinear transform error vector. In contrast to the synthesis model, in the nonlinear model, one assumes that the nonlinear transform representationy_i results from applying a generalized element-wise nonlinearity toAx_ithat is parameterized byθθθ,i.e.,

y_i= f_θ_θ_θ(Ax_i), (2.13)

(37)

12 Modeling and Estimation of Nonlinear Transform whereθθθ are parameters, which allows us not only to consider a notion for sparsity, but also to take into account robustness or discrimination.

−ExamplesOne simple example of such a transform is a sparsifying transform model [119], where the parameterθθθ =λ1∈ℜ^M,λ ∈ℜwith ahard thresholdingfunction f_hthat acts as a nonlinear transform,i.e.:

f_h(y_i(m)) =





y_i(m), if|y_i(m)|>λ, 0, otherwise,

(2.14)

other example is asoft thresholdingfunction,i.e.:

f_s(y_i(m)) =











y_i(m)−λ, ify_i(m)>λ, y_i(m) +λ, if−y_i(m)>λ,

0, otherwise,

(2.15)

which can be compactly expressed as:

f_s(y_i) =sign(Ay_i)⊙max(|Ay_i| −λ1,0), (2.16) wheresignis a sign function and⊙is the Hadamard product. The third example is the ternary encoding:

f_t(y_i) =sign(max(|Ay_i| −λ1,0)). (2.17) Another also interesting example is the ReLu activation function, that is commonly used in the deep neural networks,i.e.:

f_ReLu(y_i) =max(Ay_i,0). (2.18) Probabilistic FormulationTo introduce a probabilistic interpretation of a nonlinear transform model, we will consider the marginal probability distribution, which we express as:

p(y_i|x_i,A) = Z

θ θθ

p(y_i,θθθ|x_i,A)dθθθ, (2.19) Furthermore, we can use the chain rule, which leads to:

(38)

2.1 Sparse Synthesis Model vs Nonlinear Transform Model 13 We are interested in modeling p(θθθ,y_i|x_i,A), where under the Bayes’ rule, we focus on the proportional form that we express as:

p(y_i,θθθ|x_i,A)∝p(x_i|θθθ,y_i,A)p(y_i,θθθ|A). (2.21) In the simplest case, p(x_i|θθθ,y_i,A)models the residual vectorz_i=Ax_i−y_ias:

p(x_i|θθθ,y_i,A)∝exp

− 1

β₀∥Ax_i−y_i∥²2

, (2.22)

whereβ₀is a scaling parameter. We note that any additional knowledge about the residual vectorz_ican be used and added in the model p(x_i|θθθ,y_i,A).

In order to simplify the consideration, we neglect the dependence onAby assuming that:

p(y_i,θθθ|A) =p(y_i,θθθ)∝exp

− 1 β1

m(θθθ,y_i)

, (2.23)

wherem(.):ℜ^M×ℜ^M →ℜis a measure and β1 is a scaling parameter. The motivation behind the use of such a parametric prior p(y_i,θθθ) on y_i is to accommodate a class of assumptions related to sparsity, robustness and/or discrimination.

Learning The Model Parameters GivenCK data samples X= [x₁, ...,x_CK], under our consideration, we consider the following learning model:

p(Y,A|X) =p(Y|A,X)p(A|X) =

CK

∏

i=1

Z

θ θ θ

p(θθθ,y_i|x_i,A)dθθθp(A|x_i)

∝

CK

∏

i=1

Z

θ θ θ

p(x_i|θθθ,y_i,A)p(y_i,θθθ)dθθθp(A|x_i).

(2.24)

where we use a simplification for the prior on the linear mapA,i.e., p(A|x_i) =p(A)and we define it as:

p(A)∝exp(−Ω(A)), (2.25)

withΩ(.)denoting the prior measure which defines the properties of the rows ofA.

Minimizing the exact negative logarithm of our learning model (2.24) overY,θθθ andA is difficult since we have to integrate in order to compute the marginal and the partitioning function of the prior p(y,θθθ). Instead of minimizing the exact negative logarithm of the marginal^R_θ_θ_θ p(x_i|θθθ,y_i,A)p(y_i,θθθ)dθθθ, we consider minimizing the negative logarithm of its

(39)

14 Modeling and Estimation of Nonlinear Transform maximum point-wise estimate, i.e.,

Z

θ θ θest

p(x_i|θθθ_est,y_i,A)p(y_i,θθθ_est)dθθθ_est≤Dp(x_i|θθθ,y_i,A)p(y_i,θθθ), (2.26) where we assume that θθθ are the parameters for which p(x_i|θθθ_est,y_i,A)p(y_i,θθθ_est) has the maximum value andDis a constant. Furthermore, we use the proportional relation (2.21) and by disregarding the partitioning function related to the priorp(y_i,θθθ), we end up with the following problem formulation:

{Y,ˆ θθθˆ,Aˆ}=arg min

Y,θθθ,A







CK

∑

i=1







−logp(x_i|θθθ,y_i,A)

z }| { 1

2∥Ax_i−y_i∥²2+

−logp(yi,θθθ)

z }| { λ₁m(θθθ,y_i)





+

−logp(A)

z }| { Ω(A)





, (2.27)

where{2, ¹

λ1}={β₀,β₁}.

We note that in general, depending of the used measures that describe p(x_i|θθθ,y_i,A) and p(θθθ,y_i), even the exact minimization w.r.t. the the point-wise estimate might still be difficult to compute. Since in order p(x_i|θθθ,y_i,A)and p(θθθ,y_i) to be a properly factored probabilities, p(x_i|θθθ,y_i,A)and p(θθθ,y_i)have to contain partitioning functions, which can be exactly evaluated only by integrating over the involved parametersx_i,y_i,θθθ andA.

Alternativly, the maximization of our lernining model∏^CK_i=1^R_θ_θ_θ p(x_i|θθθ,y_i,A)p(y_i,θθθ)p(A|x_i) over any of the variablesy_i,θθθ and Acan be seen as an approximative form of integrated marginal maximization (IMM) [116] of∏^CK_i=1^R_θ_θ_θp(x_i|θθθ,y_i,A)p(y_i,θθθ)p(A|x_i) over the re- spectivey_i,θθθ orA, which can be summarized by the following steps:

- Approximative maximization of p(x_i|y_i,θθθ,A)with prior p(θθθ,y_i)overy_i, - Approximative maximization of p(x_i|y_i,θθθ,A)with prior p(θθθ,y_i)overθθθ and - Approximative maximization of∏^CK_i (x_i|y_i,θθθ,A)with prior p(A)overA.

−NT Representation EstimationAssuming that the linear mapAand the parameterθθθ is given, then the exact estimation ofy_i w.r.t. our model is equivalent to computing the minimum of the negative logarithm overp(x_i|θθθ,y_i,A)p(θθθ,y_i),i.e.:

ˆ

y_i=arg min

yi

[−logp(x_i|y_i,θθθ,A)−logp(θθθ,y_i)], (2.28) where again we point out that it might be difficult to compute depending of the chosen measures that describe p(x_i|y_i,θθθ,A)and p(θθθ,y_i), that involve a possible integration in the corresponding partitioning functions for p(x_i|y_i,θθθ,A)and p(θθθ,y_i).

Nonlinear transform learning: model, applications and algorithms

Thesis

Reference

Nonlinear transform learning: model, applications and algorithms

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES

Nonlinear Transform Learning:

Model, Applications and Algorithms

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Dimche Kostadinov

de

Strumica (Macedonia)

Thèse no 5335

GENÈVE

Repro-Mail - Université de Genève

2018

NONLINEAR TRANSFORM LEARNING:

MODEL, APPLICATIONS AND ALGORITHMS

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF UNIVERSITY OF GENEVA

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

c Copyright by Dimche Kostadinov 2019 All Rights Reserved

Acknowledgements

Abstract

Résumé

Table of contents

List of figures

List of tables

Chapter 1 Introduction

1.1 Scope of the Thesis

1.2 Thesis Outline

1.3 Main Contributions

Chapter 2

Modeling and Estimation of Nonlinear Transform

2.1 Sparse Synthesis Model vs Nonlinear Transform Model

2.1.1 Sparse Synthesis Model

∏

∑

∑

∑

2.1.2 Nonlinear Transform Model

∏

∏

∑