Alternating Conditional Expectation (ACE) applied to
classification and recommendation problems
by
Fabiain Ariel Kozynski Waserman
Master of Science. Massachusetts Institute of Technology, 2013 Telecommunications Engineer. Universidad ORT Uruguay, 2011
Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of
Electrical Engineer
in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology
June 2018
@
2018 Massachusetts Institute of Technology All Rights Reserved.The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created.
Signature redacted
Signature of Author:
Fabiain Kozynski Department of Electrical Engineering and Computer Science
A May 23, 2018
___Signature redacted
Certified by: ___ _________
Lizhong Zheng Professor of Electrical Engineering and Computer Science Thesis Supervisor
Accepted by:
Signature redacted
'/ C) K. Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science Chair, Departmental Committee for Graduate Students
MASSACHUSETS INSTITUTE OF TECHNOLOGY
JUN 18 2018
Alternating Conditional Expectation (ACE) applied to classification and
recommendation problems
by Fabian Ariel Kozynski Waserman
Submitted to the Department of Electrical Engineering and Computer Science on May 29, 2018
in partial fulfillment of the requirements for the degree of Electrical Engineer
Abstract
In this thesis, a geometric framework for describing relevant information in a collection of data is applied for the general problems of selecting informative features (dimension reduction) from high dimensional data. The framework can be used in an unsupervised manner, extracting universal features that can be used later for general classification of data.
This framework is derived by applying local approximations on the space of proba-bility distributions and a small perturbation approach. With this approach, different information theoretic results can be interpreted as linear algebra optimizations based on the norms of vectors in a linear space, which are in general, easier to carry out. Fundamentally, using known procedures such as Singular Value Decomposition (SVD) and Principal Component Analysis (PCA), dimension reduction for maximizing power can be achieved in a straight forward manner. Using the geometric framework, we relate calculation of SVD of a particular matrix related to a probabilistic channel to the application of Alternating Conditional Expectation (ACE) in the problem of optimal regression.
The key takeaway of this method is that such problems can be studied in the space of distributions of the data and not the space of outcomes. This geometric framework allows to give an operational meaning to information metrics in the context of data analysis and feature selection. Additionally, it provides a method to obtain universal classification functions without knowledge of the important feature of the problem. This framework is the applied to the problem of data classification and analysis with satisfactory results.
Thesis Supervisor: Lizhong Zheng
Acknowledgments
I would like to thank Prof. Lizhong Zheng for his guidance throughout my research and the years working alongside him. He helped me go through many difficult stages during my time at MIT. I would like to thank Prof. Munther Dahleh for his support and advice. Additionally, I'd like to thank Anuran Makur, David Qiu, Mohamed AlHajri, Dr. Mina Karzand and Prof. Guy Bresler for their company and many discussions during my time at MIT.
Special thanks go to Jenny, who has been with me during the good and the bad times, helping me along the way with her continuous support and encouragement.
These past years at MIT would not have been possible without my friends here. Thanks go to Boris, Dan, Rachael, Steven, Bernhard, Steph, Ahmed, and many more that have been a steady source of wisdom, entertainment and continuous support. It was my pleasure meeting you and enjoying so many joyful times. Special thanks go to the wonderful people I've worked with at SP during my time living here and which make this community a great place to live in.
Last but not least, I want to thank my friends Ari, Rogelio, Frederick, Daniel, my parents D6borah and Oscar, and my sister Mariel that even though they are far, they feel really close.
Contents
Abstract 3 Acknowledgments 4 List of Figures 9 List of Tables 11 Introduction 131 Background of Maximal Renyi correlation and ACE 15
1.1 Hirschfeld-Gebelein-R6nyi correlation ... 15
1.2 Alternating Conditional Expectation Algorithm . . . . 16
2 Geometric framework and its relationship to ACE 19
2.1 Geometric framework . . . . 19
2.2 Relationship between ACE, the DTM and Singular Value Decomposition 22
2.3 Sample complexity and the estimation of Pxy . . . . 22 2.4 Application to Classification problems . . . . 23
3 Analysis and classification experiments 25
3.1 Analysis of the problem . . . . 25 3.2 Description of the data and experiments . . . . 27
4 Discussion and future work directions 33
8 CONTENTS
List of Figures
3.1 Examples of images of ones and twos from MNIST . . . . 26 3.2 Error percentage as a function of the convex coefficient used to combine
pixel and pairs scores. . . . . 30
List of Tables
Introduction
The Hirschfeld-Gebelein-Rnyi[1] maximal correlation is a well-known measure of sta-tistical dependence between two (possibly categorical) random variables. In inference problems, the functions found to be optimal in this correlation can be viewed as fea-tures of observed data that carry the largest amount of information about some latent variables. These features are in general non-linear functions, and are particularly useful in processing high-dimensional observed data and reducing it to a more manageable number of dimensions. The Alternating Conditional Expectations (ACE) algorithm is an efficient way to compute the maximal correlation functions associated with the R6nyi maximal correlation as a particular case of finding optimal regression functions.
In this thesis, we present a geometric framework for working with small perturbations of discrete distributions and the effect of noisy channels applied to data sampled from said distributions. Additionally we provide an information theoretic approach to interpreting the ACE algorithm as computing the singular value decomposition of a linear map between spaces of probability distributions. With this approach, the information theoretic optimality of the ACE algorithm can be demonstrated, its convergence rate and sample complexity can be analyzed, and more importantly, it can be generalized to compute multiple pairs of correlation functions from samples. One of the main results is the improvement in sample complexity over other methods.
From the geometric approach presented, algorithms are proposed that are flexible enough to handle diverse data types. In particular multi-modal data with different types, time-scales, and qualities. The use of such algorithm is demonstrated by applying it to real-world data, that of pattern recognition of handwritten digits the MNIST database[2]. This new geometric method provides a new analytical tool for processing of partial information, and systematic understandings of many existing data analysis solutions.
This paper is organized as follows. In chapter 1, past work on the area is summarized. This is expanded in chapter 2 where we present our model and the different insights that the model provides, connecting it with the topics discussed in chapter 1. In chapter 3, we describe the classification problem to analyze and the framework is tested via simulations in Python. Finally, in chapter 4, conclusions about the work are presented alongside with directions for future work and extensions.
Chapter 1
Background of Maximal Renyi
correlation and ACE
When working with datasets, a natural goal is to be able to fit correlated data by establishing a function (or functions) between the variables. This is done in order to be able to predict the relationship between new data arising from the same distribution. The simplest example is that of fitting a dataset (X, Y) as a linear function f(X) = aX + b
that minimizes the quadratic error E
[If(X)
- Y12]. This least mean square estimation is known to be optimal in the case of jointly normally distributed data subjected to additive Gaussian uncorrelated noise.In many situations, however, the data does not present those characteristics. Data can come from a finite support dataset or a discrete (categorical) one. Additionally, it can be subject to other sources of randomness that are not Additive White Gaussian Noise (AWGN). In order to measure the strength of the dependence of two numerical random variables, the correlation between them is defined as
cor(X, Y) - cov(X, Y)
which takes values in the interval [-1, 1], with the extremes signifying that the variables
are linearly dependent, and 0 indicating that they are linearly uncorrelated (but not necessarily independent). Despite this definition being simple and straightforward and the value easy to calculate in many cases, it only indicates one type of correlation (linear) between the variables. If their relationship could be explained by a different kind of correlation, this measure is not sufficient to indicate that.
* 1.1 Hirschfeld-Gebelein-Renyi correlation
In [1] the author introduces the Hirschfeld-Gebelein-Renyi maximal correlation (from here on R6nyi correlation). For a pair of random (non constant) variables X and Y, possibly categorical, the R6nyi correlation is defined as
CHAPTER 1. BACKGROUND OF MAXIMAL R$NYI CORRELATION AND ACE A p = sup E [f (X)g(Y)] . f: X - R, g: Y -+ R, E [f(X)] = E [g(Y)] = 0 E [f2(X)] = E [g2(y)] = I
The conditions required from f and g are such to normalize and scale the functions. In the case where we can find optimal functions (the supremum is a maximum), we will denote them as
f*
and g*.This measure of correlation satisfies the following properties:
" The correlation is bounded to the interval [0, 1].
* The correlation p = 0 if and only if X and Y are independent random variables. " The correlation p = 1 if a(X) = O(Y) for some functions a, /.
Another important consequence of the R6nyi correlation is that the optimal functions (if they exist) satisfy the following expectation equations
f*(X) oc E [g*(Y)IX] (1.1)
*(Y ) oc E[f*(X|Y]
where the proportionality constants are such that the second moment condition on the functions
f*,
g* is satisfied.U 1.2 Alternating Conditional Expectation Algorithm
Determining the optimal (or close to optimal in the supremum case) functions that maximize the R6nyi correlation is not an exact method for general random variables. However, making use of the property established in (1.1), [3] presents an algorithm that iteratively approximates the optimal functions.
This algorithm, called Alternating Conditional Expectation (ACE), is presented as a way to realize regression on p independent variables and 1 dependent variable. Given predictor random variables X1, ... , X, and response random variable Y, the algorithm provides as output real functions g,
f,
.... , f, of the random variables that attempt to minimize the quantitye2(g~fljf2, f
~
E (g(Y) - Z i f(Xi)) 2E
[(g(y))21
These functions are then the optimal functional relationship between the predictor random variables and the response, in terms of minimizing quadratic error. As no
Sec. 1.2. Alternating Conditional Expectation Algorithm 17
constraints besides the first and second moment are posed for these functions, no particular structure is to be expected. However, in the case of Gaussian random variables, the resulting functions are linear in the variables (as it was expected from the
LMS regression).
This algorithm, in the case of p = 1 converges in its output to the functions
f*
andg* that maximize the Renyi correlation.
The algorithm presented in [3] is carried out by iteratively fixing p of the functions and estimating the remaining one using the corresponding conditional expectation, renormalizing where appropriate. The paper also presents a proof of existence of the optimal functions and convergence of the algorithm under mild assumptions. These
Chapter 2
Geometric framework and its
relationship to ACE
In this chapter we describe a geometric framework that is applied to random variables in order to interpret their relationship in the space of distributions. This framework provides a different interpretation of some information theoretical quantities. In addition, a clear connection with the R6nyi correlation and the ACE algorithm seen in chapter 1 will be explored.
The basics of the framework were developed in collaboration with Anuran Makur and Prof. Lizhong Zheng.
* 2.1 Geometric framework
Throughout this section, we will be working with finite spaces and we will associate real functions on those spaces with the associated vectors, indexed by the corresponding element.
Consider a random variable X over the finite sample space X = {1,... , IX} with distribution PX. Unless specified otherwise, Ex corresponds to the expected value over the (base) distribution
Px-Consider now the family of probability distributions (when valid) Fp = {X : P, (x) = Px(x) + V/Px (x)#(x), : X -+ R}
where the inner product EX V/Px(x)#(x) = 0, in order to have a valid distribution.
These conditions mean that the vectors
#
(as indexed by the elements of X) belong to the linear space orthogonal to the vector[Vl7ix].
The set of valid vectors are those in this space that translate to distributions in the simplex (valid probability distributions). We will call these vectors perturbation vectors.This definition by itself is interesting as we have provided a linear space that corresponds to distributions that are close (in fi norm for example) to a given distribution.
20 CHAPTER 2. GEOMETRIC FRAMEWORK AND ITS RELATIONSHIP TO ACE
It becomes more interesting when we incorporate additional distributions or random variables.
* 2.1.1 Connection with K-L divergence
Consider two distributions P and P2 from the same family Fx with corresponding
vectors e#1 and IE02. We can easily verify that
D(P1||P2) = 11 - #2II o(e2)
2
where 11 is the Euclidean norm. This provides a nice relationship between linear algebra and information theoretic concepts. In particular, divergence minimization can be translated to distance minimization.
* 2.1.2 Connection with log likelihood
Consider a probability distribution Pi in the family
Fx,
associated with the small perturbation vector#1.
We can see that the log likelihood L1 between that distributionand the base distribution Px satisfies
Li(x) = log ~0 #Wz
(Px (x) -Px(x)
Then, we can define the vector fi(x) = -1(x) that satisfies Ex [fi(X)] = 0. We
SPx(x)
will call this vector the score function associated with the perturbation vector 01 (or with the distribution P1) and we will restrict them to be normalized with respect to
Px, meaning that Ex
[f2
(X)] = 1. This is equivalent to restricting#1
to be unit norm (euclidean norm).Hypothesis testing
Consider now two distributions P1, P2 with associated perturbation vectors
#1, #2
respec-tively. Assume we collected n independent observations from one of the two distributions, yielding an empirical distribution P(x) = Px(x) + Px(x)Vb(x) E Fx for some pertur-bation vector /. In this case, the log likelihood of the candidate distributions, averaged over the samples is
1 * Pi (xi)
n
log () (, 01 -#2)
This means that evaluating the log likelihood is equivalent to projecting the pertur-bation vector associated with the observations into the line given by the perturpertur-bation
vectors given by the distributions of interest. This "direction of information" will be important later in classification problems.
N
2.1.3 Divergence Transition Matrix
Let X be a random variable defined in a finite X as before, and a distribution Px associated with it. Let Y be a random variable defined in a finite
Y.
Finally, assume there is an observation model PyIx. Given a distribution P(x) = Px(x) +- Px(x)#(x) E Fx, the effect of the channel on this distribution yieldsQ(y) = Pxjy(yx) - P(x) = Qy(y) + -Qy(y)[B](y)
x
where Qy = PyIx -Px, the matrix B E R31 X|1Y1 satisfies B(x, y) = Pxy'), and
\ Px (x)Q (Y),
[B#] is the product of the matrix B with the vector representation of b.
We call this matrix B the Divergence Transition Matrix (DTM) associated with the base distribution Px and the channel PyIx. Another way of writing this matrix is
B
=
-VQ1
W - [I
where W represents the conditional probability matrix associated with the channel, and the two [- .. ] terms are diagonal matrices whose non-zero entries are those of the corresponding vectors.
This DTM is a linear transformation that maps the space of distributions that are close to Px into the space of distributions that are close to Qy by mapping the
corresponding perturbation vectors.
An important observation is that P x(x) is a right singular vector of the matrix B, corresponding to the singular value 1 and the left singular vector VQy(y). This vectors do not satisfy the orthogonality constraint of the perturbation vectors, and correspond to constant score functions.
Another important observation is the fact that the matrix BT(y, x) is the DTM of the pair Y, X, that corresponds to the reverse channel transforming Y into X.
The K-L divergence between P E .Fx and Px is given by D(PIIPx) ~I|#I|. Using
the Data Processing inequality, we can see that D(QIIQy) ~ JIB - 12 < 11 2. This indicates that all the singular values of the matrix B associated to the vectors that span the space orthogonal to the previously mentioned vectors are no greater than 1.
In particular, let a- be the second greatest singular value, we have that the Renyi maximal correlation satisfies
21
CHAPTER 2. GEOMETRIC FRAMEWORK AND ITS RELATIONSHIP TO ACE
where the maximizing functions are given by
f*
(x) = 01 and g* (y) = B*0,P-x (x)NI _(Y)
where q1 and B - 1 are the left and right singular vectors associated with
o1
[3, 4].This relationship converts finding the R6nyi correlation or running ACE in some samples to the problem of finding the Singular Value Decomposition (SVD) of the corresponding DTM matrix.
* 2.2 Relationship between ACE, the DTM and Singular Value Decomposi-tion
As mentioned in the previous section, the R6nyi correlation, ACE and the SVD of the DTM are closely related.
In particular, consider a score function g(y) = 0) associated with the distribution
Q
E Ty. Then we have- (y) g(y) y) = 1 B(x, y)#(x) VQy
(y)X,
1 Pxy (x, y)VQ-y-(y)
X
Px(x)Qy(y)
= Px1y(x1y)f(x) = E[f
(X)jY = y]Therefore, applying the DTM to vectors
#
corresponds to calculating the conditional expectation of the associated score functions. This means that the ACE algorithm described in [3] is calculating the second largest singular value and the singular vectors of the associated DTM. In particular, finding the "optimal" score functions (in the sense that the ratios of KL divergence given by the Data Processing Inequality is highest), which is equivalent to finding the SVD for the second largest value, can be done by running ACE in the data. We can also see that multiplying the perturbation vectors by BT corresponds to taking the conditional expectation with respect to X on the g score functions. Therefore, an iteration of the ACE algorithm corresponds to multiplying the perturbation vector 0 by BT - B or the perturbation vector VL by B - BT.* 2.3 Sample complexity and the estimation of Px,y
When applying these ideas on real datasets, there are two main considerations: 22
1. We do not have the actual distribution Pxy but an empirical one.
2. Estimating some elements of PXy may be hard, if they are too small.
The first issue, if not critical requires that we study the usefulness of the score functions obtained by applying ACE to the empirical distribution instead of the real one. Remember that we desired to apply these functions on the data that comes from the true Pxy.
The second issue is of utmost importance, as this will dictate the number of samples needed. In general, to distinguish non-zero elements of the distribution matrix, we are going to need at least a number of samples which is the inverse of the smallest non-zero value, which decays with the cardinality of the data as 1 . As we only
need the (second) highest singular value and its associated vectors in order to obtain score functions to perform classification, we may be able to skip estimating the whole probability distribution in lieu of running ACE on the data.
This two issues are studied in [4] with the conclusion that under certain conditions the empirical distribution can be used in ACE instead of the true distribution yielding good results and that running ACE instead of trying to accurately estimate the probability distribution can be achieved with similar performance with a reduced number of samples1.
U 2.4 Application to Classification problems
Consider the three variable Markov Chain U -+ X -+ Y where U, X, Y takes values in a space U x X x Y of finite cardinality. We can consider U to be a parameter of the distribution X of interest, and Y to be observations of X after passing through a noisy channel. We can interpret this model as having Pxju E
Fx,
with different vectors qpudepending on U = u.
Given a dataset (Xi, Y) from the previous Markov Chain (where we may or may not have direct access to X), classifying the data into classes corresponds to determining the value of the hidden parameter U for each data point. In order to do this we use the previous idea to generate score functions.
We want to extract what directions in the space of perturbations around X convey the most information in order to assign each datum (xi, y%) to its correct U = u. In order to do this, we can submit X to different noisy channels and observe the output. For a randomly applied noise, that shares no correlation with the data, we would expect these main directions in which the data is differentiated to be propagated and correspond to directions of big change. Therefore, we can use the score functions obtained by this process in order to differentiate and classify the data.
IyI
23 Sec. 2.4. Application to Classification problems
24 CHAPTER 2. GEOMETRIC FRAMEWORK AND ITS RELATIONSHIP TO ACE
Notice that these functions could be not self evident by observing the data, as we do not know what are the features of the data that we should be making observations on. This allows us to do "universal feature extraction".
* 2.4.1 Multiple score functions
The process described above can be generalized to obtaining multiple score functions of decreasing predicting power (correlation). This can be done by obtaining the successive singular vectors of the DTM (by decreasing singular value). Each pair of singular vectors will provide score functions that are uncorrelated with the others.
By taking some of these score functions instead of just the one associated with the
(second) highest singular value, we can decide how much predicting power we want to compute.
The algorithm to determine this multi-level score functions can be performed by modifying the ACE algorithm to run in parallel.
Chapter 3
Analysis and classification
experiments
In this chapter we present experiments used to test the application of the geomet-ric framework outlined in chapter 2. These experiments consist in the unsupervised classification of handwritten images from the MNIST dataset'[2]. This database has thousands of handwritten representations of the digits 0 to 9 and are extensively used
as classification benchmarks.
In the experiments presented here we will study the effect of a particular strategy for summarizing the data obtained from multiple score functions applied to a single object. This is due to the fact that each of our objects (images) are composed of many pixels, each of which will have their own score function.
In the first section we present a brief analysis of problem and the experiments at an abstract level. Following that we present a description of the data and the experiments. Finally we present the results of said experiments.
The design of the experiments and the code for their realization is of my single authorship.
* 3.1 Analysis of the problem
Consider a random variable U that takes values in a subset of the digits. For each value
of U = u, we have a collection of images Xu where i is an index for each image. Each image Xu is a handwritten representation of the digits. In Figure 3.1 we can see a selection of 100 of these images, and we can see that they are quite different, for each of the two digits shown. But more importantly, there are some ones that look like twos and there's no evident score function that we could use to distinguish, at least in a way we can program it.
We want to use the framework presented in chapter 2 to generate score functions to distinguish the images of each digit. We will realize unsupervised classification, meaning
26 CHAPTE3NALYIS NCASFATOEXRINS
*riunnrinriu
OENWENNS
Figure 3.1. Collection of 50 images of handwritten ones and 50 images of handwritten
twos. We can appreciate the similarities between the same numbers, but there are some that are hard to distinguish visually.
that for each image, the label U indicating the digit it represents is not there at the moment of finding our score functions (and obviously not at the moment of applying them). Instead, what we will do is try to find the perturbation vectors 0', (and the associated score function) for each U = u that corresponds to the following distribution
Pxiu(xI'u) = Px() + v/Px (x)40(x)
where Px is the mixed distribution of all the images that we are considering for the training. In our case, this is all the available images for the digits we wish to classify. This means that for each digit u we will have a different direction of perturbation, and we assume that the images corresponding to that digit are distributed I.I.D with respect to that conditional distribution.
In the case of two digits ui and U2, assuming that the random variable U is uniform2 the perturbation vectors will satisfy 0$, = -5U 2. This means that our ideal score function should be in the direction of this perturbation line (direction of o1) with one of the classes generating positive score values and the other negative. In practice this will not be the case, but we expect to find a direction that represents the maximum change of going from the class of elements with U = ui to the class of elements with U = U2 and that separates the data the most.
Another way to phrase this is that we desire to assign to each image a point in Rk where applying a known classification method will yield the best results.
In order to obtain this perturbation direction, we will apply some noise W to the images, generating a collection of images Y. Notice that we lost the class index u, as we will not keep track of it for the classification, though we will make use of it when calculating our probability of error. Once we have our noise model we can find
f,
g2In our case this is true, as the mixture distribution is always calculated using roughly equal number of images from each class.
that maximize the R6nyi correlation and correspond to our score functions for the classification problem. These functions are independent from the images to classify and can be found offline for a given noise matrix.
Finally, we apply these functions to the images to classify and do one of two things: * If we know the number of images from each class (but not their actual label) we
can separate them in two groups of those sizes based on their scores.
* If we don't know the number of images from each class, we can apply some clustering algorithm like K-means.
U 3.2 Description of the data and experiments
The data downloaded from MNIST corresponds to thousands of individual handwritten images for each digit. Each image is a 19x19 pixel matrix and we quantize it so that each pixel can take one of four values, according to its original brightness (quantize the integer interval
{0,
... , 255} in 4 regions). This image data is grouped according to thedigit each image represents.
As each image has 361 total pixels, a distribution for the whole image would require
4361 10217 distinct values which is intractable. In order to simplify the problem, we will store the following distributions per set of images to study. Note that we have to select which digits we will consider for classification before taking this step, as we want to compile the mixture distributions. For the collection of training images for classification we collect the following:
* Px the probability distribution for each pixel a. Note that this requires storage of at most 4 .361 = 1444 values.
" Pxa,x, the probability distribution for each pair of adjacent pixels a, 0. Note that this requires at most 2 - 16 - 361 = 11552 values.
The reason for storing the links distributions is in order to introduce some spatial dependency between the pixels, as we know that adjacent pixels will be highly correlated.
After this preprocessing step, each experiment consists of the following steps:
1. Select a subset X1, . .. , X, of images for classification from those collections of digits for which we have a mixture distribution (generated above).
2. Select a noise matrix W. The noise will be independently applied to each pixel,
generating noisy images Y1,. . ., Y,.
3. For each pixel and link we can find the corresponding DTM matrix B. From it we 27
singular vectors (in decreasing singular value order) or by iterating multiplication
by BT -B per pixel and pair of adjacent pixels. Not that because the DTM depends
on each of the mixture distributions, the obtained perturbation vectors as well as the singular values will be different for each pixel.
After this process, we have k score functions per pixel (and k score functions per pair of adjacent pixels). In order to obtain a single score per level of power (order of singular value) per image, we will collect the scores using some of the following procedures:
a. Simply add all the pixel scores in the level, disregarding the link score
b. Utilize message passing to combine the link scores and the node scores to obtain a
single score. This may not converge due to the graph being loopy.
c. Add all the pixel scores in the level, weighted by the corresponding singular value.
d. Use a. or c. on the pixels and the pairs, and do a convex combination of the two.
This requires a way of finding the best coefficient to combine them.
For the simulations shown in the following section, the following design decisions were taken:
" We realize unsupervised classification of images of the digits one and two.
" We will classify a set composed of the same number of images of each digit, randomly
selected from the bank of all images.
" We consider different error matrices W with WxV = Pylx(ylx). The matrices considered can be found in Appendix A
" We collect the scores per level using procedure d. detailed above, scaling each
corresponding score function by its corresponding singular value. We determine the convex combination coefficient to use by minimizing the probability of error per level of singular value.
" We calculate the probability of error for each classification by ordering the scores
and splitting them into two sorted groups of the same size. There are two possible assignments, with one having probability of error less that 1, when comparing2' against the original digit of the image.
* 3.2.1 Experiment results
We show the result for 9 experiments utilizing the setup described above and different noise matrices, showing that this choice only affects the score functions found but not the
average classification accuracy. In each experiment, the images to classify are selected randomly from the bank of images.
We will compute the first two scores (by decreasing singular value) for each pixel and each pair of adjacent pixels. For each experiment, we plot the misclassification error against the different values of the coefficient used to combine the pixel and pair score in Figure 3.2. For the optimal value of this coefficient, we show a plot of the final score per image in R2, color coded by the original digit of the image in Figure 3.3. In order
to show the accuracy of the classification, we run K-means on the data and show the cluster centers, as well as the best separation line between the two classes (SVM)3.
In Table 3.1 we present the error percentages obtained from K-means and SVM for each experiment. Obviously, SVM will perform better as it is labeled classification, however it is not something we would be able to do from just the image scores and serves to give an indication of the efficacy of the score functions. The table is in the same order as the plots in Figure 3.2 and Figure 3.3.
We can see that the score functions obtained in each experiment provide a good representation of each image as a point in R2, as using K-means or SVM on these points results in a satisfactory classification.
K-means error SVM error
0.14 0.06 0.10 0.06 0.14 0.07 0.10 0.06 0.10 0.10 0.07 0.05 0.07 0.06 0.07 0.07 0.14 0.09
Table 3.1. Errors of classification methods applied to the image scores
30 CHAPTER 3. ANALYSIS AND CLASSIFICATION EXPERIMENTS
Firsnuarvco coe--nst singular vector score First uler vector ace
&45 - sc5 Second singular vector score 002 secon5dsingular vector score
,on "a
0 \25 o.
0.20
0. 02 004 00 0. 00 00 02 i.4 t. s 8 0.0 0 0.2 0.4 o.6 .2.0
-ode ot 000r.0 o to0000. sorercor 010000000 fa Fss0. 0ct r~ 0 "da .
-~~~~~~~~~~~
~~~~~
Fitsn~a etrsoe-nstsnua etrsoe-- rst singular vector cra5- Sonoding ro r sSecond 0.45tor singular sictrr sc0r
o00 5 - - 4
005
Us0
0.5 I
- Mrt sigula vecor core-- nst snguar vctorscor - rst singular vector scors
&45 asscns,.vn ceSecondsewuar .45dsgareorcr vectorscans
0000 - 0.0 00
0.10 0
o0
Figure 3.2. Each graph represents one experiment. For each experiment, we combine the total score for the pixels with the total score for pairs of adjacent pixels using a convex combination (for each image), where the horizontal axis is the coefficient of the pixel score and the vertical axis is the error percentage. This percentage is calculating by just using the fact that we know how many
images are in each class (same number of items in each class). We do this for the scores corresponding with the two highest singular values (dismissing the trivial one), with blue corresponding to the highest value and green, corresponding to the second highest. We see that in all cases, for most of the range, the highest singular value performs better.
-3 -2 -1
2V
% %
- -1 0 1 ;7
S -1 0 1 2
Figure 3.3. Each plot represents one experiment. For each experiment we plot the first
and second score as a point in R2, using the optimal convex combination
coefficient. In order to study the perFormance we calculate the K-means
(for K = 2) associated with the scores, as well as the optimal separating
I .. ~l1 -1 ,* >~~*~:
~
-1 0 1 2 -3 -2 -1 0 1 2 .4 :06* -3 -2 -2 0 1 -3 3Chapter 4
Discussion and future work
directions
In this paper we presented a geometric framework for interpreting the space of probability distributions that arise from perturbing a given distribution, as a linear space. This framework allows us to work with linear transformations and norms instead of probability operators and information theoretical quantities.
We show how this model can be connected with the Alternating Conditional Expec-tation algorithm (ACE) and how this can be used to determine optimal score functions to be used on the task of classification. These functions are related to the geometric framework described here by the Singular Value Decomposition of a particular matrix that describes a noise channel as applied to distributions close to a base fixed distribu-tion. Due to their connection to SVD, we can obtain a sequence of score functions of decreasing predicting power.
In order to test this framework as a method of classification, we apply it to the task of unsupervised classification of images. This process requires an offline training, obtaining statistics from a mixture of images collected from all desired labels; this collection need not be labeled but should have a similar number of items per label. After this preprocessing is done, finding optimal score functions can be done independent of the data to classify.
For each experiment, we generate the best two score functions giving rise to a two-dimensional score for each image. In order to evaluate these scoring functions, we study the error given by K-means (unlabeled data) and SVM (labeled data). We see that in both cases, this error is small considering that we reduced images with 361
quaternary pixels to two real values.
We can conclude that the framework is useful for obtaining universal classification functions, that do not require knowledge of useful features. As the problem grows more complex, we can obtain more and more score functions, up to the dimensionality of the problem, each one of decreasing predicting power.
34 CHAPTER 4. DISCUSSION AND FUTURE WORK DIRECTIONS
and the back-propagation algorithm used in its training. In the future, application to other classification problems and the comparison between neural networks and the naive classification can be studied.
Appendix A
Noise matrices
[[ 0.6 0.4 0.2 0.1 1 [ 0.2 0.2 0.4 0.05] [ 0.2 0. 0.2 0.05] [ 0. 0.4 0.2 0.8 1]EL
0.8 0.2 0.1 0.05]
[ 0.1 0.6 0.2 0.025] [ 0. 0.2 0.6 0.025] [ 0.1 0. 0.1 0.9 ]] [[ 0.8 0.2 0.1 0.05]
[ 0.1 0.6 0.2 0.025] [ 0.05 0.1 0.6 0.025] [ 0.05 0.1 0.1 0.9 ]] [[ 0.9 0.1 0.05 0.0251
[ 0.05 0.8 0.1 0.0125]E
0.05 0. 0.8 0.0125] [ 0. 0.1 0.05 0.95 ] El 0.8 0.2 0.1 0.05 ] [ 0.1 0.6 0.2 0.025]L
0.1 0. 0.6 0.025] [ 0. 0.2 0.1 0.9 ]]EL
0.6 0.4 0.2 0.1 ] [ 0.2 0.2 0.4 0.05]E
0.1 0.2 0.2 0.05]E
0.1 0.2 0.2 0.8 1]36 APPENDIX A. NOISE MATRICES [[ 0.9 0.1 0.05 0.025
1
[ 0.05 0.8 0.1 0.01251 [ 0. 0.1 0.8 0.0125] [ 0.05 0. 0.05 0.95 1] [[ 0.9 0.1 0.05 0.025 1 [ 0.05 0.8 0.1 0.0125] [ 0.025 0.05 0.8 0.01251 [ 0.025 0.05 0.05 0.95 ]] [[ 0.6 0.4 0.2 0.11
[ 0.2 0.2 0.4 0.05] [ 0. 0.4 0.2 0.05] [ 0.2 0. 0.2 0.8 11Bibliography
[1] Alfred R6nyi. On measures of dependence. Acta Mathematica Academiae Scientiarum
Hungarica, 10(3-4):441-451, 1959.
[2] L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141-142, Nov 2012. ISSN 1053-5888. doi: 10.1109/MSP.2012.2211477.
[3] Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for
multiple regression and correlation. Journal of the American Statistical Association, 80(391):580-598, September 1985.
[4] A. Makur, F. Kozynski, S. L. Huang, and L. Zheng. An efficient algorithm for information decomposition and extraction. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 972-979, Sept 2015.
[5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, 0. Grisel, M.
Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.
[6] Shao-Lun Huang, Anuran Makur, Fabiin Kozynski, and Lizhong Zheng. Efficient
statistics: Extracting information from iid observations. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, Allerton House, UIUC, Illinois, USA, October 1-3 2014.
[7] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell.
Text classification from labeled and unlabeled documents using em. Mach. Learn.,
39(2-3):103-134, 2000. ISSN 0885-6125. doi: 10.1023/A:1007692713085. URL