Dimension reduction for clustered high-dimensional data

(1)

Thesis

Reference

Dimension reduction for clustered high-dimensional data

SZEKELY, Eniko-Melinda

Abstract

Recent times have witnessed the transition towards a significantly larger scale both in the number of samples and the number of attributes characterising data collections. It is this latter aspect, the dimensionality of the data, that is at the center of the present thesis. We first analyse the evolution of the distance contrast and emphasise its dual character: absolute vs.

relative. The second focus is on clustered structures, still in the context of high-dimensional data. Our purpose is to find low-dimensional embeddings with strong discriminative power. In this direction, we propose two methods, the High-Dimensional Multimodal Distribution Embedding - a distance-based embedding method that exploits distance distributions in high dimensions - and the Cluster Space - that projects points in the space of the clusters using the probabilities obtained from a Gaussian mixture model.

SZEKELY, Eniko-Melinda. Dimension reduction for clustered high-dimensional data. Thèse de doctorat : Univ. Genève, 2011, no. Sc. 4333

URN : urn:nbn:ch:unige-173436

DOI : 10.13097/archive-ouverte/unige:17343

Available at:

http://archive-ouverte.unige.ch/unige:17343

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSIT É DE GEN ÈVE FACULT É DES SCIENCES Département d’informatique Dr. Stéphane Marchand-Maillet

Dimension reduction for clustered high-dimensional data

TH `ESE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Enik˝o Melinda SZ ´EKELY

de

Tˆırgu Mures¸ (Roumanie)

Th`ese N° 4333

GEN `EVE 2011

(3)

(4)

(5)

(6)

Abstract

We are drowning in information, but starving for knowledge.

John Naisbitt

The need for knowledge discovery from data has always represented a main challenge in many domains. Recent times have witnessed the transition towards a significantly larger scale both in the number of samples and the number of attributes characterising real-world data. It is this latter aspect, the dimensionality of the data, that is at the center of the present thesis.

In high-dimensional representation spaces, data exhibits specific behaviours that are not common in spaces of lower dimensionality. In this thesis, we first concentrate on studying the behaviour of distances in high dimensions. We analyse the evolution of the distance contrast with increasing dimensionality and emphasise its dual character: absolute vs. relative. This duality is a key aspect of high-dimensional data and plays an important role in understanding and assessing the significance of distances in high dimensions.

Our second focus is on clustered structures, still in the context of high-dimensional data. Clustering high-dimensional data is a challenging task due to the complex structure of real-world data. In this context, dimension reduction emerged as a powerful solution to the analysis of high-dimensional data.

Instead of performing the analysis in the high-dimensional space, a lower-dimensional space is found through dimension reduction and further analysis is performed in this reduced space. Our aim in this thesis is to find low-dimensional embeddings with strong discriminative power. In a first contribution, the High-Dimensional Multimodal Distribution Embedding, we exploit distance distributions in high dimensions and propose a distance-based embedding method. The strength of the method is due to a distance transformation that we apply prior to embedding and that increases the cluster preservation capability of the low-dimensional space. Next, the Cluster Space projects points in the space of the clusters using the probabilities obtained from a Gaussian mixture model. The intrinsic relationship of the Cluster Space with the quadratic discriminant function justifies its strong discriminative power.

Overall, the current thesis addresses the study of the behaviour of distances in high dimensions and the dimension reduction as a solution for finding discriminant and structure preserving low-dimensional embeddings.

(7)

bourhood graph

(8)

R´esum´e

Nous nous noyons dans l’information, mais sommes assoiff´es de connaissance.

[traduction de l’auteur]

La découverte de connaissance à partir de données a toujours représenté un défi majeur dans beau- coup de domaines. Récemment, nous avons assisté à une transition d’échelle tant dans le nombre des

échantillons que dans le nombre des attributs qui caractérisent les données réelles. C’est ce dernier aspect, la dimensionalité des données, qui est au centre de cette thèse.

Dans les espaces de représentation à hautes dimensions, les données présentent des comportements spécifiques qui ne sont pas communs aux espaces des dimensionalités plus reduites. Dans cette thèse, nous nous penchons dans un premier temps sur le comportement des distances dans les espaces à hautes dimensions. Nous analysons l’évolution du contraste entre les distances quand la dimensionalité aug- mente et mettons en évidence le caractère dual de ce contraste: absolu vs. relatif. Cette dualité joue un rôle important dans la compréhension et l’estimation de la signification des distances dans les espaces à hautes dimensions.

Dans un deuxième temps, nous nous concentrons sur les structures clusterisées (données organisées en groupes), toujours dans le contexte des hautes dimensions. Le regroupement/partitionnement de données de hautes dimensions est un défi au fait de la complexité structurelle des données réelles. Dans ce contexte, la réduction des dimensions a émergé comme une solution efficace à l’analyse des données de hautes dimensions. Plutôt que faire l’analyse dans l’espace de hautes dimensions, un espace de plus petite dimension est d’abord trouvé par les méthodes de réduction de dimensions et l’analyse est faite dans cet espace de dimension réduite. Notre but dans cette thèse est de trouver des transformations vers des espaces de petites dimensions (projections) qui soient fortement discriminants. Nous proposons une première solution, le High-Dimensional Multimodal Distribution Embedding, où nous exploitons les distributions des distances dans les hautes dimensions et proposons une méthode de réduction de dimensions basée sur les distances. La puissance de cette méthode est dûe à une transformation des distances que nous appliquons avant la projection, ce qui augmente la capacité de l’espace de dimension réduite à préserver les groupes de données (clusters). Notre deuxième contribution, le Cluster Space, projette les points dans l’espace des clusters en utilisant les probabilités d’un modèle de mélanges de Gaussiennes.

La relation intrins`eque du Cluster Space avec l’analyse discriminante quadratique justifie le fort caract`ere

(9)

Dans l’ensemble, cette thèse aborde l’étude du comportement des distances dans les hautes dimensions et la réduction de dimensions comme une solution pour trouver des projections vers des espaces de dimensions réduite qui soient discriminants.

Mots-clés: analyse de données, partitionnement de données, données de hautes dimensions, réduction de dimensions, visualisation, graphe de voisinage

(10)

List of Notations

N Size of the dataset . . . .6

X The original dataset . . . .33

x_i Data sample xiin the original space . . . .59

Y The embedded dataset . . . .33

yi Data sample yiin the embedded space . . . .59

i,j Indices used for data points. . . .59

D Dimensionality of the dataset in the original high-dimensional space . . . .6

d Dimensionality of the dataset in the embedded low-dimensional space . . . .33

l,l^′ Indices used for dimensions. . . .14

δ^min_i Distance to the nearest neighbour of xi . . . .11

δ^max_i Distance to the farthest neighbour of x_i . . . .11

δi j Distance between any two point xiand xjin the dataset . . . .8

L_p Minkowski metric of order p . . . .8

k The number of nearest neighbours. . . .24

C The number of clusters . . . .83

(14)

List of Figures

1.1 The process of dimension reduction in data mining. . . 2

2.1 The “curse of dimensionality”. . . 7

2.2 Minkowski distances. . . 9

2.3 Weighting in Minkowski distances. . . 10

2.5 The Minkowski metric L_p. . . 15

2.4 Distribution of Minkowski distances. . . 16

2.6 Absolute contrast and correlation. . . 18

2.7 Relative contrast and correlation. . . 19

2.8 Ratio between standard deviation and mean. . . 22

2.9 Means and standard deviations. . . 23

2.10 Accuracy of kNN (Breastcancer). . . . 26

2.11 Accuracy of kNN (Ionosphere). . . . 26

2.12 Accuracy of kNN (Iris) . . . 26

2.13 Accuracy of kNN (Wine) . . . 27

2.14 Accuracy of kNN (MNIST). . . . 29

2.15 Accuracy of kNN (20 Newsgroups). . . . 29

3.1 Spherical embedding. . . 32

3.2 The process of dimension reduction. . . 33

3.3 Taxonomy of dimension reduction methods. . . 34

3.4 Optimisation solutions in dimension reduction. . . 37

3.5 The Swiss roll. . . 42

3.6 The geodesic distance in high dimensions. . . 43

4.1 Multimodality of clustered data distance distributions. . . 57

4.2 Distance ratios for clustered real datasets. . . 58

4.3 Distance ratios in the embedding.. . . 58

4.4 The sigmoid function. . . 60

4.5 Similarity-based distance transformation in HDME. . . 62

(15)

4.6 Influence of the scaling factorλ. . . 63

4.7 Distance distributions (MNIST). . . 64

4.8 Distance ratios after distance transformation. . . 65

4.9 Unsupervised cluster-based HDME. . . 66

4.10 Influence of the number of nearest neighbours k.. . . 69

4.11 Influence of the parametersλand k. . . . 71

4.12 The evolution of stress in HDME with k andλ(MNIST). . . 72

4.13 Visualization (MNIST). . . 74

4.14 Distance distribution for 20 Newsgroups. . . 75

4.15 MAP and k-means (MNIST). . . . 78

4.16 MAP and k-means (20 Newsgroups). . . . 79

4.17 MAP and k-means (20 Newsgroups) for different dimensionalities. . . . 80

4.18 Newgroups N8, N9, N10, N11. . . 81

4.19 Newgroups N1, N5, N10, N15. . . 81

4.20 Visualisation Newsgroups. . . 81

4.21 Iris dataset: MAP and k-means accuracy. . . . 82

4.22 Wine dataset: MAP and k-means accuracy.. . . 82

5.1 Mahalanobis distance in the cluster space. . . 85

5.2 Cluster space (artificial data). . . 89

5.3 Influence of the number of clusters C in the cluster space. . . . 90

5.4 Cluster space (Wine). . . 91

5.5 Cluster space (Iris). . . 91

5.6 Cluster space (MNIST 1).. . . 93

5.7 Cluster space (MNIST 2).. . . 94

B.1 MNIST handwritten digits . . . 106

B.2 Stem frequency (20 Newsgroups). . . 107

(16)

List of Tables

2.1 Minkowski distances. . . 10

2.2 Absolute and relative contrast. . . 21

2.3 Attribute means for UCI Machine Learning Repository. . . 27

4.1 Datasets.. . . 70

5.1 Cluster space: the algorithm. . . 87

5.2 Evaluation of the Wine dataset in the cluster space. . . 91

B.1 MAP (20 Newsgroups). . . 108

B.2 Datasets from UCI Machine Learning Repository. . . 108

(17)

(18)

Chapter 1

Introduction

Data mining is the process of acquiring, processing and modelling data and aims to discover hidden knowledge that is not immediately evident in the data itself. Modern times have witnessed an increase in the volume of data analysed in real-world applications, increase that is twofold:

• in the number of samples, that is, in the number of observations acquired in the initial step of an application, e.g. information retrieval, gene expression analysis, social networks, customer market analysis, weather forecasting. Such applications are typically termed as large-scale applications.

• in the number of features, that is, in the number of attributes gathered to characterise each of the samples of the collection. The dimensionality of the feature spaces increased significantly in many applications, e.g. text and image retrieval, recommender systems, gene expression analysis, giving birth to the so-called high-dimensional data.

The patterns hidden in the data constitute together the underlying knowledge extractable from the available data. Structures, and especially clusters, constitute such type of knowledge as, in many appli- cations, data is naturally organised into groups. The datasets are typically described by a multitude of variables whose cardinality represents the dimensionality of the space – the original space – in which each sample of the dataset is represented. This dimensionality has significantly increased giving rise to the high-dimensional data.

1.1 Context and motivation

High-dimensional data is receiving increasing attention due to its occurrence in more and more application fields. But data in high dimensions no longer obeys the intuitive behaviours we are used to from the low-dimensional spaces. Responsible for these specific behaviours of data in high dimensions is the

“curse of dimensionality”, a term coined by [Bellman,1961].

Distances are particularly influenced by the specific geometry of high-dimensional spaces. To assess the significance of distance measurements in high dimensions, the notion of distance contrast has been introduced. This contrast has a dual character: absolute vs. relative. The absolute contrast measures the

(19)

Figure 1.1: The process of dimension reduction in data mining.

absolute difference between two distance values, while the relative contrast measures the ratio between two distance values. With the increase in dimensionality, the mean of the distance distribution tends to dominate the standard deviation and the relative contrast thus tends to disappear. The distances are said to concentrate. This behaviour has a negative influence on many analysis methods that use distances computed directly in the original space. The quality of the results is often poor, especially when searching for structures and clusters in the data.

In this high-dimensional context, dimension reduction methods have emerged as successful tools in overcoming to a certain extent the problems of the curse of dimensionality. The process of dimension reduction in data mining is illustrated in figure1.1. In the first step of the data mining process, the initial data is described in a machine-interpretable format. The common way of representing data items is by using feature representation methods. The feature spaces derived are often high-dimensional. In this high-dimensional vector spaces, dimension reduction methods are used to find lower-dimensional representations of the original data. In a structure-preserving purpose, structure identification means are employed to derive the low-dimensional data representation. Further analysis – clustering, retrieval, clas- sification – is perfomed on this low-dimensional data with the purpose of discovering useful knowledge and patterns hidden in the data.

As a general definition, dimension reduction is defined as a process meant to find meaningful low- dimensional representations of high-dimensional data. It is motivated by:

• the assumption that data lies in spaces of lower dimensionality than the original spaces;

• the need to visualise data;

• the need to reduce the computational load (time and/or memory complexity) of high-dimensional data processing.

Dimension reduction methods search for lower-dimensional embeddings that preserve the most im- portant features or properties in the data. Still, the preservation of cluster information during embedding has received only little attention, despite its importance in numerous fields. Nevertheless, the need for cluster preservation is important as, apart from a continuous inspection of the reduced space, many real- world applications rely on the recovery of structures from the original data. Moreover, clustering directly in high dimensions has been recognised itself as a difficult problem since existing algorithms do not cope well with the complex structure and sparsity of real-world high-dimensional data.

(20)

1.2. Thesis outline The above challenges – high-dimensionality and clustered data – are inherent to information retrieval tasks, where:

• the data is represented in high-dimensional spaces, e.g. text collections are represented by tens of thousands of attributes (words), equivalent to the size of a vocabulary;

• the data is generally organised in a clustered structure with multiple topics or concepts summaris- ing the content of the data collection.

Given this scenario, the high-dimensionality and the clustered data structure constitute together the focus of the current thesis. Overall, we search to find answers to several questions:

• What are the specific behaviours of high-dimensional data? How do distances evolve with increasing dimensionality? Given the correlated nature of real-world data, what is the impact of correlation on distance behaviour? Is the nearest neighbour meaningful in high dimensions?

• What are the implications of high-dimensionality on clustering and dimension reduction methods?

Can dimension reduction be used to improve clustering results? Which techniques are best adapted to the high-dimensionality of real-world data?

• How can dimension reduction methods be designed to find low-dimensional embeddings with strong discriminative power? How can existing dimension reduction methods be adapted to clustered data to provide efficient results?

• And finally, what new dimension reduction methods can be proposed to deal with the increasing dimensionality of the data?

1.2 Thesis outline

This thesis begins with an analysis of distances in high dimensions in chapter2. We extend the analysis beyond the current state-of-the-art by considering the correlation among dimensions. The impact of high- dimensionality on dimension reduction is detailed in the related work in chapter3. Then we introduce our contributions in the field of dimension reduction in chapters 4 and 5. We conclude our work in chapter6. In more details:

• Chapter2: On the nature of high-dimensional data first makes an analysis of the behaviour of the distance contrast, in absolute and relative terms, with increasing dimensionality. State-of-the- art research has focused on the study of the Minkowski distance, mainly due to its wide utilisation, especially the Euclidean distance. We discuss the concentration of distance phenomenon and emphasise its occurrence in relationship with the relative contrast, as opposed to the absolute contrast. Current research results have concentrated on the analysis of data with independent and identically distributed (iid) attributes. We extend the analysis to correlated variables and show that, contrary to iid variables, the correlation introduces significant differentiation in the data in

(21)

terms of distances. The result is important for real-world data where the attributes or a subset of the attributes are often correlated, especially locally. From the dual character of the distance contrast, we derive the dual character, relative vs. absolute, of the notion of nearest neighbours.

k-nearest neighbours has an absolute approach, whileε-nearest neighbours has a relative approach to nearest neighbours search. The absolute approach from kNN makes it the appropriate method to be employed when searching for neighbours in high dimensions.

• Chapter3: Related work presents the most common dimension reduction methods in the litera- ture. We use dimension reduction techniques to find low-dimensional embeddings of the original space and further perform the analysis in the reduced space. Our scenario is that of clustered high- dimensional data and, in this context, we discuss the adaptability of each method to such data.

Overall, local methods, i.e. they use local information, such as neighbourhoods, are better adapted to the complex structure of high-dimensional data and generate better quality results.

• Chapter4: High-Dimensional Multimodal Distribution Embedding (HDME) is an unsuper- vised distance-based dimension reduction method whose aim is to find low-dimensional representations of high-dimensional data that preserve and emphasise cluster information. The difficulty of analysing high-dimensional data arises from the fact that in high-dimensional representation spaces, all pairwise distances between points tend to become equal in relative terms (chapter 2).

This leads to the generation of spherical embeddings with low or no cluster separability. To over- come the problem of equal distances, HDME proposes a distance transformation. The choice of the transformation function is driven by insights into the distribution of distances of clustered datasets, i.e. they are characterised by multimodality. Using such transformations prior to the actual dimension reduction, we help cluster preservation during embedding. One of the most reli- able information in high-dimensional spaces is the k-nearest neighbour (chapter 2). Our distance transformation function will thus be based on the kNN information estimated in the original space.

Once the new distances computed, HDME embeds them in a lower-dimensional space using a distance-based embedding method. The resulting space proves to have a strong discriminative power when compared to state-of-the-art methods.

• Chapter5: The Cluster Space also addresses the problem of finding low-dimensional embed- dings with strong discriminative power. While HDME is a distance-based embedding method, the cluster space approaches dimension reduction from a model perspective. The data is supposed to be generated from a mixture of Gaussians. The coordinates of the points in the space of the clusters are estimated from the parameters of a Gaussian mixture model. This allows to capture not only the information among data points, but also among clusters in the same embedding space.

We justify the discriminant capability of the cluster space by showing its intrinsic relationship with the quadratic discriminant analysis.

(22)

1.3. Contributions

1.3 Contributions

We summarise the main contributions of this thesis here:

• We show the positive effect of correlation among dimensions (non-iid) on the distance contrast, both in absolute and relative terms. The correlation introduces a significant discrimination among distances, thus increasing the significance of distance estimations in high dimensions.

• We show that the Euclidean distance never concentrates in absolute terms. The absolute contrast remains constant when the dimensions are independent and starts to increase with the dimensionality as soon as correlation among dimensions is introduced.

• We derive the duality of the notion of nearest neighbours from the dual character of the notion of distance contrast. While k-nearest neighbours has an absolute approach to nearest neighbours search, ε-nearest neighbours uses a relative approach. The absolute approach of kNN makes it the appropriate method to be employed in high dimensions. Empirical results also show the high accuracy of kNN (∼90%), making it one of the most reliable operations on real-world high- dimensional data.

• We propose an unsupervised distance-based embedding method: High-Dimensional Multimodal Distribution Embedding. The aim throughout the thesis is to find low-dimensional embeddings with strong discriminative power. From the insights into the behaviour of distance distributions in high dimensions for clustered data, we propose a distance transformation. Applied prior to embedding, the transformation significantly increases the cluster separability of the representation space proposed.

• We propose a second unsupervised dimension reduction method to find an embedding space with strong discriminative power: the Cluster Space. The estimation of the embedding coordinates relies on the estimation of a Gaussian mixture model parameters. We show the intrinsic relationship of the cluster space with the quadratic discriminant analysis, thus justifying the discriminant capability of the reduced space.

(23)

On the nature of high-dimensional data

The term “high-dimensional data” is employed to designate data points represented by a large number of attributes. The range of values for which the dimensionality is considered high is varying from tens to thousands and even more dimensions. The analysis of such datasets, whether it is for retrieval, clustering, exploration or visualisation, is difficult due to the so-called “curse of dimensionality” [Bellman, 1961]. High-dimensional data is present in domains as varied as multimedia (e.g. text, images, videos), bioinformatics (e.g. gene expressions) or finances and climate (e.g. time-series).

The “curse of dimensionality” is tightly related to the “empty space phenomenon”: high-dimensional spaces are inherently sparse. To illustrate the curse of dimensionality and the empty space phenomenon we consider as example the sampling of a probability space (figure 2.1). Imagine drawing samples in the probability space - sampled with M units on each dimension - as to obtain a given accuracy.

In a one-dimensional space, one sample represents ¹⁰⁰_M% of the probability space, so M points would be required for the given accuracy. In a two-dimensional space, one sample represents ¹⁰⁰_M2% of the probability space, so M²points would be needed to obtain the same accuracy. In a D-dimensional space, one sample represents ¹⁰⁰_MD% of the probability space, so M^Dsamples would be required for the accuracy.

Therefore, the number of samples required to achieve the same given accuracy grows exponentially with the dimensionality of the space, giving rise to the “curse of dimensionality”.

With the growth in the number of attributes describing the data in current real-world applications, data mining and machine learning analysis tools are getting more and more impacted by the curse of dimensionality. The dimensionality of the data is increasing rapidly, often due to the use of automated tools for feature extraction and many cases arise where the dimensionality of the data, D, exceeds the number of samples, N (D>N).

The inherent properties of high-dimensional spaces make the commonly intuitive behaviours from low-dimensional spaces no longer applicable in high dimensions. Pairwise similarity and dissimilarity measures are particularly influenced by the specific geometry of high-dimensional spaces. The wide use of these measures in merely any data processing tool makes their analysis specially interesting.

Research efforts have focused on the study of Minkowski pairwise distances in high dimensions [Beyer et al.,1999;Hinneburg et al.,2000;Aggarwal et al.,2001;Aggarwal and Yu,2002;Francois et al.,

(24)

(a) D=1 (b) D=2 (c) D=3

Figure 2.1: Illustration of the “curse of dimensionality” (M=10). To achieve a better visualisation, when D=3 only a part of the space is shown.

2007]. The findings state that the ratios between distance values converge to one with the increase in dimensionality (see section 2.2.1). Distances are said to concentrate. As a consequence of this concentration phenomenon, the information captured by distance values in high dimensions raised many questions concerning its discriminating capability and therefore its usefulness. Throughout this chapter, we consider the concentration phenomenon and identify the conditions under which it occurs. The concentration is mainly a consequence of the fact that the standard deviation of the distance distribution varies slower than the average distance. It is therefore observed when the ratio of distances is considered and not necessarily when the absolute difference of distances is used.

The analysis of the behaviour of distances with increasing dimensionality has focused on data with independent coordinates/features/attributes/dimensions. The independence assumption implies the absence of correlation among dimensions. However, given the frequent occurrence of correlation among data attributes in real-world scenarios, we propose to extend the analysis of distance behaviour in high dimensions to correlated data. We consider different degrees of correlation among data attributes and show that correlation has a positive effect on the notion of distance in terms of discriminative power.

The presence of correlation diminishes the concentration phenomenon by increasing the contrast among distances.

After a brief introduction of the notion of distance and particularly of the Minkowski distance in section2.1, the current chapter continues with a presentation of the findings concerning the behaviour of distances in high dimensions in section2.2. The dual character – absolute vs. relative – of the notion of contrast in distances is analysed in the context of high-dimensional data. Most results presented in the literature are valid under the independent and identically distributed (iid) assumption. Given the correlated nature of real-world data, we consider the correlation assumption in high dimensions in section 2.3 and show its positive influence on the discriminative power of the notion of distance.

Section2.4reconsiders the duality of the notion of distance contrast and its relationship with the distance concentration phenomenon. In section2.5 we show the analogy between the distance contrast and the

(25)

notion of nearest neighbour and derive the dual character of the latter one: k-nearest neighbours vs. ε- nearest neighbour. This duality and its impact on the nearest neighbour search in high dimensions is also discussed. Experiments on real-world data show the high accuracy of the k-nearest neighbours for different Minkowski metrics. A summary of the main findings concludes the chapter in section2.6.

2.1 Distances in high dimensions

A distance function or a metric is a measure of the distance between two points. Given a set X of points and the distance functionδ: X×X→R,δi jis the distance between two D-dimensional points xiand xj. A function is called a metric if it satisfies the following four conditions:

1. δi j≥0 (positiveness) 2. δi j=0⇔xi=xj (identity) 3. δi j=δji(symmetry)

4. δih≤δi j+δjh(triangle inequality)

The most common distance is the Minkowski distance of order p, Lp: Lp(xi,xj) =

"

∑

D l=1

|x^l_i−x^l_j|^p

#¹

p

(2.1) The definitions of the three main Minkowski distances are expressed in table 2.1. The Minkowski distance is defined for natural values of the order p=1,2,... When extended to positive real orders of p, 1≤ p<∞, it remains a metric. However, when fractional orders are considered, 0<p<1, the triangle inequality does not hold anymore and these distances are semi-metrics, not metrics. The triangle inequality is lost due to the loss of the convexity property. This is illustrated in figure2.2, together with the contour lines of constant value for a few extended Minkowski distances in a two-dimensional space.

The contour lines reflect the positions in the space of the points equally distant from a given point, i.e. in this case, equally distant from the origin.

The increase in dimensionality is associated with the addition of extra information through the new dimensions. The added information translates into higher distance values, excepting the specific case where the same information is added by the new dimensions for all data points. With the increase in the absolute value of the distances, the contour lines – around which the distance values gather – move away from the data point, i.e. here, from the origin. The speed at which the absolute values of the distances increase – the represented contour lines move away – is generally faster than the speed at which the absolute differences between distances increase, leading to what was called the “distance concentration phenomenon” in high dimensions. Despite the general opinion that this phenomenon occurs always in high dimensions, it is actually observed in relation to the relative contrast induced by distances and only rarely in relation to the absolute contrast. The analysis of the absolute and relative contrast for high-dimensional pairwise distances is at the centre of the current chapter.

(26)

2.1. Distances in high dimensions

(a) p=0.25 (b) p=0.5

(c) p=1 (d) p=2

(e) p=3 (f) p=100

Figure 2.2: The Minkowski distance for different orders of p.

(27)

p Name Distance 1 Manhattan (Cityblock) ∑^D_l=1 |x^l_i−x^l_j|

2 Euclidean

h∑^D_l=1|x^l_i−x^l_j|²i¹

2

∞ Chebyshev (Dominance) lim_p→∞

h∑^D_l=1 |x^l_i−x^l_j|^pi¹

p =max_l|x^l_i−x^l_j|

Table 2.1: The most common Minkowski distances.

2.1.1 Attribute weighting in Minkowski metrics

Different Minkowski metrics weight differently the attributes/dimensions in the computation of the distance. The analysis of the weighting of attributes requires to consider only the terms under the sum, while neglecting the power ¹_p, i.e. this latter one has no influence on the weighting. Let the new weight- ing variable be denoted with W_p:

W_p(x_i,x_j) =

∑

^D

l=1

|x^l_i−x^l_j|^p (2.2)

Figure 2.3: Weighting in Lp. The terms under the sum, |x^l_i−x^l_j|^p, belong

to the family of power functions (figure2.3) and therefore they act differently depending on the value of p:

• for large values of p, a small variation in difference on one attribute leads to a large variation in value for the terms under the sum;

• for small values of p, a large variation in difference on one attribute leads to a small variation in value for the terms under the sum.

Thus, an increase in the value of p leads to an

increase in the importance given to large-valued differences |x^l_i−x^l_j|. On the contrary, a decrease in p leads to a uniformisation/normalisation of the importance given to the attribute differences. When p→0, all terms|x^l_i−x^l_j|^p→1 and the value of the distance tends to reflect just the number of attributes¹.

1For simplicity, in figure2.3we consider only one attribute/dimension in the illustration. Our purpose is to show the way that different values of the differences|x^l_i−x^l_j|are weighted by different metrics. If, for example, the scales of two attributes are totally different, the attribute with the largest scale will generate differences significantly larger. Thus, this large-scaled attribute will tend to dominate even more the value of the distance with the increase in p.

We exemplify the weighting of attributes in the Minkowski metric with the following example. Let us consider the description of a person by two attributes: age and height. We consider two samples (persons) with the following characteristics: x1= [60 1.63]and x₂= [35 1.70]. We then have: W₀_.₅=|60−35|⁰^.⁵+|1.63−1.70|⁰^.⁵=5+0.26=5.26; W₁=|60−35|¹+|1.63−

(28)

2.2. State of the art on relative and absolute contrast By construction, Minkowski metrics weight differently the attributes and consequently vary the importance of these attributes in the final value of the distances. Experiments performed later in this chapter show the impact of this weighting on distance computation in real-world practical applications (see sec- tion2.5.1).

2.2 State of the art on relative and absolute contrast

The contrast reflected by distances is an indicator of how similar or how different multiple data items are. Two types of contrast were identified in the literature: absolute and relative. Their analysis has been performed in the context of the nearest neighbours search [Aggarwal et al., 2001; Beyer et al., 1999;

Hinneburg et al.,2000] and they were consequently defined in terms of the distances from a query data point, x_i, to its nearest (δ^min_i ) and farthest (δ^max_i ) neighbours as in the following.

Definition 1. The absolute contrast is the absolute difference between the distances to the nearest and farthest neighbours from a query data point x_i:

δAC=δ^max_i −δ^min_i (2.3)

Definition 2. The relative contrast is the ratio between the absolute contrast and the absolute value of the shortest distance:

δRC= δAC

δ^min_i =δ^max_i −δ^min_i

δ^min_i =δ^max_i

δ^min_i −1 (2.4)

From equation (2.4) we observe that analysing the behaviour of the relative contrast is equivalent to analysing the behaviour of the ratio between the two distance values.

2.2.1 Relative contrast

[Beyer et al.,1999] proved in their work that the relative contrast converges to zero with the increase in dimensionality, given that a precondition on the distribution of distances is fulfilled. Both the precondition and the result are summarised in the following theorem, that can be found in more details in the original work:

Theorem 1. [Beyer et al.,1999] If

Dlim→∞var δ_{i j}^p E[δ_{i j}^p]

!

=0 (2.5)

with p-constant, 0<p<∞. Then for everyε>0:

Dlim→∞P[δ^max_i ≤(1+ε)δ^min_i ] =1 (2.6)

1.70|¹=25+0.07=25.07; W₂=|60−35|²+|1.63−1.70|²=625+0.0049=625.0049; W₃=|60−35|³+|1.63−1.70|³= 15625+3.43×10⁻⁴. In the different weighting schemes, we observe the difference in the weighting scale for the two attributes:

age and height. Higher the value of p, higher the weight given to the largest attribute (the age). From this example, we deduce the importance of normalisation in the Minkowski metric when the scales of the attributes are significantly different.

(29)

Reformulating the result from equation (2.6) of the theorem in terms of the ratio between the distances to the closest and farthest neighbour from the query point, we obtain the weak convergence of the ratio between the two most far away neighbours:

Dlim→∞P δ^max_i

δ^min_i ≤(1+ε)

=1 (2.7)

Thus, the ratio tends to converge to one in probability. Asδ^max_i >δ^min_i , it results that the relative contrast (equation (2.4)) decreases with the increase in dimensionality, converging to zero:

δ^max_i −δ^min_i δ^min_i

−p

→0 when D→∞ (2.8)

In summary, the theorem shows that the ratio of the distances decreases with the increase in dimensionality and tends to converge to one. The result was proven for the ratio of the distances to the most far away points from the query,δ^min_i andδ^max_i . By interpolation, the ratio of the distances to any two points from the query converges to one.

The precondition

The above result is preconditioned by equation (2.5), that can be rewritten as follows:

Dlim→∞var δ_{i j}^p E[δ_{i j}^p]

!

= lim

D→∞

var(δ_{i j}^p)

(E[δ^p_{i j}])² =0 (2.9)

The precondition tests whether the variance of the distance distribution divided by the squared expectation converges or not to zero. Actually, it tests the speed at which the variance changes with respect to the squared mean value. If the variance varies slower than the squared mean with the increase in dimensionality, it then results that the precondition is fulfilled and the relative contrast tends to disappear.

While [Beyer et al.,1999]’s theorem tests the behaviour of the variance in relative terms, it does not reveal its behaviour in absolute. This analysis was performed by [Hinneburg et al.,2000] and is presented in the following.

2.2.2 Absolute contrast

The evolution of the absolute contrast between the distances to the closest and farthest neighbour from the query point was investigated by [Hinneburg et al.,2000]. The analysis is performed for the Minkowski distance considering a D-dimensional data distribution with iid coordinates. Without loss of generality, the query point is considered to be the origin. For this data distribution the authors prove the following theorem:

Theorem 2. [Hinneburg et al.,2000] LetF^Dbe an arbitrary distribution such that each coordinate is independently drawn from a one-dimensional data distribution F. Let the distance function be a L_p metric. Then,

C_p≤ lim

D→∞E

"

δ^max_p −δ^min_p D(¹p−¹₂)

#

≤(N−1)C_p (2.10)

(30)

2.2. State of the art on relative and absolute contrast whereδ^min_p andδ^max_p are the Minkowski distances of order p, L_p, to the closest and the farthest neighbour.

C_pis a constant dependent on the order p.

The above result shows a direct dependency between the absolute difference and the Lp metric for arbitrarily distributed points with iid dimensions. The absolute contrast (δ^max_p −δ^min_p ) grows with D(¹p−¹₂). Thus, for the Manhattan metric L1, the absolute difference increases with the dimensionality and for the Euclidean metric L₂, it converges to a constant, while for the other Minkowski metrics L_pwith p>2, it decreases with the dimensionality.

The increase in the absolute difference with the dimensionality for the L1metric inspired the expan- sion of the study of the absolute contrast to fractional distances in [Aggarwal et al.,2001]. Fractional distances are extensions of the Minkowski metrics Lpfor fractional orders of p, 0<p<1 (see also sec- tion2.1). As already mentioned, the fractional distances are not metrics as they do not fulfill the triangle inequality.

The results of theorem 2 can also be applied to fractional distances. Thus, the rate of increase of the absolute difference is again of D(¹p−¹₂) where p is here fractional. From both the works of [Hinneburg et al., 2000] and [Aggarwal et al., 2001] it results that the absolute contrast for data with iid coordinates is dependent on the order p of the extended Minkowski metric.

[Hinneburg et al., 2000] consider the origin as query point. The distances to the origin from any other point represent the norms of the respective vectors. The behaviour of the norm with respect to the dimensionality had already been discussed in [Demartines,1994] and is presented in the following.

2.2.3 The concentration of the Euclidean norm for iid dimensions

[Demartines, 1994] analyses the dependency on the dimensionality of the expectation and variance of the Euclidean norm of a random vector, i.e. with iid components, and proves the following result:

Theorem 3. [Demartines,1994] Let x∈R^Dbe a random vector with iid components: x^l ∼F. Then, E(kxk2) =√

aD−b+O 1

D

(2.11)

var(kxk2) =b+O 1

√D

(2.12) where a and b are constants that do not depend on the dimension.

The theorem proves that the expectation of the Euclidean norm L2for iid data increases as the square root of the dimension, while the variance remains constant. Therefore, the expectation tends to dominate the variance with the increase in dimensionality and the relative contrast converges to zero.

Different data distributions generate different values of the parameters a and b, but the theorem remains valid whatever the data distribution. More important to the results are the relationships between dimensions/coordinates. According to the author, the results remain valid even when the condition of the theorem is broken, i.e. for non-iid dimensions/coordinates. In this case, the theorem stays true given

(31)

that the original dimension D is replaced by the actual number of “degrees of freedom”, i.e. the intrinsic dimensionality.

2.2.4 The concentration of the cosine similarity for iid dimensions

The wide use of the cosine similarity in the field of information retrieval motivated the investigation of the concentration phenomenon for this similarity measure in [Radovanovic et al.,2010]. The analysis was performed by considering two random D-dimensional vectors x and y with iid components. The following two theorems have been proven:

Theorem 4. [Radovanovic et al.,2010]

Dlim→∞E(x,y) =const. (2.13)

Theorem 5. [Radovanovic et al.,2010]

Dlim→∞

pvar(cos(x,y)) =0. (2.14)

The theorems show that the expectation of the cosine similarity converges to a constant with increasing dimensionality, while the square root of the variance converges to 0. The relative contrast thus converges to 0 and the cosine similarity is said to concentrate.

2.3 Correlation in high dimensions

The theoretical results from [Demartines,1994;Hinneburg et al.,2000;Aggarwal et al.,2001] considered a data distribution with point coordinates independently drawn from an arbitrary distribution. The independence is a strong assumption and is rarely encountered in practical applications where the coordinates or a subset of the coordinates are often correlated, thus dependent.

In this section we consider the correlation assumption and analyse the evolution of the Minkowski metric with increasing dimensionality and different degrees of correlation.

2.3.1 Minkowski distance distribution

We start by considering the following experiment: X is a D-dimensional dataset with N=1000 points drawn from a Gaussian distribution with zero mean and unit variances on all dimensions. The extended Minkowski metric for different values of p={0.25,0.5,1,2,3,100} is used to compute the pairwise distances among points from the dataset with increasing dimensionality D={1,10,20,30,...,200}. The degree of correlationΣll^′ between dimensions l and l^′² is allowed to vary from totally uncorrelated to totally correlated:

1. Σll^′ =0, l6=l^′;

2In our experiment, all dimensions have a unit standard deviation (unit variances) and the correlations equal the covariances.

(32)

2.3. Correlation in high dimensions 2. Σll^′ =0.5, l6=l^′;

3. Σll^′ =1, l6=l^′.

The distance distributions are generated for the three varying parameters:

1. p: the order of the Minkowski metric;

2. D: the dimensionality;

3. Σll^′: the degree of correlation.

Results are presented in figure2.4 and the analysis of the impact of each of these parameters on the behaviour of the distance distributions is presented in the following.

Influence of the metric Lp

Let x_i and x_j be two points in D dimensions. For illustration, we consider that on every dimension the differences between coordinates are equal to a constant M, M≥0. The distance between the two points becomes:

L_p(x_i,x_j) =

"

∑

D l=1

|x^l_i−x^l_j|^p

#¹

p

=

"

∑

D l=1

M^p

#¹

p

= [DM^p]¹^p =MD¹^p (2.15) Thus, given a fixed dimensionality D, an increase in p leads to a decrease in the magnitude of the Lp

metric, converging to a constant M when p→∞:

plim→∞Lp= lim

p→∞MD¹^p =M (2.16)

The decrease in Lp with the increase in p is reflected in both figures2.2and2.4. A consequence is that the means of the distance distributions decrease with the increase in p.

Influence of the dimensionality

The increase in dimensionality naturally leads to an increase in the value of the distance for fixed values of

p. Fixing p and varying D in equation (2.15), we obtain:

Dlim→∞Lp= lim

D→∞MD¹^p =∞ (2.17) Thus, for fixed p, the values of the L_p metrics always increase with the dimensionality as they be- long to the family of power functions ( f(x) =x^a, where a is a constant). Their rate of increase de- pends on the value of p (see figure2.5): for p<1, Lp increases at an increasing rate, for p=1 in- creases at a constant rate, while for p>1 increases at a decreasing rate.

(33)

(a) p=0.25,Σll^′=0 (b) p=0.25,Σll^′=0.5 (c) p=0.25,Σll^′=1

(d) p=0.5,Σll^′=0 (e) p=0.5,Σll^′=0.5 (f) p=0.5,Σll^′=1

(g) p=1,Σll^′=0 (h) p=1,Σll^′=0.5 (i) p=1,Σll^′=1

(j) p=2,Σll^′=0 (k) p=2,Σll^′=0.5 (l) p=2,Σll^′=1

(m) p=3,Σll^′=0 (n) p=3,Σll^′=0.5 (o) p=3,Σll^′=1

(p) p=100,Σll^′=0 (q) p=100,Σll^′=0.5 (r) p=100,Σll^′=1

Figure 2.4: Distribution of different Minkowski distances L_pfor different dimensionalities D and different degrees of correlationΣll^′.

(34)

2.3. Correlation in high dimensions Influence of the degree of correlation

Under the independence assumption of the dimensions, according to the Central Limit Theorem, the distribution of the L_pmetric is approaching the normal distribution with the increasing dimensionality D (see the first column from figure 2.4). For p<1 we have seen that the Lp metric is increasing at an increasing rate (figure2.5). Hence, the distance distribution is moving away from the origin at an increasing rate with increasing dimensionality. It results that for low dimensionalities, e.g. D=2 or D=50, the distance distributions are very close to the origin when compared to higher dimensionalities, e.g. D=200, (figure2.4(a, d)). This behaviour is fading away with the increase in p and the means of the distance distributions tend to get closer (figure2.4(g, j, m, p)).

How is the presence of correlation among dimensions influencing the distance distribution? First of all, the absence of the independence assumption among dimensions makes the Central Limit Theorem inapplicable: the distance distribution does not approach the normal distribution anymore (see the second and the third columns from figure2.4)³. In the case of independent dimensions, due to the convergence to the normal distribution, the use of the first two standardised moments – mean and standard deviation – was sufficient to characterise the distance distribution. For correlated data, the distributions are asym- metric and therefore have a non-vanishing skewness, the third standardised moment. The skewness is positive – skewness to the right – and is increasing with the correlation. A positive skewness indicates, in our case, that small distance values are more frequent than large distance values. This behaviour is observed independent of the Minkowski metric used or the dimensionality.

In the following we shall see that the presence of correlation has a positive effect on the discriminative power of the distances by significantly increasing both the absolute and relative contrast.

2.3.2 Relative and absolute contrast

Initially, the definitions of the absolute and relative contrast were taking into account only the contrast between the nearest and farthest neighbours, as they were analysed in the context of the nearest neighbour search. Here we are interested in the analysis of the distances beyond just the contrast between the most far away points. Hence the means and standard deviations of the distance distributions become the tools used in the following to investigate the behaviour of the Lpmetric for different degrees of correlation and increasing dimensionalities:

• σ: the standard deviation of the distance distribution captures the spread of the distances in absolute terms (absolute contrast);

• ^σ_µ: the ratio between the standard deviationσand the mean µ of the distance distribution captures the spread of the distances with respect to the absolute value of the average distance, i.e. the mean (relative contrast).⁴

3Note that the scale of distance values is changing between columns, that is, for different correlations.

4In theorem1the square of the ratio^σ_µ is considered: var(δi j) (E[δi j])²=

σ µ

₂

Dimension reduction for clustered high-dimensional data

Thesis

Reference

Dimension reduction for clustered high-dimensional data

Dimension reduction for clustered high-dimensional data

TH `ESE

Enik˝o Melinda SZ ´EKELY

Abstract

R´esum´e

Contents

List of Notations

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Context and motivation

1.2 Thesis outline

1.3 Contributions

On the nature of high-dimensional data

2.1 Distances in high dimensions

∑

∑

2.2 State of the art on relative and absolute contrast

2.3 Correlation in high dimensions

∑

∑