On the Spectrum of Random Features Maps of HighDimensionalData
Zhenyu Liao 1 Romain Couillet 1 2
Abstract
Random feature maps are ubiquitous in modern statistical machine learning, where they general- ize random projections by means of powerful, yet often difficult to analyze nonlinear operators. In this paper, we leverage the “concentration” phe- nomenon induced by random matrix theory to perform a spectral analysis on the Gram matrix of these random feature maps, here for Gaussian mixture models of simultaneously large dimen- sion and size. Our results are instrumental to a deeper understanding on the interplay of the nonlinearity and the statistics of the data, thereby allowing for a better tuning of random feature- based techniques.
However, the loss of information due to dimensional- ity reduction leads to distortions in the resulting projection [ Aup07 ]. These multidimensional scaling artifacts are chal- lenging in terms of interpretation and trust [ CRMH12 ] for the analysts who want to make inferences on HD data from projection. There are several algorithms [ Jol02 ], [ BSL ∗ 08 ], [ PSPM12 ], [ IMO09 ], [ JPC ∗ 11 ] and optimization criteria available for projecting high-dimensionaldata in a low- dimensional space. Many metrics and static visualization techniques exist to evaluate the optimization quality (stress) [ Kru64 ], [ Ven07 ], [ BW96 ], [ BCLC97 ], [ SSK10 ], [ SvLB10 ], [ LA11 ] of the resulting projection and its visual quality [ TBB ∗ 10 ], [ STMT12 ] when the class labels are known.
this Δ, compute and display the two distances d 1 (Z), d 2 (Z) of an instance Z to the chosen π ν .
The article is organized as follows. In the Introduction, we already discussed the motivation for using a distance-dependent approach for both the visualization and classification of high-dimensionaldata. Next, we define and list the various common distance measures we may use. This is followed by the description of several possible class representations and class proximity measures. In particular, we introduce, discuss and compare four major categories for representing and positioning an instance in a Class Proximity (CP) plane. In the Results and Discussion section we first illustrate in detail, on a high-dimensional biomedical (metabolomic) dataset ( 1 H NMR spectra of a biofluid) several feasible possibilities and processes, based on concepts of the Class Proximity Projection approach. We repeat this process for four datasets from the UCI Repository. We conclude with general observations and a summary.
targeted parameter (the algorithm is described in Section ). The authors show the greedy C-TMLE algorithm exhibits superior relative performance in analyses of sparse data, at the cost of an increase in time complexity. For instance, in a problem with p baseline covariates, one would construct and select from p candidate estimators of the nuisance parameter, yielding a time complexity of order O(p 2 ). Despite a criterion for early termination, the algorithm does not scale to large-scale and high-dimensionaldata. The aim of this article is to develop novel C-TMLE algorithms that overcome these serious practical limitations without compromising finite sample or asymptotic performance.
Abstract
This paper presents the R package HDclassif which is devoted to the clustering and the discriminant analysis of high-dimensionaldata. The classification methods proposed in the package result from a new parametrization of the Gaussian mixture model which combines the idea of dimension reduction and model constraints on the covariance matrices. The supervised classification method using this parametrization is called highdimensional discriminant analysis (HDDA). In a similar manner, the associated clustering method is called highdimensionaldata clustering (HDDC) and uses the expectation-maximization algorithm for inference. In order to correctly fit the data, both methods estimate the specific subspace and the intrinsic dimension of the groups. Due to the constraints on the covariance matrices, the number of pa- rameters to estimate is significantly lower than other model-based methods and this allows the methods to be stable and efficient in high dimensions. Two introductory examples illustrated with R codes allow the user to discover the hdda and hddc func- tions. Experiments on simulated and real datasets also compare HDDC and HDDA with existing classification methods on high-dimensional datasets. HDclassif is a free software and distributed under the general public license, as part of the R software project.
Abstract— Mapping high-dimensionaldata in a low- dimensional space, for example for visualization, is a problem of increasingly major concern in data analysis. This paper presents DD-HDS, a nonlinear mapping method that follows the line of Multi Dimensional Scaling (MDS) approach, based on the preservation of distances between pairs of data. It improves the performance of existing competitors with respect to the representation of high- dimensionaldata, in two ways. It introduces i) a specific weighting of distances between data taking into account the concentration of measure phenomenon, and ii) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the data set, with the exception of a single user-defined parameter setting the trade-off between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by "Force Directed Placement". The mappings of low- and high-dimensionaldata sets are presented a s illustrations of the features and advantages of the proposed algorithm. The weighting function specific to high- dimensionaldata and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.
Abstract
Clustering is a data mining technique intensively used for data analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of dis- covering the number of clusters automatically and offering favorable characteristics. However, in the case of high dimen- sional data, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with highdimensionaldata. We propose HD4C (High Di- mensional Data Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of highdimensionaldata such as time series (as a function of time), hyperspectral data (as a function of wavelength) etc. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach.
HighDimensionalData Analysis
Since computer power enables massive computations, data analysis has been driven by the necessity to produce algorithms able to recover the structure of datasets from input points, e.g. using manifold reconstruction and metric ap- proximation [ 19 ]. Many methods developed in the last decades use extensively distances and metrics in order to produce structures able to capture the manifold underlying the dataset, e.g. Delaunay triangulations [ 20 ], marching cubes [ 21 ], manifold reconstruction [ 22 ]. Moreover, many of those methods strongly rely on a partition of the space, e.g. Voronoi diagrams in order to produce Delaunay tri- angulations [ 20 ], kd-trees [ 23 ] and for nearest neighbour search. These methods and techniques (e.g. use of distances, partition of the space, nearest-neighbour search) are affected in high-dimensions by the curse of dimensionality [ 24 ], and consequently, algorithms based on them may be not efficient. The term curse of dimensionality, introduced by Bellman in 1961 [ 25 ], [ 26 ] is nowadays used to refer to the class of phenomena that occur in high dimensions in contrast with the low dimensional scenario. Important examples are the tendency of data to become very sparse in high dimensions [ 24 ], [ 27 ], and the concentration of dis- tances. Usually dimensions d ≤ 6 are considered low. A highdimensional regime has to be considered when dimension d ≥ 10 [ 11 ].
Figure 1: Distribution of pairwise Euclidian distances in the artificial dataset.
2 Kernel PCA projection for visualizing highdimensionaldata
The kernel PCA is the kernelized version of the PCA, a popular projection method. It operates on a kernel matrix (i.e. positive semi-definite similarity matrix), and extracts non-linear principal manifolds underlying the similarity matrix (see appendix A for details). The method maps these manifolds on a vector space: thus, we can build approximate, non-linear, 2D projections of high-dimensionaldata, by selecting the 2 dominant eigen-dimensions, and the values taken by the data elements on these.
Abstract. In many domains, the data objects are described in terms of a large number of features. The pipelined data mining approach introduced in [12] us- ing two clustering algorithms in combination with rough sets and extended with genetic programming, is investigated with the purpose of discovering important subsets of attributes in highdimensionaldata. Their classification ability is de- scribed in terms of both collections of rules and analytic functions obtained by genetic programming (gene expression programming). The Leader and several k-means algorithms are used as procedures for attribute set simplification of the information systems later presented to rough sets algorithms. Visual data min- ing techniques including virtual reality were used for inspecting results. The data mining process is setup using high throughput distributed computing techniques. This approach was applied to Breast Cancer gene expression data and it led to subsets of genes with high discrimination power with respect to the decision classes.
High-dimensionalData Streams
R´ esum´ e : Dans cette th` ese, nous proposons une nouvelle approche de l’apprentissage profond pour la classification des flux de donn´ ees de grande dimension. Au cours des derni` eres ann´ ees, les r´ eseaux de neurones sont devenus la r´ ef´ erence dans diverses applications d’apprentissage automatique. Cependant, la plupart des m´ ethodes bas´ ees sur les r´ eseaux de neurones sont con¸ cues pour r´ esoudre des probl` emes d’apprentissage statique. Effectuer un apprentissage profond en ligne est une tˆ ache difficile. La principale difficult´ e est que les classificateurs bas´ es sur les r´ eseaux de neurones reposent g´ en´ eralement sur l’hypoth` ese que la s´ equence des lots de donn´ ees utilis´ ee pendant l’entraˆ ınement est stationnaire; ou en d’autres termes, que la distribution des classes de donn´ ees est la mˆ eme pour tous les lots (hypoth` ese i.i.d.). Lorsque cette hypoth` ese ne tient pas les r´ eseaux de neurones ont tendance ` a oublier les concepts temporairement indisponibles dans le flux. Dans la litt´ erature scientifique, ce ph´ enom` ene est g´ en´ eralement appel´ e oubli catastrophique. Les approches que nous proposons ont comme objectif de garantir la nature i.i.d. de chaque lot qui provient du flux et de compenser l’absence de donn´ ees historiques. Pour ce faire, nous entraˆ ınons des mod` eles g´ en´ eratifs et pseudo-g´ en´ eratifs capable de produire des ´ echantillons synth´ etiques ` a partir des classes absentes ou mal repr´ esent´ ees dans le flux, et compl` etent les lots du flux avec ces ´ echantillons. Nous testons nos approches dans un sc´ enario d’apprentissage incr´ emental et dans un type sp´ ecifique de l’apprentissage continu. Nos approches effectuent une classification sur des flux de donn´ ees dynamiques avec une pr´ ecision proche des r´ esultats obtenus dans la configuration de classification statique o` u toutes les donn´ ees sont disponibles pour la dur´ ee de l’apprentissage. En outre, nous d´ emontrons la capacit´ e de nos m´ ethodes ` a s’adapter ` a des classes de donn´ ees invisibles et ` a de nouvelles instances de cat´ egories de donn´ ees d´ ej` a connues, tout en ´ evitant d’oublier les connaissances pr´ ec´ edemment acquises. .
for reducing space needs and speeding up similarity search. This can be classically achieved by hashing techniques which map data onto lower-dimensional representa- tions.
We recall from Chapter 2 that two hashing paradigms exist: data-independent and data-dependent hashing methods. On the one hand, Locality-Sensitive Hashing (LSH) [ Andoni and Indyk , 2008 ] and its variants [ Terasawa and Tanaka , 2007 , Andoni et al. , 2015b , Yu et al. , 2014 ] belong to the data-independent paradigm. They rely on some random projection onto a c-lower dimensional space followed by a scalar quanti- zation returning the nearest vertex from the set {−1, 1} c for getting the binary codes (e.g. the sign function is applied point-wise). On the other hand, data-dependent methods [ Wang et al. , 2018 ] learn this projection from data instead and have been found to be more accurate for computing similarity-preserving binary codes. Among them, the unsupervised data-dependent Hypercubic hashing methods, embodied by ITerative Quantization (ITQ) [ Gong et al. , 2013 ], use Principal Component Analysis (PCA) to reduce data dimensionality to c: the data is projected onto the first c prin- cipal components chosen as the ones with the highest explained variance as they carry more information on variability. If we then directly mapped each resulting direction to one bit, each of them would get represented by the same volume of binary code (1 bit), although the c th direction should carry less information than the first one. Thus, one can intuitively understand why PCA projection application solely leads to poor performance of the obtained binary codes in the NN search task. This is why data get often mixed though an isometry after PCA-projection so as to balance variance over the kept directions. See Figure 4.1 , a recall from Section 2.3.2 , for the general scheme of Hypercubic hashing methods.
Editor: David Blei
Abstract
There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.
An alternative and recent way for dealing with the problem of high-dimensionaldata classification is to model and classify the data in low-dimensional class specific subspaces. The Gaussian models for high-dimensionaldata and their associated classification method HDDA (High-Dimensional Discriminant Analysis), defined in [4] and under study in the present paper, allow to efficiently model and classify complex high-dimensionaldata in a parsimonious way since the model complexity is controlled by the intrinsic dimensions of the classes. In contrast to other generative methods which incorporate dimensionality re- duction or variable selection for dealing with high-dimensionaldata [14, 20, 23], HDDA does not reduce the dimension while modeling the data of each class in specific low-dimensional subspace. Thus, no information loss due to data dimensionality reduction is to be deplored and all the available information is used to discriminate the classes. Furthermore, several submodels are defined by introducing constraints on the parameters in order to be able to model different types of data. The choice between these submodels can be done using classical model selection tools as cross-validation or penalized likelihood criteria [17]. An additional advantage is that HDDA models require the tuning of only one parameter, which contributes to the selection of the dimension of each class specific subspace (see Section 2.4). Finally, HDDA presents several numerical advantages compared to other generative clas- sification methods: explicit formulation of the inverse covariance matrix and possibility of building the classifier when the number of learning observations is smaller than the dimen- sion.
d go to infinity on kXk p has surprising consequences, which may dramatically affect
high-dimensionaldata processing. This effect is usually referred to as the distance
concentration phenomenon in the computational learning literature. Despite a grow- ing interest in this important question, previous work has essentially characterized the problem in terms of numerical experiments and incomplete mathematical state- ments. In the present paper, we solidify some of the arguments which previously appeared in the literature and offer new insights into the phenomenon.
d . These subsets form a partition of the positive unit sphere S d−1
+ . Regarding question (Q1) this partition allows to deal with high-dimensionaldata. Indeed for β ∈ P d ∗ with cardinality b, the subset C β can be seen as part of the sphere S b−1 + . Therefore, as soon as b is moderate compared to d the use of C β reduces the dimension of the study. The idea is then to provide methods to learn on which of these subsets the spectral measure puts mass. This is developed in Chapter 2 and Chapter 3 . All the approaches introduced in this section rely on asymptotic results of multivariate random vectors (see Proposition 1.2.2 and Proposition 1.2.3 ). However in a statistical context we only have a finite data set at our disposal. Therefore the convergences that appear in the aforementioned propositions become approximations. In particular Equation ( 1.2.17 ) can be used to study the behavior of the spectral vector Θ as soon as the threshold t is "large enough". This is why a particular attention should be paid on the choice of this threshold t, or equivalently on the number of data considered to be extreme (see question (Q3)). One way to deal with this issue is to use model selection to identify for which threshold t the approximation is the most accurate.
methods hardly scale to high-dimensionaldata [11], [12]. The recent rise in performance of generative models such as generative adversarial networks (GAN) [13] or variational autoencoders (VAE) [14], [15] has made them very attractive models to perform DA. GANs have al- ready seen a wide use in many fields of application [16], [17], [18], [19], [20], including medicine [21]. For instance, GANs were used on magnetic resonance images (MRI) [22], [23], computed tomography (CT) [24], [25], X-ray [26], [27], [28], positron emission tomography (PET) [29], mass spec- troscopy data [30], dermoscopy [31] or mammography [32], [33] and demonstrated promising results. Nonetheless, most of these studies involved either a quite large training set (above 1000 training samples) or quite small dimensionaldata, whereas in everyday medical applications it remains very challenging to gather such large cohorts of labeled pa- tients. As a consequence, as of today, the case of high dimen- sional data combined with a very low sample size remains poorly explored. When compared to GANs, VAEs have only seen a very marginal interest to perform DA and were mostly used for speech applications [34], [35], [36]. Some attempts to use such generative models on medical data either for classification [37], [38] or segmentation tasks [39], [40], [41] can nonetheless be noted. The main limitation to a wider use of these models is that they most of the time produce blurry and fuzzy samples. This undesirable effect is even more emphasized when they are trained with a small number of samples which makes them very hard to use in practice to perform DA in the highdimensional (very) low sample size (HDLSS) setting.
matrix with O(d 2 ) entries (where d is the data dimen-
sion) to account for correlation between pairs of fea- tures. For high-dimensionaldata (say, d > 10 4 ), this is problematic for at least three reasons: (i) training the metric is computationally expensive (quadratic or cubic in d), (ii) the matrix may not even fit in memory, and (iii) learning so many parameters is likely to lead to severe overfitting, especially for sparse data where some features are rarely observed.
or mouse social defeat) have been shown to be significantly associated to overall survival. It is even more surprising that many random signatures can outperform most breast cancer signatures [8]. Several authors have suggested that the selected sets of genes are not unique and are strongly influ- enced by the subset of patients included in the training cohort [9, 10] and by the variable selection procedures [11– 14]. For low-dimensionaldata, the reference method to study associations with time-to-event endpoints is the Cox propor- tional hazards model. In the context of high-dimensionaldata (number of covariates > >number of observations), the Cox model may be nonidentifiable. Extensions, based on boosting or penalized regression, are proposed in the litera- ture to overcome these hurdles [15–18], as they shrink the regression coefficients towards zero. Alternatively to the Cox extensions, methods based on random forests have been adapted for survival analysis [19]. This nonparametric metho- d—random survival forest (RSF)—combines multiple deci- sion trees built on randomly selected subsets of variables. Since feature selection methods are questioned, it seems important to thoroughly assess and compare existing strate- gies that are significant components in prognostic signature development. Many studies were interested in false discovery rates or prognostic performances achieved by multiple vari- able selection methods and compared them on simulated or real datasets [20–23]. However, the impact of the training set on the stability of the results was only assessed by Michiels et al. [9] on a binary endpoint with a selection based on Pear- son’s correlation and did not evaluate most recent approaches. The main objective of this publication is to compare six typical different feature selection methods which are com- monly used for high-dimensionaldata in the context of sur- vival analysis. For this purpose and as recommended in the literature [24], a simulation study is performed, with special focus on variable selection and prediction performance according to multiple data configurations (sample size of the training set, number of genes associated with survival). Feature selection methods are then applied on published data to explore stability and prognostic performances in a real breast cancer dataset.
finite mixture model 兩 flow cytometry 兩 multivariate skew distribution
F low cytometry transformed clinical immunology and hematol- ogy over 2 decades ago by allowing the rapid interrogation of cell surface determinants and, more recently, by enabling the analysis of intracellular events using fluorophore-conjugated antibodies or markers. Although flow cytometry initially allowed the investiga- tion of only a single fluorophore, recent advances allow close to 20 parallel channels for monitoring different determinants (1–4). These advances have now surpassed our ability to interpret man- ually the resulting high-dimensionaldata and have led to growing interest and recent activity in the development of new computa- tional tools and approaches (5–8).