On the Spectrum of Random Features Maps of **High** **Dimensional** **Data**
Zhenyu Liao 1 Romain Couillet 1 2
Abstract
Random feature maps are ubiquitous in modern statistical machine learning, where they general- ize random projections by means of powerful, yet often difficult to analyze nonlinear operators. In this paper, we leverage the “concentration” phe- nomenon induced by random matrix theory to perform a spectral analysis on the Gram matrix of these random feature maps, here for Gaussian mixture models of simultaneously large dimen- sion and size. Our results are instrumental to a deeper understanding on the interplay of the nonlinearity and the statistics of the **data**, thereby allowing for a better tuning of random feature- based techniques.

Show more
13 Read more

However, the loss of information due to **dimensional**- ity reduction leads to distortions in the resulting projection [ Aup07 ]. These multidimensional scaling artifacts are chal- lenging in terms of interpretation and trust [ CRMH12 ] for the analysts who want to make inferences on HD **data** from projection. There are several algorithms [ Jol02 ], [ BSL ∗ 08 ], [ PSPM12 ], [ IMO09 ], [ JPC ∗ 11 ] and optimization criteria available for projecting **high**-**dimensional** **data** in a low- **dimensional** space. Many metrics and static visualization techniques exist to evaluate the optimization quality (stress) [ Kru64 ], [ Ven07 ], [ BW96 ], [ BCLC97 ], [ SSK10 ], [ SvLB10 ], [ LA11 ] of the resulting projection and its visual quality [ TBB ∗ 10 ], [ STMT12 ] when the class labels are known.

Show more
this Δ, compute and display the two distances d 1 (Z), d 2 (Z) of an instance Z to the chosen π ν .
The article is organized as follows. In the Introduction, we already discussed the motivation for using a distance-dependent approach for both the visualization and classification of **high**-**dimensional** **data**. Next, we define and list the various common distance measures we may use. This is followed by the description of several possible class representations and class proximity measures. In particular, we introduce, discuss and compare four major categories for representing and positioning an instance in a Class Proximity (CP) plane. In the Results and Discussion section we first illustrate in detail, on a **high**-**dimensional** biomedical (metabolomic) dataset ( 1 H NMR spectra of a biofluid) several feasible possibilities and processes, based on concepts of the Class Proximity Projection approach. We repeat this process for four datasets from the UCI Repository. We conclude with general observations and a summary.

Show more
31 Read more

targeted parameter (the algorithm is described in Section ). The authors show the greedy C-TMLE algorithm exhibits superior relative performance in analyses of sparse **data**, at the cost of an increase in time complexity. For instance, in a problem with p baseline covariates, one would construct and select from p candidate estimators of the nuisance parameter, yielding a time complexity of order O(p 2 ). Despite a criterion for early termination, the algorithm does not scale to large-scale and **high**-**dimensional** **data**. The aim of this article is to develop novel C-TMLE algorithms that overcome these serious practical limitations without compromising finite sample or asymptotic performance.

Show more
16 Read more

Abstract
This paper presents the R package HDclassif which is devoted to the clustering and the discriminant analysis of **high**-**dimensional** **data**. The classification methods proposed in the package result from a new parametrization of the Gaussian mixture model which combines the idea of dimension reduction and model constraints on the covariance matrices. The supervised classification method using this parametrization is called **high** **dimensional** discriminant analysis (HDDA). In a similar manner, the associated clustering method is called **high** **dimensional** **data** clustering (HDDC) and uses the expectation-maximization algorithm for inference. In order to correctly fit the **data**, both methods estimate the specific subspace and the intrinsic dimension of the groups. Due to the constraints on the covariance matrices, the number of pa- rameters to estimate is significantly lower than other model-based methods and this allows the methods to be stable and efficient in **high** dimensions. Two introductory examples illustrated with R codes allow the user to discover the hdda and hddc func- tions. Experiments on simulated and real datasets also compare HDDC and HDDA with existing classification methods on **high**-**dimensional** datasets. HDclassif is a free software and distributed under the general public license, as part of the R software project.

Show more
32 Read more

Abstract— Mapping **high**-**dimensional** **data** in a low- **dimensional** space, for example for visualization, is a problem of increasingly major concern in **data** analysis. This paper presents DD-HDS, a nonlinear mapping method that follows the line of Multi **Dimensional** Scaling (MDS) approach, based on the preservation of distances between pairs of **data**. It improves the performance of existing competitors with respect to the representation of **high**- **dimensional** **data**, in two ways. It introduces i) a specific weighting of distances between **data** taking into account the concentration of measure phenomenon, and ii) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the **data** set, with the exception of a single user-defined parameter setting the trade-off between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by "Force Directed Placement". The mappings of low- and **high**-**dimensional** **data** sets are presented a s illustrations of the features and advantages of the proposed algorithm. The weighting function specific to **high**- **dimensional** **data** and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.

Show more
17 Read more

Abstract
Clustering is a **data** mining technique intensively used for **data** analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of dis- covering the number of clusters automatically and offering favorable characteristics. However, in the case of **high** dimen- sional **data**, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with **high** **dimensional** **data**. We propose HD4C (**High** Di- mensional **Data** Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of **high** **dimensional** **data** such as time series (as a function of time), hyperspectral **data** (as a function of wavelength) etc. Our experiments, on both synthetic and real world **data**, illustrate the **high** performance of our approach.

Show more
11 Read more

97 Read more

Figure 1: Distribution of pairwise Euclidian distances in the artificial dataset.
2 Kernel PCA projection for visualizing **high** **dimensional** **data**
The kernel PCA is the kernelized version of the PCA, a popular projection method. It operates on a kernel matrix (i.e. positive semi-definite similarity matrix), and extracts non-linear principal manifolds underlying the similarity matrix (see appendix A for details). The method maps these manifolds on a vector space: thus, we can build approximate, non-linear, 2D projections of **high**-**dimensional** **data**, by selecting the 2 dominant eigen-dimensions, and the values taken by the **data** elements on these.

Show more
14 Read more

Abstract. In many domains, the **data** objects are described in terms of a large number of features. The pipelined **data** mining approach introduced in [12] us- ing two clustering algorithms in combination with rough sets and extended with genetic programming, is investigated with the purpose of discovering important subsets of attributes in **high** **dimensional** **data**. Their classification ability is de- scribed in terms of both collections of rules and analytic functions obtained by genetic programming (gene expression programming). The Leader and several k-means algorithms are used as procedures for attribute set simplification of the information systems later presented to rough sets algorithms. Visual **data** min- ing techniques including virtual reality were used for inspecting results. The **data** mining process is setup using **high** throughput distributed computing techniques. This approach was applied to Breast Cancer gene expression **data** and it led to subsets of genes with **high** discrimination power with respect to the decision classes.

Show more
12 Read more

131 Read more

for reducing space needs and speeding up similarity search. This can be classically achieved by hashing techniques which map **data** onto lower-**dimensional** representa- tions.
We recall from Chapter 2 that two hashing paradigms exist: **data**-independent and **data**-dependent hashing methods. On the one hand, Locality-Sensitive Hashing (LSH) [ Andoni and Indyk , 2008 ] and its variants [ Terasawa and Tanaka , 2007 , Andoni et al. , 2015b , Yu et al. , 2014 ] belong to the **data**-independent paradigm. They rely on some random projection onto a c-lower **dimensional** space followed by a scalar quanti- zation returning the nearest vertex from the set {−1, 1} c for getting the binary codes (e.g. the sign function is applied point-wise). On the other hand, **data**-dependent methods [ Wang et al. , 2018 ] learn this projection from **data** instead and have been found to be more accurate for computing similarity-preserving binary codes. Among them, the unsupervised **data**-dependent Hypercubic hashing methods, embodied by ITerative Quantization (ITQ) [ Gong et al. , 2013 ], use Principal Component Analysis (PCA) to reduce **data** dimensionality to c: the **data** is projected onto the first c prin- cipal components chosen as the ones with the highest explained variance as they carry more information on variability. If we then directly mapped each resulting direction to one bit, each of them would get represented by the same volume of binary code (1 bit), although the c th direction should carry less information than the first one. Thus, one can intuitively understand why PCA projection application solely leads to poor performance of the obtained binary codes in the NN search task. This is why **data** get often mixed though an isometry after PCA-projection so as to balance variance over the kept directions. See Figure 4.1 , a recall from Section 2.3.2 , for the general scheme of Hypercubic hashing methods.

Show more
234 Read more

Editor: David Blei
Abstract
There is a widespread need for statistical methods that can analyze **high**-**dimensional** datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general **data** analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the **data**, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for **data** tables. This model consists of a Dirichlet process mixture over the columns of a **data** table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of **data** in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular **data** of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.

Show more
51 Read more

An alternative and recent way for dealing with the problem of **high**-**dimensional** **data** classification is to model and classify the **data** in low-**dimensional** class specific subspaces. The Gaussian models for **high**-**dimensional** **data** and their associated classification method HDDA (**High**-**Dimensional** Discriminant Analysis), defined in [4] and under study in the present paper, allow to efficiently model and classify complex **high**-**dimensional** **data** in a parsimonious way since the model complexity is controlled by the intrinsic dimensions of the classes. In contrast to other generative methods which incorporate dimensionality re- duction or variable selection for dealing with **high**-**dimensional** **data** [14, 20, 23], HDDA does not reduce the dimension while modeling the **data** of each class in specific low-**dimensional** subspace. Thus, no information loss due to **data** dimensionality reduction is to be deplored and all the available information is used to discriminate the classes. Furthermore, several submodels are defined by introducing constraints on the parameters in order to be able to model different types of **data**. The choice between these submodels can be done using classical model selection tools as cross-validation or penalized likelihood criteria [17]. An additional advantage is that HDDA models require the tuning of only one parameter, which contributes to the selection of the dimension of each class specific subspace (see Section 2.4). Finally, HDDA presents several numerical advantages compared to other generative clas- sification methods: explicit formulation of the inverse covariance matrix and possibility of building the classifier when the number of learning observations is smaller than the dimen- sion.

Show more
19 Read more

d go to infinity on kXk p has surprising consequences, which may dramatically affect
**high**-**dimensional** **data** processing. This effect is usually referred to as the distance
concentration phenomenon in the computational learning literature. Despite a grow- ing interest in this important question, previous work has essentially characterized the problem in terms of numerical experiments and incomplete mathematical state- ments. In the present paper, we solidify some of the arguments which previously appeared in the literature and offer new insights into the phenomenon.

20 Read more

d . These subsets form a partition of the positive unit sphere S d−1
+ . Regarding question (Q1) this partition allows to deal with **high**-**dimensional** **data**. Indeed for β ∈ P d ∗ with cardinality b, the subset C β can be seen as part of the sphere S b−1 + . Therefore, as soon as b is moderate compared to d the use of C β reduces the dimension of the study. The idea is then to provide methods to learn on which of these subsets the spectral measure puts mass. This is developed in Chapter 2 and Chapter 3 . All the approaches introduced in this section rely on asymptotic results of multivariate random vectors (see Proposition 1.2.2 and Proposition 1.2.3 ). However in a statistical context we only have a finite **data** set at our disposal. Therefore the convergences that appear in the aforementioned propositions become approximations. In particular Equation ( 1.2.17 ) can be used to study the behavior of the spectral vector Θ as soon as the threshold t is "large enough". This is why a particular attention should be paid on the choice of this threshold t, or equivalently on the number of **data** considered to be extreme (see question (Q3)). One way to deal with this issue is to use model selection to identify for which threshold t the approximation is the most accurate.

Show more
191 Read more

methods hardly scale to **high**-**dimensional** **data** [11], [12]. The recent rise in performance of generative models such as generative adversarial networks (GAN) [13] or variational autoencoders (VAE) [14], [15] has made them very attractive models to perform DA. GANs have al- ready seen a wide use in many fields of application [16], [17], [18], [19], [20], including medicine [21]. For instance, GANs were used on magnetic resonance images (MRI) [22], [23], computed tomography (CT) [24], [25], X-ray [26], [27], [28], positron emission tomography (PET) [29], mass spec- troscopy **data** [30], dermoscopy [31] or mammography [32], [33] and demonstrated promising results. Nonetheless, most of these studies involved either a quite large training set (above 1000 training samples) or quite small **dimensional** **data**, whereas in everyday medical applications it remains very challenging to gather such large cohorts of labeled pa- tients. As a consequence, as of today, the case of **high** dimen- sional **data** combined with a very low sample size remains poorly explored. When compared to GANs, VAEs have only seen a very marginal interest to perform DA and were mostly used for speech applications [34], [35], [36]. Some attempts to use such generative models on medical **data** either for classification [37], [38] or segmentation tasks [39], [40], [41] can nonetheless be noted. The main limitation to a wider use of these models is that they most of the time produce blurry and fuzzy samples. This undesirable effect is even more emphasized when they are trained with a small number of samples which makes them very hard to use in practice to perform DA in the **high** **dimensional** (very) low sample size (HDLSS) setting.

Show more
26 Read more

matrix with O(d 2 ) entries (where d is the **data** dimen-
sion) to account for correlation between pairs of fea- tures. For **high**-**dimensional** **data** (say, d > 10 4 ), this is problematic for at least three reasons: (i) training the metric is computationally expensive (quadratic or cubic in d), (ii) the matrix may not even fit in memory, and (iii) learning so many parameters is likely to lead to severe overfitting, especially for sparse **data** where some features are rarely observed.

11 Read more

or mouse social defeat) have been shown to be signiﬁcantly associated to overall survival. It is even more surprising that many random signatures can outperform most breast cancer signatures [8]. Several authors have suggested that the selected sets of genes are not unique and are strongly inﬂu- enced by the subset of patients included in the training cohort [9, 10] and by the variable selection procedures [11– 14]. For low-**dimensional** **data**, the reference method to study associations with time-to-event endpoints is the Cox propor- tional hazards model. In the context of **high**-**dimensional** **data** (number of covariates > >number of observations), the Cox model may be nonidentiﬁable. Extensions, based on boosting or penalized regression, are proposed in the litera- ture to overcome these hurdles [15–18], as they shrink the regression coeﬃcients towards zero. Alternatively to the Cox extensions, methods based on random forests have been adapted for survival analysis [19]. This nonparametric metho- d—random survival forest (RSF)—combines multiple deci- sion trees built on randomly selected subsets of variables. Since feature selection methods are questioned, it seems important to thoroughly assess and compare existing strate- gies that are signiﬁcant components in prognostic signature development. Many studies were interested in false discovery rates or prognostic performances achieved by multiple vari- able selection methods and compared them on simulated or real datasets [20–23]. However, the impact of the training set on the stability of the results was only assessed by Michiels et al. [9] on a binary endpoint with a selection based on Pear- son’s correlation and did not evaluate most recent approaches. The main objective of this publication is to compare six typical diﬀerent feature selection methods which are com- monly used for **high**-**dimensional** **data** in the context of sur- vival analysis. For this purpose and as recommended in the literature [24], a simulation study is performed, with special focus on variable selection and prediction performance according to multiple **data** conﬁgurations (sample size of the training set, number of genes associated with survival). Feature selection methods are then applied on published **data** to explore stability and prognostic performances in a real breast cancer dataset.

Show more
15 Read more

finite mixture model 兩 flow cytometry 兩 multivariate skew distribution
F low cytometry transformed clinical immunology and hematol- ogy over 2 decades ago by allowing the rapid interrogation of cell surface determinants and, more recently, by enabling the analysis of intracellular events using fluorophore-conjugated antibodies or markers. Although flow cytometry initially allowed the investiga- tion of only a single fluorophore, recent advances allow close to 20 parallel channels for monitoring different determinants (1–4). These advances have now surpassed our ability to interpret man- ually the resulting **high**-**dimensional** **data** and have led to growing interest and recent activity in the development of new computa- tional tools and approaches (5–8).

Show more