High dimensional data

Top PDF High dimensional data:

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data Zhenyu Liao 1 Romain Couillet 1 2 Abstract Random feature maps are ubiquitous in modern statistical machine learning, where they general- ize random projections by means of powerful, yet often difficult to analyze nonlinear operators. In this paper, we leverage the “concentration” phe- nomenon induced by random matrix theory to perform a spectral analysis on the Gram matrix of these random feature maps, here for Gaussian mixture models of simultaneously large dimen- sion and size. Our results are instrumental to a deeper understanding on the interplay of the nonlinearity and the statistics of the data, thereby allowing for a better tuning of random feature- based techniques.
Show more

13 Read more

ProxiLens: Interactive Exploration of High-Dimensional Data using Projections

ProxiLens: Interactive Exploration of High-Dimensional Data using Projections

However, the loss of information due to dimensional- ity reduction leads to distortions in the resulting projection [ Aup07 ]. These multidimensional scaling artifacts are chal- lenging in terms of interpretation and trust [ CRMH12 ] for the analysts who want to make inferences on HD data from projection. There are several algorithms [ Jol02 ], [ BSL ∗ 08 ], [ PSPM12 ], [ IMO09 ], [ JPC ∗ 11 ] and optimization criteria available for projecting high-dimensional data in a low- dimensional space. Many metrics and static visualization techniques exist to evaluate the optimization quality (stress) [ Kru64 ], [ Ven07 ], [ BW96 ], [ BCLC97 ], [ SSK10 ], [ SvLB10 ], [ LA11 ] of the resulting projection and its visual quality [ TBB ∗ 10 ], [ STMT12 ] when the class labels are known.
Show more

6 Read more

Class proximity measures—dissimilarity-based classification and display of high-dimensional data

Class proximity measures—dissimilarity-based classification and display of high-dimensional data

this Δ, compute and display the two distances d 1 (Z), d 2 (Z) of an instance Z to the chosen π ν . The article is organized as follows. In the Introduction, we already discussed the motivation for using a distance-dependent approach for both the visualization and classification of high-dimensional data. Next, we define and list the various common distance measures we may use. This is followed by the description of several possible class representations and class proximity measures. In particular, we introduce, discuss and compare four major categories for representing and positioning an instance in a Class Proximity (CP) plane. In the Results and Discussion section we first illustrate in detail, on a high-dimensional biomedical (metabolomic) dataset ( 1 H NMR spectra of a biofluid) several feasible possibilities and processes, based on concepts of the Class Proximity Projection approach. We repeat this process for four datasets from the UCI Repository. We conclude with general observations and a summary.
Show more

31 Read more

Scalable Collaborative Targeted Learning for High-Dimensional Data

Scalable Collaborative Targeted Learning for High-Dimensional Data

targeted parameter (the algorithm is described in Section ). The authors show the greedy C-TMLE algorithm exhibits superior relative performance in analyses of sparse data, at the cost of an increase in time complexity. For instance, in a problem with p baseline covariates, one would construct and select from p candidate estimators of the nuisance parameter, yielding a time complexity of order O(p 2 ). Despite a criterion for early termination, the algorithm does not scale to large-scale and high-dimensional data. The aim of this article is to develop novel C-TMLE algorithms that overcome these serious practical limitations without compromising finite sample or asymptotic performance.
Show more

16 Read more

HDclassif: an R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data

HDclassif: an R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data

Abstract This paper presents the R package HDclassif which is devoted to the clustering and the discriminant analysis of high-dimensional data. The classification methods proposed in the package result from a new parametrization of the Gaussian mixture model which combines the idea of dimension reduction and model constraints on the covariance matrices. The supervised classification method using this parametrization is called high dimensional discriminant analysis (HDDA). In a similar manner, the associated clustering method is called high dimensional data clustering (HDDC) and uses the expectation-maximization algorithm for inference. In order to correctly fit the data, both methods estimate the specific subspace and the intrinsic dimension of the groups. Due to the constraints on the covariance matrices, the number of pa- rameters to estimate is significantly lower than other model-based methods and this allows the methods to be stable and efficient in high dimensions. Two introductory examples illustrated with R codes allow the user to discover the hdda and hddc func- tions. Experiments on simulated and real datasets also compare HDDC and HDDA with existing classification methods on high-dimensional datasets. HDclassif is a free software and distributed under the general public license, as part of the R software project.
Show more

32 Read more

DD-HDS: A method for visualization and exploration of high-dimensional data.

DD-HDS: A method for visualization and exploration of high-dimensional data.

Abstract— Mapping high-dimensional data in a low- dimensional space, for example for visualization, is a problem of increasingly major concern in data analysis. This paper presents DD-HDS, a nonlinear mapping method that follows the line of Multi Dimensional Scaling (MDS) approach, based on the preservation of distances between pairs of data. It improves the performance of existing competitors with respect to the representation of high- dimensional data, in two ways. It introduces i) a specific weighting of distances between data taking into account the concentration of measure phenomenon, and ii) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the data set, with the exception of a single user-defined parameter setting the trade-off between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by "Force Directed Placement". The mappings of low- and high-dimensional data sets are presented a s illustrations of the features and advantages of the proposed algorithm. The weighting function specific to high- dimensional data and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.
Show more

17 Read more

High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models

High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models

Abstract Clustering is a data mining technique intensively used for data analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of dis- covering the number of clusters automatically and offering favorable characteristics. However, in the case of high dimen- sional data, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with high dimensional data. We propose HD4C (High Di- mensional Data Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of high dimensional data such as time series (as a function of time), hyperspectral data (as a function of wavelength) etc. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach.
Show more

11 Read more

Kernel methods for high dimensional data analysis

Kernel methods for high dimensional data analysis

High Dimensional Data Analysis Since computer power enables massive computations, data analysis has been driven by the necessity to produce algorithms able to recover the structure of datasets from input points, e.g. using manifold reconstruction and metric ap- proximation [ 19 ]. Many methods developed in the last decades use extensively distances and metrics in order to produce structures able to capture the manifold underlying the dataset, e.g. Delaunay triangulations [ 20 ], marching cubes [ 21 ], manifold reconstruction [ 22 ]. Moreover, many of those methods strongly rely on a partition of the space, e.g. Voronoi diagrams in order to produce Delaunay tri- angulations [ 20 ], kd-trees [ 23 ] and for nearest neighbour search. These methods and techniques (e.g. use of distances, partition of the space, nearest-neighbour search) are affected in high-dimensions by the curse of dimensionality [ 24 ], and consequently, algorithms based on them may be not efficient. The term curse of dimensionality, introduced by Bellman in 1961 [ 25 ], [ 26 ] is nowadays used to refer to the class of phenomena that occur in high dimensions in contrast with the low dimensional scenario. Important examples are the tendency of data to become very sparse in high dimensions [ 24 ], [ 27 ], and the concentration of dis- tances. Usually dimensions d ≤ 6 are considered low. A high dimensional regime has to be considered when dimension d ≥ 10 [ 11 ].
Show more

97 Read more

On the visualization of high-dimensional data

On the visualization of high-dimensional data

Figure 1: Distribution of pairwise Euclidian distances in the artificial dataset. 2 Kernel PCA projection for visualizing high dimensional data The kernel PCA is the kernelized version of the PCA, a popular projection method. It operates on a kernel matrix (i.e. positive semi-definite similarity matrix), and extracts non-linear principal manifolds underlying the similarity matrix (see appendix A for details). The method maps these manifolds on a vector space: thus, we can build approximate, non-linear, 2D projections of high-dimensional data, by selecting the 2 dominant eigen-dimensions, and the values taken by the data elements on these.
Show more

14 Read more

Relevant Attribute Discovery in High Dimensional Data: Application to Breast Cancer Gene Expressions

Relevant Attribute Discovery in High Dimensional Data: Application to Breast Cancer Gene Expressions

Abstract. In many domains, the data objects are described in terms of a large number of features. The pipelined data mining approach introduced in [12] us- ing two clustering algorithms in combination with rough sets and extended with genetic programming, is investigated with the purpose of discovering important subsets of attributes in high dimensional data. Their classification ability is de- scribed in terms of both collections of rules and analytic functions obtained by genetic programming (gene expression programming). The Leader and several k-means algorithms are used as procedures for attribute set simplification of the information systems later presented to rough sets algorithms. Visual data min- ing techniques including virtual reality were used for inspecting results. The data mining process is setup using high throughput distributed computing techniques. This approach was applied to Breast Cancer gene expression data and it led to subsets of genes with high discrimination power with respect to the decision classes.
Show more

12 Read more

Continual forgetting-free deep learning from high-dimensional data streams

Continual forgetting-free deep learning from high-dimensional data streams

High-dimensional Data Streams R´ esum´ e : Dans cette th` ese, nous proposons une nouvelle approche de l’apprentissage profond pour la classification des flux de donn´ ees de grande dimension. Au cours des derni` eres ann´ ees, les r´ eseaux de neurones sont devenus la r´ ef´ erence dans diverses applications d’apprentissage automatique. Cependant, la plupart des m´ ethodes bas´ ees sur les r´ eseaux de neurones sont con¸ cues pour r´ esoudre des probl` emes d’apprentissage statique. Effectuer un apprentissage profond en ligne est une tˆ ache difficile. La principale difficult´ e est que les classificateurs bas´ es sur les r´ eseaux de neurones reposent g´ en´ eralement sur l’hypoth` ese que la s´ equence des lots de donn´ ees utilis´ ee pendant l’entraˆ ınement est stationnaire; ou en d’autres termes, que la distribution des classes de donn´ ees est la mˆ eme pour tous les lots (hypoth` ese i.i.d.). Lorsque cette hypoth` ese ne tient pas les r´ eseaux de neurones ont tendance ` a oublier les concepts temporairement indisponibles dans le flux. Dans la litt´ erature scientifique, ce ph´ enom` ene est g´ en´ eralement appel´ e oubli catastrophique. Les approches que nous proposons ont comme objectif de garantir la nature i.i.d. de chaque lot qui provient du flux et de compenser l’absence de donn´ ees historiques. Pour ce faire, nous entraˆ ınons des mod` eles g´ en´ eratifs et pseudo-g´ en´ eratifs capable de produire des ´ echantillons synth´ etiques ` a partir des classes absentes ou mal repr´ esent´ ees dans le flux, et compl` etent les lots du flux avec ces ´ echantillons. Nous testons nos approches dans un sc´ enario d’apprentissage incr´ emental et dans un type sp´ ecifique de l’apprentissage continu. Nos approches effectuent une classification sur des flux de donn´ ees dynamiques avec une pr´ ecision proche des r´ esultats obtenus dans la configuration de classification statique o` u toutes les donn´ ees sont disponibles pour la dur´ ee de l’apprentissage. En outre, nous d´ emontrons la capacit´ e de nos m´ ethodes ` a s’adapter ` a des classes de donn´ ees invisibles et ` a de nouvelles instances de cat´ egories de donn´ ees d´ ej` a connues, tout en ´ evitant d’oublier les connaissances pr´ ec´ edemment acquises. .
Show more

131 Read more

Contributions to unsupervised learning from massive high-dimensional data streams : structuring, hashing and clustering

Contributions to unsupervised learning from massive high-dimensional data streams : structuring, hashing and clustering

for reducing space needs and speeding up similarity search. This can be classically achieved by hashing techniques which map data onto lower-dimensional representa- tions. We recall from Chapter 2 that two hashing paradigms exist: data-independent and data-dependent hashing methods. On the one hand, Locality-Sensitive Hashing (LSH) [ Andoni and Indyk , 2008 ] and its variants [ Terasawa and Tanaka , 2007 , Andoni et al. , 2015b , Yu et al. , 2014 ] belong to the data-independent paradigm. They rely on some random projection onto a c-lower dimensional space followed by a scalar quanti- zation returning the nearest vertex from the set {−1, 1} c for getting the binary codes (e.g. the sign function is applied point-wise). On the other hand, data-dependent methods [ Wang et al. , 2018 ] learn this projection from data instead and have been found to be more accurate for computing similarity-preserving binary codes. Among them, the unsupervised data-dependent Hypercubic hashing methods, embodied by ITerative Quantization (ITQ) [ Gong et al. , 2013 ], use Principal Component Analysis (PCA) to reduce data dimensionality to c: the data is projected onto the first c prin- cipal components chosen as the ones with the highest explained variance as they carry more information on variability. If we then directly mapped each resulting direction to one bit, each of them would get represented by the same volume of binary code (1 bit), although the c th direction should carry less information than the first one. Thus, one can intuitively understand why PCA projection application solely leads to poor performance of the obtained binary codes in the NN search task. This is why data get often mixed though an isometry after PCA-projection so as to balance variance over the kept directions. See Figure 4.1 , a recall from Section 2.3.2 , for the general scheme of Hypercubic hashing methods.
Show more

234 Read more

CrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional data

CrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional data

Editor: David Blei Abstract There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.
Show more

51 Read more

Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data

Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data

An alternative and recent way for dealing with the problem of high-dimensional data classification is to model and classify the data in low-dimensional class specific subspaces. The Gaussian models for high-dimensional data and their associated classification method HDDA (High-Dimensional Discriminant Analysis), defined in [4] and under study in the present paper, allow to efficiently model and classify complex high-dimensional data in a parsimonious way since the model complexity is controlled by the intrinsic dimensions of the classes. In contrast to other generative methods which incorporate dimensionality re- duction or variable selection for dealing with high-dimensional data [14, 20, 23], HDDA does not reduce the dimension while modeling the data of each class in specific low-dimensional subspace. Thus, no information loss due to data dimensionality reduction is to be deplored and all the available information is used to discriminate the classes. Furthermore, several submodels are defined by introducing constraints on the parameters in order to be able to model different types of data. The choice between these submodels can be done using classical model selection tools as cross-validation or penalized likelihood criteria [17]. An additional advantage is that HDDA models require the tuning of only one parameter, which contributes to the selection of the dimension of each class specific subspace (see Section 2.4). Finally, HDDA presents several numerical advantages compared to other generative clas- sification methods: explicit formulation of the inverse covariance matrix and possibility of building the classifier when the number of learning observations is smaller than the dimen- sion.
Show more

19 Read more

High-dimensional $p$-norms

High-dimensional $p$-norms

d go to infinity on kXk p has surprising consequences, which may dramatically affect high-dimensional data processing. This effect is usually referred to as the distance concentration phenomenon in the computational learning literature. Despite a grow- ing interest in this important question, previous work has essentially characterized the problem in terms of numerical experiments and incomplete mathematical state- ments. In the present paper, we solidify some of the arguments which previously appeared in the literature and offer new insights into the phenomenon.

20 Read more

High-dimensional Learning for Extremes

High-dimensional Learning for Extremes

d . These subsets form a partition of the positive unit sphere S d−1 + . Regarding question (Q1) this partition allows to deal with high-dimensional data. Indeed for β ∈ P d ∗ with cardinality b, the subset C β can be seen as part of the sphere S b−1 + . Therefore, as soon as b is moderate compared to d the use of C β reduces the dimension of the study. The idea is then to provide methods to learn on which of these subsets the spectral measure puts mass. This is developed in Chapter 2 and Chapter 3 . All the approaches introduced in this section rely on asymptotic results of multivariate random vectors (see Proposition 1.2.2 and Proposition 1.2.3 ). However in a statistical context we only have a finite data set at our disposal. Therefore the convergences that appear in the aforementioned propositions become approximations. In particular Equation ( 1.2.17 ) can be used to study the behavior of the spectral vector Θ as soon as the threshold t is "large enough". This is why a particular attention should be paid on the choice of this threshold t, or equivalently on the number of data considered to be extreme (see question (Q3)). One way to deal with this issue is to use model selection to identify for which threshold t the approximation is the most accurate.
Show more

191 Read more

Data Augmentation in High Dimensional Low Sample Size Setting Using a Geometry-Based Variational Autoencoder

Data Augmentation in High Dimensional Low Sample Size Setting Using a Geometry-Based Variational Autoencoder

methods hardly scale to high-dimensional data [11], [12]. The recent rise in performance of generative models such as generative adversarial networks (GAN) [13] or variational autoencoders (VAE) [14], [15] has made them very attractive models to perform DA. GANs have al- ready seen a wide use in many fields of application [16], [17], [18], [19], [20], including medicine [21]. For instance, GANs were used on magnetic resonance images (MRI) [22], [23], computed tomography (CT) [24], [25], X-ray [26], [27], [28], positron emission tomography (PET) [29], mass spec- troscopy data [30], dermoscopy [31] or mammography [32], [33] and demonstrated promising results. Nonetheless, most of these studies involved either a quite large training set (above 1000 training samples) or quite small dimensional data, whereas in everyday medical applications it remains very challenging to gather such large cohorts of labeled pa- tients. As a consequence, as of today, the case of high dimen- sional data combined with a very low sample size remains poorly explored. When compared to GANs, VAEs have only seen a very marginal interest to perform DA and were mostly used for speech applications [34], [35], [36]. Some attempts to use such generative models on medical data either for classification [37], [38] or segmentation tasks [39], [40], [41] can nonetheless be noted. The main limitation to a wider use of these models is that they most of the time produce blurry and fuzzy samples. This undesirable effect is even more emphasized when they are trained with a small number of samples which makes them very hard to use in practice to perform DA in the high dimensional (very) low sample size (HDLSS) setting.
Show more

26 Read more

Similarity Learning for High-Dimensional Sparse Data

Similarity Learning for High-Dimensional Sparse Data

matrix with O(d 2 ) entries (where d is the data dimen- sion) to account for correlation between pairs of fea- tures. For high-dimensional data (say, d > 10 4 ), this is problematic for at least three reasons: (i) training the metric is computationally expensive (quadratic or cubic in d), (ii) the matrix may not even fit in memory, and (iii) learning so many parameters is likely to lead to severe overfitting, especially for sparse data where some features are rarely observed.

11 Read more

Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings

Comparison of Variable Selection Methods for Time-to-Event Data in High-Dimensional Settings

or mouse social defeat) have been shown to be significantly associated to overall survival. It is even more surprising that many random signatures can outperform most breast cancer signatures [8]. Several authors have suggested that the selected sets of genes are not unique and are strongly influ- enced by the subset of patients included in the training cohort [9, 10] and by the variable selection procedures [11– 14]. For low-dimensional data, the reference method to study associations with time-to-event endpoints is the Cox propor- tional hazards model. In the context of high-dimensional data (number of covariates > >number of observations), the Cox model may be nonidentifiable. Extensions, based on boosting or penalized regression, are proposed in the litera- ture to overcome these hurdles [15–18], as they shrink the regression coefficients towards zero. Alternatively to the Cox extensions, methods based on random forests have been adapted for survival analysis [19]. This nonparametric metho- d—random survival forest (RSF)—combines multiple deci- sion trees built on randomly selected subsets of variables. Since feature selection methods are questioned, it seems important to thoroughly assess and compare existing strate- gies that are significant components in prognostic signature development. Many studies were interested in false discovery rates or prognostic performances achieved by multiple vari- able selection methods and compared them on simulated or real datasets [20–23]. However, the impact of the training set on the stability of the results was only assessed by Michiels et al. [9] on a binary endpoint with a selection based on Pear- son’s correlation and did not evaluate most recent approaches. The main objective of this publication is to compare six typical different feature selection methods which are com- monly used for high-dimensional data in the context of sur- vival analysis. For this purpose and as recommended in the literature [24], a simulation study is performed, with special focus on variable selection and prediction performance according to multiple data configurations (sample size of the training set, number of genes associated with survival). Feature selection methods are then applied on published data to explore stability and prognostic performances in a real breast cancer dataset.
Show more

15 Read more

Automated high-dimensional flow cytometric data analysis

Automated high-dimensional flow cytometric data analysis

finite mixture model 兩 flow cytometry 兩 multivariate skew distribution F low cytometry transformed clinical immunology and hematol- ogy over 2 decades ago by allowing the rapid interrogation of cell surface determinants and, more recently, by enabling the analysis of intracellular events using fluorophore-conjugated antibodies or markers. Although flow cytometry initially allowed the investiga- tion of only a single fluorophore, recent advances allow close to 20 parallel channels for monitoring different determinants (1–4). These advances have now surpassed our ability to interpret man- ually the resulting high-dimensional data and have led to growing interest and recent activity in the development of new computa- tional tools and approaches (5–8).
Show more

7 Read more

Show all 10000 documents...