Simulateddata are widely used to assess optimisation methods. This is because of their ability to evaluate certain aspects of the methods under study, that are impossible to look into when using real data sets. In the context of convex optimisation, it is never possible to know the exact solution of the minimisation problem with real dataand it proves to be a difficult problem even withsimulateddata. We propose to generalise an approach originally proposed by Nesterov , for LASSO regression, to a broader family of penalised regressions. We would thus like to generate simulateddatafor which we know the exact solution of the optimised function. The inputs are: The minimiser β ∗ , a candidate data set X 0 (n × p),
Several statistical contributions have tried to address heteroscedastic models in high dimensional regression. Most works have relied on an exponential representa- tion of the variance (the log-variance being modeled as a linear combination of the features), leading to non-convex objective functions. Solvers considered for such approaches require alternate minimization [ Kolar and Sharpnack , 2012 ], possibly in an iterative fash- ion [ Daye et al. , 2012 ], a notable difference with a jointly convex formulation, for which one can control global optimality with duality gap certificates as pro- posed here. Similarly, Wagener and Dette [ 2012 ] esti- mate the variance with a preliminary adaptive Lasso step, and correct the data-fitting term in a second step. Here, we propose the multi-task Smoothed General- ized Concomitant Lasso (SGCL), an estimator that can handle data from different origins in a high di- mensional sparseregression model by jointly estimat- ing the regression coefficients and the noise levels of each modality or data source. Contrary to other het- eroscedastic Lasso estimators such as ScHeDs (a sec- ond order cone program) [ Dalalyan et al. , 2013 ], its computational cost is comparable to the Lasso, as it can benefit from coordinate descent solvers [ Tseng ,
Eugene Belilovsky · Andreas Argyriou ·
Ga¨ el Varoquaux · Matthew Blaschko
Received: date / Accepted: date
Abstract We study the problem of statistical estimation with a signal known to be sparse, spatially contiguous, and containing many highly correlated vari- ables. We take inspiration from the recently introduced k-support norm, which has been successfully applied to sparse prediction problems with correlated fea- tures, but lacks any explicit structural constraints commonly found in machine learning and image processing. We address this problem by incorporating a to- tal variation penalty in the k-support framework. We introduce the (k, s) support total variation norm as the tightest convex relaxation of the intersection of a set of sparsity and total variation constraints. We show that this norm leads to an intractable combinatorial graph optimization problem, which we prove to be NP- hard. We then introduce a tractable relaxation with approximation guarantees that scale well for grid structured graphs. We devise several first-order optimiza- tion strategies for statistical parameter estimation with the described penalty. We demonstrate the effectiveness of this penalty on classification in the low-sample regime, classification with M/EEG neuroimaging data, and image recovery with synthetic and real data background subtracted image recovery tasks. We exten- sively analyse the application of our penalty on the complex task of identifying predictive regions from low-sample high-dimensional fMRI brain data, we show that our method is particularly useful compared to existing methods in terms of accuracy, interpretability, and stability.
1) CONESTA has a convergence rate of O(1/ε), an im- provement over FISTA with fixed smoothing whose rate is O(1/ε) + O(1/ √ ε) , .
2) CONESTA outperformed (in terms of execution time and precision of the solution) several state-of-the-art optimiza- tion algorithms on both simulatedand neuroimaging data. 3) CONESTA is a robust solver that resolves practical problems affecting other solvers in very high dimensions encountered in neuroimaging. For instance: (i) The EGM does not allow true sparsity, nor complex loss functions. (ii) The Inexact FISTA converges faster (in terms of the number of iterations) compared to CONESTA. However, as observed on the high-dimensional MRI dataset, after hundreds of iterations, solving the subproblem using an inner FISTA loop makes it much slower (e.g., it took 4 times longer to reach ε < 10 −3 ) compared to CONESTA.
proach performs sparse kernel regression based on a sparsity inducing prior on the weight parameters within a Bayesian framework. Unlike the commonly used Elastic-Net or Lasso approaches (based on L1 Norm a.k.a Laplacian prior), the RVM method does not require to set any regularization parameters through cross-validation. Instead, it automatically estimates the noise level in the input dataand performs a trade-off between the number of basis (complexity of the representation) and the ability to represent the signal. Furthermore, unlike SVM regression or Elastic-Net, it provides a posterior probability of each estimated quantity which is reasonably meaningful if that quantity is similar to the training set. In our setting, we used Gaussian kernels for the non-linearregression whose variance of the kernel parameters need to be defined. The RVM regression only selects the input BSPM feature set that can best explain the activation map in the training set, thus limiting the risk of overfitting. RVM is a multivariate but single-valued approach and therefore the regression was directly performed on the reduced space of section 2.3: only 400 regressions are needed to perform an estimation of the 14K activation times. We used a Gaussian radial basis function with a kernel bandwidth of 1e4 (from cross-validation). On an EliteBook Intel Core i7, a regression of 1000 training samples and 1235 features runs in 40 sec.
Keywords-Stability; High-dimensional estimators; machine learning; brain imaging; clustering
I. I NTRODUCTION
Using machine learning on neuroimaging data, brain re- gions can be linked with external variables. In particular, linear predictive models are interesting as their coefficients form brain maps that can be interpreted. However, because of the high dimensionality of brain imaging data, their estimation is an ill-posed problem and in order to find a feasible solution, some constrains must be imposed to the estimator. A popular way to solve that problem is to use a sparsity constraint as it isolates putatively relevant features. In practice, the high correlation between neighbor- ing voxels leads to selecting too few features, and hinders the estimators’ ability to recover a stable support. The estimation instability causes a high variance of both the prediction scores and the model coefficients, and therefore may result in non reproducible findings , . To mitigate this instability, stability selection  adds randomization to sparsity for feature selection. More generally, two main classes of methods are known to stabilize models with high- correlated features. Feature clustering based methods reduce both the local correlations and the dimensionality of the data , . Model aggregation methods –such as bagging– generate new training sets from an original one withdata perturbation schemes. They build multiple estimators from the perturbed data to combine them in an estimate with reduced variance. Stability selection methods ,  are variants that use sparsity to perform feature selection. In ,
However, when the covariate dimension is much higher than the response dimension, GLLiM may result in erroneous clusters at the low dimension, leading to potentially inaccurate predictions. Specifically, when the clustering is conducted at a high joint di- mension, the distance at low dimension between two members of the same cluster could remain large. As a result, a mixture component might contain several sub-clusters and/or outliers, violating the Gaussian assumption of the model. This results in a model mis- specification effect that can seriously impact prediction performance. We demonstrate this phenomenon with a numerical example in Section 2. A natural way to lessen this effect is to increase the number of components in the mixture making each linear map- ping even more local. But this practice also increases the number of parameters to be estimated. Estimating parameters in a parsimonious manner is required to avoid over- parameterization. In addition, increasing the number of clusters could isolate some data points or lead to singular covariance matrices. Hence, a robust estimation procedure for model stability is also necessary.
Neural networks are usedincreasingly as statistical models. The performance of multilayer perceptron (MLP) andthat of linearregression (LR) were compared, with regardto the quality of prediction and estimation and the robustness to deviations from underlying assumptions of normality, homoscedasticity and independence of errors. Taking into account those deviations, 5ve designs were constructed, and, for each of them, 3000 data were simulated. The comparison between connectionist andlinear models was achieved by graphic means including prediction in- tervals, as well as by classical criteria including goodness-of-5t and relative errors. The empirical distribution of estimations andthe stability of MLP andLR were studiedby re-sampling methods. MLP andlinear regression hadcomparable performance androbustness. Despite the 7exibility of connectionist models, their predictions were stable. The empirical variances of weight estima- tions result from the distributed representation of the information among the processing elements. This emphasizes the major role of variances of weight estimations in the interpretation of neural networks. This needs, however, to be con5rmed by further studies. Therefore MLP could be useful statistical models, as long as convergence conditions are respected.
Abstract—Principal component analysis (PCA) is an ex- ploratory tool widely used in data analysis to uncover dominant patterns of variability within a population. Despite its ability to represent a data set in a low-dimensional space, PCA’s inter- pretability remains limited. Indeed, the components produced by PCA are often noisy or exhibit no visually meaningful patterns. Furthermore, the fact that the components are usually non-sparse may also impede interpretation, unless arbitrary thresholding is applied. However, in neuroimaging, it is essential to uncover clin- ically interpretable phenotypic markers that would account for the main variability in the brain images of a population. Recently, some alternatives to the standard PCA approach, such as Sparse PCA, have been proposed, their aim being to limit the density of the components. Nonetheless, sparsity alone does not entirely solve the interpretability problem in neuroimaging, since it may yield scattered and unstable components. We hypothesized that the incorporation of prior information regarding the structure of the data may lead to improved relevance and interpretability of brain patterns. We therefore present a simple extension of the popular PCA framework that adds structured sparsity penalties on the loading vectors in order to identify the few stable regions in the brain images that capture most of the variability. Such structured sparsity can be obtained by combining e.g., `1 and total variation (TV) penalties, where the TV regularization encodes information on the underlying structure of the data. This paper presents the structuredsparse PCA (denoted SPCA- TV) optimization framework and its resolution. We demonstrate SPCA-TV’s effectiveness and versatility on three different data sets. It can be applied to any kind of structureddata, such as e.g., N -dimensional array images or meshes of cortical surfaces. The gains of SPCA-TV over unstructured approaches (such as Sparse PCA and ElasticNet PCA) or structured approach (such as GraphNet PCA) are significant, since SPCA-TV reveals the variability within a data set in the form of intelligible brain patterns that are easier to interpret and more stable across different samples.
Compression and selection increase prediction accuracy. We now assess the importance of compression and variable selection for prediction performance. We consider the prediction accuracy, evaluated through the prediction error rate. A first interesting point is that the prediction performance of compression methods is improved by the addition of a selection step: logit-SPLS, SGPLS and SPLS-DA perform better than logit-PLS, GPLS and PLS-DA respectively (c.f. Tab. 3). In addition, sparse PLS approaches also present a lower classification error rate than the GLMNET method that performs variable selection only. These two points support our claim that in any case compression and selection should be both considered for prediction. Similar results are observed for other configurations of simulateddata (c.f. Supp. Mat. section A.5.2). All different SPLS-based approaches show similar prediction performance, even methods that are not converging (SPLS-log or SGPLS) compared to our adaptive approach logit-SPLS. Thus, checking prediction accuracy only may not be a sufficient criterion to assess the relevance of a method. The GPLS method is a good example of non-convergent method (c.f. Tab. 3 and Tab. A.2 in Supp. Mat.) that presents high variability and poor performance regarding prediction.
; 2 × 2 × 2-mm voxels; 0.5-mm gap). Realignment, normal- ization to MNI space, and General Linear Model (GLM) fit were performed with the SPM5 software (http://www.fil.ion.ucl.ac.uk/spm/software/spm5). The normalization is the conventional one of SPM (implying affine and non-linear transformations) and not the one using unified segmentation. The normalization parameters are estimated on the basis of a whole-head EPI acquired in addition, and are then applied to the partial EPI volumes. The data are not smoothed. In the GLM, the effect of each of the 12 stimuli convolved with a standard hemodynamic response function was modeled separately, while accounting for serial auto-correlation with an AR(1) model and removing low- frequency drift terms using a high-pass filter with a cut-off of 128 s. The GLM is fitted separately in each session for each subject, and we used in the present work the result- ing session-wise parameter estimate images (the β-maps are used as rows of X). The four different shapes of objects were pooled across for each one of the three sizes, and we are interested in finding discriminative information between sizes. This reduces to a regression problem, in which our goal is to predict a simple scalar factor (size of an ob- ject). All the analyzes are performed without any prior selection of regions of interest, and use the whole acquired volume.
14 return β (t) , θ (t)
When using a squared ` 2 loss, the curvature of the loss is constant: for the Lasso and multitask
Lasso, the Hessian does not depend on the current iterate. This is however not true for other GLM data fitting terms, e.g., Logistic regression, for which taking into account the second order information proves to be very useful for fast convergence (Hsieh et al., 2014). To leverage this information, we can use a prox-Newton method (Lee et al., 2012; Scheinberg and Tang, 2013) as inner solver; an advantage of dual extrapolation is that it can be combined with any inner solver, as we detail below. For reproducibility and completeness, we first briefly detail the Prox-Newton procedure used. In the following and in Algorithms 4 to 6 we focus on a single subproblem optimization, so for lighter notation we assume that the design matrix X is already restricted to features in the working set. The reader should be aware that in the rest of this section, β, X and p in fact refers to β W (t) , X W (t) , and p (t) .
Wavelet thresholding of spectra has to be handled with care when the spectra are the predictors of a regression problem. Indeed, a blind thresholding of the signal followed by a regression method often leads to deteriorated predictions. The scope of this paper is to show that sparseregression methods, applied in the wavelet domain, perform an automatic thresholding: the most relevant wavelet coefficients are selected to optimize the prediction of a given target of interest. This approach can be seen as a joint thresholding designed for a predictive purpose.
of weights) than the function α : x G 7→ Wx G + β I . Furthermore, Fig. 2 shows
that genetic data x G tend to express through W, and thereby participate in the
modulation of the vector α(x G ).
We compared our approach to [ 13 , 15 ], for which the codes are available. The features that are selected by [ 13 , 15 ] are similar to ours for each modality taken separately. For instance, for [ 13 ] and the task “AD versus CN”, SNPs that have the most important weights are in genes APOE (rs429358), BZW1 (rs3815501) and MGMT (rs7071424). However, the genetic parameter vector learnt from [ 13 ] or [ 15 ] is not sparse, in contrary of ours. Furthermore, for [ 15 ], the weight for the imaging kernel is nine times much larger than the weight for the genetic kernel. These experiments show that the additive model with adapted penaltiesfor each modality provides better performances than [ 15 ], but our additive, multiplicative and multilevel models provide similar performances.
The ﬁrst and maybe most important hyperparameter is K, the number of intervals in the coeﬃcient functions from the prior. Because of the discretization of the rainfall, and the number of observations, the value of K should stay small to remain parsimonious. Because of the size of the dataset, we have set the hyperparameter a to obtain a prior probability of being in the support of about 0.5. The results are given in Figure 8 . As can be seen on the left of this Figure, the error variance σ 2 decreases when K increases, because models of higher dimension can more easily ﬁt the data. The main question is when do they overﬁt the data? In this case, the Bayesian Information Criterion selects the model with K = 2 intervals, see Section 3.5 of Supplementary Materials (Grolle- mund et al., 2018 ). Given the small number of observations (n = 25), the values of BIC have to be carefully interpreted. Otherwise, looking at the right panel of Figure 8 , we can consider how the posterior probability α(t |D) depends on the value of K and choose a reasonable value. First, for K = 1 or 2, the posterior probability is high during a ﬁrst long period of time until August of year n −1 and falls to much lower values after that. Thus, these small values of K provide a rough picture of dependency. Secondly, for K = 4, 5 or 6, the posterior probability α(t |D) varies between 0.2 and 0.7 and shows doubtful variations after November of year n − 1 and other strong variations during the summer of year n − 1 that are also doubtful. Hence we decided to rely on K = 3 although this choice is rather subjective.
To limit the set of possible solutions, prior hypotheses on the nature of the source distributions are necessary. The minimum-norm estimates (MNE) for instance are based on ` 2 Tikhonov regularization which leads to a linear solution . An ` 1 norm penalty
was also proposed by Uutela et al. , modeling the underlying neural pattern as a sparse collection of focal dipolar sources, hence their name “Minimum Current Estimates” (MCE). These methods have inspired a series of contributions in source localization techniques relying on noise normalization such as dSPM  and sLORETA [11, 52] to correct for the depth bias  or block-sparse norms such as MxNE  and TF-MxNE  to leverage the spatio-temporal dynamics of MEG signals. If other imaging data are available such as fMRI [50, 70] or diffusion MRI , they can be used as prior information for example in hierarchical Bayesian models . While such techniques have had some success, source estimation in the presence of complex multi-dipole configurations remains a challenge. To address it, one idea is to leverage the anatomical and functional diversity of multi-subject datasets to improve localization results.
Our proposed continuation algorithm addresses the two main aforementioned deficiencies. Indeed, CONESTA (i) is relevant in the context of any smooth con- vex loss function because it only requires the computation of the gradient and (ii) estimates weights that are strictly sparse because it does not require smoothing the sparsity-inducing penalties. Additionally, CONESTA does not require solv- ing any linear systems in P dimensions, or inverting very large matrices (XX > is inverted in the gap, but is assumed to be small in the N P paradigm), and can easily be applied with a variety of convex smooth loss functions and many different complex convex penalties.
metric learning algorithms and design experiment on them. We divided this chapter into two parts. In the first part, we concern with the case of multiple relational tables between multiple entity tables. In response to this situation, we consider starting with the selec- tion of similar, dissimilar sets. The relationship side information is used to construct the correlation strength function to evaluate the degree of similarity between entities and enti- ties. Then, according to the evaluation results, the samples are selected as a similar set or dissimilar set and finally used in the classical metric distance learning algorithm. In the se- cond part, we focus on a large entity table with multiple relational tables between samples. In view of this situation, we propose two schemes : one scheme is to combine multiple relational adjacency matrices into relational tensors and then perform RESCAL tensor de- composition [NTK11], and treat the decomposed matrix as a new feature space, apply the metric distance learning on this new feature space ; the other scheme is to directly accumu- late the loss functions of multiple relational adjacency metric, construct a comprehensive loss function that considers both supervised information and unsupervised information, and then optimize the function. It is worth mentioning that the proposed algorithm in the first part can also be easily extended for the second case. Finally, we conducted an experimen- tal evaluation of these three relational metric learning algorithms and compared them with other metric distance learning algorithms on real data sets, and achieved excellent results on some data sets.
The main message of [15, 16] is that the EWA with a properly chosen prior is able to deal with the sparsity issue. In particular, [15, 16] prove that such an EWA satisfies a sparsity oracle inequality (SOI), which is more powerful than the best known SOI for other common procedures of sparse recovery. An important point is that almost no assumption on the Gram matrix is required. In the present work we extend this analysis in two directions. First, we prove a sharp PAC-Bayesian bound for a large class of noise distributions, which is valid for the temperature parameter depending only on the noise distribution. We impose no restriction on the values of the regression function. This result is presented in Section 2. The consequences in the context of linearregression under sparsity assumption are discussed in Section 3.
By sweeping through various values of the quantization step, we obtain the values of the R-D curves for the Tree K- SVD method and the state-of-the-art ones. The Tree K-SVD dictionary outperforms the ”flat” dictionaries when a small number of atoms is used in the representation, for complete (Fig.3), and overcomplete (Fig.4) dictionaries. The rate of Tree K-SVD with adaptive sparse coding (AdSC) is penal- ized by a flag in the bitstream of 1 bit per atom selected in the representation indicating if the next atom is selected at the same level or at the next one. However, this adaptive sparse coding method allows reaching a better quality (Fig.3) but is effective when more than 2 atoms are selected in the repre- sentation. That is why Tree K-SVD and Tree K-SVD AdSC reach about the same PSNR on Fig.4. The rate of TSITD is a bit lower but this method does not reach the same quality of reconstruction than Tree K-SVD and Tree K-SVD AdSC, especially for small dictionaries (Fig.3 and 4).