1 Introduction
**Simulated** **data** are widely used to assess optimisation methods. This is because of their ability to evaluate certain aspects of the methods under study, that are impossible to look into when using real **data** sets. In the context of convex optimisation, it is never possible to know the exact solution of the minimisation problem **with** real **data** **and** it proves to be a difficult problem even **with** **simulated** **data**. We propose to generalise an approach originally proposed by Nesterov [4], **for** LASSO **regression**, to a broader family of penalised regressions. We would thus like to generate **simulated** **data** **for** which we know the exact solution of the optimised function. The inputs are: The minimiser β ∗ , a candidate **data** set X 0 (n × p),

En savoir plus
Several statistical contributions have tried to address heteroscedastic models in high dimensional **regression**. Most works have relied on an exponential representa- tion of the variance (the log-variance being modeled as a **linear** combination of the features), leading to non-convex objective functions. Solvers considered **for** such approaches require alternate minimization [ Kolar **and** Sharpnack , 2012 ], possibly in an iterative fash- ion [ Daye et al. , 2012 ], a notable difference **with** a jointly convex formulation, **for** which one can control global optimality **with** duality gap certificates as pro- posed here. Similarly, Wagener **and** Dette [ 2012 ] esti- mate the variance **with** a preliminary adaptive Lasso step, **and** correct the **data**-fitting term in a second step. Here, we propose the multi-task Smoothed General- ized Concomitant Lasso (SGCL), an estimator that can handle **data** from different origins in a high di- mensional **sparse** **regression** model by jointly estimat- ing the **regression** coefficients **and** the noise levels of each modality or **data** source. Contrary to other het- eroscedastic Lasso estimators such as ScHeDs (a sec- ond order cone program) [ Dalalyan et al. , 2013 ], its computational cost is comparable to the Lasso, as it can benefit from coordinate descent solvers [ Tseng ,

En savoir plus
Eugene Belilovsky · Andreas Argyriou ·
Ga¨ el Varoquaux · Matthew Blaschko
Received: date / Accepted: date
Abstract We study the problem of statistical estimation **with** a signal known to be **sparse**, spatially contiguous, **and** containing many highly correlated vari- ables. We take inspiration from the recently introduced k-support norm, which has been successfully applied to **sparse** prediction problems **with** correlated fea- tures, but lacks any explicit structural constraints commonly found in machine learning **and** image processing. We address this problem by incorporating a to- tal variation penalty in the k-support framework. We introduce the (k, s) support total variation norm as the tightest convex relaxation of the intersection of a set of sparsity **and** total variation constraints. We show that this norm leads to an intractable combinatorial graph optimization problem, which we prove to be NP- hard. We then introduce a tractable relaxation **with** approximation guarantees that scale well **for** grid **structured** graphs. We devise several first-order optimiza- tion strategies **for** statistical parameter estimation **with** the described penalty. We demonstrate the effectiveness of this penalty on classification in the low-sample regime, classification **with** M/EEG neuroimaging **data**, **and** image recovery **with** synthetic **and** real **data** background subtracted image recovery tasks. We exten- sively analyse the application of our penalty on the complex task of identifying predictive regions from low-sample high-dimensional fMRI brain **data**, we show that our method is particularly useful compared to existing methods in terms of accuracy, interpretability, **and** stability.

En savoir plus
1) CONESTA has a convergence rate of O(1/ε), an im- provement over FISTA **with** fixed smoothing whose rate is O(1/ε) + O(1/ √ ε) [5], [9].
2) CONESTA outperformed (in terms of execution time **and** precision of the solution) several state-of-the-art optimiza- tion algorithms on both **simulated** **and** neuroimaging **data**. 3) CONESTA is a robust solver that resolves practical problems affecting other solvers in very high dimensions encountered in neuroimaging. **For** instance: (i) The EGM does not allow true sparsity, nor complex loss functions. (ii) The Inexact FISTA converges faster (in terms of the number of iterations) compared to CONESTA. However, as observed on the high-dimensional MRI dataset, after hundreds of iterations, solving the subproblem using an inner FISTA loop makes it much slower (e.g., it took 4 times longer to reach ε < 10 −3 ) compared to CONESTA.

En savoir plus
proach performs **sparse** kernel **regression** based on a sparsity inducing prior on the weight parameters within a Bayesian framework. Unlike the commonly used Elastic-Net or Lasso approaches (based on L1 Norm a.k.a Laplacian prior), the RVM method does not require to set any regularization parameters through cross-validation. Instead, it automatically estimates the noise level in the input **data** **and** performs a trade-off between the number of basis (complexity of the representation) **and** the ability to represent the signal. Furthermore, unlike SVM **regression** or Elastic-Net, it provides a posterior probability of each estimated quantity which is reasonably meaningful if that quantity is similar to the training set. In our setting, we used Gaussian kernels **for** the non-**linear** **regression** whose variance of the kernel parameters need to be defined. The RVM **regression** only selects the input BSPM feature set that can best explain the activation map in the training set, thus limiting the risk of overfitting. RVM is a multivariate but single-valued approach **and** therefore the **regression** was directly performed on the reduced space of section 2.3: only 400 regressions are needed to perform an estimation of the 14K activation times. We used a Gaussian radial basis function **with** a kernel bandwidth of 1e4 (from cross-validation). On an EliteBook Intel Core i7, a **regression** of 1000 training samples **and** 1235 features runs in 40 sec.

En savoir plus
Keywords-Stability; High-dimensional estimators; machine learning; brain imaging; clustering
I. I NTRODUCTION
Using machine learning on neuroimaging **data**, brain re- gions can be linked **with** external variables[1]. In particular, **linear** predictive models are interesting as their coefficients form brain maps that can be interpreted. However, because of the high dimensionality of brain imaging **data**, their estimation is an ill-posed problem **and** in order to find a feasible solution, some constrains must be imposed to the estimator. A popular way to solve that problem is to use a sparsity constraint as it isolates putatively relevant features. In practice, the high correlation between neighbor- ing voxels leads to selecting too few features, **and** hinders the estimators’ ability to recover a stable support. The estimation instability causes a high variance of both the prediction scores **and** the model coefficients, **and** therefore may result in non reproducible findings [2], [3]. To mitigate this instability, stability selection [2] adds randomization to sparsity **for** feature selection. More generally, two main classes of methods are known to stabilize models **with** high- correlated features. Feature clustering based methods reduce both the local correlations **and** the dimensionality of the **data** [4], [5]. Model aggregation methods –such as bagging– generate new training sets from an original one **with** **data** perturbation schemes. They build multiple estimators from the perturbed **data** to combine them in an estimate **with** reduced variance. Stability selection methods [2], [6] are variants that use sparsity to perform feature selection. In [5],

En savoir plus
However, when the covariate dimension is much higher than the response dimension, GLLiM may result in erroneous clusters at the low dimension, leading to potentially inaccurate predictions. Specifically, when the clustering is conducted at a high joint di- mension, the distance at low dimension between two members of the same cluster could remain large. As a result, a mixture component might contain several sub-clusters **and**/or outliers, violating the Gaussian assumption of the model. This results in a model mis- specification effect that can seriously impact prediction performance. We demonstrate this phenomenon **with** a numerical example in Section 2. A natural way to lessen this effect is to increase the number of components in the mixture making each **linear** map- ping even more local. But this practice also increases the number of parameters to be estimated. Estimating parameters in a parsimonious manner is required to avoid over- parameterization. In addition, increasing the number of clusters could isolate some **data** points or lead to singular covariance matrices. Hence, a robust estimation procedure **for** model stability is also necessary.

En savoir plus
Neural networks are usedincreasingly as statistical models. The performance of multilayer perceptron (MLP) andthat of **linear** **regression** (LR) were compared, **with** regardto the quality of prediction **and** estimation **and** the robustness to deviations from underlying assumptions of normality, homoscedasticity **and** independence of errors. Taking into account those deviations, 5ve designs were constructed, **and**, **for** each of them, 3000 **data** were **simulated**. The comparison between connectionist **and** **linear** models was achieved by graphic means including prediction in- tervals, as well as by classical criteria including goodness-of-5t **and** relative errors. The empirical distribution of estimations andthe stability of MLP andLR were studiedby re-sampling methods. MLP andlinear **regression** hadcomparable performance androbustness. Despite the 7exibility of connectionist models, their predictions were stable. The empirical variances of weight estima- tions result from the distributed representation of the information among the processing elements. This emphasizes the major role of variances of weight estimations in the interpretation of neural networks. This needs, however, to be con5rmed by further studies. Therefore MLP could be useful statistical models, as long as convergence conditions are respected.

En savoir plus
Abstract—Principal component analysis (PCA) is an ex- ploratory tool widely used in **data** analysis to uncover dominant patterns of variability within a population. Despite its ability to represent a **data** set in a low-dimensional space, PCA’s inter- pretability remains limited. Indeed, the components produced by PCA are often noisy or exhibit no visually meaningful patterns. Furthermore, the fact that the components are usually non-**sparse** may also impede interpretation, unless arbitrary thresholding is applied. However, in neuroimaging, it is essential to uncover clin- ically interpretable phenotypic markers that would account **for** the main variability in the brain images of a population. Recently, some alternatives to the standard PCA approach, such as **Sparse** PCA, have been proposed, their aim being to limit the density of the components. Nonetheless, sparsity alone does not entirely solve the interpretability problem in neuroimaging, since it may yield scattered **and** unstable components. We hypothesized that the incorporation of prior information regarding the structure of the **data** may lead to improved relevance **and** interpretability of brain patterns. We therefore present a simple extension of the popular PCA framework that adds **structured** sparsity **penalties** on the loading vectors in order to identify the few stable regions in the brain images that capture most of the variability. Such **structured** sparsity can be obtained by combining e.g., `1 **and** total variation (TV) **penalties**, where the TV regularization encodes information on the underlying structure of the **data**. This paper presents the **structured** **sparse** PCA (denoted SPCA- TV) optimization framework **and** its resolution. We demonstrate SPCA-TV’s effectiveness **and** versatility on three different **data** sets. It can be applied to any kind of **structured** **data**, such as e.g., N -dimensional array images or meshes of cortical surfaces. The gains of SPCA-TV over unstructured approaches (such as **Sparse** PCA **and** ElasticNet PCA) or **structured** approach (such as GraphNet PCA) are significant, since SPCA-TV reveals the variability within a **data** set in the form of intelligible brain patterns that are easier to interpret **and** more stable across different samples.

En savoir plus
Compression **and** selection increase prediction accuracy. We now assess the importance of compression **and** variable selection **for** prediction performance. We consider the prediction accuracy, evaluated through the prediction error rate. A first interesting point is that the prediction performance of compression methods is improved by the addition of a selection step: logit-SPLS, SGPLS **and** SPLS-DA perform better than logit-PLS, GPLS **and** PLS-DA respectively (c.f. Tab. 3). In addition, **sparse** PLS approaches also present a lower classification error rate than the GLMNET method that performs variable selection only. These two points support our claim that in any case compression **and** selection should be both considered **for** prediction. Similar results are observed **for** other configurations of **simulated** **data** (c.f. Supp. Mat. section A.5.2). All different SPLS-based approaches show similar prediction performance, even methods that are not converging (SPLS-log or SGPLS) compared to our adaptive approach logit-SPLS. Thus, checking prediction accuracy only may not be a sufficient criterion to assess the relevance of a method. The GPLS method is a good example of non-convergent method (c.f. Tab. 3 **and** Tab. A.2 in Supp. Mat.) that presents high variability **and** poor performance regarding prediction.

En savoir plus
; 2 × 2 × 2-mm voxels; 0.5-mm gap). Realignment, normal- ization to MNI space, **and** General **Linear** Model (GLM) fit were performed **with** the SPM5 software (http://www.fil.ion.ucl.ac.uk/spm/software/spm5). The normalization is the conventional one of SPM (implying affine **and** non-**linear** transformations) **and** not the one using unified segmentation. The normalization parameters are estimated on the basis of a whole-head EPI acquired in addition, **and** are then applied to the partial EPI volumes. The **data** are not smoothed. In the GLM, the effect of each of the 12 stimuli convolved **with** a standard hemodynamic response function was modeled separately, while accounting **for** serial auto-correlation **with** an AR(1) model **and** removing low- frequency drift terms using a high-pass filter **with** a cut-off of 128 s. The GLM is fitted separately in each session **for** each subject, **and** we used in the present work the result- ing session-wise parameter estimate images (the β-maps are used as rows of X). The four different shapes of objects were pooled across **for** each one of the three sizes, **and** we are interested in finding discriminative information between sizes. This reduces to a **regression** problem, in which our goal is to predict a simple scalar factor (size of an ob- ject). All the analyzes are performed without any prior selection of regions of interest, **and** use the whole acquired volume.

En savoir plus
14 return β (t) , θ (t)
5.2 Newton-Celer
When using a squared ` 2 loss, the curvature of the loss is constant: **for** the Lasso **and** multitask
Lasso, the Hessian does not depend on the current iterate. This is however not true **for** other GLM **data** fitting terms, e.g., Logistic **regression**, **for** which taking into account the second order information proves to be very useful **for** fast convergence (Hsieh et al., 2014). To leverage this information, we can use a prox-Newton method (Lee et al., 2012; Scheinberg **and** Tang, 2013) as inner solver; an advantage of dual extrapolation is that it can be combined **with** any inner solver, as we detail below. **For** reproducibility **and** completeness, we first briefly detail the Prox-Newton procedure used. In the following **and** in Algorithms 4 to 6 we focus on a single subproblem optimization, so **for** lighter notation we assume that the design matrix X is already restricted to features in the working set. The reader should be aware that in the rest of this section, β, X **and** p in fact refers to β W (t) , X W (t) , **and** p (t) .

En savoir plus
Wavelet thresholding of spectra has to be handled **with** care when the spectra are the predictors of a **regression** problem. Indeed, a blind thresholding of the signal followed by a **regression** method often leads to deteriorated predictions. The scope of this paper is to show that **sparse** **regression** methods, applied in the wavelet domain, perform an automatic thresholding: the most relevant wavelet coefficients are selected to optimize the prediction of a given target of interest. This approach can be seen as a joint thresholding designed **for** a predictive purpose.

En savoir plus
of weights) than the function α : x G 7→ Wx G + β I . Furthermore, Fig. 2 shows
that genetic **data** x G tend to express through W, **and** thereby participate in the
modulation of the vector α(x G ).
We compared our approach to [ 13 , 15 ], **for** which the codes are available. The features that are selected by [ 13 , 15 ] are similar to ours **for** each modality taken separately. **For** instance, **for** [ 13 ] **and** the task “AD versus CN”, SNPs that have the most important weights are in genes APOE (rs429358), BZW1 (rs3815501) **and** MGMT (rs7071424). However, the genetic parameter vector learnt from [ 13 ] or [ 15 ] is not **sparse**, in contrary of ours. Furthermore, **for** [ 15 ], the weight **for** the imaging kernel is nine times much larger than the weight **for** the genetic kernel. These experiments show that the additive model **with** adapted **penalties** **for** each modality provides better performances than [ 15 ], but our additive, multiplicative **and** multilevel models provide similar performances.

En savoir plus
The ﬁrst **and** maybe most important hyperparameter is K, the number of intervals in the coeﬃcient functions from the prior. Because of the discretization of the rainfall, **and** the number of observations, the value of K should stay small to remain parsimonious. Because of the size of the dataset, we have set the hyperparameter a to obtain a prior probability of being in the support of about 0.5. The results are given in Figure 8 . As can be seen on the left of this Figure, the error variance σ 2 decreases when K increases, because models of higher dimension can more easily ﬁt the **data**. The main question is when do they overﬁt the **data**? In this case, the Bayesian Information Criterion selects the model **with** K = 2 intervals, see Section 3.5 of Supplementary Materials (Grolle- mund et al., 2018 ). Given the small number of observations (n = 25), the values of BIC have to be carefully interpreted. Otherwise, looking at the right panel of Figure 8 , we can consider how the posterior probability α(t |D) depends on the value of K **and** choose a reasonable value. First, **for** K = 1 or 2, the posterior probability is high during a ﬁrst long period of time until August of year n −1 **and** falls to much lower values after that. Thus, these small values of K provide a rough picture of dependency. Secondly, **for** K = 4, 5 or 6, the posterior probability α(t |D) varies between 0.2 **and** 0.7 **and** shows doubtful variations after November of year n − 1 **and** other strong variations during the summer of year n − 1 that are also doubtful. Hence we decided to rely on K = 3 although this choice is rather subjective.

En savoir plus
To limit the set of possible solutions, prior hypotheses on the nature of the source distributions are necessary. The minimum-norm estimates (MNE) **for** instance are based on ` 2 Tikhonov regularization which leads to a **linear** solution [25]. An ` 1 norm penalty
was also proposed by Uutela et al. [63], modeling the underlying neural pattern as a **sparse** collection of focal dipolar sources, hence their name “Minimum Current Estimates” (MCE). These methods have inspired a series of contributions in source localization techniques relying on noise normalization such as dSPM [11] **and** sLORETA [11, 52] to correct **for** the depth bias [2] or block-**sparse** norms such as MxNE [56] **and** TF-MxNE [24] to leverage the spatio-temporal dynamics of MEG signals. If other imaging **data** are available such as fMRI [50, 70] or diffusion MRI [12], they can be used as prior information **for** example in hierarchical Bayesian models [55]. While such techniques have had some success, source estimation in the presence of complex multi-dipole configurations remains a challenge. To address it, one idea is to leverage the anatomical **and** functional diversity of multi-subject datasets to improve localization results.

En savoir plus
Our proposed continuation algorithm addresses the two main aforementioned deficiencies. Indeed, CONESTA (i) is relevant in the context of any smooth con- vex loss function because it only requires the computation of the gradient **and** (ii) estimates weights that are strictly **sparse** because it does not require smoothing the sparsity-inducing **penalties**. Additionally, CONESTA does not require solv- ing any **linear** systems in P dimensions, or inverting very large matrices (XX > is inverted in the gap, but is assumed to be small in the N P paradigm), **and** can easily be applied **with** a variety of convex smooth loss functions **and** many different complex convex **penalties**.

En savoir plus
metric learning algorithms **and** design experiment on them. We divided this chapter into two parts. In the first part, we concern **with** the case of multiple relational tables between multiple entity tables. In response to this situation, we consider starting **with** the selec- tion of similar, dissimilar sets. The relationship side information is used to construct the correlation strength function to evaluate the degree of similarity between entities **and** enti- ties. Then, according to the evaluation results, the samples are selected as a similar set or dissimilar set **and** finally used in the classical metric distance learning algorithm. In the se- cond part, we focus on a large entity table **with** multiple relational tables between samples. In view of this situation, we propose two schemes : one scheme is to combine multiple relational adjacency matrices into relational tensors **and** then perform RESCAL tensor de- composition [NTK11], **and** treat the decomposed matrix as a new feature space, apply the metric distance learning on this new feature space ; the other scheme is to directly accumu- late the loss functions of multiple relational adjacency metric, construct a comprehensive loss function that considers both supervised information **and** unsupervised information, **and** then optimize the function. It is worth mentioning that the proposed algorithm in the first part can also be easily extended **for** the second case. Finally, we conducted an experimen- tal evaluation of these three relational metric learning algorithms **and** compared them **with** other metric distance learning algorithms on real **data** sets, **and** achieved excellent results on some **data** sets.

En savoir plus
169 En savoir plus

The main message of [15, 16] is that the EWA **with** a properly chosen prior is able to deal **with** the sparsity issue. In particular, [15, 16] prove that such an EWA satisfies a sparsity oracle inequality (SOI), which is more powerful than the best known SOI **for** other common procedures of **sparse** recovery. An important point is that almost no assumption on the Gram matrix is required. In the present work we extend this analysis in two directions. First, we prove a sharp PAC-Bayesian bound **for** a large class of noise distributions, which is valid **for** the temperature parameter depending only on the noise distribution. We impose no restriction on the values of the **regression** function. This result is presented in Section 2. The consequences in the context of **linear** **regression** under sparsity assumption are discussed in Section 3.

En savoir plus
By sweeping through various values of the quantization step, we obtain the values of the R-D curves **for** the Tree K- SVD method **and** the state-of-the-art ones. The Tree K-SVD dictionary outperforms the ”flat” dictionaries when a small number of atoms is used in the representation, **for** complete (Fig.3), **and** overcomplete (Fig.4) dictionaries. The rate of Tree K-SVD **with** adaptive **sparse** coding (AdSC) is penal- ized by a flag in the bitstream of 1 bit per atom selected in the representation indicating if the next atom is selected at the same level or at the next one. However, this adaptive **sparse** coding method allows reaching a better quality (Fig.3) but is effective when more than 2 atoms are selected in the repre- sentation. That is why Tree K-SVD **and** Tree K-SVD AdSC reach about the same PSNR on Fig.4. The rate of TSITD is a bit lower but this method does not reach the same quality of reconstruction than Tree K-SVD **and** Tree K-SVD AdSC, especially **for** small dictionaries (Fig.3 **and** 4).

En savoir plus