From optimization to algorithmic differentiation: a graph detour

(1)

HAL Id: tel-03159975

https://tel.archives-ouvertes.fr/tel-03159975

Submitted on 4 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

graph detour

Samuel Vaiter

To cite this version:

Samuel Vaiter. From optimization to algorithmic differentiation: a graph detour. Optimization and Control [math.OC]. Université de Bourgogne, 2021. �tel-03159975�

(2)

´Ecole Doctorale Carnot–Pasteur

D e l’ o p t i m i sat i o n `a l a d i f f ´e r e n t i at i o n

d

’ a l g o r i t h m e s : u n d ´e t o u r pa r l e s g r a p h e s

F ro m o p t i m i z at i o n t o a l g o r i t h m i c

d i f f e r e n t i a t i o n

: a g r a p h d e t o u r

H a b i l i tat i o n `a

D i r i g e r d e s R e c h e rc h e s

S p é c i a l i t é M at h é m at i q u e s A p p l i q u é e s

Pr´esent´ee par

Samuel VAITER

Soutenue publiquement le 14 janvier 2021 après avis des rapporteurs Alexandre d’ A s p r e m o n t CNRS & École Normale Supérieure Martin B u r g e r FAU Erlangen-N ürnberg

Jér ôme M a l i c k CNRS & Université Grenoble Alpes et devant le jury composé de

Alexandre d’ A s p r e m o n t CNRS & École Normale Supérieure Francis Bac h INRIA Paris & École Normale Supérieure Martin B u r g e r FAU Erlangen-N ürnberg

Herv´e C a r d o t Universit´e de Bourgogne Marco C u t u r i Google Brain & ENSAE Alexandre G r a m f o r t INRIA Saclay

Jér ôme M a l i c k CNRS & Université Grenoble Alpes Joseph S a l m o n Université de Montpellier

(3)

This manuscript highlights the work of the author since he was nominated as “Charg´e de Recherche” (research scientist) at Centre national de la recherche scientifique (CNRS) in 2015. In particular, the author shows a thematic and chronological evolution of his research interests:

(i) The first part, following his post-doctoral work, is concerned with the develop-ment of new algorithms for non-smooth optimization.

(ii) The second part is the heart of his research in 2020. It is focused on the analysis of machine learning methods for graph (signal) processing.

(iii) Finally, the third and last part, oriented towards the future, is concerned with (automatic or not) differentiation of algorithms for learning and signal processing.

Keywords: Convex optimization; Automatic differentiation; high dimensional data; graph signals

R´

esum´

e

Ce manuscript présente le travail de l’auteur depuis sa nomination comme “Chargé de Recherche” au Centre national de la recherche scientifique (CNRS) en 2015. En particulier, l’auteur dresse une évolution thématique et chronologique de ses intérêts de recherche :

(i) La premier bloc, continuit´e de son travail post-doctoral, concerne le d´eveloppement de nouveaux algorithmes pour l’optimisation non-lisse.

(ii) Le second lui est le coeur de recherche en 2020 pour l’auteur, `a savoir l’analyse de m´ethodes d’apprentissage automatique pour le traitement de (signaux sur) graphes.

(iii) Enfin, le troisième et dernier bloc, tourné vers le futur, concerne la différentiation (automatique ou non) d’algorithmes en apprentissage et traitement du signal.

Mots-cl´es : Optimisation convexe; Diff´erentiation automatique; Grande dimension; Signaux sur graphes

(4)

Introduction 1

1 Past: Non-smooth First-order Optimization 8

1.1 Convex problems and first-order schemes . . . 8

1.2 Alternate minimization for Dykstra-like problems . . . 9

1.3 Dual extrapolation for sparse-like problems . . . 13

1.4 Support Vector Regression with linear constraints . . . 15

2 Present: Graphs and Machine Learning 24 2.1 Graphs and signals on graphs . . . 24

2.2 Geometry of Graph Total-Variation . . . 26

2.3 Oracle inequalities for graph signal estimators . . . 28

2.4 Guarantees for the dynamic stochastic block model . . . 32

2.5 Graph Convolutional Networks on Large Random Graphs . . . 38

3 Future: Algorithmic Differentiation 48 3.1 Differentiation of an algorithm . . . 48

3.2 Covariant refitting of estimators . . . 49

3.3 Parameter selection for the Lasso . . . 56

(5)

Remerciements

En relisant mes remerciements contenus dans mon manuscrit de doctorat, je me rends compte que ceux-ci restent parfaitement d’actualit´e `a ce jour. Mais sept ans plus tard, d’autres figures ont fait leur apparition.

Je tiens à remercier tout d’abord mes trois rapporteurs Alexandre d’Aspremont, Martin Burger et Jér ôme Malick d’avoir accepté de relire le présent manuscrit. Chacun d’entre eux est une référence en optimisation pour l’imagerie ou l’apprentissage statistique, et leur présence est un honneur. Merci également aux membres de ce jury. Francis Bach est l’unique point fixe entre ma thèse et l’HDR, et mon admiration reste entière pour son travail. Hervé Cardot a accepté de co-encadrer mon premier doctorant en me laissant toute liberté et de présider ce jury. Marco Cuturi n’est pas un collègue avec lequel j’ai interagi, mais son travail est source d’inspiration pour moi. Alexandre Gramfort et Joseph Salmon ont été à tour de r ôle coachs et collaborateurs. Coach car d’une aide précieuse par leur conseil pour rentrer dans le monde académique, jamais avare de leur temps, et collaborateurs dernièrement vu qu’une partie de mon travail a impliqué nos étudiants.

Au-delà de ce jury, je tiens à remercier plusieurs collègues et collaborateurs : Gabriel Peyré (avec qui tout ça a commencé), Jalal Fadili (et son suivi minutieux), Charles Dos-sal (et ses explications géométriques lumineuses), Charles Deledalle (et cette joie de travailler en bin ôme), Stephan De Bièvre (et les échanges lillois), Antonin Chambolle (et son intuition sans équivalent), Pauline Tan (et notre désarroi devant les intuitions d’Antonin), Jean-François Aujol (et son accueil bordelais), Nicolas Papadakis (pour les nombreuses bières et idées échangées), Abdessamad Barbara et Abderrahim Jourani (pour ma première expérience de recherche certifiée 100% bourguignonne), Pierre Bel-lec (et ma première découverte de la collaboration à distance en 2017, si utile cette année), Xavier Dupuis (et ses dessins de maison à cinq nœuds), mais aussi Bernard Bonnard, Jonas Lampart, Paolo Rossi, Sébastien Mazzarese, Guido Carlet, Michele Tri-estino, Marielle Simon, Thomas Chambrion (pour faire vivre l’IMB et Dijon), Yann Traonmilin (pour me convaincre de faire des calculs d’angle solide), Rémi Gribonval (comme rapporteur d’abord, comme conseil ensuite puis comme collaborateur), Nicolas Keriven (pour son nombre infini non-dénombrable d’idées), Nelly Pustelnik et Patrice Abry (et ma redécouverte de la physique), Mathieu Blondel (et son expertise), Alberto Bietti (pour m’avoir fait faire un peu de science durant le premier confinement). Mes remerciements vont également à mes deux doctorants, Quentin Klopfenstein (qui m’a initié à l’encadrement) et Hashem Ghanem (qui commence dans des conditions si difficiles). Je n’oublie pas les étudiants que je n’encadre pas, mais avec qui j’ai la chance d’interagir fréquemment : Mathurin Massias, Quentin Bertrand et Barbara Pascal. Enfin, merci à ma sœur et à mes parents pour leur soutien, et merci à Simona et Élie pour notre quotidien, sources de joie.

(6)

Organization. This Habilitation `a Diriger des Recherches manuscript is organized into three chapters and this introduction. The three chapter are both thematical and “chrono-logical”1

:

• Chapter1is concerned with non-smooth optimization which is the area of exper-tise that I developed at the end of my Ph.D. and postdoc. It is in some sense my “past”.

• Chapter 2presents my contribution to some problems arising in the context of graph (signal) processing. It is my main area of research at writing time. I will call it my “present”.

• Chapter3is a perilous mix of my works related to algorithmic differentiation. It is not yet an area of research that I explore systematically, nevertheless I believe it will be my “future”.

It is possible to read each chapter in an almost independent way. Almost because for instance the Lasso is defined in chapter1and re-used in chapter3.

Disclaimer on my previous works. This introduction is dedicated to a quick overview of my research contributions since 2015 starting from my postdoctoral activity. During my Ph.D. thesis (SV-PhD1) my main focus was the analysis of variational methods for low complexity regularization such as sparse regularizations, low-rank minimiza-tion, etc. It was concerned with recovery guarantees and sensitivity analysis of convex optimization problems by combining a data fidelity and a regularizing functional promoting solutions conforming to some notion of low complexity related to their non-smoothness points.

Several publications in journal (SV-C11;SV-J11;SV-J10;SV-J9) or conferences / work-shops (SV-C8;SV-C7;SV-C9;SV-C10;SV-C11;SV-J11;SV-C5;SV-C6) are the byproduct of this doctoral work. Two papers (SV-J8;SV-J4) associated to this theme have been pub-lished during the period 2015–2020, but are not described in details in this manuscript since most of the content can be found in my Ph.D. thesis (SV-PhD1) or in the review chapter (SV-BC1).

I believe this area of research is still very interesting, and I follow with attention the work from one side by Jingwei Liang and Clarice Poon on extension of the use of partial

1

(7)

smoothness, and on the other side the work by Franck Iutzeler, Guillaume Garrigos and Jér ôme Malick (in collaboration with Jalal Fadili and Gabriel Peyré) focused on mirror-stratifiable regularization.

I also made the choice of not discussing my collaboration (SV-C3;SV-C4) with R´emi Gribonval and Yann Traonmilin on designing “good” regularization functionals be-cause it is an ongoing work which is not mature enough to be summarized in a thesis chapter.

Chapter

1 – Past: Optimization for sparse-like models

My 1-year postdoc with Antonin Chambolle was focused on improving my knowledge on optimization, and more specifically on how to derive new algorithms to solve con-crete problems. The zoology of optimization techniques, even for first-order methods, is nowadays quite dense, and it is out of the scope of this document to try to provide a unified point of view. Instead, I will present three directions:

(i) Nesterov’s acceleration for alternating minimization ;

(ii) Better primal-dual gap estimation through dual extrapolation ;

(iii) Revisiting the Support Vector Regression to include constraints, with a focus on oncological applications.

Alternate minimization. With Antonin Chambolle and Pauline Tan, we proposed ( SV-J6) a method to accelerate (in the Nesterov (2004) sense) alternate minimization algo-rithms which involve two variables coupled by a quadratic penalization. This kind of problem arises when one try to evaluate proximity operator of function of the form

f = f1◦ A + f2◦ B wheref1,f2 are (strongly) convex functions andA,B linear

opera-tors. Since the work of Boyle and Dykstra (1986), we know that performing alternate minimization in the dual space allows to compute such proximity operators. Our contribution was to show that we can accelerate this minimization using FISTA-type overrelaxation (Beck and Teboulle,2009) as proposed by Chambolle and Pock (2015) in the case wheref1,f2are strongly convex with enjoying a linear rate of convergence. The

main application was to show that we can parallelize on GPU a modified version of the Total Variation (Rudin, Osher, and Fatemi,1992) to achieve very fast performance.

Extrapolation techniques for coordinate descent. With Alexandre Gramfort, Math-urin Massias and Joseph Salmon, we tackled in (SV-J2) the issue of correctly estimating the dual gap used as a stopping criterion in coordinate descent algorithms applied to sparse generalized linear models. Using Anderson (1965) acceleration methods as re-cently advocated by Scieur, d’Aspremont, and Bach (2020), we showed that it is possible to significantly improve the estimation of the lack of optimality both from a theoretical

(8)

and practical point of view. A significant contribution (done by M. Massias) of this work is to provide a drop-in scikit-learn (Pedregosa et al., 2011) Lasso estimator class which include this extrapolation method along with working-set and safe-rule improvements.

Support Vector Regression and immuno-oncology. With my Ph.D. student Quentin Klopfenstein, we are currently working with medical researchers in immuno-oncology. It led us (SV-P4) to consider the addition of linear constraints to the Support Vector Regression (Drucker et al.,1996) estimator. We showed that the popular Sequential Minimal Optimization (SMO) algorithm, proposed by Platt (1998), can be adapted to this setting. Preliminary results on immuno-oncology dataset is provided in the context of a simplex constraint, along with synthetic results on isotonic and non-negative constraints.

Chapter

2 – Present: Graphs and signals on graph

The intersection between graph theory and statistics / machine learning is one of my major research activity at the moment. Many methods proposed in the literature do not take into account the fine structures (geometric or not) behind the underlying data. Such structures can often be modeled by graphs. A refined analysis of the underlying graph influence is still missing and most of the literature neglects, for simplicity, the underlying graph structure, or uses linear estimators to overcome these issues. Here, I mainly focus on the use of robust non-linear regularizations to deal with inverse problems or classification tasks on such signals. More precisely, I have at the moment four lines of research in this area:

(i) Oracle properties of non-linear regularization for graph signal retrieval ; (ii) Geometric analysis of these regularizations ;

(iii) Analysis of the spectral clustering in a dynamic setting ; (iv) Convergence and stability of Graph Convolutional Networks.

Oracle properties of Graph-Slope. With Pierre Bellec and Joseph Salmon, we pro-posed an estimator (SV-J5) coined Graph-Slope which is an adaptation to the graph setting of the SLOPE (Bogdan et al.,2015) estimator, also known as ordered`1 regular-ization (Zeng and Figueiredo,2014) in the signal community. Our main contribution was to show that the optimal denoising rate of Graph-Slope was better than the one already proved by H ¨utter and Rigollet (2016) for Graph-TV. This analysis also provides a way to choose in a principled way the regularization parameters. We also show em-pirical performance on simulated data based on a splitting method (forward–backward on the dual).

(9)

Geometry of sparse analysis regularization a.k.a, Graph-Lasso. With Abdessamad Barbara and Abderrahim Jourani, we (SV-J3) start the investigation of the geometric structure of the solution set of Graph-TV when there is no uniqueness. We showed that a “largest” solution (i.e., less edge-sparse) are in fact a typical solution, and that a primal-dual interior point method allows to retrieve one. We performed a more refined analysis (SV-P3) with Xavier Dupuis where we connect the sparsity level of a solution with the corresponding face of the solution set (which is a polytope). It could be seen as a particular, but more precise, result of the work of Boyer et al. (2019), or more generally as a representer theorem.

Spectral clustering for dynamic stochastic block model. In a different context, we studied a dynamic stochastic block model with Nicolas Keriven (SV-P1), and how one can improve the standard spectral clustering with such prior. It follows the line of work of Lei and Rinaldo (2015a) for the regular stochastic block model and Pensky and Zhang (2019a) with a different time smoothing. By the way, we provided the first (to our knowledge) bound on normalized Laplacian matrix concentration, which is a probabilistic result of interest in itself.

Convergence and stability of Graph Convolutional Networks. Leveraging our work ( SV-P1), Alberto Bietti, Nicolas Keriven and I studied (SV-C2) properties of Graph Convo-lutional Networks (GCNs) by analyzing their behavior on standard models of random graphs, where nodes are represented by random latent variables and edges are drawn according to a similarity kernel. This allows us to overcome the difficulties of deal-ing with discrete notions such as isomorphisms on very large graphs, by considerdeal-ing instead more natural geometric aspects. We obtained the convergence of GCNs to their continuous counterpart as the number of nodes grows. Our results are fully non-asymptotic and are valid for relatively sparse graphs with an average degree that grows logarithmically with the number of nodes. We then analyze the stability of GCNs to small deformations of the random graph model.

Chapter

3 – Future: Differentiated algorithmic

Differentiable programming has attracted a lot of attention recently. The hype for this word started in 2018 following the Facebook’s comment of Yann LeCun

OK, Deep Learning has outlived its usefulness as a buzz-phrase. Deep Learning est mort. Vive Differentiable Programming!

I use the term differentiated algorithm (often coined adjoint program in the automatic differentiation community) here to focus on the fact that we do not take advantage of the automatic part of automatic differentiation which is at the core of many of the ideas of differentiable programming.

(10)

With my collaborators, we used the differentiation of algorithms towards two main goals:

(i) refitting of estimators to reduce their bias ;

(ii) selection of hyperparameters of regularized models.

Algorithmic refitting. It is well known that convex methods such as the Lasso or total variation regularization induced a bias which is seen as a contraction of the large coefficients for sparse models towards zero. Together with Charles Deledalle, Nicolas Papadakis and Joseph Salmon, we proposed (SV-J7) a systematic way to perform the debiasing of such methods along the computation of the estimator instead of relying on a two-step procedure. In order to achieve this single step procedure, we compute the differentiation of the algorithm with respect to the observation. With the same co-authors, such strategy was further extended (SV-J1) to the analysis group Lasso to obtained stronger guarantees on the refitted estimator.

(Hyper)parameters selection. Concerning the selection of hyperparameters, I ex-plored two different approaches (with two different teams) both taking their sources in the analysis of differentiated algorithms. The first line (SV-C1), in collaboration with Quentin Bertrand, Mathieu Blondel, Alexandre Gramfort, Quentin Klopfenstein and Joseph Salmon, is focused on the differentiation of a block coordinate descent algorithm to solve the Lasso problem. We showed that we can take advantage of the row and column-sparse structure of the Jacobian to improve the running time of hypergradient method to select the trade-off parameter. The other line (SV-P2), in collaboration with Patrice Abry, Barbara Pascal and Nelly Pustelnik, is an extension of a line of work ( SV-J10) we started in 2014 with Charles Deledalle, Jalal Fadili and my Ph.D. advisor Gabriel Peyr´e on risk estimation via Stein lemma (Stein,1981). We showed that SUGAR can be adapted to correlated noise model, and we applied it to a texture segmentation prob-lem. I do not develop this work in this manuscript because the rigorous presentation of the tools needs more space than other contributions.

Publications

Preprints

(SV-P1) Keriven, Nicolas and Samuel Vaiter (2020). Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model. Tech. rep. eprint:arXiv:2002.02892.

(SV-P2) Pascal, Barbara, Samuel Vaiter, Nelly Pustelnik, and Patrice Abry (2020). Automated data-driven selection of the hyperparameters for total-variation based texture segmentation. Tech. rep. eprint:arXiv:2004.09434.

(11)

(SV-P3) Dupuis, Xavier and Samuel Vaiter (2019). The Geometry of Sparse Analysis Regularization. Tech. rep. eprint:arXiv:1907.01769.

(SV-P4) Klopfenstein, Quentin and Samuel Vaiter (2019). Linear Support Vector Re-gression with Linear Constraints. Tech. rep. eprint:arXiv:1911.02306.

Journal papers

(SV-J1) Deledalle, Charles-Alban, Nicolas Papadakis, Joseph Salmon, and Samuel Vaiter (2020). “Block based refitting in`12sparse regularisation”. In: J Math

Imaging Vis (to appear). eprint:arXiv:1910.11186.

(SV-J2) Massias, Mathurin, Samuel Vaiter, Alexandre Gramfort, and Joseph Salmon (2020). “Dual Extrapolation for Sparse Generalized Linear Models”. In: 21.234, pp. 1–33. eprint: _{arXiv:1907.05830}.

(SV-J3) Barbara, Abdessamad, Abderrahim Jourani, and Samuel Vaiter (2019). “Max-imal Solutions of Sparse Analysis Regularization”. In: J Optim Theory Appl 180.2, pp. 371–396. eprint:_{arXiv:1703.00192}.

(SV-J4) Vaiter, Samuel, Gabriel Peyr´e, and Jalal Fadili (2018). “Model Consistency of Partly Smooth Regularizers”. In: IEEE Trans Inform Theory 64.3, pp. 1725– 1737. eprint:_{arXiv:1405.1004}.

(SV-J5) Bellec, Pierre, Joseph Salmon, and Samuel Vaiter (2017). “A Sharp Ora-cle Inequality for Graph-Slope”. In: Electron J Statist 11.2, pp. 4851–4870. eprint:arXiv:1706.06977.

(SV-J6) Chambolle, Antonin, Pauline Tan, and Samuel Vaiter (2017). “Accelerated Alternating Descent Methods for Dykstra-like problems”. In: J Math Imag-ing Vis 59.3, pp. 481–497.

(SV-J7) Deledalle, Charles-Alban, Nicolas Papadakis, Joseph Salmon, and Samuel Vaiter (2017). “CLEAR: Covariant LEAst-square Re-fitting with applica-tions to image restoration”. In: SIAM J Imaging Sci 10.1, pp. 243–284. eprint: arXiv:1606.05158.

(SV-J8) Vaiter, Samuel, Charles-Alban Deledalle, Gabriel Peyr´e, Jalal Fadili, and Charles Dossal (2017). “The Degrees of Freedom of Partly Smooth Regular-izers”. In: Ann Inst Stat Math 69.4, pp. 791–832. eprint:arXiv:1404.5557. (SV-J9) Vaiter, Samuel, Mohammad Golbabaee, Jalal Fadili, and Gabriel Peyr´e (2015). “Model Selection with Low Complexity Priors”. In: Inf. Inference 4.3, pp. 230–287. eprint:_{arXiv:1307.2342}.

(SV-J10) Deledalle, Charles-Alban, Samuel Vaiter, Jalal Fadili, and Gabriel Peyr´e (2014). “Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multi-ple parameter selection”. In: SIAM J Imaging Sci 7.4, pp. 2448–2487. eprint: arXiv:1405.1164.

(SV-J11) Vaiter, Samuel, Charles-Alban Deledalle, Gabriel Peyr´e, Charles Dossal, and Jalal Fadili (2013). “Local Behavior of Sparse Analysis Regulariza-tion: Applications to Risk Estimation”. In: Appl Comput Harmon Anal 35.3, pp. 433–451. eprint:arXiv:1204.3212.

(12)

Book chapter

(SV-BC1) Vaiter, Samuel, Gabriel Peyr´e, and Jalal Fadili (2015). “Low Complexity Regularization of Linear Inverse Problems”. In: Sampling Theory, a Renais-sance, pp. 103–153. eprint:arXiv:1407.1598.

Conferences or workshops

(SV-C1) Bertrand, Quentin, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, and Joseph Salmon (2020). “Implicit differentiation of Lasso-type models for hyperparameter optimization”. In: ICML. eprint: arXiv:2002.08943.

(SV-C2) Keriven, Nicolas, Alberto Bietti, and Samuel Vaiter (2020). “Convergence and Stability of Graph Convolutional Networks on Large Random Graphs”. In: NeurIPS. eprint:arXiv:2006.01868.

(SV-C3) Traonmilin, Yann and Samuel Vaiter (2018). “Optimality of 1-norm regular-ization among weighted 1-norms for sparse recovery: a case study on how to find optimal regularizations”. In: NCMIP. eprint:arXiv:1803.00773. (SV-C4) Traonmilin, Yann, Samuel Vaiter, and R´emi Gribonval (2018). “Is the

1-norm the best convex sparse regularization?” In: iTWIST. eprint: arXiv: 1806.08690.

(SV-C5) Vaiter, Samuel, Gabriel Peyr´e, and Jalal Fadili (2013a). “Robust Polyhedral Regularization”. In: SAMPTA. eprint:arXiv:1304.6033.

(SV-C6) — (2013b). “Robustesse au bruit des r´egularisations polyh´edrales”. In: GRETSI.

(SV-C7) Deledalle, Charles-Alban, Samuel Vaiter, Gabriel Peyr´e, Jalal Fadili, and Charles Dossal (2012a). “Proximal Splitting Derivatives for Risk Estima-tion”. In: NCMIP.

(SV-C8) — (2012b). “Risk estimation for matrix recovery with spectral regularization”. In: ICML (sparsity workshop). eprint:arXiv:1205.1482. (SV-C9) — (2012c). “Unbiased Risk Estimation for Sparse Analysis

Reg-ularization”. In: ICIP.

(SV-C10) Vaiter, Samuel, Charles-Alban Deledalle, Gabriel Peyr´e, Jalal Fadili, and Charles Dossal (2012). “The Degrees of Freedom of the Group Lasso”. In: ICML (sparsity workshop). eprint:arXiv:1205.1481.

(SV-C11) Vaiter, Samuel, Gabriel Peyr´e, Charles Dossal, and Jalal Fadili (2012). “Ro-bust Sparse Analysis Regularization”. In: PICOF.

Thesis

(SV-PhD1) Vaiter, Samuel (2014). “Low Complexity Regularizations of Inverse Prob-lems”. PhD thesis.

(13)

1

Past: Non-smooth First-order Optimization

This chapter covers the following contributions:

• (SV-J6): Antonin Chambolle, Pauline Tan, and Samuel Vaiter (2017). “Accelerated Alternating Descent Methods for Dykstra-like problems”. In: J Math Imaging Vis 59.3, pp. 481–497.

• (SV-J2): Mathurin Massias et al. (2020). “Dual Extrapolation for Sparse General-ized Linear Models”. In: 21.234, pp. 1–33. eprint:arXiv:1907.05830.

• (SV-P4): Quentin Klopfenstein and Samuel Vaiter (2019). Linear Support Vector Regression with Linear Constraints. Tech. rep. eprint:arXiv:1911.02306.

1.1 Convex problems and first-order schemes

This chapter is concerned with convex minimization problems in finite dimension of the form

argmin

x∈Rd

F(x) + G(x), (1.1)

whereF,G_{∈ Γ}0(Rd)are two lower-semicontinuous (lsc), convex and proper real-valued

functions onRd_{. Typically, the first function}_F_{will enjoy nice smoothness properties}

such asC1,1 (continuously differentiable functions with Lipschitz gradients) regularity whereasGwill does not share such smoothness assumption.

(14)

Solving a problem as (1.1) without additional assumptions is possible through the use of the so-called a Forward–Backward scheme (Lions and Mercier,1979) which use iterates of the form

xk+1=prox_γf(xk− γ∇F(xk)),

with0 < γ < 2/βwhereβis the Lipschitz constant of∇Fand

prox_γf(x)def.= argmin

x∈_Rd

1

2γ||z − x||

2

2+ f(x) (1.2)

is the proximity operator of f. However, computing prox_γf(x) in closed-form is po-tentially as hard as solving the initial problem! The two following sections study two specific cases where using the structure of the problem we are able to achieve a better splitting strategy than merely separating smooth and non-smooth terms.

1.2 Alternate minimization for Dykstra-like problems

This section describes the content of the journal article (SV-J6) written in collaboration with Antonin Chambolle and Pauline Tan, published in 2017 in J. Math. Imaging Vis. In several applications, such as total variation regularization or disparity estimation, one may be concerned with a problem (1.1) of the form

argmin x∈Rd F(x) + k X i=1 fi(Aix) | {z } def. = G(x) ,

wherefiare “simple” convex functions andAiare linear operators (not necessarly with

the same codomain). Most of the time, computing this proximity operator in closed form is tedious, but assuming that we know how to compute the proximity operator of eachfi, Dykstra splitting allow to evaluate the proximity operator offin an efficient

way. Boyle and Dykstra (1986) algorithm idea is to perform alternative minimization on a dual problem of (1.2) which have the specific form

argmin (y1,...,yk)∈Rn1×···×Rnk 1 2|| k X i=1 Aiyi− c|| + k X i=1 gi(yi). (1.3)

From now on, I present our result for the casek = 2to simplify the exposition, i.e., , we consider the problem

argmin

(x,y)∈Rn_×_Rm

E(x,y)def.= 1

(15)

wheref,gare convex, proper, lower-semicontinuous functions,A,Btwo linear operators. We also considerM,Ntwo symmetric positive semidefinite operators which represent metrics on which we compute the proximal step.

Our main contribution was to show (theorecally and pratically) that in order to solve (1.4), it is possible to alternate betweenK> 1proximal step onxandL> 1proximal step on

y, instead of simply performing alternate step (K = L = 1). Moreover, it is possible to accelerate this multistep alternating minimization with a FISTA-like acceleration (Beck and Teboulle,2009). This scheme is described in Algorithm1.

For this algorithm, we were able to prove a O(1/t2)rate, more precisely we have the following theorem

Th e o r e m 1 . 1 Let (xt_,_yt₎ _{be computed using Algorithm} ₁ _{starting from initial}

point (x0,y0), using the acceleration = True, and let (x?,y?) be a minimizer of E. Then, one has the global rate

E(xt_,_yt_{) −}_E(x?_,_y?₎ 6 2||x ?_{− x}0_||2 M/K+||y ?_{− y}0_||2 N/L+B∗_B (t + 1)2 . (1.5)

If it turns out thatg1andg2are strongly convex, it is possible to slighlty adapt

Algo-rithm1in order to the get a linear rate of convergence. We do not enter into the details here to avoid technicalities, and refer to (SV-J6) for more details.

O p e n Q u e s t i o n 1 . 1 Algorithm1performs overrelaxation only on one variable.

Empirically, it is possible to do it on every variables. Is is possible to analyze this variant from a theoretical point of view? More precisely,

(i) Is it possible to prove aO(1/t2)rate of convergence? (ii) If yes, do we improve the constant on the bound?

Moreover, it is assumed that the computation of the proximal steps are exact. Since, there is a lot of inner iterations, is it possible to prove such as result in the context of inexact minimization?

We applied this algorithm to a slight modification1

of the standard isotropic discretiza-tion of the Total Variadiscretiza-tion (Rudin, Osher, and Fatemi,1992). The basic idea in dimension

1

(16)

Figure 1.1: Even–odd decomposition.

2 is to consider separately the set of pixels(i,j) +{0,1}2 whenever(i,j) are even and odd. More precisely, given an image u= (ui,j)∈Rn×m, we define for(i,j)∈ [n] × [m]

tvi,j(u) =

√

2(u_i+1,j− u_i,j)2+ (u_i+1,j+1− u_i,j+1)2

+ (u_i+1,j+1− u_i+1,j)2+ (u_i,j+1− u_i,j)21/2.

This quantity can be seen as a “4-pixels cyclic Total Variation”. If one wants to enjoy the strongly convex case, it is also possible to smooth it in a Huber fashion by using forε > 0, tvε_i,j(u) = tvi,j(u) − ε if tvi,j(u)> 2ε tvi,j(u)2 4ε otherwise.

It is also possible to adapt it to tensors to take into account multiple channels (color images). To simplify the exposition, we keep the discussion on the single channel case. Using this tv, we define the regularization

f(u) = d(n−1)/2e_X i=1 d(m−1)/2e_X j=1 tv(ε)_2i_,2j(u) + dn/2e−1_X i=1 dm/2e−1_X j=1 tv(ε)_2i+1_,2j+1(u).

We will denote by Je(u) the first sum above, and by Jo(u) the second one. This de-composition is depicited in Figure1.1. The problem min_u||u−u₀||2+ f(u)has a dual of the form (1.4) which allows to use Algorithm 1. Indeed, given i,j, we denote by

D_i+1/2,ju = ui+1,j− ui,j if 1 6 i 6 n − 1, 1 6 j 6 m, andDi,j+1/2u = ui,j+1− ui,j if

16 i 6 n,16 j 6 m − 1. Then, we callDouthe ‘odd’ part ofDuandDeuthe even part,

that is

Dou= ((D_i+1/2,ju,D_i,j+1/2u,D_i+1/2,j+1u,D_i+1,j+1/2u))_{i,j odd}

andDeuis define in the same way but for even indicesi,j. It follows that

Joε(u) =sup{hξ,Doui −ε2|ξ| 2

(17)

0.0 0.2 0.4 0.6 0.8 1.0 ² 4 6 8 10 12 14 ms Newton Descent 5 10 15 20 25 30 35 λ 5 10 15 ms Newton ² = 0 Descent ² = 0 Newton ² = 0. 1 Descent ² = 0. 1

Figure 1.2: Left: influence of ε. Right: influence ofλ.

and the same holds forJe, replacingDowithDeand ‘odd’ with ‘even’. We will denote

ξo= ((ξi+1/2,j,ξi,j+1/2,ξi+1/2,j+1,ξi+1,j+1/2))i,j odd,

ξe= ((ξ_i+1/2,j,ξ_i,j+1/2,ξ_i+1/2,j+1,ξ_i+1,j+1/2))i,j even.

Hence, the dual problem reads min

(ξe_,ξo₎kD

o,∗_ξo

+ De,∗ξe−u†k2

+ f(ξe) + g(ξo), (1.6)

whereD•,∗is the adjoint ofD•,

f(ξe) =

ε

2λ|ξ

e_|2 _{if for all}_i_,_j_even,_||(ξ

i+1/2,j,ξi,j+1/2,ξi+1/2,j+1,ξi+1,j+1/2)||26 2λ2,

+∞ else

andg(ξo₎_{is defined similarly.}

O p e n Q u e s t i o n 1 . 2 The cyclic property on the grid allows to parallelize the

computation on 4-bytes block of memory (for grayscale images). An interesting perpespective is to understand of which kind of graph (see chapter2) it is possible to parallelize such scheme.

The GPGPU code for this article2

is available online. We used a standard image of size

512× 512which a dynamic inside the range[0,255]. Our stopping criterion is as before

by checking that the square root of the dual over the size of the image is less than

0.1which is an upper bound of the root mean-square error (RMSE). The dual gap is 2

(18)

computed at each iteration. If such a bound is not obtained after 10000 iterations, we stop the alternating minimization. In term of distributed computing, we choose to use thread blocks of size16× 16.

The use of Huber-TV induces better performances, in term of execution time or raw number of iterations. We first study the influence ofεin Figure1.2We compare both the case where the inner iterations are done with a Newton step and with a simple descent, both with 5 steps. For every experience in the following, we consider 20 repetitions of the experiment, and average the time obtained. Moreover, all time benchmarked are reported minus the memory initialization time. We fix the value ofλ = 30.0. Note that choosingεtoo big is however problematic in term of quality of approximation of the true Total Variation regularization.

A similar study can be performed for the influence of λ, see Figure 1.2. Again, we compare both the case where the inner iterations are done with a Newton step and with a descent, both with 5 steps. We let varyλover[1,36]and fix the value ofε = 0 (exact-TV) and alsoε = 0.1. Note that the execution time scales nicely with the dimension of the image. For instance, running our algorithm forε = 0.1andλ = 20.0took 800ms for

a2048_{× 2048}image and 4s for a4096_{× 4096}image.

1.3 Dual extrapolation for sparse-like problems

This section describes the content of the preprint (SV-J2) written in collaboration with Alexandre Gramfort, Mathurin Massias and Joseph Salmon, and submitted to J. Mach. Learn. Res.

We consider a sparse problem of the form

ˆ x∈argmin x∈Rp n X i=1 fi(x>ϕi) + λ||x||1 | {z } def. =P(x) , (1.7)

where all fi are closed, convex and proper functions. They are moreover assumed

to be differentiable with Lipschitz gradients with a common 1/γ Lipschitz constant. The two more used instances of Equation (1.7) are the Lasso (Tibshirani,1996), where

fi is the quadratic loss fi(t) = 1₂(yi− t)2 with Lipschitz constant γ = 1, and Sparse

Logistic regression (Koh, Kim, and Boyd, 2007), wheref_i are the logistic lossf_i(t) = log(1 +exp(−yit))with Lipschitz constantγ = 4.

(19)

A dual problem of Equation (1.7) reads: ˆ θ =argmax θ∈∆Φ − n X i=1 f∗_i(−λθi) ! | {z } def. =_D(θ) , (1.8) where∆Φ = θ_∈Rn_{| ||Φ}>_θ_||

∞6 1 . The KKT conditions read:

∀i ∈ [n], θˆi= −fi0(xˆ>ϕi)/λ (1.9)

∀j ∈ [p], x>j θˆ ∈ ∂| · |(xˆj) (1.10)

Using Slater’s condition, for any(x,θ)∈Rp_{× ∆}

Φ, one hasD(θ) 6 P(x), andD(θ) =ˆ P(x)ˆ .

The duality gap G(x,θ)def.=P(x) − D(θ)can thus be used as an upper bound for the sub-optimality of a primal vectorx: for anyε > 0, anyx_∈Rp, and any feasibleθ_{∈ ∆}Φ:

G(x,θ) =P(x) − D(θ) 6 ε ⇒ P(x) − P(x)ˆ 6 ε. (1.11)

Thus, using the link equation (1.9), a natural way (Mairal,2010) to construct a dual feasible pointθ(t)∈ ∆Φ at iterationt, when only a primal vectorx(t) is available, is to

use the “scaled residuals”:

θ(t)_res def.= −_∇F(Φx(t))/max(λ,||Φ>∇F(Φx(t))||_∞). (1.12)

Our contribution was to “improve” this control of sub-optimality by using a dual extrapolation based on properties of Vector AutoRegressive (VAR) sequences following the work of Scieur, d’Aspremont, and Bach (2020). Let K > 0 a fixed integer. For

K coordinate descent epochs, let r(t) = y − Φx(t) be the residuals at epoch t of the algorithm. We define the extrapolated residuals as

r(t)_acc=        r(t), ift6 K, K X k=1 ckr(t+1−k), ift > K. (1.13) wherec = (c1,. . .,cK)> ∈RKis defined3as ˆ c = (U (t)>_U(t)₎−1₁ K 1> K(U(t)>U(t))−11K . (1.14)

withU(t) = [r(t+1−K)− r(t−K),. . .,r(t)− r(t−1)] _∈ Rn×K. We proved the following

re-sult

3

(20)

Th e o r e m 1 . 2 Assume that Problem (1.7) has a unique solution. Then, the dual accelerated iterates(r(t)_acc)t∈_Ndefined by Algorithm2converges linearly to its limit.

The extrapolated feasible point is then

θ(t)_accdef.= −_∇F(r(t)_acc)/max(λ,||Φ>∇F(r(t)acc)||∞). (1.15)

Additionally, to impose monotonicity of the dual objective, and guarantee a behavior at least as good atθres, we use as dual point at iterationt:

θ(t)= argmax

θ∈{θ(t−1)_,θ(t) acc,θ(t)res}

D(θ). (1.16)

O p e n Q u e s t i o n 1 . 3 Theorem1.2proof is based on the fact that the dual iterates

are an asymptotic VAR sequence. It would be interesting to study the VAR property of more general estimators such as multitask Lasso or generic partly smooth regular-izations (Lewis,2002) e.g.,`_∞regularization, in order to exploit Aitken/Anderson (An-derson,1965) acceleration as in Scieur, d’Aspremont, and Bach (2020).

The estimator-specific λ_max refers to the smallest value giving a null solution (for in-stance λmax = ||Φ>y||∞ in the Lasso case and λmax = ||Φ>y||∞/2 for sparse logistic regression. For the Lasso (Figure1.3a) and Logistic regression (Figure1.3b), we illus-trate the applicability of dual extrapolation. Monotonicity of the duality gap computed with extrapolation is enforced via the construction of Equation (1.16). For all problems, the figures show thatθ_accgives a better dual objective after sign identification, with a duality gap sometimes even matching the suboptimality gap. They also show that the behavior is stable before identification.

1.4 Support Vector Regression with linear constraints

This section is focused on the work of my Ph.D. student Quentin Klopfenstein (SV-P4). Quentin was my Master 2 student in 2015 when he began an internship at Centre Georges-Franc¸ois Leclerc specialized in oncology research. He was exposed to a medi-cal field medi-called immuno-oncology. Tumor tissue is a complex microenvironment largely invaded by multiple immune cells. The complexity of this microenvironment is still not fully addressed. Knowing the global immune composition of tumors is thus of major importance. The development of new technologies like single cell approaches makes it possible to get a better view of the heterogeneity of tumors. To get use of

(21)

P(β(t)₎_{− D(θ}(t) res) P(β(t))− D(θ(t)acc) P(β(t))− P( ˆβ) 0 200 400 600 800 1000 1200 1400 CD epoch t 10−8 10−4

(a) Lasso, on news20 for λ = λmax/5.

0 25 50 75 100 125 150

CD epoch t

10−5

101

(b) Log. reg., on rcv1 (train) for λ = λmax/20.

Figure 1.3: Dual objectives with classical and proposed approach, for Lasso (left), Logistic re-gression (right). The dashed line marks sign identification (support identification for Multitask Lasso).

this information, inverse problem methods for transcriptomic data were reported to allow the estimation of the abundance of member cell types in a mixed cell population. The modelization done is that the RNA extracted from the tumor is seen as a mixed signal composed of different pure signals coming from the different types of cells. This signal can be unmixed knowing the different pure RNA signal of the different types of cells. In other words,ywill be the RNA signal coming from a tumor andΦwill be the design matrix composed of the RNA signal from the isolated cells. The number of rows represent the number of genes that we have access to and the number of columns ofΦis the number of cell populations that we would like to quantify. The hypothesis is that there is a linear relationship betweenΦand y. As said above, we want to estimate proportions which means that the estimator has to belong to the probability simplex

∆n={x : xi> 0,

P

ixi= 1}.

In immuno-oncology, the current state-of-the-art method is proposed by Newman et al. (2015) which is an unconstrained _ν-SVR (Sch ¨olkopf et al., 1999) followed by the projection onto the non-negative orthant and then followed by a`1projection.

We proposed to impose these constraints to the original optimization problem, and more generally to consider linear constaints using the primal form:

min x,x0,ξi,ξ∗i,ε 1 2||x|| 2 + C(νε + 1 n n X i=1 (ξi+ ξ∗i)) subject to xTϕi+ x0− yi6 ε + ξi yi− xTϕi− x06 ε + ξ∗i ξi,ξ∗i > 0,ε> 0 Ax6 b Γ x = d, (1.17)

(22)

whereA_∈Rk1×p_,_Γ _∈Rk2×p_,_β_∈Rp_,_ξ_,_ξ∗_∈Rn _and_β

0,ε,∈R. We recover

• isotonic constraints withAthe incidence matrix of a directed acyclic graph,Γ = 0,

b = 0andd = 0;

• non-negative constraints withA = −Ip,b = 0,C = 0andd = 0;

• simplex constraints withA = −Ip,b = 0,Γ =1andd = 1.

The algorithm (algorithm3) that we propose uses the structure of the dual problem of (1.17). If the set{x ∈Rn,_Ax_{6 b},_{Γ x = d}} is not empty then, one observes that strong duality holds for (1.17). Moreover, the dual problem of (1.17) is

min α,α∗_,γ,µ 1 2 (α − α∗)TQ(α − α∗) + γTAATγ + µTΓ ΓTµ + 2 n X i=1 (αi− α∗i)γ T Aϕi − 2 n X i=1 (αi− α∗i)µTΓ ϕi− 2γTAΓTµ + yT(α − α∗) + γTb − µTd subject to 06 α(∗)i 6 C n 1T_{(α + α}∗₎ 6 Cν 1T(α − α∗) = 0 γj> 0, (1.18)

and the equation link between primal and dual is

x = −

n

X

i=1

(αi− α∗i)ϕi− ATγ + ΓTµ.

The objective functionfwhich we will write in the stacked form as:

f(θ) = θTQθ + l¯ Tθ, where θ =     α α∗ γ µ     , l =     y −y b −d    ∈R 2n+k1+k2_, ¯ Q =     Q −Q XAT _−XΓT −Q Q −XAT XΓT AXT −AXT AAT −AΓT −Γ XT _{Γ X}T _{−Γ A}T _{Γ Γ}T     =     X −X A −Γ     XT _−XT _AT _−ΓT

is a square matrix of size2n + k1+ k2

(23)

Th e o r e m 1 . 3 For any given τ > 0 the sequence of iterates {θk_}_{, defined by the}

generalized SMO algorithm, converges to an optimal solution of the optimization problem (1.18).

We letfas the objective function of Problem (1.18) and ∇f ∈ R2n+k1+k2 _{its gradient.}

We will also say that(i,j)is a violating pair of variables if one of these two conditions is satisfied:

i_{∈ I}up(α),j∈ Ilow(α)and∇αif <∇αjf

i∈ Ilow(α),j∈ Iup(α)and∇αif >∇αjf.

We will say that jis a τ-violating variable for the blockγ if ∇γjf + τ < 0. We will say

thatjis aτ-violating variable for the blockµif|∇µjf| > τ.

Studiying the optimality condition of (1.18), we define the update between iterate k and iteratek + 1of the generalized SMO algorithm to be:

(i) if the blockα is selected and(i,j)is the most violating pair of variable then the update will be as follows:

αk+1_i = αk_i + t∗ αk+1_j = αkj − t∗,

where t∗ = min(max(I1,−

(∇_αif−∇_αjf) (Qii−2Qij+Qjj)),I2) with I1 = max(−α k i,αkj − C n) and I2=min(αk_j,C_n− αk_i).

(ii) if the block α∗ is selected and(i∗,j∗)is the most violating pair of variable then the update will be as follows:

(α∗_i)k+1= (α∗_i)k+ t∗ (α∗_j)k+1= (α∗_j)k− t∗,

where t∗ = min(max(I1,−

(∇α∗ if−∇α∗jf) (Qii−2Qij+Qjj)),I2) withI1 = max(−(α ∗ i)k,(α∗j)k− C n) andI2 =min((α∗_j)k,C_n− (α∗_i))k.

(iii) if the blockγ is selected andiis the index of the most violating variable in this block then the update will be as follows:

γk+1_i =max(− ∇γif

(AAT₎ ii

+ γk_i,0).

(iv) if the blockµis selected andiis the index of the most violating variable in this block then the update will be as follows:

µk+1_i = − ∇µif

(Γ ΓT₎ ii

(24)

LSSVR SOLS Cibersort inf 30 25 20 15 10 5 2 1 0 SNR 0.15 0.20 0.25 0.30 0.35 RMSE LSSVR Cibersort SOLS Gaussian noise inf 30 25 20 15 10 5 2 1 0 SNR 0.15 0.20 0.25 0.30 0.35 RMSE LSSVR Cibersort SOLS Laplacian noise

Figure 1.4: The Root Mean Squared Error (RMSE) as a function of the Signal to Noise Ration (SNR) is presented on a real dataset where noise was manually added. Two different noise distribution were tested: Gaussian and Laplacian. Each point of the curve is the mean RMSE of 12 different response vectors and we repeated the process four times for each level of noise. This would be equivalent to having 48 different repetitions.

O p e n Q u e s t i o n 1 . 4 We have two lines of open questions with Quentin on this

research direction.

• From a theoretical perspective, we would like to derive a rate of convergence of this Generalized SMO algorithm. More generally, studying properties of the greedy (block) coordinate descent seems interesting, even if most practical applications use nowadays cyclic or random CD.

• From an applicative perspective, we would like to investigate why the SVR seems to outperform other regression methods in the context of immuno-oncological data.

The code, written by Quentin Klopfenstein, for the different regression settings is available on a GitHub repository4

, each setting is wrapped up in a package and is fully compatible with scikit learn (Pedregosa et al., 2011) BaseEstimator class. We present here some results for the simplex regression.

We compared the RMSE of our estimator to the Simplex Ordinary Least Squares (SOLS) and to the estimator proposed in the biostatics litterature that is called Cibersort on a real biological dataset where the real quantities of cells to obtain were known. The dataset can be found on the GEO website under the accession code GSE111035

. For this example n = 584 andp = 4 and we have access to 12 different samples that are

4

https://github.com/Klopfe/LSVR

5

The dataset can be downloaded from thehttps://www.ncbi.nlm.nih.gov/geo/Gene Expression Om-nibus website under the accession code GSE11103.

(25)

our repetitions. Following the same idea than previous benchmark performed in this field of application, we increased the level of noise in the data and compared the RMSE of the different estimators. Gaussian and Laplacian distributions of noise were added to the data. The choice of the two hyperparameters Cand νwas done using 5-folds cross validation on a grid of possible pairs. The values ofCwere taken evenly spaced in the log₁₀base between[−5,−3], we considered 10 different values. The interval ofC

is different than the simulated data because of the difference in the range value of the dataset. The values ofνwere taken evenly spaced in the linear space between[0.05,1.0]

and we also considered 10 possible values.

We see that when there is no noise in the data (SNR =∞) both Cibersort and SSVR estimator perform equally. The SOLS estimator already has a higher RMSE than the two others estimator probably due to the noise already present in the data. As the level of noise increases, the SSVR estimator remains the estimator with the lowest RMSE in both gaussian and laplacian noise settings.

(26)

Algorithm 1 M u lt i s t e p A lt e r nat i n g M i n i m i z at i o n

input :MetricM,N, number of inner loopsK,L> 1,(x0,y0)an initialization, accelerate a boolean

t_{← 0}

θ0← 0

¯

x0_{← x}0, ¯y0_{← y}0, ˚y0_{← y}0

while stopping criterion unsatisfied do

// Perform K proximal steps on x

ˆ xt+1₀ _←x¯t fork = 0,. . .,K − 1do ˆ xt+1_k+1_←argmin x∈Rn g1(x) + 1 2||Ax + By˚ t − c||2₂+1 2||x −xˆ t+1 k || 2 M ˆ xt+1←xˆt+1_K ˜ xt+1_← 1 K K X k=1 ˆ xt+1_k

// Perform L proximal steps on y

ˆ yt+1₀ ←y¯t forl = 0,. . .,L − 1do ˆ yt+1_l+1 _←argmin y∈Rm g2(y) + 1 2||Ax˜ t+1_{+ By − c}_||2 2+ 1 2||y −yˆ t+1 l || 2 N ˆ yt+1_{← y}t+1 L ˜ yt+1← 1 L L X l=1 ˆ yt+1_l if accelerate=True then

// Over-relaxation of y θt+1_{← (1 +}p1 + 4(θt₎2_)/2 ¯ xt+1 =x˜t+1+θ_θtt+1−1(x˜ t+1₋_x_˜t_{) +} θt θt+1(xˆ t+1₋_x_˜t+1₎ ¯ yt+1 ₌_y_˜t+1₊θt₋₁ θt+1(y˜t+1−y˜t) + θt θt+1(yˆt+1−y˜t+1) ˚ yt+1 =y˜t+1+θ_θtt+1−1(y˜t+1−y˜t) else ¯ xt+1 =x˜t+1 ¯ yt+1 ₌_y_˜t+1 ˚ yt+1 =y˜t+1 t_{← t + 1} returnx˜t_{, ˜}_yt

(27)

Algorithm 2 c yc l i c C D f o r P ro b l e m 1. 7 w i t h d ua l e x t r a p o l a t i o n input :Φ,y,λ,x(0),ε param :T,K = 5,fdual= 10 Φx_{← Φx}(0),θ(0)_{← −∇F(Φx}(0))/max(λ,||Φ>∇F(Φx(0)₎_|| ∞) fort = 1,. . .,T do

ift = 0 modfdualthen// compute θ and gap every f epoch only

t0_{← t/f}dual ;// dual point indexing

r(t0)_{← Φx}

computeθ(t_res0) andθ_acc(t0) with eqs. (1.12) and (1.15)

θ(t0)_←argmax D(θ) | θ ∈ {θ(t0−1)_,_θ(t0) acc ,θ (t0₎ res }

;// robust dual extr. with (1.16)

ifP(x(t)_{) −}_D(θ(t0)_{) < ε}_then_break; forj = 1,. . .,pdo x(t+1)_j _←STx(t)_j −γx > j∇F(Φx) ||xj||2 ), γλ ||xj||2 Φx_{← Φx + (x}(t+1)_j − x(t)_j )xj returnx(t),θ(t0)

(28)

Algorithm 3 G e n e r a l i z e d S M O a l g o r i t h m

input :Φ,ν,y param :τ > 0

Initializingα0_∈Rn,(α∗)0_∈Rn,γ0_∈Rk1 _and_µ0_∈Rk2 _inF _{and set}_{k = 0}

while∆ > τdo i←argmin i∈Iup ∇αif j←argmax i∈I_low ∇αjf i∗_←argmin i∈I∗ up ∇α∗_if j∗←argmax i∈I∗ low ∇α∗_jf ∆1← ∇αjf −∇αif ∆2← ∇α∗ jf −∇α∗if ∆3← − min j∈{1,...,k1} ∇γjf ∆4← max j∈{1,...,k2} |∇µjf|

∆_←max(∆1,∆2,∆3,∆4)// Select the maximal violating variables

if∆ = ∆1 then

αk+1←Solution of subproblem forαiandαj

if∆ = ∆2 then

(α∗)k+1←Solution of subproblem forαi∗ andα_j∗

if∆ = ∆3 then

u = argmin

i∈{1,...,k1}

∇γif

γk+1_←_{Solution of subproblem for}_γ u

else

u = argmax

i∈{1,...,k2}

|∇µif|

µk+1←Solution of subproblem forµu

k← k + 1

(29)

2

Present: Graphs and Machine Learning

This chapter is an overview of my work on graph (signal) processing. In particular, it is extracted from the following works:

• (SV-J3): Abdessamad Barbara, Abderrahim Jourani, and Samuel Vaiter (2019). “Maximal Solutions of Sparse Analysis Regularization”. In: J Optim Theory Appl

180.2, pp. 371–396. eprint:_{arXiv:1703.00192}.

• (SV-P3): Xavier Dupuis and Samuel Vaiter (2019). The Geometry of Sparse Analysis Regularization. Tech. rep. eprint:arXiv:1907.01769.

• (SV-J5): Pierre Bellec, Joseph Salmon, and Samuel Vaiter (2017). “A Sharp Ora-cle Inequality for Graph-Slope”. In: Electron J Statist 11.2, pp. 4851–4870. eprint: arXiv:1706.06977.

• (SV-P1): Nicolas Keriven and Samuel Vaiter (2020). Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model. Tech. rep. eprint:arXiv:2002.02892.

• (SV-C2): Nicolas Keriven, Alberto Bietti, and Samuel Vaiter (2020). “Convergence and Stability of Graph Convolutional Networks on Large Random Graphs”. In: NeurIPS. eprint:arXiv:2006.01868

2.1 Graphs and signals on graphs

Graphs and matrices associated to a graph Let G = (V,E) be an undirected graph onnvertices, meaning that up to an isomorphismV = [n], andpedges, i.e., 2-set ofV

(30)

• Adjacency matrix. The adjacency matrix A ofGis defined asA_∈Rn×nsuch that

Aij= 1ifiandjare connected, and 0 otherwise. Note thatAis symmetric.

• Incidence matrix. The edge-vertex incidence matrix∆> ∈Rp×n _{is defined as}

(∆>)e,v=        +1, ifv =min(i,j) −1, ifv =max(i,j) 0, otherwise, (2.1) wheree ={i,j}.

• Combinatorial Laplacian matrix. The matrixLC(A) = ∆∆> is the so-called

com-binatorial graph Laplacian ofG. The LaplacianLCis invariant under a change of

orientation of the graph. It is also defined as LC(A) = D(A) − A where D(A)is

the (diagonal) degree matrix

D(A) =diag((di)ni=1) where di=

n

X

j=1

Aij.

• Normalized Laplacian matrix. An alternative Laplacian is commonly used for instance in community detection

LN(A) =Id− D(A)−

1

2AD(A)−

1 2,

the so-called “Normalized Laplacian”. We use the convention that0−12 = 0in the

notationD(A)−1₂_.

Inverse problems on a graph we consider the following inverse problem for a signal over a graph. Assume that each vertex i _{∈ [n]} of the graph carries a signal x?_i. One observes the vectory_∈Rq _{and aims to estimate}_x?_∈_Rn_{, i.e.,}

y = Φx?+ ε , (2.2)

whereε∼ N(0,σ2Idq)is a noise vector. We will say that an edgee ={i,j}of the graph

carries the signal (∆>x?)e. A signal x? ∈ Rn has few discontinuities if∆>x? has few

nonzero coefficients, i.e.,||∆>x?||0 is small, or equivalently if most edges of the graph

carry the constant signal. In particular, if ||∆>x?||0 = s, we say that x? is a vector of

(31)

2.2 Geometry of Graph Total-Variation

This section describes the content of (SV-J3) written in collaboration with Abessamad Barbara and Aberrahim Jourani, published in J. Optim. Theory. Appl.. It also covers ( SV-P3), a work in collaboration with Xavier Dupuis to appear in SIAM J. Optim..

We focus here on a convex regularization promoting edge-sparsity in the context of a linear inverse problem/regression problem where the regularization reads:

min x∈Rn 1 2n||y − Φx|| 2 2+ λ||∆>x||1 (2.3)

wherey∈Rq_{is a observation/response vector,}_{Φ :}_Rn _→_Rq_{is the sensing/acquisition}

linear operator andλ > 0the hyper-parameter used as a trade-off between fidelity and regularization. Note that at this point, we do not make any assumption on the incidence matrix∆or the acquisition operatorΦ.

In a serie of works, I was interested with my coauthors on the following issue: When the solution set of eq. (2.3) is not reduced to a singleton, what is its “geome-try”?

In (SV-J3), we provided a geometrical interpretation of a solution with a maximal∆> -support, namely the fact that such a solution lives in the relative interior of the solution set. More precisely, we are concerned with the characterization of a vector of maximal

∆>-support, i.e., a solution of (2.3) such that for every x∈ X,||∆>x||₀ 6 ||∆>x+||₀. We denote by S the set of solution of (2.3) which have maximal_D-support. Clearly this set is well-defined and contained in X. Our result is the following.

Th e o r e m 2 . 1 Let ¯x∈X. Then ¯xis a maximally∆>-supported solution if, and only if, ¯x_∈ri X (or equivalently if ¯x_∈ri S). In other words,

S=ri S=ri X.

In (SV-P3), we refined this analysis to understand better the geometry of the polytope of the solutions at the price of more complicated statement.

In the spirit of (Boyer et al.,2019), we first considered the compact polyhedron₍Ker_∆>₎⊥∩

B1, which is isomorphic to the projection ofB1 onto the quotient of the ambient space

by the lineality space KerD∗. We showed that the convex polyhedron (KerD∗)⊥_∩ B1 is compact (i.e. is a convex polytope). It admits extreme points that belong to

(KerD∗)⊥_{∩ ∂B}1. And it is possible to do an “algebraic test”: given ¯x∈ (KerD∗)⊥∩ ∂B1,

¯

s =sign(D∗x)¯ , and ¯J =cosupp(D∗x)¯ , ¯

(32)

The analysis of the level set(Ker∆>)⊥_{∩ B}1allows to derive the following result.

Th e o r e m 2 . 2 Let ¯x_∈ri(X), ¯s =sign(∆>x)¯ , and ¯F = Br∩ {x ∈Rn:h∆s¯,xi = r}with

r =_k∆>x¯_k1. Then X= (x +¯ KerΦ)_∩F¯. It follows that X= (x +¯ KerΦ)_{∩ {x ∈}Rn :sign(∆>x)s¯}, ri(X) = (x +¯ KerΦ)∩ {x ∈Rn :sign(∆>x) =s¯}, dir(X) =KerΦ_∩Ker∆>_¯J (where ¯J =cosupp(s)¯ ).

Moreover, the faces of X are exactly the sets of the form{x ∈ X: J _⊂cosupp(∆>x)}

with ¯J ⊂ J; their relative interior is given by {x ∈ X : J = cosupp(∆>x)} and their direction by KerΦ_∩Ker∆>_J.

As an application, we show that “most” intersection of affine subspaces with the unit ball can been seen as a solution set of (2.3).

C o ro l l a r y 2 . 1 Letr > 0andAbe an affine subspace such that∅ 6= A ∩ Br⊂ ∂Br.

Then there existΦ,yandλ > 0such that the solution set of (2.3) is X = A∩ B_r and KerΦ =dir(A).

From a practical point of view, these results add another argument towards the need for a good choice of regularizer/dictionary when a user seeks a robust and unique solution to its optimization problem. This work is mainly of theoretical interest since numerical applications should deal with exponential algorithms with respect to the signal dimension. Note however that in the case of the expected sparsity level of the maximal solution is logarithmic in the dimension, the enumeration problem is in this case tractable. We believe that these results will help other theoretical works around sparse analysis regularization, such as performing sensitivity analysis with respect to the dictionary used in the regularization.

O p e n Q u e s t i o n 2 . 1 Extension of our results to non-convex sparse analysis

pe-nalizations such as k · kp with0 < p < 1is an interesting research direction, where

face decomposition of the polytope unit-ball needs to be replaced with stratification of semi-algebraic sets.

(33)

2.3 Oracle inequalities for graph signal estimators

This section describes the content of (SV-J5) written in collaboration with Pierre Bellec and Joseph Salmon, published in Electron. J. Statist. in 2017.

We consider a denoising problem i.e.,q = nandΦ =Id. We consider here the so-called Graph-Slope variational scheme:

ˆ x :=βˆGS_∈argmin x∈Rp 1 2n||y − x|| 2 +||∆>x||[λ] , (2.4) where ||∆>_x_|| [λ]= p X j=1 λj|∆>x|↓_j , (2.5)

with λ = (λ1,. . .,λp) ∈ Rp satisfying λ1 > λ2 > · · · > λp > 0, and using for any

vector θ_∈ Rp the notation(|θ|₁↓,. . .,|θ|↓p) for the non-increasing rearrangement of its

amplitudes (|θ1|,. . .,|θp|). According to (Bogdan et al., 2015),|| · ||_[λ] is a norm over Rp _{if and only if}_λ

1 > λ2 >· · · > λp > 0with at least one strict inequality. This is a

consequence of the observation that ifλ1 > λ2 >· · · > λp > 0then one can rewrite the

Slope-norm ofθas the maximum over allτ_{∈ S}p (the set of permutations over[p]), of

the quantityPp_i=1λj|θτ(j)|:

||θ||[λ]= max τ∈Sp p X j=1 λj|θτ(j)| = p X j=1 λj|θ|↓j . (2.6)

Ifλ1= λ2=· · · = λp then||θ||[λ]= λ1||θ||1 for allθ∈Rp, so that the minimization

prob-lems (2.3) and (2.4) are the same. On the other hand, ifλ_j> λ_j+1for somej = 1,. . .,p − 1, then the optimization problems (2.3) and (2.4) differ. For instance, if _λ₁ _{> λ}₂ _{> 0}, all coefficients of∆>xare equally penalized in the Graph-Lasso (2.3), while coefficients of

∆>xare not uniformly penalized in the Graph-Slope optimization problem (2.4). Our result on this estimator is the following For any integer s and weights λ = (λ1,. . .,λp), define Λ(λ,s) = s X j=1 λ2_j 1/2 . (2.7)

Th e o r e m 2 . 3 Assume that the Graph-Slope weights λ1 > · · · > λp > 0 are such

that the event

1 n||∆

†

(34)

has probability at least1/2. Then, for anyδ_{∈ (0},1), we have with probability at least 1 − 2δ 1 n||x − xˆ ?_||2 6 inf s∈[p]    _x∈inf_Rn ||∆>_x_|| 06s 1 n||x − x ?_||2₊ 1 2n 3nΛ(λ,s) 2κ(s) + σ + 2σp2log(1/δ) √ n !2   , (2.9) whereΛ(_·,·) is defined in (2.7) and the compatibility factorκ(s)is defined as

κ(s), inf v∈Rn_{:3Λ(λ,s)||∆}>_v_|| 2> Pp j=s+1λj|∆>v|↓j ||v|| ||∆>_v_|| 2 . (2.10)

Theorem2.3does not provide an explicit choice for the weights _λ₁ _>· · · > λ_p. These weights should be large enough so that the event (2.8) has probability at least1/2. We discussed in our paper (SV-J5) an MCMC approach to ensure this event, and also a theoretical approach. We detail here only the second approach based on the following result. Let us first write

ρ(G) =max

j∈[p]||(∆

>₎†_e j|| ,

following the notation in (H ¨utter and Rigollet,2016).

C o ro l l a r y 2 . 2 Assume that the Graph-Slope weights λ1 > . . . > λp > 0 satisfy

for anyj_{∈ [p]}

nλj> 8σρ(G)

q

log(2p/j). (2.11)

Then, for any δ _{∈ (0},1), the oracle inequality (2.9) holds with probability at least

1 − 2δ.

Under the same hypothesis as Theorem 2.3 but with the special choice nλ_j =

8σρ(G)plog(2p/j)for anyj_{∈ [p]}, then for anyδ_{∈ (0},1), we have with probability at

least1 − 2δ 1 n||x − xˆ ?_||2 6 inf s∈[p],x∈Rn ||∆>_x_|| 06s 1 n||x − x ?_||2 +σ_n248ρ 2₍_G)s κ2_(s) log 2ep s +σ_n2(2 + 16log 1 δ) . (2.12)

Note that ifλ1 =· · · = λp = λ, then the event (2.8) reduces tok(∆>)†εk∞ 6 nλ/2. The

(35)

with variance at mostσ2ρ(G)2, so that (2.8) has probability at least1/2 provided thatλ is of order(ρ(G)σ/n)plogp.

O p e n Q u e s t i o n 2 . 2 Extending Theorem2.3 and Corollary2.2to the context of

inverse problem instead of denoising is an open and difficult problem. A first step would be to extend the results of (H ¨utter and Rigollet,2016) to the case of general inverse problems.

Corollary2.2is an improvement w.r.t. the bound provided in H ütter and Rigollet,2016, Theorem 2 for the TV denoiser (also sometimes referred to as the Generalized Lasso) relying on`1regularization defined in Eq. (2.3). Indeed, the contribution of the second term in Corollary2.2is reduced from log(ep/δ)(in H ütter and Rigollet,2016, Theorem 2) to log_(2ep/s). Thus the dependence of the right hand side of the oracle inequality in the confidence levelδis significantly reduced compared to the result of H ütter and Rigollet,2016, Theorem 2. A similar bound as in Corollary2.2 could be obtained for

`1 regularization adapting the proof from Bellec, Lecu´e, and Tsybakov,2018, Theorem

4.3. However such a better bound would be obtained for a choice of regularization parameter relying on the∆>-sparsity of the signal. The Graph-Slope does not rely on such a quantity, and thus Graph-Slope is adaptive to the unknown∆>-sparsity of the signal.

We used FISTA on the dual problem1

to solve the Graph-Slope denoising problem. To illustrate the behavior of Graph-Slope, we first propose two synthetic experiments in moderate dimension. The first one is concerned with the so-called “Caveman” graph and the second one with the 1D path graph.

For these two scenarios, we analyze the performance following the same protocol. For a given noise level σ, we use the bounds derived in Theorem 2.3 (we dropped the constant term 8) and in (H ¨utter and Rigollet,2016), i.e.,

λ_GL = ρ(G)σ r 2log(p) n and (λGS)j= ρ(G)σ r 2log(p/j) n ∀j ∈ [p] . (2.13)

For everyn0between 0 andp, we generate 1000 signals as follows. We drawJuniformly

at random among all the subsets of[p]of sizen0. Then, we letΠJbe the projection onto

Ker∆>_J and generate a vectorg∼ N(0, Idn). We then constructx?= c(Id− ΠJ)gwhere

cis a given constant (herec = 8). This constrains the signalx?to be of∆>-sparsity at mostp − n0.

We corrupt the signals by adding a zero mean Gaussian noise with variance σ2, and run both the Graph-Lasso estimator and the Graph-Slope estimator. We then compute

1

(36)

the mean of the mean-squared error (MSE), the false detection rate (FDR) and the true detection rate (TDR). To clarify our vocabulary, given an estimator ˆx and a ground truthx?, the MSE reads(1/n)||x?−xˆ||2, while the FDR and TDR read, respectively,

FDR(xˆ,x?) =   

|{j∈[p]| j∈supp(∆>_{ˆx) and j6∈supp(∆}>_x?₎}_|

| supp(∆>_ˆx)| if∆>xˆ 6= 0 0 if∆>x = 0ˆ , (2.14) and TDR(xˆ,x?) =   

|{j∈[p]| j∈supp(∆>_{ˆx) and j∈supp(∆}>_x?₎}_|

| supp(∆>_x?₎_| , if∆>x?6= 0,

0, if∆>x?= 0, (2.15)

where for anyz_∈Rp_{, supp}_{(z) =}_j_{∈ [p] | z}

j6= 0

.

Example on Caveman The caveman model was introduced to model small-world phenomenon in sociology. Here we consider its relaxed version, which is a graph formed by lcliques of size k(hence n = lk), such that with probabilityq _{∈ [0},1], an edge of a clique is linked to a different clique. In our experiment, we setl = 4,k = 10

(n = 40) andq = 0.1. We provide a visualisation of such a graph in Figure2.1a. For this

realization, we havep = 180. The rewired edges are indicated in blue in Figure2.1a whereas the edges similar to the complete graph on 10 nodes are in black. The signals are generated as random vectors of given ∆>-sparsity with a noise level of σ = 0.2. Figure2.1bshows the weights decay.

Figures2.1c–2.1erepresent the evolution of the MSE and TDR in function of the level of

∆>-sparsity. We observe that while the MSE is close between the Graph-Lasso and the Graph-Slope estimator at low level of sparsity, the TDR is vastly improved in the case of Graph-Slope, with a small price concerning the FDR (a bit more for the Monte Carlo choice of the weights). Hence empirically, Graph-Slope will make more discoveries than Graph-Lasso without impacting the overall FDR/MSE, and even improving it.

Example on a path: 1D–Total Variation The classical 1D–Total Variation corresponds to the Graph-Lasso estimator ˆβGLwhenGis the path graph overnvertices, hence with

p = n − 1 edges. In our experiments, we take n = 100, σ = 0.6 and a very sparse

gradient (s = 4). According to these values, and taking a random amplitude for each step, we generate a piecewise-constant signal. We display a typical realization of such a signal in Figure2.2a. Figure2.2bshows the weights decay. Note that in this case, the Monte–Carlo weights shape differs from the one in the previous experiment. Indeed, they are adapated to the underlying graph, contrary to the theoretical weights λ_GS

which depend only on the size of the graph. Figures2.2c–2.2erepresent the evolution of the MSE and TDR in function of the level of∆>-sparsity. Here, Graph-Slope does not improve the MSE significantly. However, as for the caveman experiments, Graph-Slope