On the dynamics of descent algorithms in high-dimensional planted models

(1)

HAL Id: tel-03166020

https://tel.archives-ouvertes.fr/tel-03166020

Submitted on 11 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

On the dynamics of descent algorithms in

high-dimensional planted models

Stefano Sarao Mannelli

To cite this version:

Stefano Sarao Mannelli. On the dynamics of descent algorithms in high-dimensional planted models. Disordered Systems and Neural Networks [cond-mat.dis-nn]. Université Paris-Saclay, 2020. English. �NNT : 2020UPASP014�. �tel-03166020�

(2)

Thè

se de

doctorat

NNT : 2020UP ASP014

algorithms in high-dimensional

planted models

Thèse de doctorat de l’Université Paris-Saclay

École doctorale n◦ _{564, physique de l’Ile-de-France (PIF)}

Spécialité de doctorat: Physique

Unité de recherche: Université Paris-Saclay, CNRS, CEA, Institut de physique théorique, 91191, Gif-sur-Yvette, France. Référent: : faculté des sciences d’Orsay

Thèse présentée et soutenue en vidéoconférence, le 02 Octobre 2020, par

Stefano SARAO MANNELLI

Composition du jury:

Silvio Franz Président Professeur, Université Paris-Sud (Laboratoire de Physique

Statistique et Modèles Statistique)

Gérard Ben Arous Rapporteur Professeur, New York University (Courant Institute)

Federico Ricci-Tersenghi Rapporteur Professeur, Università Sapienza (Dipartimento di Fisica)

Andrew Saxe Examinateur Chargé de recherche, University of Oxford (Department

of Experimental Psychology)

Lenka Zdeborová Directrice Chargé de recherche, CEA, Université Paris Saclay, CNRS

(3)

(4)

Abstract

Optimization of high-dimensional non-convex models has always been a difficult and fascinating problem. Since our minds tend to apply notions that we experi-enced and naturally learned in low-dimension, our intuition is often led astray.

Those problems appear naturally and become more and more relevant, in par-ticular in an era where an increasingly large amount of data is available. Most of the information that we receive is useless and identifying what is relevant is an intricate problem.

Machine learning problems and inference problems often fall in these settings. In both cases we have a cost function that depends on a large number of pa-rameters that should be optimized. A rather simple, but common, choice is the use of local algorithms based on the gradient, that descend in the cost function trying to identify good solutions. If the cost function is convex, then under mild condi-tions on the descent rate, we are guaranteed to find the good solution. However often we do not have convex costs. To understand what happens in the dynamics of these non-convex high-dimensional problems is the main goal of this project.

In the thesis we will space from Bayesian inference to machine learning in the attempt of building a theory that describes how the algorithmic dynamics evolve and when it is doomed to fail. Prototypical models of machine learning and inference problems are intimately related. Another interesting connection that is known since long time, is the link between inference problems and disordered systems studied by statistical physicists. The techniques and the results developed in the latter form the true skeleton of this work.

In this dissertation we characterize the algorithmic limits of gradient descent and Langevin dynamics. We analyze the structure of the landscape and find the counter-intuitive result that in general an exponential number of spurious solu-tions do not prevent vanilla gradient descent initialized randomly to find the only good solution. Finally, we build a theory that explains quantitatively and qualita-tively the underlying phenomenon.

(5)

(6)

Acknowledgements

I express my deepest gratitude to Pierfrancesco who has always helped and sup-ported me during my doctorate. He always found the time to discuss science with me, from conversation on the chief systems to the tiniest details in the com-putations. With his love for fundamental questions and lack of interest toward simplistic answers, he has been to me a great example of scientific integrity.

I’d like to thank Lenka for giving me the opportunity to study these problems and to get to know a vibrant community of researchers on both sides of the ocean. I am very grateful for the possibility I had to travel and explore different realities. In those travels I rarely was alone, and I must thank the members of our big group for sharing those experiences with me: Alia, Andre, Antoine, Benjamin, Bruno, Cedric, Christian, Gabriele, Hugo, Jonathan, Luca, Francesca, Federica, Maria, Marylou, Ruben, Sebastian, Stéphane, Thibault. In some cases our overlap in the group has been short, but I loved the time spent with all of you.

I have to thank Giampaolo who worked to an extremely related PhD project and often had to face difficulties similar to the one that I encountered. I really enjoyed the collaboration we started that allowed me to spend some valuable time in Sapienza. A large part of what I understood about the physics of my problem is due to the conversations we had.

I’d like to thank Chiara, Eric, Florent and Giulio for the instructive collabora-tions we had.

I must thank my family and in particular Giulia for always being supportive in whatever idea or activity I wanted to start; even when it was the (unsuccessful) construction of a drone or the decision of regularly practicing slacklining with me in the rainy Parisian autumn.

I thank all my friend from PCS that stayed in Paris for the PhD, Antonio, Bartolo, Ivan, Luca and Orazio, for making me feel Paris like home since the very first day.

A special thanks also goes to Camille, Carine, Laure and Sylvie for making (almost) effortless my interaction with the nightmarish administrative system of CEA and CNRS.

My research wouldn’t have been possible without the financial help of CEA and Université Paris-Saclay who funded my project.

Finally, I must thank again Francesca and Pierfrancesco for providing me with precious feedback on the draft of this thesis.

(7)

(8)

I Introduction and Methods 1

1 Overview 2

1.1 Sampling algorithms vs approximate algorithms . . . 4

1.2 Gradient flow and geometry . . . 8

1.3 Structure of the manuscript . . . 9

1.4 Contributions . . . 9

2 Overview en français 12 2.1 Algorithmes d’échantillonnage vs algorithmes approximatifs . . . . 14

2.2 GF et géométrie . . . 18

3 Methods 20 3.1 Tensor PCA . . . 20

3.2 Approximate Message Passing . . . 22

3.2.1 Approximate Message Passing and Bethe free entropy . . . . 22

3.2.2 State evolution equations . . . 25

3.3 Langevin dynamics . . . 28

3.3.1 Dynamical Mean-Field Equations . . . 29

3.3.2 General aging ansatz . . . 33

3.4 Replica method . . . 38

3.4.1 Monasson’s method . . . 41

3.5 Kac-Rice . . . 44

II Building the Theory 50 4 The spiked matrix-tensor model 51 5 Bayesian estimation 55 5.1 Approximate message passing . . . 56

5.1.1 Analysis of the state evolution equation . . . 57 v

(9)

vi Contents

5.2 Langevin algorithm and its analysis . . . 60

5.2.1 Behavior of the Langevin algorithm. . . 63

5.2.2 Glassy nature of the Langevin-hard phase. . . 66

5.2.3 Probing dynamically the threshold . . . 69

5.2.4 BBP on TAP . . . 72

6 Gradient descent dynamics 76 6.1 Maximum-likelihood approximate message passing . . . 79

6.1.1 ML-AMP & stationary points of the loss . . . 82

6.1.2 State evolution . . . 82

6.2 Landscape characterization . . . 85

6.3 Gradient descent flow analysis . . . 91

6.4 A theory for gradient descent flow . . . 93

6.4.1 Probing the landscape by the Kac-Rice method . . . 96

6.4.2 Probing the gradient descent flow dynamics . . . 99

III Beyond p-spins 104 7 Phase retrieval 105 7.1 The need for a third way . . . 108

7.2 BBP on the threshold states . . . 110

7.2.1 Theory for the BBP threshold . . . 111

7.2.2 Further numerical justifications . . . 114

7.3 Characterization of threshold states . . . 116

7.4 Final remarks . . . 122

8 Conclusions 123 Acronyms 126 Bibliography 127 A CHSCK from Martin-Siggia-Rose formalism 141 A.0.1 Derivation of CHSCK Equations . . . 142

B Numerical integration of CHSCK equations 145 B.1 Fixed time-grid(2+p)-spin . . . 145

B.2 Dynamical time-grid (2+p)-spin . . . 146

B.3 Numerical checks on the dynamical algorithm . . . 154

(10)

B.5 Initial conditions . . . 157

C Quenched complexity of spiked matrix-tensor model 159 C.1 Summary of the Kac-Rice complexity . . . 159

C.2 Derivation of the quenched complexity . . . 162

C.2.1 Joint probability density . . . 164

C.2.2 Expected value of the Hessian . . . 166

C.2.3 Complexities: putting pieces together . . . 168

D Numerical simulations in the spiked matrix-tensor 170

(11)

(12)

Introduction and Methods

(13)

Chapter 1 Overview

Machine learning (ML)has achieved astonishing success across real world prob-lems, such as image classification, speech recognition, text processing, and phys-ical problems, from quantum physics [CT17], to astrophysics [BB10], to high-energy physics [RWR+18]. Despite these practical successes, a large number of aspects still lacks theoretical understanding. Practitioners identified several pre-scriptions to construct working machine learning applications, but it is often un-clear why those recipes are effective. Consider a typical classification task, where a dataset consisting of pictures of cats and dogs is provided to the machine with the correct labels. What follows is the minimization of a cost function. Given new images of pets, the goal of the machine is to be able to correctly classify them into cats and dogs, thus successfully generalizing from what it has seen.

The optimization process itself is puzzling. In general, the cost function is high-dimensional and non-convex. Intuition would suggest that a random initialization would lead to some local spurious, non-informative, minimum with very little hope to achieve a good generalization. Instead, in practice even the use of vanilla gradient descent often leads to good generalization. Part of the computer science community analysed the problem geometrically by studying the properties of the cost function [BBV16,Kaw16,GLM16,GJZ17,DLT+18]. They consider generative models where a signal is observed through a noisy channel and it is possible to tune its strength with respect to the strength of the noise, thesignal to noise ratio (SNR), and change the landscape. They showed that in a variety of problems, all the minima become equally good or the spurious minima disappear as the SNR becomes sufficiently large, thus making the landscape trivial.

The goal of this Ph.D. project is to give a contribution towards the understand-ing of the learnunderstand-ing dynamics usunderstand-ing the tools developed in the context of statis-tical physics of disordered systems. We will consider first inference problems

(14)

using a Bayesian approach [MBC+20] and then move to optimization [SMKUZ19,

MBC+19]. The relation between the two approaches becomes evident from the point of view of Bayesian statistics. Let XXX be the guess on the hidden signal and YYY the observation, we can express how plausible is to observe YYY given our guess with the likelihoodP[YYY|XXX]. Bayes formula allows to invert the likelihood into the posterior probability P[XXX|YYY] ∝ P[YYY|XXX]P[XXX], that also includes prior information on the guess, such as sparsity or norm constraints. We can write an approximate expression P[XXX|YYY] ≈_β≈1 1 P[YYY] -cost z }| { P[YYY|XXX]βP_[_X_X_X_]₌. 1 Z [YYY]e −βH(XXX,YYY)_. _(1.1)

In the last equality we identify the terms with a Gibbs distribution with inverse temperature β. Given the posterior we can estimate the signal by considering the expected value:

ˆxxxβ =EXXX|YYY;β[XXX]. (1.2)

Observe that when the inverse temperature parameter equals 1, Eq. (1.1) is the posterior probability of the problem. As β tends to infinity, the cost dominates and the estimator Eq. (1.2) will be given by the maximum likelihood optimization problem.

However, the exact computation of this expected value is prohibitive in large dimension, in fact it is as complex as evaluating the partition function. In order to avoid such complication numerous ingenious techniques have been considered in the past to obtain an approximate estimation. Two main approaches consist of approximating the posterior, and sampling the posterior*_.

• The idea of adapting the approximations proposed in disordered systems to computer science problems is not recent, and early works appeared in the 80s and 90s [MPV87, EVdB01]. Ideas from physics were transferred to problems in signal processing and optimisation, providing both theoreti-cal understanding and practitheoreti-cal algorithms based on Cavity Method and its variations [MM09,ZK16,LKZ17]. Those methods have the advantage of be-ing at the same time algorithms and analytical tools. In many problems they were proved to be asymptotically optimal [Mio17, LML+17, BKM+19a], in the sense that information-theoretically they achieve the best performance in polynomial time.

• Algorithms that sample the posterior are Monte Carlo and the Langevin algorithm. Studies on the Langevin algorithm in disordered systems have

*_{The approaches that we consider are based on mean field techniques that become exact in the} high-dimensional limit.

(15)

4 1.1. Sampling algorithms vs approximate algorithms

their root in the late 70s [MSR73a, DD78, KT87, CHS93]. Despite the dy-namics was understood for some recurrent neural networks in long-time regime [SCS88, Coo00], generalizing and solving the corresponding equa-tions is very difficult even in the simplest models of statistical inference [ABUZ18]. Consequently, analysis of the performance of gradient-based algorithms such as the Langevin algorithm remains an open problem. A progress on this question was made in in the course of this Ph.D. project [MBC+20,SMKUZ19,MBC+19,SMBC+20].

In the thesis we analyze in detail two models: the spiked matrix-tensor model in Chapter4, 5, and 6; and phase retrieval in Chapter 7. In the rest of the chap-ter we give a non-technical overview on the main results, considering the spiked matrix-tensor model as example, but without entering in the details of the deriva-tion. All the details will be provided in the related parts of the manuscript.

In a nutshell the spiked matrix-tensor model considered in the following is an inference problem where a hidden signal xxx∗ is observed via two channels with additive noise; the first channel gives a matrix and the second an order-3 tensor. The noise is assumed to be Gaussian with zero mean and variance ∆2 and ∆3

respectively.

1.1 Sampling algorithms vs approximate algorithms

We are going to consider an algorithmic version of cavity method as an exam-ple of an approximation algorithm [MM09]. This algorithm was developed in-dependently in the information theory and Bayesian inference community un-der the name of belief propagation [Gal62, Pea82, Pea86]. In the case of fully connected models, belief propagation can be simplified by assuming a Gaussian structure in the beliefs, leading to theapproximate message passing (AMP) algo-rithm [DMM09,LKZ17]. The algorithm appears as an iterative algorithm that pro-gressively update the estimator of the signal and its variance. AMP presents nu-merous remarkable features: it provably achieves optimal performances in many problems including the spiked matrix-tensor and phase retrieval [Mio17,LML+17,

BKM+_19a_,_MBC+₂₀_{], and its average behaviour can be analytically followed by a}

set of equations called state evolution [JM13]. State evolution equations can be used to portrait the phase diagram of a ML model under examination. In clas-sical physics the phase diagram is a graphical representation of the thermody-namic states of a substance under different conditions (e.g. phase of matter of H2O observed for different temperature and pressure), likewise inML the phase

(16)

diagram represents the different performance of an algorithm varying the param-eters of the model. In Fig. 1.1 we show the example of the spiked matrix-tensor model[MBC+20] where the two axes represent the intensity of the two sources of noise in the model and the phases are:

• easy (green), the algorithm starting from a random initialization converges to the optimal solution with high probability;

• hard (yellow), there is a solution that is better than random, but with high probability the algorithm will not find it if initialized at random;

• impossible (red), the best solution that any algorithm can find coincides with random guessing. 0 1 2 3 4 5 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1/2

Easy

Impossible

Hard

= 1

= 1.25

=

Figure 1.1: Phase diagram of the spiked matrix-tensor model. As the variances of the noise in matrix, ∆2, and the noise in the tensor, ∆3, change different phases

appear. We can distinguish the easy (green) phase where AMPcan detect the sig-nal, the impossible (red) phase where it is information-theoretically impossible to detect the signal, and the hard (yellow) phase where the signal can in principle be detected, but this is expected to take exponential time as it requires to jump over an energy barrier. The grey lines in the easy phase represent the algorithmic transition of the Langevin algorithm for β = 1, β = 1.25, and β= ∞. For a fixed

β, the Langevin algorithm starts to detect the signal above the respective grey line.

The plus marks and the cross marks are the extrapolation of the Langevin thresh-old from the numerical study of the dynamical equations. We can observe good agreement with the analytical prediction. The purple dashed line characterizes the trivialization transition, above that line the energy landscape does not present any spurious minima.

(17)

6 1.1. Sampling algorithms vs approximate algorithms

The phase diagram can now be compared with the behaviour of the sampling algorithms.

In order to sample from the posterior probability it is necessary to design a dynamics that has the posterior probability as its stationary measure at large times. A typical sampling algorithm with this objective is the Langevin algorithm. Given a Hamiltonian H of a physical system, Langevin dynamics describes the evolution of the system coupled with a thermal bath at temperature T=1/β

d

dtxi(t) = −

∂H(xxx) ∂xi

(t) +ηi(t). (1.3)

where ηi(t)is the Langevin noise withE[ηi(t)] =0 andE[ηi(t)ηj(t0)] = 2_βδijδ(t−

t0). In the late 70s, techniques [DD78] for the study of Langevin dynamics were adapted to disordered systems providing a set of PDEs on the evolution of few relevant observables. More recently, the results of these techniques have been proved with mathematical rigour in the mixed p-spin model [BADG06, DS20]. Those methods have been generalized to the study of planted systems [CB12] and applied to the present problem in [MBC+20, MBC+19]. Two variants of the dynamical mean field theory were used to derive the corresponding equations: the dynamical cavity method was used in [MBC+20], and the generating functional formalism [MBC+19].

The equations obtained characterize the evolution of: the alignment of the system with the hidden signal m(t) = limN→∞ N1E ∑ixi(t)x∗_i, the self-alignment

at different times C(t, t0) = limN→∞ N1E ∑ixi(t)xi(t0), and the response to a

per-turbation of the spins at a previous time R(t, t0) = limN→∞ N1E ∑iδxi(t)/δηi(t0).

Where the expected value is intended over the Langevin noise and the noise in-trinsic of the system, in the example of the spiked matrix-tensor model this is due to the noise in observation channels. The spiked matrix-tensor model has the nice feature of having a closed form for these equations, allowing an easier evaluation of the numerical solution by propagation from the initial conditions. In [MBC+₂₀_, _SMKUZ19_{] the limits of Langevin and gradient descent (respectively)}

have been evaluated numerically by extrapolation from the numerical solutions, see Fig.1.1. In general the dynamical equations do not close, thus a self-consistent loop is necessary in order to evaluate a numerical solution limiting the times ac-cessible in the numerics [RBBC19].

An alternative can be derived from the work [CK93] where the authors pro-posed an ansatz for the large time behaviour of the p-spin model, which assumes two time scales. The authors also showed that the dynamics is attracted by states -called threshold states - characterized by a Hessian that displays marginality, i.e. its

(18)

spectrum touches the zero. In [MBC+19], these two ideas are used to derive the analytical threshold of the Langevin dynamics and gradient descent, by assuming that initially the dynamics will tend to the threshold states and at later times it will increase the alignment with the ground truth. The growth is exponential and the rate is Λ(∆₂,∆3; β) = _∆1₂ −

q

1

∆2 +

2(1−∆2/β)

∆3 , the phase transition occurs when

the exponent crosses the null value. Analytical and numerical results are shown in Fig.1.1giving a perfect agreement.

101 ₁₀2 ₁₀3 ₁₀4 ₁₀5 ₁₀6 t 0.0 0.2 0.4 0.6 0.8 1.0 m 3= 1.8 3= 1.4 3= 1.1 3= 0.9 3= 0.75 3= 0.7 3= 0.64 0 10 20 30 40 50 Iteration 0.0 0.1 0.2 0.3 0.4 0.5

Figure 1.2: Comparison of the evolution of the overlap with the signal in Langevin dynamics andAMP(inset), for∆2=0.70 and several values of ∆3.

The results suggest that sampling algorithms have worse algorithmic threshold thanAMP. This idea was foreseen in [AFUZ19], where the authors used a large deviation analysis [Mon95] to find exponentially many atypical glassy states in the landscape. They conjectured that the presence of this atypical glassy states may block the dynamics of sampling algorithms. The same analysis was also performed in the spiked matrix-tensor model confirming their findings [MBC+20]. Another signature of the different transitions appears in the evolution ofAMP and Langevin dynamics, Fig. 1.2. For a fixed value of ∆2 (with ∆2 < 1), we can

compare dynamics for different values of ∆3. As the system gets closer to the

transition, the time to find the solution increases. We can thus observe thatAMP maintains the same typical time to find the solution for the different values of ∆3, instead the typical time of the Langevin dynamics increases exponentially as

∆3becomes smaller. This illustrates the counter-intuitive finding that making the

problem simpler by decreasing the noise in the tensor actually harms the Langevin evolution.

(19)

8 1.2. Gradient flow and geometry

Easy (trivial) Easy (with spuriousity)

gradient-descent hard

En

er

gy

Impossible

Figure 1.3: The cartoon represents the energy landscape for an arbitrary value of ∆3 > 1. The good minimum is drawn in blue. 1/∆2 plays the role of the

SNR. Starting from low SNR, in the impossible region it is thermodynamically impossible to distinguish between good and bad minima. Increasing the SNR, the good minimum becomes energetically favored but the exponential number of the spurious minima stops the dynamics. At largerSNRthe threshold minima become saddles pointing toward the good minimum and gradient descent starts to find the solution. Finally theSNRbecomes larger than the trivialization threshold and only the good minimum survives.

1.2 Gradient flow and geometry

It was already clear in [CK93] that β enters in a smooth way in the dynamical equations, thus studying the limit β→∞ we can derive the behaviour of gradient descent dynamics. InMLgradient descent and its several variations (e.g. stochas-tic gradient descent) are usually used to minimize the cost function. Currently very few problems are amenable to analytical study of the dynamics.

In the 80s [BM81] and in the early 2000 [CLPR03, CGPM03, CLPR04] there was an effort to understand the geometrical structure of the energy landscape in disordered models. Given the number of critical points of the model, N_c, the studies focused on the annealed (and quenched) complexity defined as logE[Nc]

(and E[logN_c], respectively). The authors used an expression that enumerates the number of critical points, namely the Kac-Rice formula [AT09], computed us-ing replica theory. Recently another approach for the evaluation of the Kac-Rice formula has been proposed that uses random matrix theory, giving fruitful results in the p-spin model (planted and unplanted) [ABA ˇC13, BAMMN17, RBABC19]. In [SMKUZ19] the analysis was generalized to the spiked matrix-tensor model allowing to distinguish between regions where exponentially many minima are present and regions where only the good minima appear. The line that sepa-rates them is the trivialization transition line. As gradient descent is run above this

(20)

line, provided that the time discretization is thin enough, we have a guarantee of finding the good minimum. Many papers [Kaw16, GLM16, GJZ17, DLT+18] have focused on this aspect showing in several problems that as theSNRis large enough the bad minima disappear, or all the minima become equally good. In the spiked matrix-tensor model geometrical trivialization and gradient descent transi-tion can both be pinpointed on the phase diagram. The results, Fig.1.1, show that gradient descent starts to detect the signal before the trivialization transition has occurred. Although the algorithmic threshold of gradient descent can not occur after the trivialization transition, it might appear counter-intuitive to understand why the two lines do not coincide and there is a distance of orderO(1)that sep-arates them. In [MBC+19] the puzzle was solved, Fig. 1.3. The authors showed that, moving from a low SNR region where the algorithm fails, the algorithmic transition of gradient descent appears when the dominant minima (the threshold states or threshold minima) develop an instability, aBaik-BenArous-Péché (BBP) instability [BBAP05], becoming saddles with a single negative direction that points toward the signal. In this region there are still exponentially many minima that do not carry information on the signal, nevertheless the dynamics is first attracted by the saddles at the threshold that shield the system from the bad minima and point in the right direction.

1.3 Structure of the manuscript

In the rest of the manuscript we will work out the details of the theory described in this chapter. In Chapter3we discuss the techniques. In Part IIwe apply those methods to the spiked matrix-tensor model and develop the theory for the algo-rithmic threshold of Langevin dynamics and gradient descent. Finally in Part III we extend the analysis to small neural networks, namely to the phase retrieval problem.

1.4 Contributions

The introductory chapters (the present one and the next one in French) are based on

• [MZ20] Stefano Sarao Mannelli and Lenka Zdeborová. Thresholds of de-scending algorithms in inference problems. Journal of Statistical Mechanics: Theory and Experiment, 2020(3):034004, 2020: a review on the works [MBC+20,

(21)

10 1.4. Contributions

Chapter5and6are based on

• [MBC+₂₀_{] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent}

Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Marvels and pit-falls of the langevin algorithm in noisy high-dimensional inference. Physical Review X, 10(1):011057, 2020: where we adapted the dynamical mean-field theory to inference problems, and we draw the phase diagram of the spiked matrix-tensor model for the Langevin algorithm and AMP.

• [SMKUZ19] Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In International Conference on Ma-chine Learning, pages 4333–4342, 2019: where we studied the maximum a posteriori estimator of the spiked matrix-tensor model with a maximum-a-posteriori version ofAMPand gradient descent flow. We also used Kac-Rice formula to evaluate the trivialization transition. In the work, we identified a region of the phase diagram where gradient descent from random initial-ization recovers the global minimum with high probability even though an exponential number of spurious minima is present.

• [MBC+19] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analy-sis of gradient-flow in spiked matrix-tensor models. In Advances in Neural Information Processing Systems, pages 8676–8686, 2019: where we found a qualitative explanation for the convergence of gradient descent flow in the presence of rough landscape, and we analytically evaluate the algorithmic threshold of gradient descent.

Chapter7is based on

• [SMBC+20] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Flo-rent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Complex dy-namics in simple neural networks: Understanding gradient flow in phase retrieval. arXiv preprint arXiv:2006.06997, 2020: where we tested the theory developed in the previous works [MBC+20,SMKUZ19,MBC+19] to the case of neural networks, in particular to the problem of phase retrieval.

The works, reported below, were obtained during this Ph.D. project but I decided not to cover their results in order to maintain coherence in the dissertation

• [SMVEZ20] Stefano Sarao Mannelli, Eric Vanden-Eijnden, and Lenka Zde-borová. Optimization and generalization of shallow neural networks with

(22)

quadratic activation functions. arXiv preprint arXiv:2006.15459, 2020: in this work we study the algorithmic transition of gradient-based algorithms in a teacher-student problem where teacher and student are one-hidden layer neural networks with different number of hidden units. This allows the study of over-parametrization by taking a student with a larger number of hidden-units with respect to the teacher. We characterized the average triv-ialization transition and showed that numerically gradient descent achieves such bound. We analyzed the behavior of the training in the population loss, identifying different phases of the learning dynamics and characterizing the convergence rate.

• [dBB+20] Luca dall’Asta, Antoine Baker, Indaco Biazzo, Alfredo Braunstein, Giovanni Catania, Alessandro Ingrosso, Florent Krzakala, Fabio Mazza, Marc Mézard, Anna Paola Muntoni, Maria Refinetti, Stefano Sarao Mannelli, and Lenka Zdeborová. Epidemic mitigation by statistical inference from contact tracing data. arXiv preprint arXiv:2009.09422, 2020: in this work we used a probabilistic approach to characterize individual risk of being infected by the SARS-CoV-2 given partial observation on the population. The population is given by a contact network obtained using smartphones’ applications. We designed two algorithms: The first one is based on belief propagation (BP) assuming an SIR epidemic model of infection. The second one is a coarse-grained version ofBPthat has the advantages of being computationally effi-cient and being easily implementable using current privacy-preserving pro-tocols. Although the algorithms are developed considering SIR epidemic models we shown that their performance is robust toward model mismatch-ing. The algorithms out-perform the standard contact tracing strategy and reduce by at least 10% the number of app users needed to control the spread in the population.

(23)

Chapter 2 Overview en français

MLa remporté un succès étonnant sur des problèmes du monde réel, tels que la classification d’images, la reconnaissance vocale, le traitement de texte et les prob-lèmes physiques, de la physique quantique [CT17] à l’astrophysique [BB10], en passant par la haute énergie physique [RWR+18]. Malgré ces succès pratiques, un grand nombre d’aspects manquent encore de compréhension théorique. Les prati-ciens ont identifié plusieurs recommandations pour construire des applications d’apprentissage automatique fonctionnelles, mais on ne sait souvent pas pourquoi ces recettes sont efficaces. Considérez une tâche de classification typique, où un ensemble de données constitué d’images de chats et de chiens est fourni à la ma-chine avec les étiquettes correctes. Ce qui suit est la minimisation d’une fonction de coût. Compte tenu de nouvelles images d’animaux de compagnie, le but de la machine est de pouvoir les classer correctement en chats et en chiens, généralisant ainsi avec succès ce qu’elle a vu.

Le processus d’optimisation lui-même est déroutant. En général, la fonction de coût est de grande dimension et non convexe. L’intuition suggérerait qu’une initialisation aléatoire conduirait à un minimum local faux, non informatif, avec très peu d’espoir de parvenir à une bonne généralisation. Au lieu de cela, dans la pratique, même l’utilisation de la descente de gradient vanille conduit souvent à une bonne généralisation. Une partie de la communauté informatique a analysé le problème géométriquement en étudiant les propriétés de la fonction de coût [BBV16, Kaw16,GLM16,GJZ17, DLT+₁₈_{]. Ils considèrent des modèles génératifs}

où un signal est observé à travers un canal bruyant et il est possible d’ajuster sa force par rapport à la force du bruit, leSNR, et de changer le paysage energetic. Ils ont montré que dans une variété de problèmes, tous les minima deviennent égale-ment bons ou les minima parasites disparaissent à mesure que le SNR devient suffisamment grand, rendant ainsi le paysage trivial.

(24)

Le but de ce doctorat est de contribuer à la compréhension de la dynamique d’apprentissage en utilisant les outils développés dans le cadre de la physique statistique des systèmes désordonnés. Nous examinerons d’abord les problèmes d’inférence utilisant une approche bayésienne [MBC+20], puis passerons à l’optimisation [SMKUZ19, MBC+19]. La relation entre les deux approches devient évidente du point de vue des statistiques bayésiennes. Soit XXX la supposition sur le sig-nal caché et YYY l’observation, nous pouvons exprimer à quel point il est plausi-ble d’observer YYY étant donné notre estimation avec la vraisemblanceP[YYY|XXX]. La formule de Bayes permet d’inverser la vraisemblance en probabilité postérieure

P[XXX|YYY] ∝ P[YYY|XXX]P[XXX], qui inclut également des informations préalables sur la supposition, telles que la parcimonie ou les contraintes de norme. On peut écrire une expression approximative

P[XXX|YYY] ≈_β≈1 1 P[YYY] -cost z }| { P[YYY|XXX]βP_[_X_X_X_]₌. 1 Z [YYY]e −βH(XXX,YYY)_. _(2.1)

Dans la dernière égalité, nous identifions les termes avec une distribution de Gibbs à température inverse β. Compte tenu du postérieur, nous pouvons estimer le signal en considérant la valeur attendue:

ˆxxxβ =EXXX|YYY;β[XXX]. (2.2)

Observez que lorsque le paramètre de température inverse vaut 1, Eq. (2.1) est la probabilité postérieure du problème. Comme β tend vers l’infini, le coût domine et l’estimateur Eq. (2.2) sera donné par le problème d’optimisation du maximum de vraisemblance.

Cependant, le calcul exact de cette valeur attendue est prohibitif en grande dimension, en fait il est aussi complexe que l’évaluation de la fonction de partition. Afin d’éviter une telle complication, de nombreuses techniques ingénieuses ont été envisagées dans le passé pour obtenir une estimation approximative. Deux approches principales consistent à approximer le postérieur, et à échantillonner le postérieur*_.

• L’idée d’adapter les approximations proposées dans les systèmes désordon-nés à des problèmes informatiques n’est pas récente, et les premiers travaux sont apparus dans les années 80 et 90 [MPV87, EVdB01]. Les idées de la physique ont été transférées à des problèmes de traitement et d’optimisation du signal, fournissant à la fois une compréhension théorique et des

algo-*_{Les approches que nous considérons sont basées sur des techniques de champ moyen qui} deviennent exactes dans la limite de haute dimension.

(25)

14 2.1. Algorithmes d’échantillonnage vs algorithmes approximatifs

rithmes pratiques basés sur la méthode de la cavité et ses variations [MM09,

ZK16,LKZ17]. Ces méthodes ont l’avantage d’être à la fois des algorithmes et des outils d’analyse. Dans de nombreux problèmes, ils se sont avérés être asymptotiquement optimaux [Mio17,LML+17,BKM+19a], dans le sens où, en théorie, ils obtiennent les meilleures performances en temps polynomial. • Les algorithmes qui échantillonnent le postérieur sont Monte Carlo et l’algorithme

de Langevin. Les études sur l’algorithme de Langevin dans les systèmes dé-sordonnés ont leur origine à la fin des années 70 [MSR73a, DD78, KT87,

CHS93]. Bien que la dynamique ait été comprise pour certains réseaux de neurones récurrents en régime de longue date [SCS88, Coo00], il est très difficile de généraliser et de résoudre les équations correspondantes même dans les modèles les plus simples d’inférence statistique [ABUZ18]. Par conséquent, l’analyse des performances des algorithmes basés sur le gra-dient comme l’algorithme de Langevin reste un problème ouvert. Un pro-grès sur cette question a été réalisé au cours de ce doctorat projet [MBC+20,

SMKUZ19,MBC+19,SMBC+20].

Dans la thèse, nous analysons en détail deux modèles: le modèle matriciel-tensoriel à pointes dans le chapitre 4, 5, et 6; et récupération de phase dans le chapitre 7. Dans la suite du chapitre, nous donnons un aperçu non technique des principaux résultats, en prenant comme exemple le modèle matriciel-tenseur dopé, mais sans entrer dans les détails de la dérivation. Tous les détails seront fournis dans les parties correspondantes du manuscrit.

En un mot, le modèle matriciel-tensoriel enrichi considéré dans ce qui suit est un problème d’inférence où un signal caché xxx∗ est observé via deux canaux avec un bruit additif; le premier canal donne une matrice et le second un tenseur d’ordre 3. Le bruit est supposé gaussien avec une moyenne et une variance nulles respectivement∆2et∆3.

2.1 Algorithmes d’échantillonnage vs algorithmes

approxi-matifs

Nous allons considérer une version algorithmique de la méthode cavity comme ex-emple d’algorithme d’approximation [MM09]. Cet algorithme a été développé in-dépendamment dans la communauté de la théorie de l’information et de l’inférence bayésienne sous le nom de propagation de croyances [Gal62,Pea82,Pea86]. Dans le cas de modèles entièrement connectés, la propagation des croyances peut être simplifiée en supposant une structure gaussienne dans les croyances, conduisant

(26)

à l’algorithme AMP [DMM09, LKZ17]. L’algorithme apparaît comme un algo-rithme itératif qui met à jour progressivement l’estimateur du signal et sa variance. AMPprésente de nombreuses fonctionnalités remarquables: il atteint des perfor-mances optimales dans de nombreux problèmes, y compris le tenseur de ma-trice à pointes et la récupération de phase [Mio17,LML+17,BKM+19a,MBC+20], et son comportement moyen peut être suivi analytiquement par un ensemble d’équations appelées évolution de l’état [JM13]. Les équations d’évolution d’état peuvent être utilisées pour représenter le diagramme de phase d’un modèleMLen cours d’examen. En physique classique, le diagramme de phase est une représen-tation graphique des états thermodynamiques d’une substance dans différentes conditions (par exemple phase de la matière de H2O observée pour différentes

températures et pressions), de même dansMLle diagramme de phase représente les performances différentes d’un algorithme faisant varier les paramètres du modèle. Dans la Fig.2.1 nous montrons l’exemple du modèle matriciel-tensoriel à pointes [MBC+20] où les deux axes représentent l’intensité des deux sources de bruit dans le modèle et les phases sont:

• easy (verte), l’algorithme partant d’une initialisation aléatoire converge vers la solution optimale à forte probabilité;

• hard (jaune), il existe une solution meilleure que aléatoire, mais avec une forte probabilité, l’algorithme ne la trouvera pas s’il est initialisé au hasard; • impossible (rouge), la meilleure solution qu’un algorithme puisse trouver

coïncide avec une estimation aléatoire.

Le diagramme de phase peut maintenant être comparé au comportement des algorithmes d’échantillonnage.

Afin d’échantillonner à partir de la probabilité postérieure, il est nécessaire de concevoir une dynamique qui a la probabilité postérieure comme mesure station-naire à des moments importants. Un algorithme d’échantillonnage typique avec cet objectif est l’algorithme de Langevin. Étant donné un hamiltonienH d’un sys-tème physique, la dynamique de Langevin décrit l’évolution du syssys-tème couplé à un bain thermique à température T =1/β

d

dtxi(t) = −

∂H(xxx) ∂xi

(t) +ηi(t). (2.3)

avec ηi(t)le bruit de Langevin etE[ηi(t)] =0,E[ηi(t)ηj(t0)] = 2_βδijδ(t−t0). A la fin

des années 70, les techniques [DD78] pour l’étude de la dynamique de Langevin ont été adaptées aux systèmes désordonnés fournissant un ensemble de PDE sur

(27)

16 2.1. Algorithmes d’échantillonnage vs algorithmes approximatifs 0 1 2 3 4 5 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1/2

Easy

Impossible

Hard

= 1

= 1.25

=

Figure 2.1: Diagramme de phase du modèle matriciel-tensoriel enrichi. Comme les variances du bruit dans la matrice,∆2, et le bruit dans le tenseur,∆3, changent de

phases différentes apparaissent. On distingue la phase easy (verte) oùAMP peut détecter le signal, la phase impossible (rouge) où il est information-théoriquement impossible de détecter le signal, et le phase hard (jaune) où le signal peut en principe être détecté, mais cela devrait prendre un temps exponentiel car il faut sauter par-dessus une barrière d’énergie. Les lignes grises de la phase facile représentent la transition algorithmique de l’algorithme de Langevin pour β= 1,

β=1.25 et β=∞. Pour un β fixe, l’algorithme de Langevin commence à détecter

le signal au-dessus de la ligne grise respective. Les points positifs et les croix sont l’extrapolation du seuil de Langevin à partir de l’étude numérique des équations dynamiques. Nous pouvons observer un bon accord avec la prédiction analytique. La ligne pointillée violette caractérise la transition de banalisation, au-dessus de cette ligne le paysage énergétique ne présente pas de minima parasites.

l’évolution de quelques observables pertinents. Plus récemment, les résultats de ces techniques ont été prouvés avec une rigueur mathématique dans le modèle mixte p-spin [BADG06,DS20]. Ces méthodes ont été généralisées à l’étude des sys-tèmes plantés [CB12] et appliquées au problème actuel dans [MBC+20,MBC+19]. Deux variantes de la théorie du champ moyen dynamique ont été utilisées pour dériver les équations correspondantes: la méthode de la cavité dynamique a été utilisée dans [MBC+20], et le formalisme fonctionnel générant [MBC+19].

Les équations obtenues caractérisent l’évolution de: l’alignement du système avec le signal caché m(t) =limN→∞ N1E ∑ixi(t)x∗i, l’auto-alignement à des instants

différents C(t, t0) = limN→∞ N1E ∑ixi(t)xi(t0), et la réponse à a perturbation des

spins à un instant précédent R(t, t0) = limN→∞ N1E ∑iδxi(t)/δηi(t0). Lorsque la

valeur attendue est prévue sur le bruit de Langevin et le bruit intrinsèque du sys-tème, dans l’exemple du modèle matriciel-tenseur à pointes, cela est dû au bruit

(28)

dans les canaux d’observation. Le modèle matriciel-tenseur enrichi a la particular-ité d’avoir une forme fermée pour ces équations, permettant une évaluation plus facile de la solution numérique par propagation à partir des conditions initiales. Dans [MBC+20, SMKUZ19] les limites de Langevin et de descente de gradient (respectivement) ont été évaluées numériquement par extrapolation à partir des solutions numériques, voir Fig. 2.1. En général les équations dynamiques ne se ferment pas, donc une boucle auto-cohérente est nécessaire pour évaluer une so-lution numérique limitant les temps accessibles dans les numériques [RBBC19].

Une alternative peut être dérivée du travail [CK93] où les auteurs ont pro-posé un ansatz pour le comportement grand temps du modèle p -spin, qui sup-pose deux échelles de temps. Les auteurs ont également montré que la dy-namique est attirée par des états - appelés seuils d’états - caractérisés par un Hessien qui affiche une marginalité, c’est-à-dire que son spectre touche le zéro. Dans cite sarao2019afraid, ces deux idées sont utilisées pour dériver le seuil ana-lytique de la dynamique de Langevin et de la descente de gradient, en supposant qu’initialement la dynamique tendra vers les états de seuil et plus tard elle aug-mentera l’alignement avec la vérité terrain . La croissance est exponentielle et le taux est Λ(∆₂,∆3; β) = _∆1₂ −

q

1

∆2 +

2(1−∆2/β)

∆3 , la transition de phase se produit

lorsque l’exposant croise la valeur nulle. Les résultats analytiques et numériques sont présentés sur la Fig.2.1 donnant un accord parfait.

101 ₁₀2 ₁₀3 ₁₀4 ₁₀5 ₁₀6 t 0.0 0.2 0.4 0.6 0.8 1.0 m 3= 1.8 3= 1.4 3= 1.1 3= 0.9 3= 0.75 3= 0.7 3= 0.64 0 10 20 30 40 50 Iteration 0.0 0.1 0.2 0.3 0.4 0.5

Figure 2.2: Comparaison de l’évolution du recouvrement avec le signal en dy-namique de Langevin etAMP(inset), pour∆2 =0.70 et plusieurs valeurs de∆3.

Les résultats suggèrent que les algorithmes d’échantillonnage ont un seuil al-gorithmique plus mauvais que AMP. Cette idée a été prévue dans [AFUZ19], où les auteurs ont utilisé une analyse de grande déviation [Mon95] pour trouver de manière exponentielle de nombreux états vitreux atypiques dans le paysage. Ils ont supposé que la présence de ces états vitreux atypiques pouvait bloquer la dynamique des algorithmes d’échantillonnage. La même analyse a également

(29)

18 2.2. GF et géométrie

été effectuée dans le modèle de matrice-tenseur enrichi confirmant leurs résultats [MBC+20].

Une autre signature des différentes transitions apparaît dans l’évolution de la dynamique de AMP et de Langevin, Fig. 2.2. Pour une valeur fixe de ∆2 (avec

∆2 < 1), nous pouvons comparer des dynamiques pour différentes valeurs de

∆3. À mesure que le système se rapproche de la transition, le temps nécessaire

pour trouver la solution augmente. On peut ainsi observer que gls amp maintient le même temps typique pour trouver la solution pour les différentes valeurs de ∆3, au lieu de cela le temps typique de la dynamique de Langevin augmente

exponentiellement à mesure que∆3devient plus petit. Cela illustre la constatation

contre-intuitive selon laquelle simplifier le problème en diminuant le bruit dans le tenseur nuit en fait à l’évolution de Langevin.

2.2 GF et géométrie

Easy (trivial) Easy (with spuriousity)

gradient-descent hard

En

er

gy

Impossible

Figure 2.3: Le dessin animé représente le paysage énergétique pour une valeur arbitraire de∆3 > 1. Le bon minimum est dessiné en bleu. 1/∆2 joue le rôle du

SNR. En partant de basSNR, dans la région impossible, il est thermodynamique-ment impossible de faire la distinction entre les bons et les mauvais minima. En augmentant le SNR, le bon minimum devient énergétiquement favorisé mais le nombre exponentiel des minima parasites arrête la dynamique. À plus grand SNR, les minima de seuil deviennent des selles pointant vers le bon minimum et la descente en gradient commence à trouver la solution. Enfin, le SNR devient plus grand que le seuil de banalisation et seul le bon minimum survit.

Il était déjà clair dans [CK93] que β entre de manière lisse dans les équations dynamiques, étudiant ainsi la limite β → ∞ nous pouvons dériver le comporte-ment de la dynamique de descente de gradient. DansMLla descente de gradient et ses diverses variations (par exemple la descente de gradient stochastique) sont généralement utilisées pour minimiser la fonction de coût. Actuellement, très peu

(30)

de problèmes se prêtent à une étude analytique de la dynamique.

Dans les années 80 [BM81] et au début de 2000 [CLPR03, CGPM03, CLPR04] il y avait un effort pour comprendre la structure géométrique du paysage énergé-tique dans des modèles désordonnés. Compte tenu du nombre de points cri-tiques du modèle, Nc, les études se sont concentrées sur la complexité recuite (et

trempée) définie comme logE[N_c](etE[logN_c], respectivement). Les auteurs ont utilisé une expression qui énumère le nombre de points critiques, à savoir la for-mule de Kac-Rice [AT09], calculée en utilisant la théorie des répliques. Récemment, une autre approche pour l’évaluation de la formule de Kac-Rice a été proposée qui utilise la théorie de la matrice aléatoire, donnant des résultats fructueux dans le modèle p-spin (planté et non planté) [ABA ˇC13, BAMMN17, RBABC19]. Dans [SMKUZ19], l’analyse a été généralisée au modèle matriciel-tenseur enrichi per-mettant de distinguer les régions où de nombreux minima sont exponentiellement présents et les régions où seuls les bons minima apparaissent. La ligne qui les sé-pare est la ligne transition de trivialisation. La descente de gradient étant exécutée au-dessus de cette ligne, à condition que la discrétisation temporelle soit suffisam-ment fine, nous avons la garantie de trouver le bon minimum. De nombreux arti-cles [Kaw16, GLM16,GJZ17, DLT+18] se sont concentrés sur cet aspect montrant dans plusieurs problèmes que comme le SNR est assez grand, les mauvais min-ima disparaissent, ou tous les minmin-ima deviennent également bons. Dans le modèle matriciel-tenseur enrichi, la banalisation géométrique et la transition de descente de gradient peuvent toutes deux être identifiées sur le diagramme de phase. Les résultats, Fig. 2.1, montrent que la descente de gradient commence à détecter le signal avant la transition de banalisation. Bien que le seuil algorithmique de de-scente de gradient ne puisse pas se produire après la transition de banalisation, il peut sembler contre-intuitif de comprendre pourquoi les deux lignes ne coïncident pas et il y a une distance d’ordreO(1)qui les sépare . Dans [MBC+19] le puzzle a été résolu, Fig.2.3. Les auteurs ont montré qu’en partant d’une région basseSNR où l’algorithme échoue, la transition algorithmique de descente de gradient appa-raît lorsque les minima dominants (les états seuils ou minima seuils) développent une instabilité, une instabilité BBP[BBAP05], devenant des selles avec une seule direction négative qui pointe vers le signal. Dans cette région, il existe encore exponentiellement de nombreux minima qui ne portent pas d’informations sur le signal, néanmoins la dynamique est d’abord attirée par les selles au seuil qui protègent le système des mauvais minima et pointent dans la bonne direction.

(31)

Chapter 3 Methods

In this chapter we develop the technical framework that we are going to use in the rest of the thesis. In particular: we derive the AMP algorithm and its state evo-lution [LKZ17,LML+17]; we explore Langevin dynamics using dynamical cavity method and analyse its large times regimes [CS92, CK94, CK95, CC05]; we con-sider replica method [MPV87, Nis01, Zam10,CC05] to analyse the free energy of the model; we study the energetic (loss) landscape of a problem with Kac-Rice formula [CC05,Fyo04,AT09,ABA ˇC13,RBABC19].

We remark that the results and methods reported in this chapter were already present in the literature, possibly in a slightly different form, and are here col-lected and applied to a simple model. All the methods here presented were either derived or reviewed in the literature cited above.

For sake of clarity we apply these methods to an inference problem called order-3 tensorprincipal component analysis (PCA)or planted 3-spin.

3.1 Tensor PCA

TensorPCA [JL09,RM14,DM14] is a higher-order generalization of the standard principal component analysis used on matrices. In the matricial counterpart, we aim to extract a low dimensional structure from a given matrix and this can be done by looking at the eigenvalues and selecting the dominant ones. If we assume that the observed matrix YYY is the noisy observation of a rank-1 matrice, then we have the structure: Yij = x∗ix∗j +ξij. In this case xxx∗ is the vector that we want to

estimate and ξξξ is noise. Although the generalization of the spectral method to tensors is not straightforward, we can imagine to generalize the structure that we observe in matrixPCAto tensors: We assume that an observed order 3 tensor has

(32)

a low rank structure but it is corrupted by Gaussian noise, i.e. Tijk = √ 2 N x ∗ ix∗jx∗k +ξijk (3.1) with xxx∗ ∈ SN−1₍√_N₎_{and ξ}

ijk ∼i.i.d. N (0,∆). The scaling in N is chosen in order

to have a problem that is neither trivially easy nor trivially hard. Without loss of generality we consider the noise to be symmetric with respect to the permutation of the indexes.

We can write the likelihood and define the associated generalized posterior probability P(xxx|TTT) = 1 Z (TTT) " N

∏

i=1 e−µx2_i/2 #

∏

i<j<k e− β 2∆ Ti,j,k− √ 2 Nxixjxk 2 , (3.2)

with µ a Lagrange multiplier that enforces the spherical constraint andZ (TTT) en-sures the correct normalization, called partition function in statistical physics. As explained in the introduction, the posterior probability can be interpreted as a Gibbs measure of a physical system with spherical spins subjected to the Hamil-tonian H(xxx|TTT) = − √ 2 N∆_i_<

∑

_j_<_kξi,j,kxixjxk− N 3∆ ∑ N i=1xix∗i N !3 , (3.3)

that is equivalent to a planted version of known disordered systems model called p-spin [GM84]. In the expression above we simplified the terms in the exponent of the likelihood and removed constant contributions. We can recognize the appear-ance of the term m=_∑N_i₌₁xix∗i/N, called magnetization, that represents the overlap

between estimator xxx and the target xxx∗ and gives a good measure of the quality of reconstruction. In the Hamiltonian we can therefor recognize two contributions: the first one characterized by disorder that scales as 1/√∆, and the second one that tends to align the system to the target and scales as 1/∆. As ∆ gets large enough, the second term is suppressed and the energetic landscape is dominated by randomness. Viceversa is∆ is small, the second term dominates and we expect the system to align toward the target*_.

*_{We will see that actually the system will never be completely dominated by the signal in this} scaling in N. Even for small, but finite, values of∆ the landscape of the problem will always show the presence of minima that are generated by noise and do not carry information on the target xxx∗.

(33)

22 3.2. Approximate Message Passing

3.2 Approximate Message Passing

While matrix PCAcan be easily solved with spectral analysis, in tensor PCA we need more complex algorithms. The first that we analyze isAMPthat is a powerful iterative algorithm to compute the local magnetizations hxii given the observed

tensor. It is rooted in the cavity method of statistical physics of disordered systems [TAP77, MPV87] and it has been recently developed in the context of statistical inference [DMM09]. Among the properties that make AMP extremely useful is the fact that its performances can be analyzed in the thermodynamic limit. Indeed in such limit, its dynamical evolution is described by the so called state evolution equations [DMM09] that can be used to portrait the phase diagram of the problem. In the following we derive theAMPequations for the tensorPCA.

3.2.1 Approximate Message Passing and Bethe free entropy

AMP can be obtained as a relaxed Gaussian closure of the BP algorithm. The derivation that we present follows the lines of [LML+17, LKZ17]. The posterior probability can be represented as a factor graph where all the variables are repre-sented by circles and are linked to squares representing the interactions [MM09].

x1 x2 x3 · · · xN · · · PX PX PX PX PT PT PT PT ˜t23N →N t3→23 N

Figure 3.1: The factor graph representation of the posterior measure of the model, Eq. 3.2. The variable nodes represented with white circles are the components of the signal while the black squares are factor nodes that denote interactions between the variable nodes as they appear in the interaction terms of the distri-bution. There are two types of factor nodes: PX is the prior that depends on a

single variable, PT is the probability of observing a tensor element Tijk. The

poste-rior, apart from the normalization factor, is simply given by the product of all the factor nodes.

This representation is very convenient to write down theBPequations. In the BP algorithm we iteratively update until convergence a set of variables, which are beliefs of the (cavity) magnetization of the nodes. The reasoning behind how BP works is the following: Given the current state of the variable nodes, take a factor node and exclude one node among its neighbors. The remaining neigh-bors through the factor node express a belief on the state of the excluded node. This belief is mathematically described by a probability distribution called

(34)

mes-sage, ˜tt_ijk_→_i(xi). At the same time, another belief on the state of the excluded node

is given by the rest of the network excluded the factor node previously taken into account, ti→ijk(xi). These messages travel in the factor graph Fig.3.1carrying partial

information on the real magnetization of the single nodes, and they are iterated until convergence. The iterative scheme is described by the following equations

˜tt

ijk→i(xi)∝

Z

dx_jtt_j_→_ijk(x_j) dx_ktt_k_→_ijk(x_k)P_TTT Tijk

√ 2 N xixjxik ! , tt_i_→+1_ii 2...ip(xi)∝ PX(xi)

∏

j0_<_k0₆₌_j,k ˜tt ij0_k0_→_i(x_i), (3.4)

where the normalization constants that guarantee that the messages are probabil-ity distributions have to be determined. When the messages converge to a fixed point, the estimation of the local magnetizations can be obtained through the com-putation of the marginal probability distribution of the variables given by

ρi(xi) =PX(xi)

∏

j<k

˜tt

ijk→i(xi). (3.5)

We note that the computational cost to produce an iteration ofBPscales as O(N3). Furthermore Eqs. (3.4) are iterative equations for continuous functions and there-fore are extremely hard to solve when dealing with continuous variables. The advantage of AMP is to reduce drastically the computational complexity of the algorithm by closing the equations on a Gaussian ansatz for the messages. This is justified in tensorPCAsince the factor graph is fully connected and therefore each iteration of the algorithm involves sums of a large number of independent random variables that give rise to Gaussian distributions. Gaussian random variables are characterized by their means and covariances that are readily obtained for N1 expanding the factor nodes for small ωijk =

√

2xixjxk/N.

Once theBPequations are relaxed on Gaussian messages, the final step to ob-tain theAMPalgorithm is the so-called TAPyfication procedure [LKZ17,TAP77], which considers the fact that removing one node or one factor produces only a weak perturbation to the real marginals and the perturbation can be described in terms of the real marginals of the variable nodes themselves. By applying this scheme we obtain the AMP equations, which are described by a set of auxiliary variables A and Bi and by the meanhxiiand variance σi = hx2iiof the marginals

of the variable nodes. TheAMPiterative equations are

Bt_i = √ 2 ∆N_j

∑

_<_kTijk ˆxt_jˆxt_k− 2 ∆ 1 N

∑

_k σ t k ! " 1 N

∑

_k ˆx t kˆxtk−1 # ˆxt_i−1; (3.6)

(35)

24 3.2. Approximate Message Passing At = 1 ∆ " 1 N

∑

_k ˆx t k 2 #2 ; (3.7) ˆxt_i+1 = f(A, Bi); (3.8) σ_it+1= ∂ ∂Bf(A, Bi), (3.9) f(A, B) ≡ Z dx 1 Z (A, B)xPX(x)e Bx−1 2Ax2 = B 1+A. (3.10)

It can be shown that these equations can be obtained as saddle point equations from the so called Bethe free entropy defined asΦBethe =log ZBethe(Y, T)/N where

ZBetheis the Bethe approximation to the partition function which is defined as the

normalization of the posterior measure. The expression of the Bethe free entropy per variable can be computed in a standard way (see [MM09]) and it is given by

ΦBethe= 1 N  

∑

i log Zi+

∑

i≤j≤k log Zijk−

∑

i(ijk) log Z_i(ijk)   , (3.11) where Zi = Z dxiPX(xi)

∏

(j,k) ˜tijk→i(xi), Zijk = Z dxjtj→ijk(xj)dxktk→ijk(xk)

∏

i<j<k e−2∆1 Tijk− √ 2 Nxixjxk 2 , Z_i(ijk)= Z dxiti→ijk(x)˜tijk→i(xi),

are a set of normalization factors. Using the Gaussian approximation for the mes-sages and employing the same TAPyification procedure used to get theAMP equa-tions we obtain the Bethe free entropy density as

ΦBethe= 1 N

∑

_i logZ (A, Bi) + 2 3 1 N

∑

_i " Aˆx 2 i +σi 2 −Biˆxi # + + 2 6∆ ∑ i ˆx2i N 2 ∑iσi N , (3.12)

where we used the variables defined in Eqs. (3.6-3.7) for sake of compactness and Z (A, B)is defined as Z (A, B) = Z dxPX(x)eBx− Ax2 2 = √ 1 A+1e B2 2(A+1) _. _(3.13)

(36)

0

1

2

3

4

5

6

7

8 1/

0.0

0.2

0.4

0.6

0.8

1.0 m

Figure 3.2: Fixed points of the state evolution equation of order 3 tensor PCA. In the vertical axes we see the overlap between the estimator and the target, in the horizontal axis we see the inverse of the variance that plays the role of sig-nal to noise ratio. We observe the presence of at most three fixed points: the orange line that does not carry information, the blue line that is informative and the blue dashed line that has information on the target but it is unstable. The vertical dashed line at 1/∆ = 4 indicates the dynamic transition that separates the impossible phase characterize the appearance of an informative solution.

3.2.2 State evolution equations

The state evolution equations are a set of iterative equations that follow the average dynamical evolution of the algorithm Eqs. (3.6-3.9) via two order parameters: mt = ∑i ˆxtix∗i/N the overlap with the ground truth, and qt = ∑i ˆxtiˆxti/N the self-overlap

of the estimator. In the algorithm the estimator Eq. (3.8) and its variance Eq. (3.9) depend on A and{Bi}i, the idea is to characterize the distribution of A and{Bi}iin

the large N limit. By central limit theorem we expect them to behave as Gaussian random variables and therefor it suffices to study their first and second moment to characterize their distributions.

E[B_it] = √ 2 ∆N_j

∑

_<_k Z dP(Tijk)Tijkˆxtjˆxtk+o(1/N) = 1 ∆(mt)2xi∗ (3.14) Var[Bt_i] = 2 ∆N2

∑

j<kj0

∑

_<_k0 Z

dP(ξ_ijk)dP(ξ_ij0_k0)ξ_ijkξ_ijkˆxt_jˆxt_kˆxt_j0ˆxt_k0+o(1/N2) =

1 ∆(qt)2

(3.15) and analogously we find thatE[A] = (qt)2/∆ and its variance is subleading in N. Substituting these results into the algorithm we obtain

x_it+1= f (qt)2 ∆ , (mt)2 ∆ xi∗+ qt √ ∆W (3.16)

(37)

26 3.2. Approximate Message Passing σt+1= ∂Bf (qt)2 ∆ , (mt)2 ∆ xi∗+ qt √ ∆W = 1 1+ (qt₎2/∆ (3.17)

with W a standard Gaussian variable. Finally writing explicitly the expressions just found into the definitions of the order parameters

mt+1=E_xxx∗_,W "

∑

i f (qt₎2 ∆ , (mt₎2 ∆ x∗i + qt √ ∆W x∗_i/N # = = (m t₎2_/∆ 1+ (qt₎2/∆ (3.18) qt+1 =E_xxx∗_,W "

∑

i f ₍ qt₎2 ∆ , (mt₎2 ∆ x∗i + qt √ ∆W 2 /N # = = (m t₎4_/∆2_{+ (}_qt₎2_/∆ [1+ (qt₎2_/∆_]2 (3.19)

If all the information about the generating model are known, we are in the so-called Bayes-optimal setting. In this situation the Nishimori’s conditions hold, i.e. given k random variables sampled from the posterior distribution, x(1), x(2), . . . , x(k) and a random variable sampled from the prior, x∗, then for any function f

EhfXXX(1), XXX(2), . . . , XXX(k)i=E_P₍_X_X_X_|_T_T_T₎_P₍_T_T_T₎hE_P₍_X_X_X(1)_,...,X_X_X(k−1)_|_T_T_T,X_X_X₎ h fXXX, XXX(1), . . . , XXX(k−1)ii= =E_P₍_X_X_X_|_T_T_T₎_P₍_T_T_T₎hE_P₍_X_X_X(1)_,...,X_X_X(k−1)_|_T_T_T₎ h fXXX, XXX(1), . . . , XXX(k−1)ii= =EhfXXX∗, XXX(1), . . . , XXX(k−1)i, (3.20) where we used that (XXX(1), . . . , XXX(k−1)) and XXX are independent when conditioned on TTT and that we are in the Bayes optimal setting (i.e. P(XXX) =P(XXX∗)).

As a corollary if we consider f(xxx(1), xxx(2)) =xxx(1)·xxx(2)/N we obtain m=q, namely that average overlap between the target x∗ and a random configuration sampled from the posterior is equal to the overlap of two configurations sampled from the posterior.

Observe that using the Nishimori condition Eq. (3.20) we have mt ₌_qt_{at every}

step, and the equations collapse into a single equation mt+1 = (m

t₎2

∆+ (mt₎2. (3.21)

From this equation we can notice that m=0 is always solution for arbitrary∆>0. Another fixed point exists if∆ < 1/4 but from the algorithmic point of view the

(38)

evolution starts very close to zero mt=0 =O(1/√N). Therefor in the large N limit the dynamics will always get stuck to the uninformative solution m = 0. The phase diagram of the model can be found in Fig. 3.2 where we can see the fixed point of Eq. (3.21) for different values of∆. We can notice that although at ∆ =1/4 an informative solution appears (in blue), the non-informative brach (in orange) never disappears.

We will see later in the chapter the fixed point equations ofAMPare equivalent to the solution of thereplica symmetric (RS)saddle point equations. In fact it can be noticed that the solution of Eq. (3.21) is equivalent to the solution of Eq. (3.59).