• Aucun résultat trouvé

Multilinear compressive sensing and an application to convolutional linear networks

N/A
N/A
Protected

Academic year: 2021

Partager "Multilinear compressive sensing and an application to convolutional linear networks"

Copied!
41
0
0

Texte intégral

(1)

HAL Id: hal-01494267

https://hal.archives-ouvertes.fr/hal-01494267v4

Submitted on 3 May 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Multilinear compressive sensing and an application to

convolutional linear networks

François Malgouyres, Joseph Landsberg

To cite this version:

François Malgouyres, Joseph Landsberg. Multilinear compressive sensing and an application to

con-volutional linear networks. SIAM Journal on Mathematics of Data Science, Society for Industrial and

Applied Mathematics, 2019, 1 (3), pp.446-475. �hal-01494267v4�

(2)

TO CONVOLUTIONAL LINEAR NETWORKS 2

FRANC¸ OIS MALGOUYRES † AND JOSEPH LANDSBERG

3

Abstract. We study a deep linear network endowed with the following structure: a matrix X

4

is obtained by multiplying K matrices (called factors and corresponding to the action of the layers).

5

The action of each layer (i.e. factor) is obtained by applying a fixed linear operator to a vector of

6

parameters satisfying a constraint. The number of layers is not limited. Assuming that X is given

7

and factors have been estimated, the error between the product of the estimated factors and X (i.e.

8

the reconstruction error) is either the statistical or the empirical risk.

9

We provide necessary and sufficient conditions on the network topology under which a stability

10

property holds. The stability property requires that the error on the parameters defining the

near-11

optimal factors scales linearly with the reconstruction error (i.e. the risk). Therefore, under these

12

conditions on the network topology, any successful learning task leads to stably defined features that

13

can be interpreted.

14

In order to do so, we first evaluate how the Segre embedding and its inverse distort distances.

15

Then, we show that any deep structured linear network can be cast as a generic multilinear problem

16

that uses the Segre embedding. This is the tensorial lifting. Using the tensorial lifting, we provide a

17

necessary and sufficient conditions for the identifiability of the factors up to a scale rearrangement.

18

We finally provide necessary and sufficient condition called deep-Null Space Property (because of the

19

analogy with the usual Null Space Property in the compressed sensing framework) which guarantees

20

that the stability property holds.

21

We illustrate the theory with a practical example where the deep structured linear network is a

22

convolutional linear network. We obtain a condition on the scattering of the supports which is strong

23

but not empty. A simple test on the network topology can be implemented to test if the condition

24

holds.

25

Key words. Interpretable learning, stable recovery, matrix factorization, deep linear networks,

26

convolutional networks.

27

AMS subject classifications.68T05; 90C99; 15-02

28

1. Introduction. 29

1.1. The aim of the paper. Deep learning has led to many practical break-30

throughs and has led to significant improvements and state of the art performances in 31

many fields such as computer vision, natural language processing, signal processing, 32

robotics etc. The range of applications grows at a strong pace. Despite these empir-33

ical successes, the theory supporting deep learning is still far from satisfactory. For 34

instance, sharp and accurate answers to the most natural questions on: the efficiency 35

of optimization algorithms when applied to the objective function minimized in deep 36

learning ([55,37, 22,23, 68,10, 11,44, 79]); and the expressiveness of the networks 37

([12,3,31,26,45,27,63,76]); and guarantees on the statistical risk ([42,34,80,69]) 38

for learned neural networks are still missing. This makes it difficult to optimize and 39

configure neural networks. Moreover, the absence of answers to these questions pre-40

vents certification that systems built with deep learning algorithms are robust. 41

The reasons explaining the outcome of a neural network are often difficult to 42

highlight ([8, 62, 41]). Even worse, despite the settings described in [6, 4, 14, 53, 43

Submitted to the editors 7/3/2018.

Funding: Joseph Landsberg is supported by NSF DMS-1405348 and AF1814254

Institut de Math´ematiques de Toulouse ; UMR5219 Universit´e de Toulouse ; CNRS UPS IMT;

F-31062 Toulouse Cedex 9, France; ([email protected])

Department of Mathematics; Mailstop 3368;Texas A& M University; College Station, TX

(3)

71,83], the instability of the parameters optimizing the deep learning objective does 44

not allow the interpretation of the features defined by these parameters. This last 45

problem is the one we investigate in this work. 46

Our goal in this paper is to evaluate how far the architectures used in applications 47

are from architectures for which we can guarantee that the parameters returned by 48

the algorithm, and therefore the features defined using these parameters, are stably 49

defined. To do so, we consider two families of networks and establish necessary and 50

sufficient conditions on their topology guaranteeing that the features learned by the 51

algorithm are stably defined. 52

More precisely, we establish statements of the following form for two families of 53

deep networks. Below, the action of the network parameterized by h is denoted fh.

54

Informal theorem 1.1. Stability guarantee 55

We assume a known parameterized family of functions fhand a metric1d between

56

parameter pairs. We establish a necessary and sufficient condition on the family fh

57

guaranteeing that: 58

There exists a constant C > 0 such that for any input/output pairs I, X and any 59

pair of parameters h∗, h for which

60 δ =kX − fh∗(I)k, 61 and 62 η =kX − fh(I)k, 63

are sufficiently small, we have 64

(1.1) d(h, h∗)≤ C(δ + η).

65

Considering a regression problem, the values δ and η can be interpreted as the 66

statistical or the empirical risk for the parameters h and h∗. The inequality (1.1)

67

therefore guarantees that the set made of the parameters leading to a small risk has 68

a small diameter. The features defined using such parameters are therefore stably 69

defined. This seems to be the minimal condition allowing the interpretation of the 70

features. The condition on the family of functions fhis typically a condition on the

71

topology of the network. 72

In the Informal Theorem1.1, h and h∗ might have different roles. For instance, 73

if we know that the input/output pairs have been generated using a particular h, 74

possibly up to some error as modeled by δ, then (1.1) guarantees that h∗ is close to 75

h and provides a way to control the statistical risk. 76

The existing stability guarantees [6,4,14,53,71,83,60] consider this setting and 77

describe both a network topology and an algorithm whose output h∗ is guaranteed

78

to be close to h. In this study, we do not make any assumption on the construction 79

of h and h∗ and our objective is more modest. With regard to their objective, giving

80

a necessary and sufficient condition of stability plays the same role as a complexity 81

theory statement saying that a particular configuration is NP-hard. It rules out some 82

network topologies. 83

Notice that, when I and X are such that it is possible to have δ = η = 0, the 84

above stability guarantee implies that the minimizer of the network objective function 85

is unique. Theorem 6.5 and Theorem 6.4 will show that this uniqueness condition 86

is strongly related to the level of over-specification of the network. The simplified 87

1

(4)

and intuitive statement is that optimal solutions of overspecified networks are not 88

unique and are unstable. This explains the instability observed in applications. The 89

theorems analyze this property in detail. This might be viewed as a negative result 90

since overspecification is currently the main hypothesis of statements guaranteeing 91

the success of the neural network optimization. 92

1.2. The considered deep networks. 93

1.2.1. Overview. We consider two kinds of deep networks: A general family of 94

deep structured linear networks2 in Sections 6 and 7; and a family of convolutional

95

linear networks in Section8. The formal statements for the deep structured linear net-96

works are in Theorems7.2and7.3. The statements for convolutional linear networks 97

are in Theorem8.4. Below, we describe the deep structured linear networks. 98

1.2.2. Deep structured linear networks. The term deep linear network usu-99

ally corresponds to fully-connected feed-forward networks, without bias, in which the 100

activation function is the identity. In the general results described in this paper we 101

consider deep linear networks and provide two means to enforce some structure to the 102

network. As we describe below, the structures can be used to include: feed-forward 103

linear networks; convolutional linear networks (as is done in Section8); the action of a 104

ReLU activation function; sparse networks; non-negative networks; and combinations 105

of the above. The family also includes most matrix factorization problems. 106

We model a deep structured linear network as a product of matrices called factors. 107

The factors depend linearly on parameters in RS, for S

∈ N. 108

More precisely, consider an arbitrary depth parameter K ≥ 1. The number of 109

layers is K +1 and the layers are enumerated in such a way that the layer receiving the 110

input is K +1 and the layer returning the output is 1. We consider sizes m1. . . mK+1∈

111

N, write m1= m, mK+1= n. We consider, for k = 1 . . . K, the linear map 112 Mk : RS −→ Rmk×mk+1 (1.2) 113 h7−→ Mk(h) 114

Given some parameters, h1, . . . , hK ∈ RS, the action of the deep structured linear

115

network is the product 116

M1(h1)· · · MK(hK)

117

The factor MK(hK) might involve the inputs of the samples by considering MK(hK) =

118 M′

K(hK)I for: a linear map MK′ ; and for a matrix I whose columns contain the inputs.

119

Given outputs X∈ Rm×n, the optimization of the parameters h1, . . . , hKdefining the

120

network aims at getting 121

M1(h1)· · · MK(hK)≃ X.

122

To model feed-forward linear networks, the mappings Mk, k = 1 . . . K− 1 (and

123

MK′ ) construct the matrix by placing the entry of hk corresponding to an edge in the

124

network in the corresponding entry in Mk(hk).

125

For convolutional layers, Mk and MK′ concatenate convolution matrices3defined

126

by a portion of the entries in hk. Each convolution matrix is at the location

corre-127

sponding to a prescribed edge. 128

2

We call this family deep structured linear networks because the family is endowed with tools to impose structures. We analyze the impact of the structure on the stability property. However, these tools might be used to define the usual deep linear network.

3

Depending on the situation: Toeplitz, block-Toeplitz, circulant or block-circulant matrices. The matrices often involve downsampling.

(5)

The main argument for studying deep structured linear networks is due to their 129

strong connection to non-linear networks that uses the rectified linear unit (ReLU)4

130

activation function. We explain it in detail. The action of the ReLU activation 131

function at the layer k treats every entry independently of the other entries and 132

multiplies it by either 1 (the entry is kept) or 0 (the entry is canceled). More precisely, 133

denoting h = (hk)k=1..K, the action of the ReLU activation function on the layer k is

134

to apply the map Ak: Rmk×n7−→ Rmk×n(where mk× n is the size data in the layer

135

k) such that: 136

(AkM )i,j = ak(h)i,jMi,j, for (i, j)∈ {1, . . . , mk} × {1, . . . , n}

137 where ak(h)∈ {0, 1}mk×nis defined by 138 ak(h)i,j= ( 1 ifMk+1(hk+1)Ak+1Mk+2(hk+2)· · · AK−1MK(hK)  i,j≥ 0 0 otherwise. 139 The function 140 ak : RS×K −→ {0, 1}mk×n 141 h7−→ ak(h) 142

is piecewise constant because {0, 1}mk×n is finite. (This has already been used in

143

[68].) As a consequence, the parameter space RS×K is partitioned into subsets such

144

that on every subset ak is constant, for all k = 1..K. Therefore, on every subset

145

the action of the non-linear network coincides with the action of a deep structured 146

linear network that groups at every layer Ak and Mk+1. Further, the landscape of

147

the objective function of the non-linear neural network that uses ReLU coincides, on 148

every part of the partition, with the landscape of a deep structured linear network. 149

This is a strong argument in favor of the study of deep structured linear networks. 150

Notice that deep structured linear networks have also been obtained in [22,23,44] 151

by modeling the action of the activation function as random, independent of the 152

input and when considering the expectation of the network action. However, these 153

assumptions are not satisfied by the deep networks used in applications (see [23]) 154

and it is not clear that this link can be exploited to obtain theoretical guarantees for 155

realistic deep networks. 156

In addition to the structure induced by the operators Mk, we also consider

struc-157

ture imposed on the vectors h. We assume that we know a collection of models 158

M = (ML)

L∈N with the property that for every L,ML ⊂ RS×K is a given subset.

159

We will assume that the parameters h ∈ RS×K defining the factors are such that

160

there exists L∈ N such that h ∈ ML. For instance, the constraint h

∈ ML might

161

be used to impose sparsity, grouped sparsity or co-sparsity. One might also use the 162

constraint h ∈ ML to impose non-negativity, orthogonality, equality, compactness,

163

etc. Generally speakingM is used to impose some prior or some form of regularity or 164

to compress the parameter space and obtain better bounds [7]. The models might also 165

be used to alleviate ambiguities. For instance, if the operators Mk and Mk+1 allow

166

permutations (i.e. there exists (hk, hk+1)6= (gk, gk+1) and a permutation matrix C

167

such that Mk(hk) = Mk(gk)C and Mk+1(hk+1) = C−1Mk+1(gk+1)), we can use a

168

complete ordering of the parameter space RK×S and impose, usingM, the largest of 169

all the equivalent versions of a parameters to be considered. 170

4

(6)

1.3. Bibliography. 171

1.3.1. Other matrix factorization and compressed sensing. The content 172

of this paper is strongly related to and can be considered as an extension of the 173

research field usually named compressed sensing. Because of the importance of this 174

field of research and to simplify the reading for readers whose main interest is in deep 175

learning, we have separated this part of the bibliography and placed it in Section2. 176

Notice that the statement of the informal theorem1.1can be interpreted in the context 177

of signal recovery. In particular, the results on deep structured linear networks can 178

probably be specialized to be applicable to matrix factorization problems for which 179

stability properties have not been established [77,20,21,59,58,46,56]. We have not 180

investigated this potential. 181

1.3.2. Tensors and deep networks. The analysis conducted in this paper 182

is based on a connection, named tensorial lifting, between deep structured linear 183

networks and a tensor problem (see Section5). The tensorial lifting has already been 184

described in [60] but other connections between tensor and network problems have 185

been described by other authors. In particular, in [26, 27, 45], the authors define 186

a score function using a tensor. They highlight a network topology that computes 187

the score function defined by a tensor decomposable using a CP-decomposition, a 188

Hierarchical Tucker [26, 27] or a tensor train decomposition[45]. They then deduce 189

the expressive power of the network topology from the connections between the tensor 190

decompositions. These results highlight and analyze why deep networks are more 191

expressive than shallow ones. Tensors and tensor decomposition have also been used 192

to represent the cross-moment and construct a solver [42], encode the convolution 193

layers with a tensor of order 4 and manipulate this tensor to improve the network 194

[49,65,82], to represent a tensor layer [73,81]. 195

1.3.3. Stability property for neural networks. To the best of our knowl-196

edge, the articles establishing stability properties are [4,14,53,71, 83, 60]. 197

Among them, [14,53,83] consider a family of networks of depth 1 or 2 (depending 198

on the article, the definition of the depth may vary). The article [71] contains a study 199

on deep networks (the depth can be large), but the study only focuses on the recovery 200

of one layer. The articles [4, 60] consider networks without depth limitation. 201

In [14], the authors consider the minimization of the statistical risk (not the em-202

pirical risk). The input is assumed Gaussian and the output is generated by a network 203

involving one linear layer followed by ReLU and a mean. The number of intermediate 204

nodes is smaller than the input size. They provide conditions guaranteeing that, with 205

high probability, a randomly initialized gradient descent algorithm converges to the 206

true parameters. The authors of [53] consider a feed forward network made of one 207

unknown linear layer, followed by ReLU and a sum. The size of the intermediate layer 208

equals the size of the data, the size of the output is 1. Again, they assume Gaussian 209

input data and consider the minimization of the risk (not the empirical risk). They 210

show that the stochastic gradient descent converges to the true solution. In [83], the 211

authors consider a non-linear layer followed by a linear layer. The size of the interme-212

diate layer is smaller than the size of the input and the size of the output is 1. They 213

describe an initialization algorithm based on a tensor decomposition such that with 214

high probability, the gradient algorithm minimizing the empirical risk converges to 215

the true parameters that generated the data. 216

The authors of [71] consider a feed-forward neural network and show that, if the 217

input is Gaussian or its distribution is known, a method based on moments and sparse 218

(7)

dictionary learning can retrieve the parameers defining the first layer. Nothing is said 219

about the stability or the estimation of the other layers. 220

The authors of [4] consider deep feed-forward networks which are very sparse and 221

randomly generated. They show that they can be learned with high probability one 222

layer after another. However, very sparse and randomly generated networks are not 223

used in practice and one might want to study more versatile structures. The article 224

[60] studies deep structured linear networks (without the models M) and uses the 225

same tensorial lifting we use here. However, in [60] the function d measuring the 226

error between parameters is only defined using the ℓ∞ norm and ii not a metric.

227

The transversality condition of [60] is sufficient to guarantee the stability but is not 228

necessary. All these weakness are corrected in this extended version. The general 229

result is also specialized to deep convolutional linear networks. 230

1.4. Organization of the paper. Because it is strongly related, we give an ex-231

tensive bibliography on compressive sensing and stable recovery properties for matrix 232

factorization problems in Section2. We describe the framework of the paper and our 233

notations in Section3. 234

The main contributions of this paper are: 235

• In Section4, we investigate and recall several results on tensors, tensor rank 236

and the Segre embedding. In particular, we investigate how the Segre em-237

bedding distorts distances. 238

• In Section5, we describe the tensorial lifting. It expresses any deep structured 239

linear networks in a generic multilinear format. The latter composes a linear 240

lifting operator and the Segre embedding. 241

• When δ = η = 0 (see Section6): 242

– We establish a simple geometric condition on the intersection of two sets 243

which is necessary and sufficient to guarantee the identifiability of the 244

parameters up to scale ambiguity (Proposition6.3). 245

– We provide simpler conditions which involve the rank of the lifting op-246

erator (defined in Section5) such that: 247

∗ Under-specified case: If the lifting operator rank is large (e.g. larger 248

than 2K(S− 1) + 2, when M = RS×K) and the lifting operator is

249

random, for almost every lifting operator, the solution of 250

M1(h1)· · · MK(hK) = X

251

is identifiable (Theorem6.4). 252

∗ Over-specified case: If the lifting operator rank is small (e.g. smaller 253

than 2S− 1, when M = RS×K), the solution of

254

M1(h1)· · · MK(hK) = X

255

is not identifiable (Theorem6.5); 256

– We also provide a simple algorithm to compute the rank of the lifting 257

operator (Proposition5.3). 258

• Stability guarantee statements for deep structured linear networks are in Sec-259

tion7: 260

– We define the deep-Null Space Property (Definition 7.1): a generaliza-261

tion of the usual Null Space Property [25] that also applies to the deep 262

problems. 263

– We establish that the deep-Null Space Property is a necessary and suffi-264

cient condition to guarantee stability (see the informal statement above 265

(8)

or Theorem7.2and Theorem 7.3). 266

• We specialize the results to convolutional linear networks in Section 8 and 267

establish a simple condition that can be computed (see Algorithm8.1). The 268

is such that (see Theorem8.4) 269

– If the condition is satisfied the convolutional linear networks can be 270

stably recovered; 271

– If the condition is not satisfied, the convolutional linear network is not 272

identifiable. 273

In simple words, the condition holds when the supports of the convolution 274

kernels are sufficiently scattered. This is not satisfied by the convolutional 275

kernel used in applications and explains their instability. 276

2. Bibliography on matrix factorization and compressed sensing. Be-277

fore describing the bibliography on compressed sensing, we interpret this stability 278

statement of the Informal Theorem1.1 in the context of signal processing. In signal 279

processing, we usually know that h exists and δ represents the sum of a modeling 280

error and noise. The inequality (1.1) guarantees that, when the condition is satisfied, 281

even an approximative minimizer of 282

(2.1) argminL∈N,(hk)k=1..K∈ML kM1(h1)· · · MK(hK)− Xk

2.

283

leads to a solution h∗close h. This property is often named: stable recovery guarantee. 284

When δ = 0 (i.e., the data exactly fits the model and is not noisy) and η = 0 285

(i.e., (2.1) is perfectly solved) this is an identifiability guarantee. This is a necessary 286

condition of stable recovery. 287

In this section, we distinguish the cases K = 1, K = 2 and K≥ 3. 288

2.1. K = 1: Linear inverse problems. The simplest version consists of a 289

model with one layer (i.e., K = 1) andM = RS×K. Recovering h

1from X is a linear

290

inverse problem. The data X can be vectorized to form a column vector and the 291

operator M1simply multiplies the column vector h1 by a fixed (rectangular) matrix.

292

Typically, when the linear inverse problem is over-determined, the latter matrix has 293

more rows than columns, the uniqueness of a solution to (2.1) depends on the column 294

rank of the matrix and the stable recovery constant depends on the smallest singular 295

value of M1.

296

When the matrix is not full colum rank, the identifiability and stable recovery 297

for this problem has been intensively studied for many constraintsM. In particular, 298

for sparsity constraints this is the compressed/compressive sensing problem (see the 299

seminal articles [15, 30]). Some compressed sensing statements (especially the ones 300

guaranteeing that any minimizer of the ℓ0problem stably recovers the unknown prob-301

lem) are special cases (K = 1) of the statements provided in this paper. We will not 302

give a complete review on compressed sensing but would like to highlight the Null 303

Space Property described in [25]. The fundamental limits of compressed sensing (for 304

a solution of the ℓ0 problem) have been analyzed in detail in [13].

305

Although the main novelty of the paper is to investigate stable recovery properties 306

for any K ≥ 1, we specialize the statements made for K ≥ 1 to the case K = 1 in 307

order to illustrate the new statements and to provide a way of comparison with well 308

known results. 309

2.2. K = 2: Bilinear inverse problems and bilinear parameterizations. 310

When k≥ 2, the problem becomes non-linear because of the product in (2.1). This 311

(9)

significantly complicates the analysis. What follows are the main instances studied in 312

the literature when K = 2. 313

Non-negative Matrix factorization (NMF) and low rank prior:. In non-negative 314

matrix factorization [50], M1 and M2 map the entries in h1 and h2 at prescribed

315

locations in the factors (say, one column after another). The constraintsM imposes 316

that all the entries in h1 and h2 are non-negative. The NMF has been widely used

317

for many applications. 318

Conditions guaranteeing that the factors provided by the NMF identify5the

cor-319

rect factors (up to rescaling and permutation) were first established in the pioneering 320

work [29]. To the best of our knowledge, this is the first paper addressing recovery 321

guarantees for a problem of depth K = 2. It emphasizes a separability condition that 322

guarantees identifiability. The proof is purely geometric and relies on the analysis 323

of inclusions of simplicial cones. This result is significantly extended in [48]. In this 324

paper, the continuity of the NMF estimator is established. Concerning computational 325

aspects, NMF is NP-complete [78]. However, under the separability hypothesis of [29], 326

the solution of the NMF problem can be computed in polynomial time [5]. 327

We can slightly generalize6the problem and introduce a linear degradation

oper-328 ator 329

H : Rm×n−→ Rm×n. 330

Use the same mapping M1 and M2 as for the NMF, with M = RS×K, but with a

331

small number of lines (resp columns) in M2(h2) (resp. M1(h1)). Any solution of the

332 problem 333 (h∗1, h∗2)∈ argmin(h1,h2)∈RS×K kH(M1(h1)M2(h2))− Xk 2 334

leads to a low rank approximation M1(h∗1)M2(h∗2) of an inverse of H, at X. Again, a

335

large corpus of literature exists on the low rank prior [66,17,32,19]. 336

Phase Retrieval:. Phase retrieval fits the framework described in the present pa-337

per when we take 338 M1(h1) = diag (Fh1) M2(h2) = (Fh2)∗ 339 and 340 M = {(h, h) ∈ RS×K | h ∈ RS} 341

where S is the size of the signal,F computes N linear measurements of any element 342

in RS (typically Fourier measurements), diag (.) creates an N× N diagonal matrix

343

whose diagonal contains the input and∗ is the (entry-wise) complex conjugate.

344

The tensorial lifting at the core of the present paper generalizes the lifting used 345

in the inspiring work on PhaseLift [52, 18, 16]. As is often the case when K = 2, 346

PhaseLift is a semidefinite program that can be efficiently solved when the unknown 347

is of moderate size. These papers also provide conditions on the measurements guar-348

anteeing that the phases are stably recovered by PhaseLift. 349

The benefit of the generalization introduced with the tensorial lifting is that it 350

applies to any multilinear inverse problem. 351

Self-calibration and de-mixing. Measuring operators often depend linearly on pa-352

rameters that are not perfectly known. The estimation of these parameters is crucial 353

to restore the data measured by the device. This is the self-calibration problem. This 354

5

Stable recovery is not established.

6

The interested readers can check that this generalization only leads to a small change of the Lifting operator introduced in Section5. It is therefore done at no cost.

(10)

naturally fits the setting of this article: - we let h1 be the parameters defining the

355

sensing matrix and M1(h1) be the sensing matrix. Then h2 defines the signal (or

356

signals) contained in the column(s) of M2(h2).

357

Many instances of this problem have been studied and much progress has been 358

made to obtain algorithms that can be applied to problems of larger and larger size. 359

This leads to a very interesting line of research. 360

To the best of our knowledge, the first stable recovery statements concern the 361

blind-deconvolution problem. In [2], the authors use a lifting to transform the blind-362

deconvolution problem into a semidefinite program with an unknown whose size is 363

the product of the sizes7 of h

1 and h2. Such problems can be solved for unknowns of

364

moderate size. The authors of [2] provide explicit conditions guaranteeing the stable 365

recovery with high probability. This idea has been generalized and applied to other 366

similar problems in [24,9]. The authors of [54] consider a significantly more general 367

calibration model. In this model, M1(h1) is diagonal and its diagonal contains the

368

entries of h1. M2(h2) simply multiplies h2 by a fixed known matrix (the theorems

369

consider a random matrix). The constraint imposes h2to be sparse. For this problem,

370

they prove that with high probability the numerical method called SparseLift is stable 371

with a controlled accuracy. SparseLift returns the left and right singular vectors of 372

the solutions of an ℓ1 optimization problem whose unknown is the same as in [2].

373

However, solving an ℓ1 minimization problem is much simpler than a semi-definite

374

problem. This is a very significant practical improvement. 375

As emphasized in [51], in order to motivate its non-convex approach, the only 376

drawback of the numerical methods described in [2,54] is their complexity. The extra 377

complexity is due to the fact that they optimize a variable in the product space RS×S

378

and then deduce an approximate solution of the un-lifted problem. This is what 379

motivates the authors of [51] to propose a non-convex approach. The constructed 380

algorithm provably stably recovers the sensing parameters and the signals with a 381

geometric onvergence rate. 382

Sparse coding and dictionary learning:. Sparse coding and dictionary learning is 383

another kind of bilinear problem (see [67] for an overview). In that framework, the 384

columns of X contain the data. Most often, people consider two layers: K = 2. 385

The layer M1(h1) is an optimized dictionary of atoms defined by the parameters h1

386

and each column of M2(h2) contains the code (or coordinates) of the corresponding

387

column in X. Most often, h2is assumed sparse.

388

The identifiability and stable recovery of the factors has been studied in many 389

dictionary learning contexts and provides guarantees on the approximate recovery of 390

both an incoherent dictionary and sparse coefficients when the number of samples is 391

sufficiently large (i.e., in our setting when n is large). In [36], the authors developed 392

local optimality conditions in the noiseless case, as well as sample complexity bounds 393

for local recovery when M1(h1) is square and M2(h2) are iid Bernoulli-Gaussian. This

394

was extended to overcomplete dictionaries in [33] (see also [70] for tight frames) and to 395

the noisy case in [43]. The authors of [74] provide exact recovery results for dictionary 396

learning, when the coefficient matrix has Bernoulli-Gaussian entries and the dictionary 397

matrix has full column rank. This was extended to overcomplete dictionaries in [1] 398

and in [6] but only for approximate recovery. Finally, [35] provides such guarantees 399

under general conditions which cover many practical settings. 400

Contributions in these frameworks. The present article considers the identifiabil-401

ity and stability of the recovery for any K≥ 1 in a general and unifying framework. As 402

7

(11)

was already mentioned, we do not investigate computational issues. As will appear, 403

later the paper, the analogue of the lifting at the core of the algorithms described 404

in the above papers (in particular the papers on phase retrieval and self-calibration) 405

is a tensorial lifting (see Section5) and involves tensors that cannot be manipulated 406

in practice. Even when we are able to manipulate the tensors, the computation 407

of the best rank 1 approximation of such tensors is an open non-convex problem. 408

Therefore, there is no numerically efficient and reliable way to extract the un-lifted 409

parameters from an optimized tensor. Because of that, we have not yet pursued the 410

construction of a numerical scheme based on the tensorial lifting when K ≥ 3. As 411

was already mentioned, at this writing, the success of algorithms for K≥ 3 is mostly 412

supported by empirical evidence. Proving their efficiency is a wide open problem (see 413

[55,37,22,23,68,10,11,44,79]). The purpose of the paper is to provide guarantees 414

on the stability of the solution when such an empirical success occurs. 415

The specialization of the presented results to problems with K = 2 leads to 416

necessary and sufficient conditions for the stable recovery. This is slightly different 417

from the usual approach. Usually, authors provide sufficient conditions and argue 418

their sharpness by comparing the number of samples required by their method and 419

the information theoretic limit (typically, the number of independent variables of the 420

problem). 421

It would of course be interesting to see how far it is possible to unify the different 422

problems with K = 2 using the framework of this paper. We have however not 423

pursued this route and instead focus on the situation K≥ 3. 424

2.3. K≥ 3. The difficulties, when K ≥ 3, come from the fact that tools used for 425

problems with K = 2 are not applicable. In particular, we cannot use the usual lifting, 426

the singular value decomposition or the sin-θ theorem in [28]. Often, these tools are 427

replaced by analogous objects involving tensors. This complicates the analysis and 428

prohibits the use of numerical schemes that manipulate lifted variables. 429

To the best of our knowledge, little is known concerning the identifiability and 430

the stability of matrix factorization when K≥ 3. The uniqueness of the factorization 431

corresponding to the Fast Fourier Transform was proved in [57]. Other results consider 432

the identifiability of the factors which are sparse and random [64]. The authors of 433

the present paper have announced preliminary versions of the results described here 434

in [60]. They are significantly extended here. 435

3. Notation and summary of the hypotheses. We continue to use the no-436

tation introduced in the introduction. For an integer k∈ N, set [k] = {1, · · · , k}. 437

We consider K ≥ 1 and S ≥ 2 and real-valued tensors of order K whose axes 438

are of size S, denoted T ∈ RS×···×S. The space of tensors is abbreviated RSK

. The 439

entries of T are denoted Ti1,··· ,iK, where (i1,· · · , iK)∈ [S]

K. For i∈ [S]K, the entries

440

of i are i = (i1,· · · , iK) (for j∈ [S]K we let j = (j1,· · · , jK), etc). We either write Ti

441

or Ti1,··· ,iK.

442

To simplify notations, from now on, the parameters defining the factors are gath-443

ered in a single matrix and are denoted with bold fonts h∈ RS×K. The kth vector

444

containing the parameters for the layer k is denoted hk ∈ RS. The ith entry of the

445

kth vector is denoted hk,i∈ R. A vector not related to an element in RS×K is denoted

446

h∈ RS (i.e., using a light font). Throughout the paper we assume

447

M = (ML)L∈N, withML⊂ RS×K. 448

We also assume that, for all L∈ N, ML

6= ∅. They can however be equal or constant 449

(12)

after a given L′.

450

All the vector spaces RSK

, RS×K, RS etc. are equipped with the usual Euclidean

451

norm. This norm is denotedk.k and the scalar product h., .i. In the particular case 452

of matrices, k.k corresponds to the Frobenius norm. We also use the usual p norm, 453

for p∈ [1, ∞], and denote it by k.kp. In particular, for h∈ RS×K and T ∈ RS

K , we 454 have for p < +∞ 455 khkp= K X k=1 S X i=1 |hk,i|p !1/p , kT kp=   X i∈[S]K |Ti|p   1/p 456 and 457 khk∞= max k∈[K] i∈[S] |hk,i| , kT k∞= max i∈[S]K|Ti|. 458 Set 459 (3.1) RS×K ∗ ={h ∈ RS×K | ∀k ∈ [K], khkk 6= 0}. 460

Define an equivalence relation on RS×K

∗ : for any h, g∈ RS×K, h∼ g if and only if

461

there exist (λk)k∈[K]∈ RK such that

462 (3.2) K Y k=1 λk = 1 and hk = λkgk,∀k ∈ [K]. 463

Denote the equivalence class of h∈ RS×K

∗ byhhi.

464

The zero tensor is of rank 0. A non-zero tensor T ∈ RSK

is of rank 1 (or decom-465

posable) if and only if there exists h∈ RS×K

∗ such that T is the outer product of the

466

vectors hk, for k∈ [K]. That is, for any i ∈ [S]K,

467

Ti= h1,i1· · · hK,iK.

468

Let Σ1⊂ RS

K

denote the set of tensors of rank 0 or 1. 469

The rank of a tensor T ∈ RSK

is 470

rk (T ) = min{r ∈ N | there exists T1,· · · , Tr∈ Σ1 such that T = T1+· · · + Tr}.

471 For r∈ N, let 472 Σr={T ∈ RS K | rk (T ) ≤ r}. 473

The∗ superscript refers to optimal solutions. A set with a ∗ subscript means that 474

0 is ruled out of the set. In particular, Σ1,∗ denotes the non-zero tensors of rank 1.

475

Attention should be paid to RS×K (see (3.1)). 476

4. Facts on the Segre embedding and tensors of rank 1 and 2. Parametrize 477 Σ1⊂ RS K by the map 478 (4.1) P : RS×K −→ Σ1⊂ RS K

h 7−→ (h1,i1h2,i2· · · hK,iK)i∈[S]K.

479

The map P is called the Segre embedding and is often denoted dSeg in the algebraic 480

geometry literature. 481

Standard Facts: 482

(13)

1. Identifiability of hhi from P (h): For h and g ∈ RS×K

∗ , P (h) = P (g) if

483

and only ifhhi = hgi. 484

2. Geometrical description of Σ1,∗: Σ1,∗ is a smooth (i.e., C∞) manifold

485

of dimension K(S− 1) + 1 (see, e.g., [47], chapter 4, pp. 103). 486

3. Geometrical description of Σ2: We recall that the singular locus (Σ2)sing

487

of the closure Σ2 of Σ2 has dimension strictly less than that of Σ2 and that

488

Σ2\(Σ2)sing is a smooth manifold. The dimension of Σ2\(Σ2)sing is 2K(S−

489

1) + 2 when K > 2, and is 4(S− 1) when K = 2 (see, e.g., [47], chapter 5). 490

We can improve Standard Fact1 and obtain a stability result guaranteeing that if we know a rank 1 tensor sufficiently close to P (h), we approximately knowhhi. In order to state this, we need to define a metric on RS×K

∗ / ∼ (where ∼ is defined by

(3.2)). This has to be considered with care since, whatever h ∈ RS×K

∗ , the subset

{h | h ∈ hhi} is not compact. In particular, considering h′ k =  λ h k if k = 1 λ−K−11 h k otherwise

when λ goes to infinity, we easily construct examples that make the standard metric 491

on equivalence classes useless8.

492

This leads us to consider 493 RS×K diag ={h ∈ R S×K ∗ | ∀k ∈ [K], khkk∞=kh1k∞}. 494

The interest in this set comes from the fact that, whatever h∈ RS×K

∗ , the sethhi ∩

495

RS×K

diag is finite. Indeed, if g ∈ hhi ∩ RS×Kdiag the (λk)k∈[K] ∈ R

K such that, for all

496

k∈ [K], hk = λkgk must all satisfy|λk| = 1, i.e., λk=±1.

497

Definition 4.1. For any p ∈ [1, ∞], we define the mapping dp : (RS×K

∗ / ∼

498

×RS×K

∗ /∼) → R by

499

dp(hhi, hgi) = inf h′ ∈hhi∩RS×K diag g′ ∈hgi∩RS×Kdiag kh′− g′kp ∀h, g ∈ RS×K . 500

Proposition4.2. For any p∈ [1, ∞], dp is a metric on RS×K ∗ /∼.

501

The proof is in Appendix10.1. 502

Notice that the equivalence relationship and metric defined above are not adapted 503

to operators Mk allowing invariance such as permutations. More precisely, for some

504

operators Mk, there exists h, g and a permutation matrix C such that (hk, hk+1)6=

505

(gk, gk+1) and Mk(hk) = Mk(gk)C and Mk+1(hk+1) = C−1Mk+1(gk+1). In such a

506

case, we have dp(hhi, hgi) 6= 0. However, the features defined at the layer k are just

507

8

For instance, if h and g ∈ RS×K∗ are such that h1= g1, we have

inf

h′∈hhi,g∈hgikh

− gk p= 0

even though we might have h26= g2(and therefore hhi 6= hgi). This does not define a metric.

Also, when h and g are such that hk6= gk, whatever k ∈ [K], we have

sup h′∈hhi inf g′∈hgikh ′− gk p= +∞.

Therefore, the Hausdorff distance between hhi and hgi is infinite for almost every pair (h, g). This metric is therefore not very useful in the present context.

(14)

permuted and can still be interpreted. As already said, in such a case, it is possible 508

to use the modelsM to select one the equally interpretable h. 509

Using the above metric, we can state that not onlyhhi is uniquely determined by 510

P (h), but this operation is stable. 511

Theorem 4.3. Stability of hhi from P (h) 512

Let h and g∈ RS×K

∗ be such thatkP (g)−P (h)k∞≤ 12 max (kP (h)k∞,kP (g)k∞).

513 For all p, q∈ [1, ∞], 514 (4.2) dp(hhi, hgi) ≤ 7(KS) 1 pmin  kP (h)kK1−1 ∞ ,kP (g)k 1 K−1 ∞  kP (h) − P (g)kq. 515

The proof of the theorem is in Appendix10.2. 516

In the final result, the bound established in Theorem4.3 plays a role similar to 517

the sin− θ Theorem of [28] in [54, 18,2]. 518

The following proposition shows that the upper bound in (4.2) cannot be improved 519

by a significant factor, in particular when q is large. 520

Proposition4.4. There exist h and g∈ RS×K

∗ such thatkP (g)k∞≤ kP (h)k∞, 521 kP (g) − P (h)k∞≤12 kP (h)k∞ and 522 7(KS)1pkP (h)k 1 K−1 ∞ kP (h) − P (g)kq≤ Cq dp(hhi, hgi), 523 where 524 Cq=  28(KS)1q if q < +∞, 28 if q = +∞. 525

The proof of the theorem is in Appendix10.3. 526

As stated in the following theorem, we have a more valuable upper bound in the 527

general case. 528

Theorem 4.5. “Lipschitz continuity”of P 529

For any q∈ [1, ∞] and any h and g ∈ RS×K

∗ , 530 (4.3) kP (h) − P (g)kq ≤ S K−1 q K1− 1 qmax  kP (h)k1−K1 ∞ ,kP (g)k1− 1 K ∞  dq(hhi, hgi). 531

The theorem is proved in Appendix 10.4. 532

Notice that, considering h and g∈ RS×K such that h

k,i= 1 and gk,i= ε, for all

533

k∈ [K] and i ∈ [S] and for a 0 < ε ≪ 1, we easily calculate 534 SK−1q K1− 1 qmax  kP (h)k1−K1 ∞ ,kP (g)k1− 1 K ∞  dq(hhi, hgi) ≤ KkP (h) − P (g)kq. 535

As a consequence, the upper bound in Theorem4.5is tight up to at most a factor K. 536

5. The tensorial lifting . The following proposition is clear (it can be shown 537

by induction on K): 538

Proposition5.1. Let Mk, k∈ [K] be as in (1.2). The entries of the matrix M1(h1)M2(h2)· · · MK(hK)

are multivariate polynomials whose variables are the entries of h∈ RS×K. Moreover,

539

every entry is the sum of monomials of degree K. Each monomial is a constant times 540

h1,i1· · · hK,iK, for some i∈ [S]

K.

(15)

Notice that any monomial h1,i1· · · hK,iK is the entry P (h)i in the tensor P (h).

542

Therefore every polynomial in the previous proposition takes the formPi∈[S]KciP (h)i

543

for some constants (ci)i∈[S]K independent of h. In words, every entry of the matrix

544

M1(h1)M2(h2)· · · MK(hK) is obtained by applying a linear form to P (h). Moreover,

545

the polynomial coefficients defining the linear form are uniquely determined by the 546

linear maps M1,· · · , MK. This leads to the following statement.

547

Corollary 5.2. Tensorial Lifting 548

Let Mk, k∈ [K] be as in (1.2). The map

(h1, . . . , hK)7−→ M1(h1)M2(h2)· · · MK(hK),

uniquely determines a linear map 549

A : RSK

−→ Rm×n,

550

such that for all h∈ RS×K

551

(5.1) M1(h1)M2(h2)· · · MK(hK) =AP (h).

552

We call (5.1) and its use the tensorial lifting. When K = 1, we simply have 553

A = M1. When K = 2 it corresponds to the usual lifting already exploited to

estab-554

lish stability results for phase recovery, blind-deconvolution, self-calibration, sparse 555

coding, etc. Notice that, when K ≥ 2, it may be difficult to provide a closed form 556

expression for the operator A. We can however determine simple properties of A. 557

In most reasonable cases, A is sparse. If the operators Mk simply embed the values

558

of h in a matrix, the matrix representing A only contains zeros and ones. Since the 559

operators Mk are known, we can compute AP (h), for any h ∈ RS×K, using (5.1).

560

Said differently, we can computeA for any rank 1 tensor. Therefore, since A is linear, 561

we can compute AT for any low rank tensor T . If the dimensions of the problem 562

permit, one can manipulateA in a basis of RSK

. 563

Since rk (A) is an important quantity, we emphasize that rk (A) ≤ mn. It is also 564

possible to compute rk (A), when mn is not too large, using the following proposition. 565

Proposition5.3. For R independent random hr, with r = 1..R, according to 566

the normal distribution in RS×K, we have with probability 1

567 (5.2) dim(Span ((AP (hr))r=1..R)) =  R if R≤ rk (A) rk (A) otherwise. 568

The proof is in Appendix10.5. 569

Using Corollary5.2, when (2.1) has a minimizer, we rewrite in the form 570

(5.3) h∗∈ argmin

L∈N,h∈ML kAP (h) − Xk2.

571

We now decompose this problem into two sub-problems: A least-squares problem 572

(5.4) T∗∈ argminT ∈RSK kAT − Xk

2

573

and a non-convex problem 574

(5.5) h′∗∈ argminL∈N,h∈MLkA(P (h) − T∗)k2.

575

Proposition5.4. Let X and A be such that (2.1) has a minimizer: 576

(16)

1. Let h∗ be a solution of (5.3). Then, for any solution Tof (5.4), halso

577

minimizes (5.5). 578

2. Let T∗ be a solution of (5.4) and h′∗ a solution of (5.5). Then, h′∗ also

579

minimizes (5.3). 580

The proof is in Appendix10.6. 581

From now on, because of the equivalence between solutions of (5.5) and (5.3), we 582

stop using the notation h′∗and write h∈ argmin

L∈N,h∈MLkA(P (h) − T∗)k2.

583

6. Identifiability (error free case). Throughout this section, we assume that 584

X is such that there exists L and h∈ ML such that

585

(6.1) X = M1(h1)· · · MK(hK).

586

Under this assumption, X =AP (h), so 587

P (h)∈ argminT ∈RSKkAT − Xk

2.

588

Moreover, we trivially have P (h) ∈ Σ1 and therefore h minimizes (5.5), (2.1) and

589

(5.3). As a consequence, (2.1) has a minimizer. 590

We ask whether there exist guarantees that the resolution of (2.1) allows one to 591

recover h up to the usual uncertainties. 592

In this regard, for any h ∈ hhi, we have P (h) = P (h) and therefore AP (h) = 593

AP (h) = X. Thus unless we make further assumptions on h, we cannot expect to 594

distinguish any particular element of hhi using only X. In other words, recovering 595

hhi is the best we can hope for. 596

Definition 6.1. Identifiability 597

We say that hhi is identifiable if the elements of hhi are the only solutions of 598

(2.1). 599

We say that M is identifiable if for every L ∈ N and every h ∈ ML, hhi is

600

identifiable. 601

Proposition6.2. Characterization of the global minimizers 602

For any L∗∈ N and any h∈ ML∗

, (L∗, h)∈ argmin L∈N,h∈MkAP (h) − Xk2 603 if and only if 604 P (h∗)∈ P (h) + Ker (A) . 605

The Proposition is proved in Appendix10.7. 606

In order to state the following proposition, we define for any L and L′∈ N

607 P (ML) − P (ML′ ) :=nP (h)− P (g) | h ∈ ML and g ∈ ML′o ⊂ RSK . 608

Proposition6.3. Necessary and sufficient conditions of identifiability 609

1. For any L and h∈ ML: hhi is identifiable if and only if for any L ∈ N

610

P (h) + Ker (A)∩ P (ML)

⊂ {P (h)}. 611

2. M is identifiable if and only if for any L and L∈ N

612

(6.2) Ker (A) ∩ P (ML)− P (ML′)⊂ {0}. 613

The Proposition is proved in Appendix10.8. 614

(17)

In the context of the usual compressed sensing (i.e., when K = 1, M contains 615

L-sparse signals,A is a rectangular matrix with full row rank and X is a vector), the 616

proposition is already stated in Lemma 3.1 of [25]. 617

In reasonably small cases and when P (M) is algebraic, one can use tools from 618

numerical algebraic geometry such as those described in [39,40] to check whether the 619

condition (6.2) holds or not. The drawback of Proposition6.3 is that, given a deep 620

structured linear network as described byA, the condition (6.2) might be difficult to 621

verify. 622

We therefore establish simpler conditions related to the identifiability ofM. First 623

we establish a condition such that for almost everyA satisfying it, M is identifiable. 624

The main benefit of this condition is that its constituents can be computed in many 625

practical situations. 626

Before that, we recall a few facts of algebraic geometry, for X, Y ⊂ RN, the join

of X and Y (see, e.g., [38, Ex. 8.1]) is

J(X, Y ) :={sx + ty | x ∈ X, y ∈ Y, s, t ∈ R}.

If for all L∈ N, MLis Zariski closed and invariant under rescaling (e.g., if they are all

627

linear spaces), then P (ML)

−P (ML′

) is a Zariski open subset of J(P (ML), P (

ML′

)). 628

In general, it is contained in this join. 629

Recall the following fact (*): for complex algebraic varieties X, Y ⊂ CN, any

630

component Z of X ∩ Y has dim (Z) ≥ dim (X) + dim (Y ) − N, and equality holds 631

generically (we make “generically” precise in our context below). Moreover, if X, Y 632

are invariant under rescaling, since 0 ∈ X ∩ Y , we have X ∩ Y 6= ∅. (See, e.g., [72, 633

§I.6.2].) 634

This intersection result indicates that if there exists L, L′ such that

rk(A) < dimP (ML)− P (ML′)

we expect to have non-identifiability; and if the rank is larger, for all pair L, L′, we

635

expect identifiability. More precisely: 636

Theorem 6.4. Almost surely sufficient condition for Identifiability 637

For almost everyA such that

rk(A) ≥ dimJ(P (ML), P (

ML′

)), for all L, L′, M is identifiable.

638

The theorem is proved in Appendix 10.9. 639

Since dimJ(P (ML), P (ML′

)) ≤ dim P (ML)+ dimP (ML′

), if Dmax is

640

the maximum dimension of P (ML) over all L, one has the same conclusion if rk(A) ≥

641

2Dmax.

642

When K = 1, we illustrate this result by interpreting it in the context of com-643

pressive sensing, where h is a vector, X is a vector, A is a rectangular sampling 644

matrix of full row rank and Ker (A) is large. The statement analogous to Theorem 645

6.4in the compressive sensing framework takes the form: “For almost every sampling 646

matrix, any L sparse signal h can be recovered from Ah as soon as 2L ≤ rk (A).” 647

Moreover, the constituent of the ℓ0minimization model used to recover the signal are

648

also the constituents of (5.3). Again, the main novelty is to extend this result to the 649

identifiability of the factors of deep matrix products. 650

(18)

In order to establish a necessary condition for identifiability, first note that if we 651

extend P (ML)

− P (ML′

) to be scale invariant, this will not affect whether or not it 652

intersects ker(A) outside of the origin. We immediately conclude that in the complex 653

setting whereML,

ML′

are both Zariski closed, thatM is non-identifiable whenever 654

rk(A) < dimP (ML)− P (ML′

). This indicates that we should always expect non-655

identifiability whenever rk(A) < dimP (ML)

− P (ML′

) but is not adequate to 656

prove it because real algebraic varieties need not satisfy (*). However it is true for 657

real linear spaces, so we immediately conclude the following weak result: 658

Theorem 6.5. Necessary condition for Identifiability 659

Let C(P (ML)

− P (ML′

)) be the set of all points on all lines through the origin 660

intersecting P (ML)

− P (ML′

), and let q be the maximal dimension of a linear space 661

on C(P (ML)

− P (ML′

)). Then if q > rk(A), M is not identifiable. In particular 662

when the P (ML)’s contain linear space and if we let Sthe be the largest dimension

663

of these vector space, if 2S′> rk(A), then M is not identifiable.

664

Let us illustrate the theorems by considering a deep feed-forward ReLU network 665

and consider the structured linear network obtained by fixing the action of ReLU 666

(as is done in Section1.2.2). The matrix X contains the outputs, the operator MK

667

multiplies the matrix containing the inputs by the weights between the first and the 668

second layer. For every input/output pair, the action of ReLU is different and removes 669

paths from the input entries to the output entries. We assume however that every 670

entry of every output is reached by at least one path in the network starting at a non-671

zero entry of the input. In that case, it is not difficult to see that A is a surjection 672

and therefore rk(A) = mn, where m is the size of the output and n is the number of 673

learning samples. 674

The condition in Theorem6.4becomes 675

mn≥ dim (Σ2) = 2K(S− 1) + 2,

676

and KS is typically the number of parameters of the network. The intuition behind 677

Theorem 6.4 is that, if the action of ReLU is sufficiently random and if the above 678

inequality holds, we can expect the network to be identifiable with high probability9.

679

This situation corresponds to an under-parameterized case (favorable for identifiabil-680

ity). 681

The condition in Theorem6.5is 682

2S > mn. 683

When this inequality holds the network is not identifiable. It corresponds to an 684

over-parameterized configuration. In the intermediate situation, when 2S ≤ mn < 685

2K(S− 1) + 2, and when the action of the activation function does not introduce 686

sufficiently randomness, the theorems are inconclusive. 687

Notice that such networks can also be analyzed using Proposition6.3. It is indeed 688

not difficult to see that if there exists two paths that: 1/ start from the same entry 689

of the input layer; 2/ end at the same entry of the output layer; 3/ if both paths are 690

present (despite the action of ReLU) for every input/output pair; then (6.2) does not 691

hold10and RS×K is not identifiable. It is not clear at this point that the conditions 1,

692

9

This statement gives the intuition behind Theorem6.4but it should be made precise, as em-phasized in the perspectives of this paper.

10

Simply consider two rank one tensors, each tensor being a Dirac at the position corresponding to one of the two paths.

(19)

2, 3 are met by all non-identifiable structured linear feed-forward networks. However, 693

removing paths from the network (as is done by ReLU and Dropout) is a way to avoid 694

conditions 1, 2, 3 to be met. 695

7. Stability guarantee. In this section, we consider errors of different natures. 696

We assume that there exists L and L∗∈ N, h ∈ ML and h∈ ML∗

, such that 697 (7.1) kM1(h1)· · · MK(hK)− Xk ≤ δ, 698 and 699 (7.2) kM1(h∗1)· · · MK(h∗K)− Xk ≤ η, 700

for δ and η typically small. 701

Again, this corresponds to existing unknown parameters h that we estimate from 702

a noisy observation X, using an inaccurate solution h∗ of (2.1) (as in [13] where the

703

case K = 1 is studied). Otherwise, h and h∗ shall be interpreted as different learned

704

parameters; δ and η are the corresponding risks. 705

Notice that the above hypothesis does not even require (2.1) to have a solution. 706

Algorithms which do not come with a guarantee sometimes manage to reach small 707

δ and η values. In those cases, the analysis we conduct in this section permits to 708

get the stability guarantee, despite the lack of a guarantee of the algorithm. Finally, 709

the hypotheses (7.1) and (7.2) enable one to obtain guarantees for algorithms that, 710

instead of minimizing (2.1), minimize an objective function which approximates the 711

one in (2.1). This is particularly relevant for machine learning applications when (2.1) 712

can be an empirical risk that needs to be regularized or is not truely minimized (for 713

instance, when using dropout [75]). 714

A necessary and sufficient condition for the identifiability ofM is stated in Propo-715

sition6.3. The condition is on the way Ker (A) and P (ML)

− P (ML′

) intersect. In 716

order to get a stability guarantee, we need a stronger condition on the geometry of 717

this intersection to hold for every L and L′ ∈ N. This condition is provided in the

718

next definition. 719

Definition 7.1. Deep-Null Space Property 720

Let γ > 0 and ρ > 0. We say that Ker (A) satisfies the deep-Null Space Prop-721

erty (deep-NSP) with respect to the collection of models M with constants (γ, ρ) if 722

for any L and L′ ∈ N, any T ∈ P (ML)− P (ML′

) satisfying kAT k ≤ ρ and any 723

T′∈ Ker (A), we have

724

(7.3) kT k ≤ γkT − Tk.

725

The deep-NSP implies that, for T ∈ P (ML)

− P (ML′

) close to Ker (A) in the sense 726

that kAT k ≤ ρ we must have, by decomposing T = T+ T′′, with T∈ Ker (A) and

727

T′′in its orthogonal complement

728 kT k ≤ γkT − T′k = γkT′′k ≤ γ σminkAT ′′k ≤ γ σmin ρ, 729

where σmin is the smallest non-zero singular value of A. In words, kT k must be

730

small. We can conclude that under the deep-NSP, P (ML)

− P (ML′

) and{T ∈ RSK

| 731

kAT k ≤ ρ} intersect at most in the neighborhood of 0. 732

Additionally, (7.3) implies that in the neighborhood of 0, Ker (A) and P (ML)

− 733

P (ML′

) are not tangential, i.e., their intersection is transverse. 734

(20)

If Ker (A) satisfies the deep-NSP with respect to the collection of models M with 735

constants (γ, ρ), then for all T′ ∈ Ker (A) and all T ∈ P (ML)

− P (ML′ ) satisfying 736 kAT k ≤ ρ, 737 kT′k ≤ kT k + kT− T k ≤ (γ + 1)kT− T k. 738 Therefore, 739 (7.4) ∀T′∈ Ker (A) , kTk ≤ (γ + 1)d loc(T′, P (ML)− P (ML ′ )), 740

where we have set for any C⊂ RSK

741

dloc(T′, C) = inf T ∈C,kAT k≤ρkT

− T k.

742

The converse is also true: if Ker (A) satisfies (7.4), it satisfies the deep-NSP with 743

respect to the collection of models M with appropriate constants. In the context of 744

the usual compressed sensing (i.e., when K = 1,MLcontains L-sparse signals,A is a

745

rectangular matrix with full row rank and X is a vector), the localization appearing in 746

dloccan be discarded since the inequality must hold when T′is small and since in this

747

case this localization has no effect. Therefore, in the compressed sensing context, (7.4) 748

(and therefore deep-NSP) is the usual Null Space Property with respect to L-sparse 749

vectors, as defined in [25]. However, deep-NSP is generalized to take into account 750

deep structured linear network. This motivates the name. 751

In the general case, the deep-NSP can be understood as a local version of the 752

generalized-NSP for A relative to P (∪L∈NML)

− P (∪L∈NML), as defined in [13].

753

Our interest in locality (as imposed by the constraintkAT k ≤ ρ) is motivated by the 754

fact that we want to use the deep-NSP when the signal to noise ratio is controlled 755

(i.e., the hypotheses of Theorem 4.3 are satisfied). The condition for the stability 756

property therefore includes such hypotheses. 757

We have not adapted the robust-NSP defined in [13]. The benefit in not using 758

this definition is to obtain a simpler definition for deep-NSP. In particular (7.3) does 759

not involve the geometry ofA in the orthogonal complement of Ker (A). Looking in 760

detail at the benefit of this adaptation is of great interest. 761

Finally, we trivially have the following facts: 762

• If Ker (A) = {0}, then Ker (A) satisfies the deep-NSP with respect to the 763

model RS×K with constant (1, +∞).

764

• For any γ′ ≥ γ: If Ker (A) satisfies the deep-NSP with respect to the collection

765

of modelsM with constants (γ, ρ), then Ker (A) satisfies the deep-NSP with 766

respect to the collection of modelsM with constants (γ, ρ).

767

• For any M′ ⊂ M: If Ker (A) satisfies the deep-NSP with respect to the 768

collection of modelsM with constant (γ, ρ), then Ker (A) satisfies the deep-769

NSP with respect to the collection of models M′ with constants (γ, ρ). In 770

particular, if Ker (A) satisfies the deep-NSP with respect to the model RS×K

771

with constant (γ, ρ), it satisfies the deep-NSP with respect to any collection 772

of models, with constants (γ, ρ). 773

Theorem 7.2. Sufficient condition for the stability property 774

Assume Ker (A) satisfies the deep-NSP with respect to the collection of models 775

M and with the constants (γ, ρ). For any h∗ as in (7.2) with η and δ (see (7.2) and

776

(7.1)) such that δ + η≤ ρ, we have 777

kP (h∗)− P (h)k ≤ γ σmin

(δ + η), 778

(21)

where σmin is the smallest non-zero singular value of A. Moreover, if h ∈ RS×K and 779 γ σmin (δ + η)≤ 1 2 max kP (h)k∞,kP (h∗)k∞  then 780 (7.5) dp(hh∗i, hhi) ≤ 7(KS)p1γ σmin minkP (h)kK1−1 ∞ ,kP (h∗)k 1 K−1 ∞  (δ + η). 781

The first part of the proof is very similar to standard proofs in the Compressed 782

Sensing and stable recovery literature. The second part simply uses Theorem 4.3. 783

The Theorem is proved in Appendix10.10. 784

Theorem 7.2provides a sufficient condition to obtain stability. The only signifi-785

cant hypothesis made on the deep structured linear network is that Ker (A) satisfies 786

the deep-NSP with respect to the collection of modelsM. One might ask whether 787

this hypothesis is sharp or not. The next theorem shows that the answer to this 788

question is positive. 789

Theorem 7.3. Necessary condition for the stability property 790

Assume the stability property holds: There exists C and δ > 0 such that for any 791

L∈ N, h ∈ ML, any X =AP (h) + e, with kek ≤ δ, any L∈ N and any h∈ ML∗

792 such that 793 kAP (h∗)− Xk2 ≤ kek 794 we have 795 d2(hh∗i, hhi) ≤ C min  kP (h)kK1−1 ∞ ,kP (h∗)k 1 K−1 ∞  kek. 796

Then, Ker (A) satisfies the deep-NSP with respect to the collection of models M with constants

(γ, ρ) = (CSK−12 √K σ

max, δ)

where σmax is the spectral radius ofA.

797

The first part of the proof is inspired by and similar to the proof of the analogous 798

converse statement in [25]. The second part simply uses Theorem 4.5. The Theorem 799

is proved in Appendix 10.11. 800

The sharpness of the known results when K = 2 is usually argued by comparing 801

the number of samples necessary for the recovery and the information theoretic limit 802

of the problem. As far as the authors know, the above theorem is therefore new even 803

when K = 2. 804

As is usually the case with Null Space Property or Restricted Isometry Property, 805

it will often be difficult or impossible to establish that a particular operator A sat-806

isfies the deep-NSP with respect to the collection of models M. To find favorable 807

cases, we need to consider random operators A such that the distribution of A en-808

ables one to establish that the deep-NSP holds with high probability, when in the 809

right configurations (see the bibliography in Section2 whose references contain many 810

examples of such arguments). The most common distribution for the analog of A 811

include operators/matrices whose coefficients are Gaussian or Bernoulli. Collections 812

of models inducing sparsity, non-negativity, low-rank constraints etc. are the most 813

studied. In this regard, the fact that there exists a low complexity test guaranteeing 814

that the networks considered in Section8can be stably recovered is an exception. 815

8. Application to convolutional linear network. We consider a convolu-816

tional linear network as depicted in Figure1. The network typically aims at perform-817

ing a linear analysis or synthesis of a signal living in RN. The considered convolutional

818

linear network is defined from a rooted directed acyclic graph G(E, N ) composed of 819

Figure

Fig. 1 . Example of the considered convolutional linear network. To every edge is attached a convolution kernel

Références

Documents relatifs

This thesis studies empirical properties of deep convolutional neural net- works, and in particular the Scattering Transform.. Indeed, the theoretical anal- ysis of the latter is

– Convolution layers &amp; Pooling layers + global architecture – Training algorithm + Dropout Regularization.. • Useful

The contribution of this paper is twofold: firstly, a new method involving hierarchical PGMs and FCNs is developed to perform semantic segmentation of multiresolution remote

Design by Stephen Sprott Printed and bound in France © 2014 dan walsh &amp; onestar press onestar press 49, rue Albert 75013 Paris France [email protected] www.onestarpress.com

The sparsity of the transfer matrix is obtained trying to reduce the degree (i.e. the number of measurements involved in each linear combination) of the coded packets by

Localization of parathyroid tumors in patients with asymptomatic hyperparathyroidism and no previous surgery. Further evidence against the routine use of

Example of input signals and outputs after applying the reconstruction techniques (a) Input signal; (b) Output signal using Bayesian technique combined with Circulant matrix;

Les ceintures de radiation terrestres partagent aussi cette région et ces deux populations de plasma (&#34;froid&#34; pour la plasmasphère et &#34;chaud&#34; pour les ceintures