HAL Id: hal-01494267
https://hal.archives-ouvertes.fr/hal-01494267v4
Submitted on 3 May 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Multilinear compressive sensing and an application to
convolutional linear networks
François Malgouyres, Joseph Landsberg
To cite this version:
François Malgouyres, Joseph Landsberg. Multilinear compressive sensing and an application to
con-volutional linear networks. SIAM Journal on Mathematics of Data Science, Society for Industrial and
Applied Mathematics, 2019, 1 (3), pp.446-475. �hal-01494267v4�
TO CONVOLUTIONAL LINEAR NETWORKS 2
FRANC¸ OIS MALGOUYRES † AND JOSEPH LANDSBERG ‡
3
Abstract. We study a deep linear network endowed with the following structure: a matrix X
4
is obtained by multiplying K matrices (called factors and corresponding to the action of the layers).
5
The action of each layer (i.e. factor) is obtained by applying a fixed linear operator to a vector of
6
parameters satisfying a constraint. The number of layers is not limited. Assuming that X is given
7
and factors have been estimated, the error between the product of the estimated factors and X (i.e.
8
the reconstruction error) is either the statistical or the empirical risk.
9
We provide necessary and sufficient conditions on the network topology under which a stability
10
property holds. The stability property requires that the error on the parameters defining the
near-11
optimal factors scales linearly with the reconstruction error (i.e. the risk). Therefore, under these
12
conditions on the network topology, any successful learning task leads to stably defined features that
13
can be interpreted.
14
In order to do so, we first evaluate how the Segre embedding and its inverse distort distances.
15
Then, we show that any deep structured linear network can be cast as a generic multilinear problem
16
that uses the Segre embedding. This is the tensorial lifting. Using the tensorial lifting, we provide a
17
necessary and sufficient conditions for the identifiability of the factors up to a scale rearrangement.
18
We finally provide necessary and sufficient condition called deep-Null Space Property (because of the
19
analogy with the usual Null Space Property in the compressed sensing framework) which guarantees
20
that the stability property holds.
21
We illustrate the theory with a practical example where the deep structured linear network is a
22
convolutional linear network. We obtain a condition on the scattering of the supports which is strong
23
but not empty. A simple test on the network topology can be implemented to test if the condition
24
holds.
25
Key words. Interpretable learning, stable recovery, matrix factorization, deep linear networks,
26
convolutional networks.
27
AMS subject classifications.68T05; 90C99; 15-02
28
1. Introduction. 29
1.1. The aim of the paper. Deep learning has led to many practical break-30
throughs and has led to significant improvements and state of the art performances in 31
many fields such as computer vision, natural language processing, signal processing, 32
robotics etc. The range of applications grows at a strong pace. Despite these empir-33
ical successes, the theory supporting deep learning is still far from satisfactory. For 34
instance, sharp and accurate answers to the most natural questions on: the efficiency 35
of optimization algorithms when applied to the objective function minimized in deep 36
learning ([55,37, 22,23, 68,10, 11,44, 79]); and the expressiveness of the networks 37
([12,3,31,26,45,27,63,76]); and guarantees on the statistical risk ([42,34,80,69]) 38
for learned neural networks are still missing. This makes it difficult to optimize and 39
configure neural networks. Moreover, the absence of answers to these questions pre-40
vents certification that systems built with deep learning algorithms are robust. 41
The reasons explaining the outcome of a neural network are often difficult to 42
highlight ([8, 62, 41]). Even worse, despite the settings described in [6, 4, 14, 53, 43
∗Submitted to the editors 7/3/2018.
Funding: Joseph Landsberg is supported by NSF DMS-1405348 and AF1814254
†Institut de Math´ematiques de Toulouse ; UMR5219 Universit´e de Toulouse ; CNRS UPS IMT;
F-31062 Toulouse Cedex 9, France; ([email protected])
‡Department of Mathematics; Mailstop 3368;Texas A& M University; College Station, TX
71,83], the instability of the parameters optimizing the deep learning objective does 44
not allow the interpretation of the features defined by these parameters. This last 45
problem is the one we investigate in this work. 46
Our goal in this paper is to evaluate how far the architectures used in applications 47
are from architectures for which we can guarantee that the parameters returned by 48
the algorithm, and therefore the features defined using these parameters, are stably 49
defined. To do so, we consider two families of networks and establish necessary and 50
sufficient conditions on their topology guaranteeing that the features learned by the 51
algorithm are stably defined. 52
More precisely, we establish statements of the following form for two families of 53
deep networks. Below, the action of the network parameterized by h is denoted fh.
54
Informal theorem 1.1. Stability guarantee 55
We assume a known parameterized family of functions fhand a metric1d between
56
parameter pairs. We establish a necessary and sufficient condition on the family fh
57
guaranteeing that: 58
There exists a constant C > 0 such that for any input/output pairs I, X and any 59
pair of parameters h∗, h for which
60 δ =kX − fh∗(I)k, 61 and 62 η =kX − fh(I)k, 63
are sufficiently small, we have 64
(1.1) d(h, h∗)≤ C(δ + η).
65
Considering a regression problem, the values δ and η can be interpreted as the 66
statistical or the empirical risk for the parameters h and h∗. The inequality (1.1)
67
therefore guarantees that the set made of the parameters leading to a small risk has 68
a small diameter. The features defined using such parameters are therefore stably 69
defined. This seems to be the minimal condition allowing the interpretation of the 70
features. The condition on the family of functions fhis typically a condition on the
71
topology of the network. 72
In the Informal Theorem1.1, h and h∗ might have different roles. For instance, 73
if we know that the input/output pairs have been generated using a particular h, 74
possibly up to some error as modeled by δ, then (1.1) guarantees that h∗ is close to 75
h and provides a way to control the statistical risk. 76
The existing stability guarantees [6,4,14,53,71,83,60] consider this setting and 77
describe both a network topology and an algorithm whose output h∗ is guaranteed
78
to be close to h. In this study, we do not make any assumption on the construction 79
of h and h∗ and our objective is more modest. With regard to their objective, giving
80
a necessary and sufficient condition of stability plays the same role as a complexity 81
theory statement saying that a particular configuration is NP-hard. It rules out some 82
network topologies. 83
Notice that, when I and X are such that it is possible to have δ = η = 0, the 84
above stability guarantee implies that the minimizer of the network objective function 85
is unique. Theorem 6.5 and Theorem 6.4 will show that this uniqueness condition 86
is strongly related to the level of over-specification of the network. The simplified 87
1
and intuitive statement is that optimal solutions of overspecified networks are not 88
unique and are unstable. This explains the instability observed in applications. The 89
theorems analyze this property in detail. This might be viewed as a negative result 90
since overspecification is currently the main hypothesis of statements guaranteeing 91
the success of the neural network optimization. 92
1.2. The considered deep networks. 93
1.2.1. Overview. We consider two kinds of deep networks: A general family of 94
deep structured linear networks2 in Sections 6 and 7; and a family of convolutional
95
linear networks in Section8. The formal statements for the deep structured linear net-96
works are in Theorems7.2and7.3. The statements for convolutional linear networks 97
are in Theorem8.4. Below, we describe the deep structured linear networks. 98
1.2.2. Deep structured linear networks. The term deep linear network usu-99
ally corresponds to fully-connected feed-forward networks, without bias, in which the 100
activation function is the identity. In the general results described in this paper we 101
consider deep linear networks and provide two means to enforce some structure to the 102
network. As we describe below, the structures can be used to include: feed-forward 103
linear networks; convolutional linear networks (as is done in Section8); the action of a 104
ReLU activation function; sparse networks; non-negative networks; and combinations 105
of the above. The family also includes most matrix factorization problems. 106
We model a deep structured linear network as a product of matrices called factors. 107
The factors depend linearly on parameters in RS, for S
∈ N. 108
More precisely, consider an arbitrary depth parameter K ≥ 1. The number of 109
layers is K +1 and the layers are enumerated in such a way that the layer receiving the 110
input is K +1 and the layer returning the output is 1. We consider sizes m1. . . mK+1∈
111
N, write m1= m, mK+1= n. We consider, for k = 1 . . . K, the linear map 112 Mk : RS −→ Rmk×mk+1 (1.2) 113 h7−→ Mk(h) 114
Given some parameters, h1, . . . , hK ∈ RS, the action of the deep structured linear
115
network is the product 116
M1(h1)· · · MK(hK)
117
The factor MK(hK) might involve the inputs of the samples by considering MK(hK) =
118 M′
K(hK)I for: a linear map MK′ ; and for a matrix I whose columns contain the inputs.
119
Given outputs X∈ Rm×n, the optimization of the parameters h1, . . . , hKdefining the
120
network aims at getting 121
M1(h1)· · · MK(hK)≃ X.
122
To model feed-forward linear networks, the mappings Mk, k = 1 . . . K− 1 (and
123
MK′ ) construct the matrix by placing the entry of hk corresponding to an edge in the
124
network in the corresponding entry in Mk(hk).
125
For convolutional layers, Mk and MK′ concatenate convolution matrices3defined
126
by a portion of the entries in hk. Each convolution matrix is at the location
corre-127
sponding to a prescribed edge. 128
2
We call this family deep structured linear networks because the family is endowed with tools to impose structures. We analyze the impact of the structure on the stability property. However, these tools might be used to define the usual deep linear network.
3
Depending on the situation: Toeplitz, block-Toeplitz, circulant or block-circulant matrices. The matrices often involve downsampling.
The main argument for studying deep structured linear networks is due to their 129
strong connection to non-linear networks that uses the rectified linear unit (ReLU)4
130
activation function. We explain it in detail. The action of the ReLU activation 131
function at the layer k treats every entry independently of the other entries and 132
multiplies it by either 1 (the entry is kept) or 0 (the entry is canceled). More precisely, 133
denoting h = (hk)k=1..K, the action of the ReLU activation function on the layer k is
134
to apply the map Ak: Rmk×n7−→ Rmk×n(where mk× n is the size data in the layer
135
k) such that: 136
(AkM )i,j = ak(h)i,jMi,j, for (i, j)∈ {1, . . . , mk} × {1, . . . , n}
137 where ak(h)∈ {0, 1}mk×nis defined by 138 ak(h)i,j= ( 1 ifMk+1(hk+1)Ak+1Mk+2(hk+2)· · · AK−1MK(hK) i,j≥ 0 0 otherwise. 139 The function 140 ak : RS×K −→ {0, 1}mk×n 141 h7−→ ak(h) 142
is piecewise constant because {0, 1}mk×n is finite. (This has already been used in
143
[68].) As a consequence, the parameter space RS×K is partitioned into subsets such
144
that on every subset ak is constant, for all k = 1..K. Therefore, on every subset
145
the action of the non-linear network coincides with the action of a deep structured 146
linear network that groups at every layer Ak and Mk+1. Further, the landscape of
147
the objective function of the non-linear neural network that uses ReLU coincides, on 148
every part of the partition, with the landscape of a deep structured linear network. 149
This is a strong argument in favor of the study of deep structured linear networks. 150
Notice that deep structured linear networks have also been obtained in [22,23,44] 151
by modeling the action of the activation function as random, independent of the 152
input and when considering the expectation of the network action. However, these 153
assumptions are not satisfied by the deep networks used in applications (see [23]) 154
and it is not clear that this link can be exploited to obtain theoretical guarantees for 155
realistic deep networks. 156
In addition to the structure induced by the operators Mk, we also consider
struc-157
ture imposed on the vectors h. We assume that we know a collection of models 158
M = (ML)
L∈N with the property that for every L,ML ⊂ RS×K is a given subset.
159
We will assume that the parameters h ∈ RS×K defining the factors are such that
160
there exists L∈ N such that h ∈ ML. For instance, the constraint h
∈ ML might
161
be used to impose sparsity, grouped sparsity or co-sparsity. One might also use the 162
constraint h ∈ ML to impose non-negativity, orthogonality, equality, compactness,
163
etc. Generally speakingM is used to impose some prior or some form of regularity or 164
to compress the parameter space and obtain better bounds [7]. The models might also 165
be used to alleviate ambiguities. For instance, if the operators Mk and Mk+1 allow
166
permutations (i.e. there exists (hk, hk+1)6= (gk, gk+1) and a permutation matrix C
167
such that Mk(hk) = Mk(gk)C and Mk+1(hk+1) = C−1Mk+1(gk+1)), we can use a
168
complete ordering of the parameter space RK×S and impose, usingM, the largest of 169
all the equivalent versions of a parameters to be considered. 170
4
1.3. Bibliography. 171
1.3.1. Other matrix factorization and compressed sensing. The content 172
of this paper is strongly related to and can be considered as an extension of the 173
research field usually named compressed sensing. Because of the importance of this 174
field of research and to simplify the reading for readers whose main interest is in deep 175
learning, we have separated this part of the bibliography and placed it in Section2. 176
Notice that the statement of the informal theorem1.1can be interpreted in the context 177
of signal recovery. In particular, the results on deep structured linear networks can 178
probably be specialized to be applicable to matrix factorization problems for which 179
stability properties have not been established [77,20,21,59,58,46,56]. We have not 180
investigated this potential. 181
1.3.2. Tensors and deep networks. The analysis conducted in this paper 182
is based on a connection, named tensorial lifting, between deep structured linear 183
networks and a tensor problem (see Section5). The tensorial lifting has already been 184
described in [60] but other connections between tensor and network problems have 185
been described by other authors. In particular, in [26, 27, 45], the authors define 186
a score function using a tensor. They highlight a network topology that computes 187
the score function defined by a tensor decomposable using a CP-decomposition, a 188
Hierarchical Tucker [26, 27] or a tensor train decomposition[45]. They then deduce 189
the expressive power of the network topology from the connections between the tensor 190
decompositions. These results highlight and analyze why deep networks are more 191
expressive than shallow ones. Tensors and tensor decomposition have also been used 192
to represent the cross-moment and construct a solver [42], encode the convolution 193
layers with a tensor of order 4 and manipulate this tensor to improve the network 194
[49,65,82], to represent a tensor layer [73,81]. 195
1.3.3. Stability property for neural networks. To the best of our knowl-196
edge, the articles establishing stability properties are [4,14,53,71, 83, 60]. 197
Among them, [14,53,83] consider a family of networks of depth 1 or 2 (depending 198
on the article, the definition of the depth may vary). The article [71] contains a study 199
on deep networks (the depth can be large), but the study only focuses on the recovery 200
of one layer. The articles [4, 60] consider networks without depth limitation. 201
In [14], the authors consider the minimization of the statistical risk (not the em-202
pirical risk). The input is assumed Gaussian and the output is generated by a network 203
involving one linear layer followed by ReLU and a mean. The number of intermediate 204
nodes is smaller than the input size. They provide conditions guaranteeing that, with 205
high probability, a randomly initialized gradient descent algorithm converges to the 206
true parameters. The authors of [53] consider a feed forward network made of one 207
unknown linear layer, followed by ReLU and a sum. The size of the intermediate layer 208
equals the size of the data, the size of the output is 1. Again, they assume Gaussian 209
input data and consider the minimization of the risk (not the empirical risk). They 210
show that the stochastic gradient descent converges to the true solution. In [83], the 211
authors consider a non-linear layer followed by a linear layer. The size of the interme-212
diate layer is smaller than the size of the input and the size of the output is 1. They 213
describe an initialization algorithm based on a tensor decomposition such that with 214
high probability, the gradient algorithm minimizing the empirical risk converges to 215
the true parameters that generated the data. 216
The authors of [71] consider a feed-forward neural network and show that, if the 217
input is Gaussian or its distribution is known, a method based on moments and sparse 218
dictionary learning can retrieve the parameers defining the first layer. Nothing is said 219
about the stability or the estimation of the other layers. 220
The authors of [4] consider deep feed-forward networks which are very sparse and 221
randomly generated. They show that they can be learned with high probability one 222
layer after another. However, very sparse and randomly generated networks are not 223
used in practice and one might want to study more versatile structures. The article 224
[60] studies deep structured linear networks (without the models M) and uses the 225
same tensorial lifting we use here. However, in [60] the function d measuring the 226
error between parameters is only defined using the ℓ∞ norm and ii not a metric.
227
The transversality condition of [60] is sufficient to guarantee the stability but is not 228
necessary. All these weakness are corrected in this extended version. The general 229
result is also specialized to deep convolutional linear networks. 230
1.4. Organization of the paper. Because it is strongly related, we give an ex-231
tensive bibliography on compressive sensing and stable recovery properties for matrix 232
factorization problems in Section2. We describe the framework of the paper and our 233
notations in Section3. 234
The main contributions of this paper are: 235
• In Section4, we investigate and recall several results on tensors, tensor rank 236
and the Segre embedding. In particular, we investigate how the Segre em-237
bedding distorts distances. 238
• In Section5, we describe the tensorial lifting. It expresses any deep structured 239
linear networks in a generic multilinear format. The latter composes a linear 240
lifting operator and the Segre embedding. 241
• When δ = η = 0 (see Section6): 242
– We establish a simple geometric condition on the intersection of two sets 243
which is necessary and sufficient to guarantee the identifiability of the 244
parameters up to scale ambiguity (Proposition6.3). 245
– We provide simpler conditions which involve the rank of the lifting op-246
erator (defined in Section5) such that: 247
∗ Under-specified case: If the lifting operator rank is large (e.g. larger 248
than 2K(S− 1) + 2, when M = RS×K) and the lifting operator is
249
random, for almost every lifting operator, the solution of 250
M1(h1)· · · MK(hK) = X
251
is identifiable (Theorem6.4). 252
∗ Over-specified case: If the lifting operator rank is small (e.g. smaller 253
than 2S− 1, when M = RS×K), the solution of
254
M1(h1)· · · MK(hK) = X
255
is not identifiable (Theorem6.5); 256
– We also provide a simple algorithm to compute the rank of the lifting 257
operator (Proposition5.3). 258
• Stability guarantee statements for deep structured linear networks are in Sec-259
tion7: 260
– We define the deep-Null Space Property (Definition 7.1): a generaliza-261
tion of the usual Null Space Property [25] that also applies to the deep 262
problems. 263
– We establish that the deep-Null Space Property is a necessary and suffi-264
cient condition to guarantee stability (see the informal statement above 265
or Theorem7.2and Theorem 7.3). 266
• We specialize the results to convolutional linear networks in Section 8 and 267
establish a simple condition that can be computed (see Algorithm8.1). The 268
is such that (see Theorem8.4) 269
– If the condition is satisfied the convolutional linear networks can be 270
stably recovered; 271
– If the condition is not satisfied, the convolutional linear network is not 272
identifiable. 273
In simple words, the condition holds when the supports of the convolution 274
kernels are sufficiently scattered. This is not satisfied by the convolutional 275
kernel used in applications and explains their instability. 276
2. Bibliography on matrix factorization and compressed sensing. Be-277
fore describing the bibliography on compressed sensing, we interpret this stability 278
statement of the Informal Theorem1.1 in the context of signal processing. In signal 279
processing, we usually know that h exists and δ represents the sum of a modeling 280
error and noise. The inequality (1.1) guarantees that, when the condition is satisfied, 281
even an approximative minimizer of 282
(2.1) argminL∈N,(hk)k=1..K∈ML kM1(h1)· · · MK(hK)− Xk
2.
283
leads to a solution h∗close h. This property is often named: stable recovery guarantee. 284
When δ = 0 (i.e., the data exactly fits the model and is not noisy) and η = 0 285
(i.e., (2.1) is perfectly solved) this is an identifiability guarantee. This is a necessary 286
condition of stable recovery. 287
In this section, we distinguish the cases K = 1, K = 2 and K≥ 3. 288
2.1. K = 1: Linear inverse problems. The simplest version consists of a 289
model with one layer (i.e., K = 1) andM = RS×K. Recovering h
1from X is a linear
290
inverse problem. The data X can be vectorized to form a column vector and the 291
operator M1simply multiplies the column vector h1 by a fixed (rectangular) matrix.
292
Typically, when the linear inverse problem is over-determined, the latter matrix has 293
more rows than columns, the uniqueness of a solution to (2.1) depends on the column 294
rank of the matrix and the stable recovery constant depends on the smallest singular 295
value of M1.
296
When the matrix is not full colum rank, the identifiability and stable recovery 297
for this problem has been intensively studied for many constraintsM. In particular, 298
for sparsity constraints this is the compressed/compressive sensing problem (see the 299
seminal articles [15, 30]). Some compressed sensing statements (especially the ones 300
guaranteeing that any minimizer of the ℓ0problem stably recovers the unknown prob-301
lem) are special cases (K = 1) of the statements provided in this paper. We will not 302
give a complete review on compressed sensing but would like to highlight the Null 303
Space Property described in [25]. The fundamental limits of compressed sensing (for 304
a solution of the ℓ0 problem) have been analyzed in detail in [13].
305
Although the main novelty of the paper is to investigate stable recovery properties 306
for any K ≥ 1, we specialize the statements made for K ≥ 1 to the case K = 1 in 307
order to illustrate the new statements and to provide a way of comparison with well 308
known results. 309
2.2. K = 2: Bilinear inverse problems and bilinear parameterizations. 310
When k≥ 2, the problem becomes non-linear because of the product in (2.1). This 311
significantly complicates the analysis. What follows are the main instances studied in 312
the literature when K = 2. 313
Non-negative Matrix factorization (NMF) and low rank prior:. In non-negative 314
matrix factorization [50], M1 and M2 map the entries in h1 and h2 at prescribed
315
locations in the factors (say, one column after another). The constraintsM imposes 316
that all the entries in h1 and h2 are non-negative. The NMF has been widely used
317
for many applications. 318
Conditions guaranteeing that the factors provided by the NMF identify5the
cor-319
rect factors (up to rescaling and permutation) were first established in the pioneering 320
work [29]. To the best of our knowledge, this is the first paper addressing recovery 321
guarantees for a problem of depth K = 2. It emphasizes a separability condition that 322
guarantees identifiability. The proof is purely geometric and relies on the analysis 323
of inclusions of simplicial cones. This result is significantly extended in [48]. In this 324
paper, the continuity of the NMF estimator is established. Concerning computational 325
aspects, NMF is NP-complete [78]. However, under the separability hypothesis of [29], 326
the solution of the NMF problem can be computed in polynomial time [5]. 327
We can slightly generalize6the problem and introduce a linear degradation
oper-328 ator 329
H : Rm×n−→ Rm×n. 330
Use the same mapping M1 and M2 as for the NMF, with M = RS×K, but with a
331
small number of lines (resp columns) in M2(h2) (resp. M1(h1)). Any solution of the
332 problem 333 (h∗1, h∗2)∈ argmin(h1,h2)∈RS×K kH(M1(h1)M2(h2))− Xk 2 334
leads to a low rank approximation M1(h∗1)M2(h∗2) of an inverse of H, at X. Again, a
335
large corpus of literature exists on the low rank prior [66,17,32,19]. 336
Phase Retrieval:. Phase retrieval fits the framework described in the present pa-337
per when we take 338 M1(h1) = diag (Fh1) M2(h2) = (Fh2)∗ 339 and 340 M = {(h, h) ∈ RS×K | h ∈ RS} 341
where S is the size of the signal,F computes N linear measurements of any element 342
in RS (typically Fourier measurements), diag (.) creates an N× N diagonal matrix
343
whose diagonal contains the input and∗ is the (entry-wise) complex conjugate.
344
The tensorial lifting at the core of the present paper generalizes the lifting used 345
in the inspiring work on PhaseLift [52, 18, 16]. As is often the case when K = 2, 346
PhaseLift is a semidefinite program that can be efficiently solved when the unknown 347
is of moderate size. These papers also provide conditions on the measurements guar-348
anteeing that the phases are stably recovered by PhaseLift. 349
The benefit of the generalization introduced with the tensorial lifting is that it 350
applies to any multilinear inverse problem. 351
Self-calibration and de-mixing. Measuring operators often depend linearly on pa-352
rameters that are not perfectly known. The estimation of these parameters is crucial 353
to restore the data measured by the device. This is the self-calibration problem. This 354
5
Stable recovery is not established.
6
The interested readers can check that this generalization only leads to a small change of the Lifting operator introduced in Section5. It is therefore done at no cost.
naturally fits the setting of this article: - we let h1 be the parameters defining the
355
sensing matrix and M1(h1) be the sensing matrix. Then h2 defines the signal (or
356
signals) contained in the column(s) of M2(h2).
357
Many instances of this problem have been studied and much progress has been 358
made to obtain algorithms that can be applied to problems of larger and larger size. 359
This leads to a very interesting line of research. 360
To the best of our knowledge, the first stable recovery statements concern the 361
blind-deconvolution problem. In [2], the authors use a lifting to transform the blind-362
deconvolution problem into a semidefinite program with an unknown whose size is 363
the product of the sizes7 of h
1 and h2. Such problems can be solved for unknowns of
364
moderate size. The authors of [2] provide explicit conditions guaranteeing the stable 365
recovery with high probability. This idea has been generalized and applied to other 366
similar problems in [24,9]. The authors of [54] consider a significantly more general 367
calibration model. In this model, M1(h1) is diagonal and its diagonal contains the
368
entries of h1. M2(h2) simply multiplies h2 by a fixed known matrix (the theorems
369
consider a random matrix). The constraint imposes h2to be sparse. For this problem,
370
they prove that with high probability the numerical method called SparseLift is stable 371
with a controlled accuracy. SparseLift returns the left and right singular vectors of 372
the solutions of an ℓ1 optimization problem whose unknown is the same as in [2].
373
However, solving an ℓ1 minimization problem is much simpler than a semi-definite
374
problem. This is a very significant practical improvement. 375
As emphasized in [51], in order to motivate its non-convex approach, the only 376
drawback of the numerical methods described in [2,54] is their complexity. The extra 377
complexity is due to the fact that they optimize a variable in the product space RS×S
378
and then deduce an approximate solution of the un-lifted problem. This is what 379
motivates the authors of [51] to propose a non-convex approach. The constructed 380
algorithm provably stably recovers the sensing parameters and the signals with a 381
geometric onvergence rate. 382
Sparse coding and dictionary learning:. Sparse coding and dictionary learning is 383
another kind of bilinear problem (see [67] for an overview). In that framework, the 384
columns of X contain the data. Most often, people consider two layers: K = 2. 385
The layer M1(h1) is an optimized dictionary of atoms defined by the parameters h1
386
and each column of M2(h2) contains the code (or coordinates) of the corresponding
387
column in X. Most often, h2is assumed sparse.
388
The identifiability and stable recovery of the factors has been studied in many 389
dictionary learning contexts and provides guarantees on the approximate recovery of 390
both an incoherent dictionary and sparse coefficients when the number of samples is 391
sufficiently large (i.e., in our setting when n is large). In [36], the authors developed 392
local optimality conditions in the noiseless case, as well as sample complexity bounds 393
for local recovery when M1(h1) is square and M2(h2) are iid Bernoulli-Gaussian. This
394
was extended to overcomplete dictionaries in [33] (see also [70] for tight frames) and to 395
the noisy case in [43]. The authors of [74] provide exact recovery results for dictionary 396
learning, when the coefficient matrix has Bernoulli-Gaussian entries and the dictionary 397
matrix has full column rank. This was extended to overcomplete dictionaries in [1] 398
and in [6] but only for approximate recovery. Finally, [35] provides such guarantees 399
under general conditions which cover many practical settings. 400
Contributions in these frameworks. The present article considers the identifiabil-401
ity and stability of the recovery for any K≥ 1 in a general and unifying framework. As 402
7
was already mentioned, we do not investigate computational issues. As will appear, 403
later the paper, the analogue of the lifting at the core of the algorithms described 404
in the above papers (in particular the papers on phase retrieval and self-calibration) 405
is a tensorial lifting (see Section5) and involves tensors that cannot be manipulated 406
in practice. Even when we are able to manipulate the tensors, the computation 407
of the best rank 1 approximation of such tensors is an open non-convex problem. 408
Therefore, there is no numerically efficient and reliable way to extract the un-lifted 409
parameters from an optimized tensor. Because of that, we have not yet pursued the 410
construction of a numerical scheme based on the tensorial lifting when K ≥ 3. As 411
was already mentioned, at this writing, the success of algorithms for K≥ 3 is mostly 412
supported by empirical evidence. Proving their efficiency is a wide open problem (see 413
[55,37,22,23,68,10,11,44,79]). The purpose of the paper is to provide guarantees 414
on the stability of the solution when such an empirical success occurs. 415
The specialization of the presented results to problems with K = 2 leads to 416
necessary and sufficient conditions for the stable recovery. This is slightly different 417
from the usual approach. Usually, authors provide sufficient conditions and argue 418
their sharpness by comparing the number of samples required by their method and 419
the information theoretic limit (typically, the number of independent variables of the 420
problem). 421
It would of course be interesting to see how far it is possible to unify the different 422
problems with K = 2 using the framework of this paper. We have however not 423
pursued this route and instead focus on the situation K≥ 3. 424
2.3. K≥ 3. The difficulties, when K ≥ 3, come from the fact that tools used for 425
problems with K = 2 are not applicable. In particular, we cannot use the usual lifting, 426
the singular value decomposition or the sin-θ theorem in [28]. Often, these tools are 427
replaced by analogous objects involving tensors. This complicates the analysis and 428
prohibits the use of numerical schemes that manipulate lifted variables. 429
To the best of our knowledge, little is known concerning the identifiability and 430
the stability of matrix factorization when K≥ 3. The uniqueness of the factorization 431
corresponding to the Fast Fourier Transform was proved in [57]. Other results consider 432
the identifiability of the factors which are sparse and random [64]. The authors of 433
the present paper have announced preliminary versions of the results described here 434
in [60]. They are significantly extended here. 435
3. Notation and summary of the hypotheses. We continue to use the no-436
tation introduced in the introduction. For an integer k∈ N, set [k] = {1, · · · , k}. 437
We consider K ≥ 1 and S ≥ 2 and real-valued tensors of order K whose axes 438
are of size S, denoted T ∈ RS×···×S. The space of tensors is abbreviated RSK
. The 439
entries of T are denoted Ti1,··· ,iK, where (i1,· · · , iK)∈ [S]
K. For i∈ [S]K, the entries
440
of i are i = (i1,· · · , iK) (for j∈ [S]K we let j = (j1,· · · , jK), etc). We either write Ti
441
or Ti1,··· ,iK.
442
To simplify notations, from now on, the parameters defining the factors are gath-443
ered in a single matrix and are denoted with bold fonts h∈ RS×K. The kth vector
444
containing the parameters for the layer k is denoted hk ∈ RS. The ith entry of the
445
kth vector is denoted hk,i∈ R. A vector not related to an element in RS×K is denoted
446
h∈ RS (i.e., using a light font). Throughout the paper we assume
447
M = (ML)L∈N, withML⊂ RS×K. 448
We also assume that, for all L∈ N, ML
6= ∅. They can however be equal or constant 449
after a given L′.
450
All the vector spaces RSK
, RS×K, RS etc. are equipped with the usual Euclidean
451
norm. This norm is denotedk.k and the scalar product h., .i. In the particular case 452
of matrices, k.k corresponds to the Frobenius norm. We also use the usual p norm, 453
for p∈ [1, ∞], and denote it by k.kp. In particular, for h∈ RS×K and T ∈ RS
K , we 454 have for p < +∞ 455 khkp= K X k=1 S X i=1 |hk,i|p !1/p , kT kp= X i∈[S]K |Ti|p 1/p 456 and 457 khk∞= max k∈[K] i∈[S] |hk,i| , kT k∞= max i∈[S]K|Ti|. 458 Set 459 (3.1) RS×K ∗ ={h ∈ RS×K | ∀k ∈ [K], khkk 6= 0}. 460
Define an equivalence relation on RS×K
∗ : for any h, g∈ RS×K, h∼ g if and only if
461
there exist (λk)k∈[K]∈ RK such that
462 (3.2) K Y k=1 λk = 1 and hk = λkgk,∀k ∈ [K]. 463
Denote the equivalence class of h∈ RS×K
∗ byhhi.
464
The zero tensor is of rank 0. A non-zero tensor T ∈ RSK
is of rank 1 (or decom-465
posable) if and only if there exists h∈ RS×K
∗ such that T is the outer product of the
466
vectors hk, for k∈ [K]. That is, for any i ∈ [S]K,
467
Ti= h1,i1· · · hK,iK.
468
Let Σ1⊂ RS
K
denote the set of tensors of rank 0 or 1. 469
The rank of a tensor T ∈ RSK
is 470
rk (T ) = min{r ∈ N | there exists T1,· · · , Tr∈ Σ1 such that T = T1+· · · + Tr}.
471 For r∈ N, let 472 Σr={T ∈ RS K | rk (T ) ≤ r}. 473
The∗ superscript refers to optimal solutions. A set with a ∗ subscript means that 474
0 is ruled out of the set. In particular, Σ1,∗ denotes the non-zero tensors of rank 1.
475
Attention should be paid to RS×K∗ (see (3.1)). 476
4. Facts on the Segre embedding and tensors of rank 1 and 2. Parametrize 477 Σ1⊂ RS K by the map 478 (4.1) P : RS×K −→ Σ1⊂ RS K
h 7−→ (h1,i1h2,i2· · · hK,iK)i∈[S]K.
479
The map P is called the Segre embedding and is often denoted dSeg in the algebraic 480
geometry literature. 481
Standard Facts: 482
1. Identifiability of hhi from P (h): For h and g ∈ RS×K
∗ , P (h) = P (g) if
483
and only ifhhi = hgi. 484
2. Geometrical description of Σ1,∗: Σ1,∗ is a smooth (i.e., C∞) manifold
485
of dimension K(S− 1) + 1 (see, e.g., [47], chapter 4, pp. 103). 486
3. Geometrical description of Σ2: We recall that the singular locus (Σ2)sing
487
of the closure Σ2 of Σ2 has dimension strictly less than that of Σ2 and that
488
Σ2\(Σ2)sing is a smooth manifold. The dimension of Σ2\(Σ2)sing is 2K(S−
489
1) + 2 when K > 2, and is 4(S− 1) when K = 2 (see, e.g., [47], chapter 5). 490
We can improve Standard Fact1 and obtain a stability result guaranteeing that if we know a rank 1 tensor sufficiently close to P (h), we approximately knowhhi. In order to state this, we need to define a metric on RS×K
∗ / ∼ (where ∼ is defined by
(3.2)). This has to be considered with care since, whatever h ∈ RS×K
∗ , the subset
{h | h ∈ hhi} is not compact. In particular, considering h′ k = λ h k if k = 1 λ−K−11 h k otherwise
when λ goes to infinity, we easily construct examples that make the standard metric 491
on equivalence classes useless8.
492
This leads us to consider 493 RS×K diag ={h ∈ R S×K ∗ | ∀k ∈ [K], khkk∞=kh1k∞}. 494
The interest in this set comes from the fact that, whatever h∈ RS×K
∗ , the sethhi ∩
495
RS×K
diag is finite. Indeed, if g ∈ hhi ∩ RS×Kdiag the (λk)k∈[K] ∈ R
K such that, for all
496
k∈ [K], hk = λkgk must all satisfy|λk| = 1, i.e., λk=±1.
497
Definition 4.1. For any p ∈ [1, ∞], we define the mapping dp : (RS×K
∗ / ∼
498
×RS×K
∗ /∼) → R by
499
dp(hhi, hgi) = inf h′ ∈hhi∩RS×K diag g′ ∈hgi∩RS×Kdiag kh′− g′kp ∀h, g ∈ RS×K∗ . 500
Proposition4.2. For any p∈ [1, ∞], dp is a metric on RS×K ∗ /∼.
501
The proof is in Appendix10.1. 502
Notice that the equivalence relationship and metric defined above are not adapted 503
to operators Mk allowing invariance such as permutations. More precisely, for some
504
operators Mk, there exists h, g and a permutation matrix C such that (hk, hk+1)6=
505
(gk, gk+1) and Mk(hk) = Mk(gk)C and Mk+1(hk+1) = C−1Mk+1(gk+1). In such a
506
case, we have dp(hhi, hgi) 6= 0. However, the features defined at the layer k are just
507
8
For instance, if h and g ∈ RS×K∗ are such that h1= g1, we have
inf
h′∈hhi,g′∈hgikh
′− g′k p= 0
even though we might have h26= g2(and therefore hhi 6= hgi). This does not define a metric.
Also, when h and g are such that hk6= gk, whatever k ∈ [K], we have
sup h′∈hhi inf g′∈hgikh ′− g′k p= +∞.
Therefore, the Hausdorff distance between hhi and hgi is infinite for almost every pair (h, g). This metric is therefore not very useful in the present context.
permuted and can still be interpreted. As already said, in such a case, it is possible 508
to use the modelsM to select one the equally interpretable h. 509
Using the above metric, we can state that not onlyhhi is uniquely determined by 510
P (h), but this operation is stable. 511
Theorem 4.3. Stability of hhi from P (h) 512
Let h and g∈ RS×K
∗ be such thatkP (g)−P (h)k∞≤ 12 max (kP (h)k∞,kP (g)k∞).
513 For all p, q∈ [1, ∞], 514 (4.2) dp(hhi, hgi) ≤ 7(KS) 1 pmin kP (h)kK1−1 ∞ ,kP (g)k 1 K−1 ∞ kP (h) − P (g)kq. 515
The proof of the theorem is in Appendix10.2. 516
In the final result, the bound established in Theorem4.3 plays a role similar to 517
the sin− θ Theorem of [28] in [54, 18,2]. 518
The following proposition shows that the upper bound in (4.2) cannot be improved 519
by a significant factor, in particular when q is large. 520
Proposition4.4. There exist h and g∈ RS×K
∗ such thatkP (g)k∞≤ kP (h)k∞, 521 kP (g) − P (h)k∞≤12 kP (h)k∞ and 522 7(KS)1pkP (h)k 1 K−1 ∞ kP (h) − P (g)kq≤ Cq dp(hhi, hgi), 523 where 524 Cq= 28(KS)1q if q < +∞, 28 if q = +∞. 525
The proof of the theorem is in Appendix10.3. 526
As stated in the following theorem, we have a more valuable upper bound in the 527
general case. 528
Theorem 4.5. “Lipschitz continuity”of P 529
For any q∈ [1, ∞] and any h and g ∈ RS×K
∗ , 530 (4.3) kP (h) − P (g)kq ≤ S K−1 q K1− 1 qmax kP (h)k1−K1 ∞ ,kP (g)k1− 1 K ∞ dq(hhi, hgi). 531
The theorem is proved in Appendix 10.4. 532
Notice that, considering h and g∈ RS×K such that h
k,i= 1 and gk,i= ε, for all
533
k∈ [K] and i ∈ [S] and for a 0 < ε ≪ 1, we easily calculate 534 SK−1q K1− 1 qmax kP (h)k1−K1 ∞ ,kP (g)k1− 1 K ∞ dq(hhi, hgi) ≤ KkP (h) − P (g)kq. 535
As a consequence, the upper bound in Theorem4.5is tight up to at most a factor K. 536
5. The tensorial lifting . The following proposition is clear (it can be shown 537
by induction on K): 538
Proposition5.1. Let Mk, k∈ [K] be as in (1.2). The entries of the matrix M1(h1)M2(h2)· · · MK(hK)
are multivariate polynomials whose variables are the entries of h∈ RS×K. Moreover,
539
every entry is the sum of monomials of degree K. Each monomial is a constant times 540
h1,i1· · · hK,iK, for some i∈ [S]
K.
Notice that any monomial h1,i1· · · hK,iK is the entry P (h)i in the tensor P (h).
542
Therefore every polynomial in the previous proposition takes the formPi∈[S]KciP (h)i
543
for some constants (ci)i∈[S]K independent of h. In words, every entry of the matrix
544
M1(h1)M2(h2)· · · MK(hK) is obtained by applying a linear form to P (h). Moreover,
545
the polynomial coefficients defining the linear form are uniquely determined by the 546
linear maps M1,· · · , MK. This leads to the following statement.
547
Corollary 5.2. Tensorial Lifting 548
Let Mk, k∈ [K] be as in (1.2). The map
(h1, . . . , hK)7−→ M1(h1)M2(h2)· · · MK(hK),
uniquely determines a linear map 549
A : RSK
−→ Rm×n,
550
such that for all h∈ RS×K
551
(5.1) M1(h1)M2(h2)· · · MK(hK) =AP (h).
552
We call (5.1) and its use the tensorial lifting. When K = 1, we simply have 553
A = M1. When K = 2 it corresponds to the usual lifting already exploited to
estab-554
lish stability results for phase recovery, blind-deconvolution, self-calibration, sparse 555
coding, etc. Notice that, when K ≥ 2, it may be difficult to provide a closed form 556
expression for the operator A. We can however determine simple properties of A. 557
In most reasonable cases, A is sparse. If the operators Mk simply embed the values
558
of h in a matrix, the matrix representing A only contains zeros and ones. Since the 559
operators Mk are known, we can compute AP (h), for any h ∈ RS×K, using (5.1).
560
Said differently, we can computeA for any rank 1 tensor. Therefore, since A is linear, 561
we can compute AT for any low rank tensor T . If the dimensions of the problem 562
permit, one can manipulateA in a basis of RSK
. 563
Since rk (A) is an important quantity, we emphasize that rk (A) ≤ mn. It is also 564
possible to compute rk (A), when mn is not too large, using the following proposition. 565
Proposition5.3. For R independent random hr, with r = 1..R, according to 566
the normal distribution in RS×K, we have with probability 1
567 (5.2) dim(Span ((AP (hr))r=1..R)) = R if R≤ rk (A) rk (A) otherwise. 568
The proof is in Appendix10.5. 569
Using Corollary5.2, when (2.1) has a minimizer, we rewrite in the form 570
(5.3) h∗∈ argmin
L∈N,h∈ML kAP (h) − Xk2.
571
We now decompose this problem into two sub-problems: A least-squares problem 572
(5.4) T∗∈ argminT ∈RSK kAT − Xk
2
573
and a non-convex problem 574
(5.5) h′∗∈ argminL∈N,h∈MLkA(P (h) − T∗)k2.
575
Proposition5.4. Let X and A be such that (2.1) has a minimizer: 576
1. Let h∗ be a solution of (5.3). Then, for any solution T∗ of (5.4), h∗ also
577
minimizes (5.5). 578
2. Let T∗ be a solution of (5.4) and h′∗ a solution of (5.5). Then, h′∗ also
579
minimizes (5.3). 580
The proof is in Appendix10.6. 581
From now on, because of the equivalence between solutions of (5.5) and (5.3), we 582
stop using the notation h′∗and write h∗∈ argmin
L∈N,h∈MLkA(P (h) − T∗)k2.
583
6. Identifiability (error free case). Throughout this section, we assume that 584
X is such that there exists L and h∈ ML such that
585
(6.1) X = M1(h1)· · · MK(hK).
586
Under this assumption, X =AP (h), so 587
P (h)∈ argminT ∈RSKkAT − Xk
2.
588
Moreover, we trivially have P (h) ∈ Σ1 and therefore h minimizes (5.5), (2.1) and
589
(5.3). As a consequence, (2.1) has a minimizer. 590
We ask whether there exist guarantees that the resolution of (2.1) allows one to 591
recover h up to the usual uncertainties. 592
In this regard, for any h ∈ hhi, we have P (h) = P (h) and therefore AP (h) = 593
AP (h) = X. Thus unless we make further assumptions on h, we cannot expect to 594
distinguish any particular element of hhi using only X. In other words, recovering 595
hhi is the best we can hope for. 596
Definition 6.1. Identifiability 597
We say that hhi is identifiable if the elements of hhi are the only solutions of 598
(2.1). 599
We say that M is identifiable if for every L ∈ N and every h ∈ ML, hhi is
600
identifiable. 601
Proposition6.2. Characterization of the global minimizers 602
For any L∗∈ N and any h∗ ∈ ML∗
, (L∗, h∗)∈ argmin L∈N,h∈MkAP (h) − Xk2 603 if and only if 604 P (h∗)∈ P (h) + Ker (A) . 605
The Proposition is proved in Appendix10.7. 606
In order to state the following proposition, we define for any L and L′∈ N
607 P (ML) − P (ML′ ) :=nP (h)− P (g) | h ∈ ML and g ∈ ML′o ⊂ RSK . 608
Proposition6.3. Necessary and sufficient conditions of identifiability 609
1. For any L and h∈ ML: hhi is identifiable if and only if for any L ∈ N
610
P (h) + Ker (A)∩ P (ML)
⊂ {P (h)}. 611
2. M is identifiable if and only if for any L and L′∈ N
612
(6.2) Ker (A) ∩ P (ML)− P (ML′)⊂ {0}. 613
The Proposition is proved in Appendix10.8. 614
In the context of the usual compressed sensing (i.e., when K = 1, M contains 615
L-sparse signals,A is a rectangular matrix with full row rank and X is a vector), the 616
proposition is already stated in Lemma 3.1 of [25]. 617
In reasonably small cases and when P (M) is algebraic, one can use tools from 618
numerical algebraic geometry such as those described in [39,40] to check whether the 619
condition (6.2) holds or not. The drawback of Proposition6.3 is that, given a deep 620
structured linear network as described byA, the condition (6.2) might be difficult to 621
verify. 622
We therefore establish simpler conditions related to the identifiability ofM. First 623
we establish a condition such that for almost everyA satisfying it, M is identifiable. 624
The main benefit of this condition is that its constituents can be computed in many 625
practical situations. 626
Before that, we recall a few facts of algebraic geometry, for X, Y ⊂ RN, the join
of X and Y (see, e.g., [38, Ex. 8.1]) is
J(X, Y ) :={sx + ty | x ∈ X, y ∈ Y, s, t ∈ R}.
If for all L∈ N, MLis Zariski closed and invariant under rescaling (e.g., if they are all
627
linear spaces), then P (ML)
−P (ML′
) is a Zariski open subset of J(P (ML), P (
ML′
)). 628
In general, it is contained in this join. 629
Recall the following fact (*): for complex algebraic varieties X, Y ⊂ CN, any
630
component Z of X ∩ Y has dim (Z) ≥ dim (X) + dim (Y ) − N, and equality holds 631
generically (we make “generically” precise in our context below). Moreover, if X, Y 632
are invariant under rescaling, since 0 ∈ X ∩ Y , we have X ∩ Y 6= ∅. (See, e.g., [72, 633
§I.6.2].) 634
This intersection result indicates that if there exists L, L′ such that
rk(A) < dimP (ML)− P (ML′)
we expect to have non-identifiability; and if the rank is larger, for all pair L, L′, we
635
expect identifiability. More precisely: 636
Theorem 6.4. Almost surely sufficient condition for Identifiability 637
For almost everyA such that
rk(A) ≥ dimJ(P (ML), P (
ML′
)), for all L, L′, M is identifiable.
638
The theorem is proved in Appendix 10.9. 639
Since dimJ(P (ML), P (ML′
)) ≤ dim P (ML)+ dimP (ML′
), if Dmax is
640
the maximum dimension of P (ML) over all L, one has the same conclusion if rk(A) ≥
641
2Dmax.
642
When K = 1, we illustrate this result by interpreting it in the context of com-643
pressive sensing, where h is a vector, X is a vector, A is a rectangular sampling 644
matrix of full row rank and Ker (A) is large. The statement analogous to Theorem 645
6.4in the compressive sensing framework takes the form: “For almost every sampling 646
matrix, any L sparse signal h can be recovered from Ah as soon as 2L ≤ rk (A).” 647
Moreover, the constituent of the ℓ0minimization model used to recover the signal are
648
also the constituents of (5.3). Again, the main novelty is to extend this result to the 649
identifiability of the factors of deep matrix products. 650
In order to establish a necessary condition for identifiability, first note that if we 651
extend P (ML)
− P (ML′
) to be scale invariant, this will not affect whether or not it 652
intersects ker(A) outside of the origin. We immediately conclude that in the complex 653
setting whereML,
ML′
are both Zariski closed, thatM is non-identifiable whenever 654
rk(A) < dimP (ML)− P (ML′
). This indicates that we should always expect non-655
identifiability whenever rk(A) < dimP (ML)
− P (ML′
) but is not adequate to 656
prove it because real algebraic varieties need not satisfy (*). However it is true for 657
real linear spaces, so we immediately conclude the following weak result: 658
Theorem 6.5. Necessary condition for Identifiability 659
Let C(P (ML)
− P (ML′
)) be the set of all points on all lines through the origin 660
intersecting P (ML)
− P (ML′
), and let q be the maximal dimension of a linear space 661
on C(P (ML)
− P (ML′
)). Then if q > rk(A), M is not identifiable. In particular 662
when the P (ML)’s contain linear space and if we let S′ the be the largest dimension
663
of these vector space, if 2S′> rk(A), then M is not identifiable.
664
Let us illustrate the theorems by considering a deep feed-forward ReLU network 665
and consider the structured linear network obtained by fixing the action of ReLU 666
(as is done in Section1.2.2). The matrix X contains the outputs, the operator MK
667
multiplies the matrix containing the inputs by the weights between the first and the 668
second layer. For every input/output pair, the action of ReLU is different and removes 669
paths from the input entries to the output entries. We assume however that every 670
entry of every output is reached by at least one path in the network starting at a non-671
zero entry of the input. In that case, it is not difficult to see that A is a surjection 672
and therefore rk(A) = mn, where m is the size of the output and n is the number of 673
learning samples. 674
The condition in Theorem6.4becomes 675
mn≥ dim (Σ2) = 2K(S− 1) + 2,
676
and KS is typically the number of parameters of the network. The intuition behind 677
Theorem 6.4 is that, if the action of ReLU is sufficiently random and if the above 678
inequality holds, we can expect the network to be identifiable with high probability9.
679
This situation corresponds to an under-parameterized case (favorable for identifiabil-680
ity). 681
The condition in Theorem6.5is 682
2S > mn. 683
When this inequality holds the network is not identifiable. It corresponds to an 684
over-parameterized configuration. In the intermediate situation, when 2S ≤ mn < 685
2K(S− 1) + 2, and when the action of the activation function does not introduce 686
sufficiently randomness, the theorems are inconclusive. 687
Notice that such networks can also be analyzed using Proposition6.3. It is indeed 688
not difficult to see that if there exists two paths that: 1/ start from the same entry 689
of the input layer; 2/ end at the same entry of the output layer; 3/ if both paths are 690
present (despite the action of ReLU) for every input/output pair; then (6.2) does not 691
hold10and RS×K is not identifiable. It is not clear at this point that the conditions 1,
692
9
This statement gives the intuition behind Theorem6.4but it should be made precise, as em-phasized in the perspectives of this paper.
10
Simply consider two rank one tensors, each tensor being a Dirac at the position corresponding to one of the two paths.
2, 3 are met by all non-identifiable structured linear feed-forward networks. However, 693
removing paths from the network (as is done by ReLU and Dropout) is a way to avoid 694
conditions 1, 2, 3 to be met. 695
7. Stability guarantee. In this section, we consider errors of different natures. 696
We assume that there exists L and L∗∈ N, h ∈ ML and h∗∈ ML∗
, such that 697 (7.1) kM1(h1)· · · MK(hK)− Xk ≤ δ, 698 and 699 (7.2) kM1(h∗1)· · · MK(h∗K)− Xk ≤ η, 700
for δ and η typically small. 701
Again, this corresponds to existing unknown parameters h that we estimate from 702
a noisy observation X, using an inaccurate solution h∗ of (2.1) (as in [13] where the
703
case K = 1 is studied). Otherwise, h and h∗ shall be interpreted as different learned
704
parameters; δ and η are the corresponding risks. 705
Notice that the above hypothesis does not even require (2.1) to have a solution. 706
Algorithms which do not come with a guarantee sometimes manage to reach small 707
δ and η values. In those cases, the analysis we conduct in this section permits to 708
get the stability guarantee, despite the lack of a guarantee of the algorithm. Finally, 709
the hypotheses (7.1) and (7.2) enable one to obtain guarantees for algorithms that, 710
instead of minimizing (2.1), minimize an objective function which approximates the 711
one in (2.1). This is particularly relevant for machine learning applications when (2.1) 712
can be an empirical risk that needs to be regularized or is not truely minimized (for 713
instance, when using dropout [75]). 714
A necessary and sufficient condition for the identifiability ofM is stated in Propo-715
sition6.3. The condition is on the way Ker (A) and P (ML)
− P (ML′
) intersect. In 716
order to get a stability guarantee, we need a stronger condition on the geometry of 717
this intersection to hold for every L and L′ ∈ N. This condition is provided in the
718
next definition. 719
Definition 7.1. Deep-Null Space Property 720
Let γ > 0 and ρ > 0. We say that Ker (A) satisfies the deep-Null Space Prop-721
erty (deep-NSP) with respect to the collection of models M with constants (γ, ρ) if 722
for any L and L′ ∈ N, any T ∈ P (ML)− P (ML′
) satisfying kAT k ≤ ρ and any 723
T′∈ Ker (A), we have
724
(7.3) kT k ≤ γkT − T′k.
725
The deep-NSP implies that, for T ∈ P (ML)
− P (ML′
) close to Ker (A) in the sense 726
that kAT k ≤ ρ we must have, by decomposing T = T′+ T′′, with T′ ∈ Ker (A) and
727
T′′in its orthogonal complement
728 kT k ≤ γkT − T′k = γkT′′k ≤ γ σminkAT ′′k ≤ γ σmin ρ, 729
where σmin is the smallest non-zero singular value of A. In words, kT k must be
730
small. We can conclude that under the deep-NSP, P (ML)
− P (ML′
) and{T ∈ RSK
| 731
kAT k ≤ ρ} intersect at most in the neighborhood of 0. 732
Additionally, (7.3) implies that in the neighborhood of 0, Ker (A) and P (ML)
− 733
P (ML′
) are not tangential, i.e., their intersection is transverse. 734
If Ker (A) satisfies the deep-NSP with respect to the collection of models M with 735
constants (γ, ρ), then for all T′ ∈ Ker (A) and all T ∈ P (ML)
− P (ML′ ) satisfying 736 kAT k ≤ ρ, 737 kT′k ≤ kT k + kT′− T k ≤ (γ + 1)kT′− T k. 738 Therefore, 739 (7.4) ∀T′∈ Ker (A) , kT′k ≤ (γ + 1)d loc(T′, P (ML)− P (ML ′ )), 740
where we have set for any C⊂ RSK
741
dloc(T′, C) = inf T ∈C,kAT k≤ρkT
′− T k.
742
The converse is also true: if Ker (A) satisfies (7.4), it satisfies the deep-NSP with 743
respect to the collection of models M with appropriate constants. In the context of 744
the usual compressed sensing (i.e., when K = 1,MLcontains L-sparse signals,A is a
745
rectangular matrix with full row rank and X is a vector), the localization appearing in 746
dloccan be discarded since the inequality must hold when T′is small and since in this
747
case this localization has no effect. Therefore, in the compressed sensing context, (7.4) 748
(and therefore deep-NSP) is the usual Null Space Property with respect to L-sparse 749
vectors, as defined in [25]. However, deep-NSP is generalized to take into account 750
deep structured linear network. This motivates the name. 751
In the general case, the deep-NSP can be understood as a local version of the 752
generalized-NSP for A relative to P (∪L∈NML)
− P (∪L∈NML), as defined in [13].
753
Our interest in locality (as imposed by the constraintkAT k ≤ ρ) is motivated by the 754
fact that we want to use the deep-NSP when the signal to noise ratio is controlled 755
(i.e., the hypotheses of Theorem 4.3 are satisfied). The condition for the stability 756
property therefore includes such hypotheses. 757
We have not adapted the robust-NSP defined in [13]. The benefit in not using 758
this definition is to obtain a simpler definition for deep-NSP. In particular (7.3) does 759
not involve the geometry ofA in the orthogonal complement of Ker (A). Looking in 760
detail at the benefit of this adaptation is of great interest. 761
Finally, we trivially have the following facts: 762
• If Ker (A) = {0}, then Ker (A) satisfies the deep-NSP with respect to the 763
model RS×K with constant (1, +∞).
764
• For any γ′ ≥ γ: If Ker (A) satisfies the deep-NSP with respect to the collection
765
of modelsM with constants (γ, ρ), then Ker (A) satisfies the deep-NSP with 766
respect to the collection of modelsM with constants (γ′, ρ).
767
• For any M′ ⊂ M: If Ker (A) satisfies the deep-NSP with respect to the 768
collection of modelsM with constant (γ, ρ), then Ker (A) satisfies the deep-769
NSP with respect to the collection of models M′ with constants (γ, ρ). In 770
particular, if Ker (A) satisfies the deep-NSP with respect to the model RS×K
771
with constant (γ, ρ), it satisfies the deep-NSP with respect to any collection 772
of models, with constants (γ, ρ). 773
Theorem 7.2. Sufficient condition for the stability property 774
Assume Ker (A) satisfies the deep-NSP with respect to the collection of models 775
M and with the constants (γ, ρ). For any h∗ as in (7.2) with η and δ (see (7.2) and
776
(7.1)) such that δ + η≤ ρ, we have 777
kP (h∗)− P (h)k ≤ γ σmin
(δ + η), 778
where σmin is the smallest non-zero singular value of A. Moreover, if h ∈ RS×K∗ and 779 γ σmin (δ + η)≤ 1 2 max kP (h)k∞,kP (h∗)k∞ then 780 (7.5) dp(hh∗i, hhi) ≤ 7(KS)p1γ σmin minkP (h)kK1−1 ∞ ,kP (h∗)k 1 K−1 ∞ (δ + η). 781
The first part of the proof is very similar to standard proofs in the Compressed 782
Sensing and stable recovery literature. The second part simply uses Theorem 4.3. 783
The Theorem is proved in Appendix10.10. 784
Theorem 7.2provides a sufficient condition to obtain stability. The only signifi-785
cant hypothesis made on the deep structured linear network is that Ker (A) satisfies 786
the deep-NSP with respect to the collection of modelsM. One might ask whether 787
this hypothesis is sharp or not. The next theorem shows that the answer to this 788
question is positive. 789
Theorem 7.3. Necessary condition for the stability property 790
Assume the stability property holds: There exists C and δ > 0 such that for any 791
L∈ N, h ∈ ML, any X =AP (h) + e, with kek ≤ δ, any L∗∈ N and any h∗∈ ML∗
792 such that 793 kAP (h∗)− Xk2 ≤ kek 794 we have 795 d2(hh∗i, hhi) ≤ C min kP (h)kK1−1 ∞ ,kP (h∗)k 1 K−1 ∞ kek. 796
Then, Ker (A) satisfies the deep-NSP with respect to the collection of models M with constants
(γ, ρ) = (CSK−12 √K σ
max, δ)
where σmax is the spectral radius ofA.
797
The first part of the proof is inspired by and similar to the proof of the analogous 798
converse statement in [25]. The second part simply uses Theorem 4.5. The Theorem 799
is proved in Appendix 10.11. 800
The sharpness of the known results when K = 2 is usually argued by comparing 801
the number of samples necessary for the recovery and the information theoretic limit 802
of the problem. As far as the authors know, the above theorem is therefore new even 803
when K = 2. 804
As is usually the case with Null Space Property or Restricted Isometry Property, 805
it will often be difficult or impossible to establish that a particular operator A sat-806
isfies the deep-NSP with respect to the collection of models M. To find favorable 807
cases, we need to consider random operators A such that the distribution of A en-808
ables one to establish that the deep-NSP holds with high probability, when in the 809
right configurations (see the bibliography in Section2 whose references contain many 810
examples of such arguments). The most common distribution for the analog of A 811
include operators/matrices whose coefficients are Gaussian or Bernoulli. Collections 812
of models inducing sparsity, non-negativity, low-rank constraints etc. are the most 813
studied. In this regard, the fact that there exists a low complexity test guaranteeing 814
that the networks considered in Section8can be stably recovered is an exception. 815
8. Application to convolutional linear network. We consider a convolu-816
tional linear network as depicted in Figure1. The network typically aims at perform-817
ing a linear analysis or synthesis of a signal living in RN. The considered convolutional
818
linear network is defined from a rooted directed acyclic graph G(E, N ) composed of 819