A finite sample analysis of the benign overfitting phenomenon for ridge function estimation

(1)

HAL Id: hal-03124591

https://hal.archives-ouvertes.fr/hal-03124591

Preprint submitted on 28 Jan 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

phenomenon for ridge function estimation

Emmanuel Caron, Stéphane Chrétien

To cite this version:

Emmanuel Caron, Stéphane Chrétien. A finite sample analysis of the benign overfitting phenomenon

for ridge function estimation. 2021. �hal-03124591�

(2)

A finite sample analysis of the benign overfitting phenomenon for ridge function estimation

Emmanuel Caron Stéphane Chrétien

Abstract

Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters p approaches the sample size n, the generalisation error (a.k.a. testing error) increases, but in many cases, it starts decreasing again past the threshold p = n. This surprising phenomenon, brought to the theoretical community attention in [5], has been thoroughly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-square problem, mostly in the asymptotic regime when p and n tend to +∞; see e.g. [13]. In the present paper, we propose a finite sample analysis of non-linear models of ridge type, where we investigate the overparametrised regime of the double descent phenomenon for both the estimation problem and the prediction problem. Our results provide a precise analysis of the distance of the best estimator from the true parameter as well as a generalisation bound which complements recent works of [4] and [8]. Our analysis is based on efficient but elementary tools closely related to the continuous Newton method [22].

Index Terms Double descent, Ridge function estimation, finite sample analysis.

I. I NTRODUCTION

T HE tremendous achievements of deep learning models in the solution of complex prediction tasks have been the focus of great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years.

Many success stories related to the use of Deep Neural Networks have even been reported in the media and no data scientist can ignore the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.

One of the key ingredient in their success is the huge number of parameters involved in all current architectures, a very counterintuitive approach that defies traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error and the expected outcome of this approach should be very poor generalisation performance. However, the main suprise came from the observation that interpolating networks can still generalise well, as shown in the following table

¹

model # param. _n ^p train loss test error

CudaConvNet 145,578 2.9 0 23

CudaConvNet with regularization 145,578 2.9 0.34 18

Microlnception 1,649,402 33 0 14

ResNet 2,401,440 48 0 13

Belkin et al. [5] recently addressed the problem of resolving this paradox, and brought some new light on the relationships between interpolation and generalization to unseen data. In the particular instance of kernel ridge regression, [17] proved that interpolation can coexist with good generalization. In a subsequent line of work, recent connections between kernel regression and wide neural networks extensively were studied by [15], [12], [1], [6] and provide additional motivation for a deeper understanding of the double descent phenomenon for kernel methods. Further motivations provided by Chizat and Bach [9]

about any nonlinear model of the form

E [Y | X ] = f (X ; θ)

E. Caron is with the Laboratoire de Mathématiques et ses Applications (LMA), Université d’Avignon Campus Jean-Henri Fabre 301, rue Baruch de Spinoza, BP 21239 F-84 916 AVIGNON Cedex 9. Most of this work was done while the author was with the ERIC Laboratory, University of Lyon 2, Lyon, France.

S. Chrétien is with the ERIC Laboratory and UFR ASSP, University of Lyon 2, 5 avenue Mendes France, 69676 Bron Cedex. He is invited researcher at the Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK, the National Physical Laboratory, 1 Hampton Road, Teddington, TW11 0LW, UK, The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, UK, and FEMTO-ST Institute/Optics Department, 15B Avenue des Montboucons, F-25030 Besançon, France.

1

database is CIFAR 10: set up is p parameters, n = 50, 000 training samples and 10 classes [18].

(3)

with parameter θ ∈ R ^p . As elegantly summarised in the introduction of [19], if we assume that p is so large that training by gradient flow moves each of them by just a small amount with respect to some random initialization θ ₀ ∈ R ^p , linearising the model around θ ₀ gives

E (Y | X) ≈ f (X; θ 0 ) + ∇ θ f (X ; θ 0 ) ^t β,

which leads, in the Empirical Risk Minimisation setting, to consider a simpler linear regression problem, with high dimensional random features ∇ θ f (X i ; θ 0 ) , i = 1, . . . , n, which owe their randomness to the randomness of the initialisation θ 0 . This approximation is now well known to be missing the main features of deep neural networks [10] but it is still a good test bench for new methods of analysing the double descent phenomenon.

In this paper, we consider a statistical model of the form

E [Y i | X i ] = f (X _i ^t θ

^∗

), i = 1, . . . , n, (1) where θ

^∗

∈ R ^p and the function f is assumed increasing. The data X 1 , . . . , X n will be assumed isotropic and subGaussian, and the observation errors will be assumed subGaussian as well. When the estimation of θ

^∗

is performed using Empirical Risk Estimation, i.e. by solving

θ ˆ = argmin _θ∈Θ 1 n

n

X

i=1

`(Y i − f (X _i ^t θ)) (2)

for a given smooth loss function `, we show that the double descent phenomenon takes place and we give precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions n and p, various bounds on the derivatives of f and of the loss function used in the Empirical Risk Estimation.

Our contribution is the first non-asymptotic analysis of the double descent phenomenon for non-linear models. Our results precisely characterise the proximity in ` 2 of a certain solution θ ˆ of (2) to θ

^∗

, from which the performance of the minimum norm solution follows naturally. Our proofs are very elementary as they utilise an elegant continuous Newton method argument initially promoted by Neuberger in a series of papers intended to provide a new proof of the Nash Moser theorem [7], [22].

II. M AIN R ESULTS

In this section, we describe our mathematical model and set the notations.

A. Mathematical presentation of the problem

We assume that (1) holds and that f and the loss function ` : R 7→ R satisfy the following properties

•

f is increasing, `

⁰

(0) = 0 and `

⁰⁰

is upper bounded by a constant C `

⁰⁰

> 0.

Concerning the statistical data, we will assume that

•

the random vectors X 1 , . . . , X n are independent subGaussian vectors in R ^p with φ 2 -norm upper bounded by K X , and are such that the matrix

X ^t =

X ₁ , . . . , X _n is full rank with probability one

•

for all i = 1, . . . , n, the random vectors X _i are assumed

– to have a second moment matrix equal to the identity

²

E [X _i X _i ^t ] = I _p – to have ` ₂ -norm equal

³

to √

p

•

the errors _i = Y _i − E [Y _i | X _i ] are independent subGaussian centered random variables with ψ ₂ -norm upper bounded by K .

The performance of the estimators are often measured by the theoretical risk R : Θ 7→ R by R(θ) = E

h

`(Y − f (X ^t θ)) i .

In order to estimate θ

^∗

, the Empirical Risk Minimizer θ ˆ is defined as a solution to the following optimisation problem

θ ˆ ∈ argmin _θ∈Θ R ˆ n (θ) (3)

with

R ˆ n (θ) = 1 n

n

X

i=1

`(Y i − f (X _i ^t θ)).

2

i.e. X

i

, i = 1, . . . , n are isotropic

3

notice that this is different from the usual regression model, where the columns are assumed to be normalised

(4)

B. Statement of our main theorems

Our main result is the following. Their proofs are given in Section B and Section C in the appendix.

Theorem II-B.1. (Underparametrised setting) Assume that p and n are such that C _K ²

_X

p < (1 − α) ² n

Let

r = 6

√

CC `

⁰⁰

C f

⁰

K δ

⁻¹

√ p (1 − α) √

n − C K

_X

√ p ,

where C is a positive absolute constant. Assume that f

⁰

(z) ≤ C _f

⁰

and ` and f are such that

|`

⁰⁰

(w) f

⁰

(z) ² − `

⁰

(w) f

⁰⁰

(z)| ≥ δ > 0 for all z in XB ₂ (θ

^∗

, r). Then, with probability at least

1 − 2 exp (−c _K

_X

α ² n) − exp

− p 2

,

the unique solution θ ˆ to the optimisation problem (3) satisfies

k θ ˆ − θ

^∗

k 2 ≤ r.

Theorem II-B.2. (Overparametrised setting) Assume that p and n are such that C _K ²

_X

n < (1 − α) ² p Let

r = 6 √

CC _`

⁰⁰

C _f

⁰

K δ

⁻¹

√ n (1 − α) √

p − C K

_X

√ n ,

where C is a positive absolute constant. Assume that f

⁰

(z) ≤ C f

⁰

and ` and f are such that

|`

⁰⁰

(w) f

⁰

(z) ² − `

⁰

(w) f

⁰⁰

(z)| ≥ δ > 0,

for all z in XB 2 (θ

^∗

, r). Then, there exists a first order stationary point θ ˆ to the optimisation problem (3) such that, with probability larger than or equal to

1 − 2 exp (−c K

_X

α ² p) − exp

− n 2

, we have

k θ ˆ − θ

^∗

k 2 ≤ r.

We now establish the following generalisation bound.

Theorem II-B.3 (Generalisation bound). Let the assumptions of Theorem II-B.2 hold. Assume that the distribution of the data X _i , i = 1, . . . , n has a compact support with √

p-covering number N X () for any > 0. Let X _n+1 denote a random isotropic vector with ` ₂ -norm √

p following the same distribution as X _i , i = 1, . . . , n and let p _min denote the smallest probability for that same distribution, among all the balls in the -covering. Assume that

n ≥ p

⁻¹

_min







NX

()

X

k=1

N_X

() k

k + t

v u u t 2

NX

()

X

k=1

N_X

() k

k ²







for some t > 0. Then, for any > 0, there exists a minimum norm first order stationary point θ ˆ ^] of problem (3) which satisfies

√ 1

p X _n+1 ^t θ ˆ ^] − 1

√ p X _n+1 ^t θ

^∗

≤ (1 + 4)

6 √

CC `

⁰⁰

C f

⁰

K δ

⁻¹

√ n (1 − α) √

p − C K

_X

√ n

+ 4 kθ

^∗

k 2 ,

with probability larger than or equal to

1 − 2 exp (−c _K

_X

α ² p) − exp

− n 2

− 1 t ² .

Many recent works have studied the intrinsic dimension of various data sets arising in machine learning [11], [14], [2], [3],

[21]. From this intriguing point of view, the sample distribution may have a much lower covering number than the dimension

(5)

of the ambiant space into which the data is embedded. Interesting work on empirical estimation of the packing number of the sample in machine learning can be found in [16].

C. The case of linear regression

In the linear case where f(X _i ^t θ) = X _i ^t θ and the loss is quadratic `(z) = ¹ ₂ z ² , the optimisation problem (3) is θ ˆ = argmin _θ∈

_Rp

1 n

n

X

i=1

1 2 (Y _i − X _i ^t θ) ² . We have, C f

⁰

= 1 and C _f

(k)

= 0 for k ≥ 2. The quadratic loss function is used with

`(z) = 1

2 z ² , `

⁰

(z) = z, `

⁰⁰

(z) = 1 and C _`

(k)

= 0, k ≥ 3. We therefore have

Corollary II-C.1 (Underparametrised case: linear model). Assume that p and n are such that C _K ²

_X

p < (1 − α) ² n

Let

r = 6 √ CK

√ p (1 − α) √

n − C K

_X

√ p ,

where C is a positive absolute constant. Then, with probability at least 1 − 2 exp (−c K

_X

α ² n) − exp (− p

2 ), the unique solution θ ˆ to the optimisation problem (3) satisfies

k θ ˆ − θ

^∗

k 2 ≤ r.

Proof. Apply Theorem II-B.1 and use thet fact that δ = 1 in the linear case.

Corollary II-C.2 (Overparametrised case: linear model). Assume that p and n are such that C _K ²

X

n < (1 − α) ² p.

Let

r = 6 √ CK

√ n (1 − α) √

p − C K

_X

√ n ,

where C is a positive absolute constant. Then, there exists a solution θ ˆ to the optimisation problem (3) such that, with probability larger than or equal to

1 − 2 exp (−c K

_X

α ² p) − exp (− n 2 ), we have

k θ ˆ − θ

^∗

k 2 ≤ r.

Corollary II-C.3 (Generalisation bound). Let the assumptions of Corollary II-C.2 hold. Assume that the distribution of the data X i , i = 1, . . . , n has a compact support with √

p-covering number N X () for any > 0. Let X n+1 denote a random isotropic vector with ` 2 -norm √

p following the same distribution as X i , i = 1, . . . , n and let p min denote the smallest probability for that same distribution, among all the balls in the -covering. Assume that

n ≥ p

⁻¹

_min







NX

()

X

k=1

NX

() k

k + t

v u u t 2

NX

()

X

k=1

NX

() k

k ²







for some t > 0. Then, for any > 0, there exists a minimum norm first order stationary point θ ˆ ^] of problem (3) which satisfies

√ 1

p X _n+1 ^t θ ˆ ^] − 1

√ p X _n+1 ^t θ

^∗

≤ (1 + 4)

6 √ CK

√ n (1 − α) √

p − C _K

_X

√ n

+ 4 kθ

^∗

k 2 ,

(6)

with probability larger than or equal to

1 − 2 exp (−c _K

_X

α ² p) − exp

− n 2

− 1 t ² .

D. Discussion of the results and new implications for some classical models

Theorem II-B.1 and Theorem II-B.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.

•

Our analysis of the underparametrised setting shows that we can obtain an error of order less than or equal to p p/n for all p up to an arbitrary fraction of n.

•

In the overparametrised setting, we get that the error bound is of order p

n/p and decreases as p grows to +∞. We therefore recover the "double descent phenomenon".

Similar but simpler bounds also hold in the linear model setting, as presented in Corollary II-C.1 and Corollary II-C.2.

Bounds on the generalisation error are provided in Theorem II-B.3 and Corollary II-C.3.

E. Comparison with previous results

Our results are based on a new zero finding approach inspired from [22] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the "double descent phenomenon"

in [5], many authors have addressed the question of precisely characterising the error decays as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works [19] address the problem in the asymptotic regime. Recently, the finite sample analysis has been addressed in the very interesting works [4] and [8] for the linear model only. The work of [4] and [8] give very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption.

In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariate vector in the case where the covariate vectors are assumed isotropic (more general results can easily be recovered by a simple change of variable). Our analysis is made very elementary by using Neuberger’s result [22] and the subGaussian isotropic assumption which allows to leverage previous results of [24] about finite random matrix with subGaussian rows or columns, depending of the setting (underparametrised vs overparametrised).

III. C ONCLUSION AND PERSPECTIVES

This work presents a precise quantitative, finite sample analysis of the double descent phenomenon in the estimation of linear and non-linear models. We make use of a zero-finding result of Neuberger [22] which can be applied to a large number of settings in machine learning.

Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. We are currently working on the analysis of the double descent phenomenon in the case of Residual Neural Networks and we expect to post our new findings in a near future. Another possible direction is to include penalisation, which can be treated using the same techniques via Karush-Kuhn-Tucker conditions. This can be applied to Ridge Regression and ` 1 -penalised estimation and makes a promising avenue for future investigations. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [20].

R EFERENCES

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018. 1

[2] Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E Houle, Ken-ichi Kawarabayashi, and Michael Nett. Estimating local intrinsic dimensionality. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 29–38, 2015. 3

[3] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems, pages 6109–6119, 2019. 3

[4] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020. 1, 5

[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.

Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. 1, 5

[6] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018. 1

[7] Alfonso Castro and JW Neuberger. An inverse function theorem via continuous newton’s method. 2001. 2

[8] Geoffrey Chinot and Matthieu Lerasle. Benign overfitting in the large deviation regime. arXiv preprint arXiv:2003.05838, 2020. 1, 5

[9] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 1, 2018. 1

(7)

[10] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2933–2943, 2019. 2

[11] Jose A Costa, Abhishek Girotra, and AO Hero. Estimating local intrinsic dimension with k-nearest neighbor graphs. In IEEE/SP 13th Workshop on Statistical Signal Processing, 2005, pages 417–422. IEEE, 2005. 3

[12] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018. 1

[13] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019. 1

[14] Matthias Hein and Jean-Yves Audibert. Intrinsic dimensionality estimation of submanifolds in rd. In Proceedings of the 22nd international conference on Machine learning, pages 289–296, 2005. 3

[15] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. 1

[16] Balázs Kégl. Intrinsic dimension estimation using packing numbers. In Advances in neural information processing systems, pages 697–704, 2003. 4 [17] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv:1808.00387, 2018. 1 [18] M. Soltanolkotabi. Towards demystifying overparameterization in deep learning = https://imaging-in-paris.github.io/semester2019/slides/w3/

soltanolkotabi.pdf, note = Online; accessed 25 June 2020. 1

[19] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019. 2, 5

[20] Shahar Mendelson. Extending the small-ball method. arXiv preprint arXiv:1709.00843, 2017. 5

[21] Marc Mezard. Artificial intelligence: success, limits, myths and threats. statistical physics and statistical inference (lecture 3). video link, 2020. 3 [22] John W Neuberger. The continuous newton’s method, inverse functions, and nash-moser. The American Mathematical Monthly, 114(5):432–437, 2007.

1, 2, 5, 6

[23] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. 8

[24] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010. 5, 11, 12

[25] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. 8

A PPENDIX

A. Common framework: Chasing close-to-ideal solutions using Neuberger’s quantitative perturbation theorem

1) Neuberger’s theorem: The following theorem of Neuberger [22] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.

Theorem A.1 (Neuberger’s theorem for ERM). Suppose that r > 0, that θ

^∗

∈ R ^p and that the Jacobian D R ˆ _n (·) is a continuous map on B _r (θ

^∗

), with the property that for each θ in b _r (θ

^∗

) there exists a vector d in B _r (0) such that,

lim

t↓0

D R ˆ _n (θ + td) − D R ˆ _n (θ)

t = −D R ˆ _n (θ

^∗

). (4)

Then there exists u in B r (θ

^∗

) such that D R ˆ n (u) = 0.

2) Computing the second derivative: Since the loss is twice differentiable, the empirical risk R ˆ n is itself twice differentiable.

The Gradient of the empirical risk is given by

∇ R ˆ n (θ) = − 1 n

n

X

i=1

`

⁰

(Y i − f (X _i ^t θ)) f

⁰

(X _i ^t θ)X i

= − 1

n X ^t D(ν) `

⁰

(), where `

⁰

() is to be understood componentwise, and

ν i = f

⁰

(X _i ^t θ).

The Hessian is given by

∇ ² R ˆ n (θ) = 1 n

n

X

i=1

`

⁰⁰

(Y i − f (X _i ^t θ)) f

⁰

(X _i ^t θ) ²

− `

⁰

(Y _i − f (X _i ^t θ)) f

⁰⁰

(X _i ^t θ) X _i X _i ^t .

The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (4) associated with our setting, is the following

∇ ² R ˆ _n (θ)d = −∇ R ˆ _n (θ

^∗

).

The Hessian matrix can be rewritten as

∇ ² R ˆ _n (θ) = 1

n X ^t D(µ)X, (5)

(8)

where D _Y,X is a diagonal matrix given by

µ i = `

⁰⁰

(Y i − f (X _i ^t θ)) f

⁰

(X _i ^t θ) ² − `

⁰

(Y i − f (X _i ^t θ)) f

⁰⁰

(X _i ^t θ)

!

B. Proof of Theorem II-B.1: The underparametrised case 1) A technical lemma:

Lemma B.1. For all i = 1, . . . , n, the variable `

⁰

( i ) is subGaussian, with variance proxy k`

⁰

( i )k ψ

₂

= C `

⁰⁰

k i k ψ

₂

. Proof. Let us compute

k`

⁰

( _i )k _ψ

₂

= sup

γ≥1

γ

^−1/2

E |`

⁰

( _i )| ^γ 1/γ

.

Lipschitzianity of `

⁰

implies that

|`

⁰

( i ) − `

⁰

(0)| ≤ C `

⁰⁰

| i − 0|

and since `

⁰

(0) = 0, we get

|`

⁰

( i )| ≤ C `

⁰⁰

| i |, which implies that

k`

⁰

( _i )k ψ

2

= C _`

⁰⁰

sup

γ≥1

γ

^−1/2

E | i | ^γ ^1/γ .

Thus,

k`

⁰

( i )k ψ

₂

= C `

⁰⁰

k i k ψ

₂

< +∞

as announced.

2) Key lemma: Using Corollary E.2, we have s min (X ) > 0 as long as

C _K ²

_X

p < (1 − α) ² n (6)

for some positive constant C K

_X

depending on the subGaussian norm K X of X 1 , . . . , X n . Lemma B.2. With probability larger than or equal to

1 − exp

− p 2

, (7)

we have

kdk 2 ≤ 6 √

CC `

⁰⁰

C f

⁰

K

√ p δ s min (X) . Proof. We recall that the Jacobian vector is

∇ R ˆ _n (θ) = − 1

n X ^t D(ν)`

⁰

(),

where `

⁰

() is to be understood componentwise, and D(ν) is a diagonal matrix with ν i = f

⁰

(X _i ^t θ

^∗

) for all i in 1, . . . , n. The Hessian matrix is

∇ ² R ˆ n (θ) = 1

n X ^t D(µ)X, where D(µ) is a diagonal matrix given by, for all i in 1, . . . , n

µ i = `

⁰⁰

(Y i − f(X _i ^t θ)) f

⁰

(X _i ^t θ) ² − `

⁰

(Y i − f(X _i ^t θ)) f

⁰⁰

(X _i ^t θ).

We now compute the solution of Neuberger’s equation (4)

∇ ² R ˆ _n (θ)d = ∇ R ˆ _n (θ

^∗

), i.e.

−

X ^t D(µ)X

d = X ^t D(ν)`

⁰

().

(9)

The singular value decomposition of X gives X = U ΣV ^t , where U ∈ R ^n×p is a matrix whose columns form an orthormal family, V ∈ R ^p×p is an orthogonal matrix and Σ ∈ R ^p×p is diagonal and invertible. Thus, we obtain the equivalent equation

−

V ΣU ^t D(µ)U ΣV ^t

d = V ΣU ^t D(ν )`

⁰

().

Using invertibility of V and Σ, this is equivalent to

−U ^t D(µ)U ΣV ^t d = U ^t D(ν)`

⁰

(), and thus

−V ^t d = Σ

⁻¹

(U ^t D(µ)U )

⁻¹

U ^t D(ν )`

⁰

().

Then, as kdk 2 = kV ^t dk 2 ,

d

₂ =

Σ

⁻¹

(U ^t D(µ)U )

⁻¹

U ^t D(ν)`

⁰

() ₂

≤ Σ

⁻¹

2 (U ^t D(µ)U)

⁻¹

U ^t D(ν)`

⁰

() 2

≤ 1

s min (X ) kU ^t D(µ)

⁻¹

D(ν)`

⁰

()k 2 . (8) Using maximal inequalities, we can now bound the deviation probability of

kU ^t D(µ)

⁻¹

D(ν)`

⁰

()k 2 = max

kuk2

=1 u ^t U ^t D(µ)

⁻¹

D(ν )`

⁰

(),

with u ∈ R ^p . For this purpose, we first prove that U ^t D(µ)

⁻¹

D(ν )`

⁰

() is a subGaussian vector and provide an explicit upper bound on its norm. Since U is a n × p matrix whose columns form an orthormal family, u ^t U ^t is a unit-norm ` ₂ vector of size n which is denoted by w. Then,

u ^t U ^t D(µ)

⁻¹

D(ν)`

⁰

() =

n

X

i=1

w _i ν _i µ i

`

⁰

( _i ),

and since `

⁰

( i ) is centered and subGaussian for all i = 1, . . . , n, Vershynin’s [25, Proposition 2.6.1] gives ku ^t U ^t D(µ)

⁻¹

D(ν)`

⁰

()k ² _ψ

₂

≤ C

n

X

i=1

w _i ν _i

µ i

`

⁰

( _i )

2 ψ

₂

.

Since

w _i ν i

µ _i `

⁰

( _i )

2 ψ

2

≤ |w _i | ² max ⁿ

i

⁰

=1

ν i

⁰

µ _i

⁰

`

⁰

( _i

⁰

)

2 ψ

2

,

we have

ku ^t U ^t D(µ)

⁻¹

D(ν)`

⁰

()k ² _ψ

₂

≤ Ckwk ² ₂ max ⁿ

i

⁰

=1

ν i

⁰

µ _i

⁰

`

⁰

( i

⁰

)

2 ψ

2

,

for some absolute constant C > 0. As kwk 2 = 1, we get that for all u ∈ R ^p with kuk 2 = 1, ku ^t U ^t D(µ)

⁻¹

D(ν)`

⁰

()k ² _ψ

₂

≤ C max ⁿ

i

⁰

=1

ν _i

⁰

µ i

⁰

`

⁰

( i

⁰

)

2 ψ

₂

. We deduce that U ^t D(µ)

⁻¹

D(ν)`

⁰

() is a subGaussian random vector with variance proxy

C max ⁿ

i

⁰

=1

ν i

⁰

µ _i

⁰

`

⁰

( i

⁰

)

2 ψ

2

.

Using the maximal inequality from [23, Theorem 1.19] and the subGaussian properties described in [25, Proposition 2.5.2], we get with probability 1 − exp

− ^η ₂

²

kU ^t D(µ)

⁻¹

D(ν)`

⁰

()k 2 ≤ 2 √ C max ⁿ

i

⁰

=1

ν i

⁰

µ i

⁰

`

⁰

( i

⁰

) _ψ

2

(2 √

p + η). (9)

(10)

Following Lemma B.1, we deduce that _µ ^ν

ⁱ

i

`

⁰

( _i

⁰

) is a subGaussian random variable with variance proxy

ν i

⁰

µ i

⁰

`

⁰

( i

⁰

) _ψ

2

≤ ν i

⁰

µ i

⁰

C `

⁰⁰

k i

⁰

k ψ

₂

≤ C _`

⁰⁰

max ⁿ _i

0

ν _i

⁰

min ⁿ _i

0

µ i

⁰

K .

Then, restricting a priori θ to lie in the ball B 2 (θ

^∗

, r), which is coherent with the assumptions of Newberger’s Theorem A.1, and using the assumption that |`

⁰⁰

(w) f

⁰

(z) ² − `

⁰

(w) f

⁰⁰

(z)| ≥ δ, for all w and all z ∈ X B 2 (θ

^∗

, r), together with the boundedness of f

⁰

, we get

max ⁿ _i

0

=1 ν i

⁰

min ⁿ _i

0

=1 µ _i

⁰

≤ C f

⁰

δ . Subsequently, the quantity (9) becomes

kU ^t D(µ)

⁻¹

D(ν)`

⁰

()k 2 ≤ 2

√ CC `

⁰⁰

max ⁿ _i

0

=1 ν i

⁰

min ⁿ _i

0

=1 µ i

⁰

K (2 √ p + η)

≤ 2 √

CC `

⁰⁰

C f

⁰

K (2 √ p + η)

δ ,

with probability 1 − exp

− ^η ₂

²

. Taking η = √

p, equation (8) yields

kdk 2 =

∇ ² R ˆ n (θ)

⁻¹

∇ R ˆ n (θ

^∗

) ₂ ≤ 6 √

CC `

⁰⁰

C f

⁰

K

√ p δ s min (X) , with probability 1 − exp − ^p ₂

.

3) End of the proof of Theorem II-B.1: In order to complete the proof of Theorem II-B.1, note that, using Corollary E.2, we have with probabilty 1 − 2 exp (−c K

X

α ² n)

s _min (X ) ≥ (1 − α) √

n − C _K

_X

√ p.

Therefore, with probability larger than or equal to

1 − 2 exp (−c K

_X

α ² n) − exp

− p 2

, we have

kdk ₂ ≤ 6 √

CC _`

⁰⁰

C _f

⁰

K √ p δ (1 − α) √

n − C _K

_X

√ p . Hence, the proof of Theorem II-B.1 is completed.

C. Proof of Theorem II-B.2: The overparametrised case

1) Key lemma: Using Corollary E.4, we have s min (X ^t ) > 0 as long as C _K ²

_X

n < (1 − α) ² p

for some positive constant C K

X

depending on the subGaussian norm K X of X 1 , . . . , X n . Lemma C.1. With probability larger than or equal to

1 − exp

− n 2

, we have

kdk 2 ≤ 6 √

CC _`

⁰⁰

C _f

⁰

K √ n δ s min (X ^t ) . Proof. We recall that the Jacobian vector is

∇ R ˆ _n (θ) = − 1

n X ^t D(ν)`

⁰

(),

where `

⁰

() is to be understood componentwise, and D(ν) is a diagonal matrix with ν _i = f

⁰

(X _i ^t θ

^∗

). The Hessian matrix is

∇ ² R ˆ _n (θ) = 1

n X ^t D(µ)X,

(11)

where D(µ) is a diagonal matrix given by

µ _i = `

⁰⁰

(Y _i − f(X _i ^t θ)) f

⁰

(X _i ^t θ) ² − `

⁰

(Y _i − f(X _i ^t θ)) f

⁰⁰

(X _i ^t θ).

As in the underparametrised case, we have to solve (4) i.e 1

n X ^t D(µ)Xd = 1

n X ^t D(ν )`

⁰

(), which can be solved by finding the least norm solution of the interpolation problem

D(µ)Xd = D(ν )`

⁰

(), i.e.

d = X

^†

D(µ)

⁻¹

D(ν)`

⁰

().

Given the compact SVD of X = U ΣV ^t , where U ∈ O(n) and V ∈ R ^p×n with orthonormal columns, we get d = V Σ

⁻¹

U ^t D(µ

⁻¹

)D(ν)`

⁰

().

We then have

kdk 2 = kV Σ

⁻¹

U ^t D(µ

⁻¹

)D(ν )`

⁰

()k 2

As kxk 2 = kV xk 2 for all x ∈ R ⁿ ,

kdk 2 = kΣ

⁻¹

U ^t D(µ

⁻¹

)D(ν)`

⁰

()k 2 ≤ 1

s min (X ^t ) kU ^t D(µ

⁻¹

)D(ν )`

⁰

()k 2 . Then, using the same bound as for the underparametrised case

kU ^t D(µ)

⁻¹

D(ν )`

⁰

()k 2 ≤ 2 √

CC `

⁰⁰

C f

⁰

K (2 √ n + η)

δ .

with probability 1 − exp

− ^η ₂

²

. Taking η = √

n, we get

kdk 2 ≤ 6 √

CC _`

⁰⁰

C _f

⁰

K √ n

δ s min (X ^t ) , (10)

with probability 1 − exp − ⁿ ₂ .

2) End of the proof of Theorem II-B.2: In order to complete the proof of Theorem II-B.2, note that, using Corollary E.4, we have with probability 1 − 2 exp (−c K

_X

α ² p)

s min (X ^t ) ≥ (1 − α) √

p − C K

_X

√ n. (11)

Therefore, with probability larger than or equal to

1 − 2 exp (−c K

X

α ² p) − exp

− n 2

,

we have

kdk 2 ≤ 6 √

CC `

⁰⁰

C f

⁰

K

√ n δ((1 − α) √

p − C _K

_X

√ n) . Hence the proof of of Theorem II-B.2 is completed.

D. Proof of Theorem II-B.3

By the bounds on the expectation and the variance for the coupon collector problem recalled in Section F, with probability larger that 1 − 1/t ² , it is sufficient to consider a sample of size n satisfying

n ≥ p

⁻¹

_min







NX

()

X

k=1

N_X

() k

k + t

v u u t 2

NX

()

X

k=1

N_X

() k

k ²







(12)

in order to ensure that each ball in the √

p-covering of the support of the distribution of X . On the other hand, any additional measurement X n+1 is at most 2 √

p close to one of the samples X 1 , . . . , X n , say X i

_n+1

. Now, let θ ˆ ^] denote the minimum norm solution to Xθ = X θ, i.e. ˆ

argmin _θ kθk 2 subject to Xθ = X θ. ˆ (12)

Then,

|X _n+1 ^t (ˆ θ ^] − θ)| ≤ |X ˆ _i ^t

_n+1

(ˆ θ ^] − θ)| ˆ + |(X n+1 − X i

_n+1

) ^t (ˆ θ ^] − θ)| ˆ

≤ | X _i ^t

n+1

(ˆ θ ^] − θ) ˆ

| {z }

=0

by (12)

| + 2 √

p k θ ˆ ^] − θk ˆ ₂ ,

which gives, given that k θ ˆ ^] k 2 ≤ k θk ˆ 2 ,

|X _n+1 ^t (ˆ θ ^] − θ)| ≤ ˆ 4 √ pk θk ˆ 2 . Moreover, since

k θk ˆ ₂ ≤ kθ

^∗

k ₂ + k θ ˆ − θ

^∗

k ₂ and since, by Theorem II-B.2,

k θ ˆ − θ

^∗

k 2 ≤ 6 √

CC _`

⁰⁰

C _f

⁰

K √ n δ((1 − α) √

p − C K

_X

√ n) ,

with probability at least

1 − 2 exp (−c K

X

α ² p) − exp

− n 2

, we get

|X _n+1 ^t (ˆ θ ^] − θ)| ≤ ˆ 4 √

p kθ

^∗

k 2 + 6 √

CC `

⁰⁰

C f

⁰

K

√ n δ((1 − α) √

p − C K

_X

√ n)

!

with probability

1 − 2 exp (−c K

_X

α ² p) − exp

− n 2

− 1 t ² . Using the triangular inequality

|X _n+1 ^t (ˆ θ ^] − θ

^∗

)| ≤ |X _n+1 ^t (ˆ θ ^] − θ)| ˆ + |X _n+1 ^t (ˆ θ − θ

^∗

)|

and the fact that, since kX _n+1 k ₂ ≤ √ p,

|X _n+1 ^t (ˆ θ − θ

^∗

)| ≤ √

p 6 √

CC _`

⁰⁰

C _f

⁰

K √ n δ((1 − α) √

p − C K

_X

√ n) ,

the proof is completed.

E. Classical bounds on the extreme singular values of finite dimensional random matrices

1) Random matrices with independent rows: Recall that the matrix X is composed by n i.i.d. subGaussian random vectors in R ^p , with K X = max i kX i k ψ

₂

. In the underparametrised case, we have n > p. Let us recall the following bound on the singular values of a matrix with independent subGaussian rows.

Theorem E.1. [24, Theorem 5.39] Let X be an n × p matrix whose rows X i , i = 1, . . . , n are independent sub-Gaussian isotropic random vectors in R ^p . Then for every t ≥ 0, with probability at least 1 − 2 exp(−c K

_X

t ² ), one has

√ n − C K

_X

√ p − t ≤ s min (X) ≤ s max (X ) ≤ √

n + C K

_X

√ p + t

where C K

_X

, c K

_X

> 0 depend only on the subGaussian norm K X = max i kX i k ψ

₂

of the rows.

In the our main text, we use the corollary Corollary E.2. Let us suppose that t ≥ α √

n with α > 0. Using the same assumptions as in Theorem E.1, then with probability

(13)

equal or larger than 1 − 2 exp(−c _K

_X

α ² n)

s _min (X ) ≥ (1 − α) √

n − C _K

_X

√ p, s max (X ) ≤ (1 + α) √

n + C K

_X

√ p.

2) Random matrices with independent columns: In the overparametrised case, the following theorem of Vershynin will be instrumental.

Theorem E.3. [24, Theorem 5.58] Let X be an n × p matrix with n ≥ p whose rows X i are independent sub-Gaussian isotropic random vectors in R ^p with kX i k 2 = √

p a.s. Then for every t ≥ 0, with probability at least 1 − 2 exp(−c K

_X

t ² ), one has

√

p − C K

_X

√ n − t

≤ s min (X ^t ) ≤ s max (X ^t ) ≤ √

p + C K

_X

√ n + t , where C K

_X

, c K

_X

> 0 depend only on the subGaussian norm K X = max i kX i k ψ

₂

of the columns.

In the main text, we use the corollary Corollary E.4. Let us suppose that t ≥ α √

p with α > 0. Using the same assumptions as in Theorem E.3, then with probability equal or larger than 1 − 2 exp(−c _K

_X

α ² p)

s _min (X ^t ) ≥ (1 − α) √

p − C _K

_X

√ n, s max (X ^t ) ≤ (1 + α) √

p + C K

_X

√ n.

F. The coupon collector problem

There are K coupons in a collection. A collector can purchase coupons at random until he completes the collection. At each round, a coupon is revealed with probability p k , k = 1, . . . , K. Let N be the number of coupons he will need to collect before he has at least one coupon of each type. Then, the expectation and the second moment of N are given by

E [N] = Z

∞

0 1 −

K

Y

k=1

1 − e

^−p^k

^t

! dt,

which can be written

E [N] = X

k

1 p _k − X

k<k

⁰

1 p _k + p _k

⁰

+ · · · + (−1) ^K−1 1 p ₁ + · · · + p _K and

E N ²

= Z

∞

0 2tP (X > t)dt = Z

∞

0 2t 1 −

K

Y

k=1

1 − e

^−p^k

^t

! dt,

which can be written E

N ²

= 2 X 1 p ² _k − X

k<k

⁰

1 (p k + p k

⁰

) ² + · · · + (−1) ^K−1 1 (p 1 + · · · + p K ) ²

! .

Define p min = min ^K _k=1 p k and p max = max ^K _k=1 p k . Then, we have E [N ] ≤ p

A finite sample analysis of the benign overfitting phenomenon for ridge function estimation

HAL Id: hal-03124591

https://hal.archives-ouvertes.fr/hal-03124591

Preprint submitted on 28 Jan 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

phenomenon for ridge function estimation

Emmanuel Caron, Stéphane Chrétien

To cite this version:

Emmanuel Caron, Stéphane Chrétien. A finite sample analysis of the benign overfitting phenomenon

for ridge function estimation. 2021. �hal-03124591�