On the stability and accuracy of least squares approximations

(1)

On the stability and accuracy of least squares approximations

Albert Cohen, Mark A. Davenport and Dany Leviatan

^∗

November 13, 2011

Abstract

We consider the problem of reconstructing an unknown function f on a domainX from samples of f atn randomly chosen points with respect to a given measureρX. Given a sequence of linear spaces (Vm)m>0 with dim(Vm) =m ≤n, we study the least squares approximations from the spaces Vm. It is well known that such approximations can be inaccurate when m is too close to n, even when the samples are noiseless. Our main result provides a criterion on m that describes the needed amount of regularization to ensure that the least squares method is stable and that its accuracy, measured in L²(X, ρX), is comparable to the best approximation error of f by elements from Vm. We illustrate this criterion for various approximation schemes, such as trigonometric polynomials, withρX being the uniform measure, and algebraic polynomials, withρX being either the uniform or Chebyshev measure.

For such examples we also prove similar stability results using deterministic samples that are equispaced with respect to these measures.

1 Introduction and main results

Let X be a domain ofR^d and ρX be a probability measure on X. We consider the problem of estimating an unknown functionf :X→Rfrom samples (yi)i=1,...,nwhich are either noiseless or noisy observations of f at the points (xi)i=1,...,n, where thexi are i.i.d. with respect toρX. We measure the error betweenf and its estimator ˜f in theL²(X, ρX) norm

kvk:=Z

X

|v(x)|²dρX(x)^1/2 ,

and we denote byh·,·ithe associated inner product.

Given a fixed sequence of finite dimensional spaces (Vm)m≥1 of L²(X, ρX) such that dim(Vm) = m.

We would like to compute the best approximation of f in Vm. This is given by the L²(X, ρX) orthogonal projector ontoVm, which we denote byPm:

Pmf := argmin

v∈Vm

kf−vk.

We let

em(f) =kf−Pmfk denote the best approximation error.

∗This research has been partially supported by the ANR Defi08 “ECHANGE” and by the US NSF grant DMS-1004718.

Portions of this work were completed while M.A.D. and D.L. were visitors at University Pierre et Marie Curie

(2)

In general, we may not have access to eitherρX or any information aboutf aside from the observations at the points (xi)i=1,...,n. In this case we cannot explicitly computePmf. A natural approach in this setting is to consider the solution of the least squares problem

w= argmin

v∈Vm

n

X

i=1

|yi−v(xi)|².

Typically, we are interested in the case where m≤nwhich is the regime where this problem may admit a unique solution.

In the noiseless case yi = f(xi), and hence w may be viewed as the application of the least squares projection operator ontoVmtof, i.e., we can write

w=P_mⁿf := argmin

v∈Vm

kf −vkn

where

kvkn:=1 n

n

X

i=1

|v(xi)|²^1/2

is theL²norm with respect to the empirical measure and, analogously,h·,·in the associated empirical inner product.

It is well known that least squares approximations may be inaccurate even when the measured samples are noiseless. For example, ifVmis the spacePm−1of algebraic polynomials of degreem−1 over the interval [−1,1] and if we choose m = n, this corresponds to Lagrange interpolation, which is known to be highly unstable, failing to converge towards f when given values at uniformly spaced samples, even when f is infinitely smooth (the “Runge phenomenon”). Regularization by takingmsubstantially smaller thannmay therefore be needed even in a noise-free context. The goal of this paper is to provide with a mathematical analysis on the exact needed amount of such regularization.

Stability of the least squares problem. The solution of the least squares problem can be computed by solving anm×msystem: specifically, if (L1, . . . , Lm) is an arbitrary basis forVm, then we can write

w=

m

X

j=1

ujLj,

whereu= (uj)j=1,...,m is the solution of them×msystem

Gu=f, (1.1)

withG:= (hLj, Lkin)j,k=1,...,m andf = (_n¹Pn

i=1yiLk(xi))k=1,...,m. In the noiseless caseyi=f(xi), so that we can also writef := (hf, Lkin)k=1,...,m. In the event thatGis singular, we simply setw= 0.

For the purposes of our analysis, suppose that the basis (L1, . . . , Lm) is orthonormal in the sense of L²(X, ρX).¹ In this case we have

E(G) = (hLj, Lki)j,k=1,...,m=I.

1While such a basis is generally not accessible when ρ_X is unknown, we require it only for the analysis. The actual computation of the estimator can be made using any known basis ofVm, since the solutionwis independent of the basis used in computing it.

(3)

Our analysis requires an understanding of how the random matrix G deviates from its expectation I in probability. Towards this end, we introduce the quantity

K(m) := sup

x∈X m

X

j=1

|Lj(x)|². Note that the function Pm

j=1|Lj(x)|² is invariant with respect to a rotation applied to (L1, . . . , Lm) and therefore independent of the choice of the orthonormal basis: it only depends on the spaceVm and on the measureρX, and henceK(m) also depends only onVm. Also note that

K(m)≥

m

X

j=1

kLjk²=m.

We also will use the notation

9M9= max

v6=0

|Mv|

|v| , for the spectral norm of a matrix.

Our first result is a probabilistic estimate of the comparability of the normsk · k and k · k_n uniformly over the spaceVm. This is equivalent to the proximity of the matrices GandIin spectral norm, since we have that for all δ∈[0,1],

9G−I9≤δ⇔

kvk²_n− kvk²

≤δkvk², v∈Vm. Theorem 1 For0< δ <1, one has the estimate

Pr{9G−I9> δ}= Pr{∃v∈Vm :

kvk²_n− kvk²

> δkvk²} ≤2mexp

− cδn K(m)

, (1.2)

wherecδ :=δ+ (1−δ) log(1−δ)>0.

The proof of Theorem 1 is a simple application of tail bounds for sums of random matrices obtained in [1]. A consequence of this result is that the norms k · kand k · kn are comparable with high probability if K(m) is smaller thannby a logarithmic factor: for example takingδ=¹₂, we find that for anyr >0,

Pr

9G−I9> 1 2

= Pr

∃v∈Vm :

kvk²n− kvk² > 1

2kvk²

≤2n^−r, (1.3)

ifmis such that

K(m)≤κ n

logn, with κ:= c1/2

1 +r = 1−log 2

2 + 2r . (1.4)

The above condition thus ensures thatGis well conditioned with high probability. It can also be thought of as ensuring that the least squares problem is stable with high probability. Indeed the right side of the least squares system can be written asf =My with

M= 1

n(Lj(xi))j,i∈{1,...,m}×{1,...,n}, anm×nmatrix. Observing thathGv,vi=n|M^Tv|², we find that

9M9=9M^T9=1 n9G9

^1/2 .

(4)

Therefore, if9G−I9≤ ¹₂, then we have that for any data vectory the solutionw=Pm

j=1ujLj satisfies kwk=|u| ≤9G⁻¹9·9M9·|y| ≤ 1

√n2 r3

2|y|, which thus gives the stability estimate

kwk ≤C1 n

n

X

i=1

|yi|²^1/2

, C=√ 6.

In the noiseless case, this can be written as kPmⁿfk ≤ Ckfkn, i.e., the least squares projection is stable between the norms k · kn and k · k.

Let us mention that similar probabilistic bounds have been previously obtained, see in particular§5.2 in [2]. These earlier results allow us to obtain the bound (1.3), however relying on the stronger condition

K(m) <∼ n

log(n) ^1/2

.

The numerical results for polynomial least squares that we present in §3 hint that the weaker condition K(m) <

∼ ^log(n)ⁿ is sharp. The quantityK(m) was also used in [3] in order to control theL^∞(X) norm and theL²(ρX) norm.

Accuracy of least squares approximation. As an application, we can derive an estimate for the error of least squares approximation in expectation. Here, we make the assumption that a uniform bound

|f(x)| ≤L, (1.5)

holds for almost everyxwith respect toρX. Form≤n, we consider the truncated least squares estimator f˜=TL(w),

whereTL(t) = sign(t) max{L,|t|}. Our first result deals with the noiseless case.

Theorem 2 In the noiseless case, for any r >0, if mis such that the condition (1.4)holds, then

E(kf−f˜k²)≤(1 +ε(n))em(f)²+ 8L²n^−r, (1.6) whereε(n) := _log(n)^4κ →0 asn→+∞, withκas in (1.4)

At this point a few remarks are due regarding the implications of this result in terms of the convergence rate of the estimate. In the general setting of regression on a random design, one observes independent samples zi = (xi, yi) of a variable z = (x, y) of lawρ over X ×Y and marginal law ρX over X, and one estimates the regression function defined as the conditional expectation

f(x) :=E(y|x).

Assuming thatf satisfies the uniform bound (1.5), one computes the truncated least squares estimator now withyi in place off(xi). A typical convergence bound for this estimator, see for example Theorem 11.4 in [5], is

E(kf−f˜k²)≤C

em(f)²+ max{L², σ²}mlogn n

, (1.7)

(5)

where

σ²:= max

x∈XE(|y−f(x)|²

x) (1.8)

is the maximal conditional variance of the noise. Convergence rates may be found after balancing the two terms, but they are limited by the optimal learning raten⁻¹, and this limitation persists even in the noiseless caseσ²= 0 due to the presence ofL²in the right side of (1.7). In contrast, Theorem 2 yields fast convergence rates, provided that the approximation erroremhas fast decay and that the value ofmsatisfying (1.4) can be chosen large enough.

One motivation for studying the noiseless case is the numerical treatment of parameter dependent PDEs of the general form

F(f, x) = 0,

wherexis a vector of parameters in some compact setP ∈R^d. We can consider the solution mapx7→f(x) either as giving the exact solution to the PDE for the given value of the parameter vectorxor as the exact result of a numerical solver for this value of x. In the stochastic PDE context, x is random and obeys a certain law which may be known or unknown. From a random draw (xi)i=1,...,n, we obtain solutions fi = f(xi) which are noiseless observations of the solution map, and are interested in reconstructing this map. In instances such as elliptic problems with parameters in the diffusion coefficients, the solution map can be well-approximated by polynomials in x(see [4]). In this context, an initial study of the needed amount of regularization was given in [6], however specifically targeted towards polynomial least squares.

For the noisy regression problem, our analysis can also be adapted in order to derive the following result.

Theorem 3 For any r >0, if mis such that the condition (1.4)holds, then E(kf−f˜k²)≤(1 + 2ε(n))em(f)²+ 8L²n^−r+ 8σ²m

n, (1.9)

with ε(n) as in Theorem2 andσis the maximal variance given by (1.8).

In the noiseless case, the bound in Theorem 2 suggests thatmshould be chosen as large as possible under the constraint that (1.4) holds. In the noisy case, the value of mminimizing the bound in Theorem 3 also depends on the decay ofem, which is generally unknown. In such a situation, a classical way of choosing the value of m is by a model selection procedure, such as adding a complexity penalty in the least squares or using an independent validation sample. Such procedures can also be of interest in the noiseless case when the measureρX is unknown, since the maximal value ofmsuch that (1.4) holds is then also unknown.

The rest of our paper is organized as follows: we give the proofs of the above results in§2 and we present in§3 examples of applications to classical approximation schemes such as piecewise constants, trigonometric polynomials, or algebraic polynomials. For such examples, we study the range ofmsuch that (1.4) holds and show that this range is in accordance with stability results that can be proved for deterministic sampling.

Numerical illustrations are given for algebraic polynomial approximation.

2 Proofs

Proof of Theorem 1: The matrix Gcan be written as G=X₁+· · ·+X_n, whereXi are i.i.d. copies of the random matrix

X= 1

n(Lj(x)Lk(x))j,k=1,...,m,

(6)

where xis distributed according to ρX. We use the following Chernoff bound from [7], originally obtained by [1]: ifX₁, . . . ,X_n are independentm×mrandom self-adjoint and positive matrices satisfying

λmax(Xi) =9Xi9≤R, almost surely, then with

µmin:=λmin

Xⁿ

i=1

E(Xi)

and µmax:=λmax

Xⁿ

i=1

E(Xi) , one has

Pr (

λmin

Xⁿ

i=1

X_i

≤(1−δ)µmin

)

≤m e^−δ (1−δ)^1−δ

µmin/R

, 0≤δ <1, and

Pr (

λmax

Xⁿ

i=1

X_i

≥(1 +δ)µmax

)

≤m e^δ (1 +δ)^1+δ

µmin/R

, δ≥0 In our present case, we have Pn

i=1E(Xi) =nE(X) = Iso thatµmin =µmax = 1. It is easily checked that

e^δ

(1+δ)^1+δ ≤_(1−δ)^e^−δ1−δ for 0< δ <1, and therefore

Pr{9G−I9> δ} ≤2m e^−δ (1−δ)^1−δ

^1/R

= 2mexpcδ

R .

We next use the fact that a rank 1 symmetric matrix ab^T = (bjak)j,k=1,...,m has its spectral norm equal to the product of the Euclidean norms of the vectorsaandb, and therefore

9X9≤ 1 n

m

X

j=1

|Lj(x)|²=K(m) n ,

almost surely. We may therefore takeR=^K(m)_n which concludes the proof. 2 Proof of Theorem 2: We denote by dρⁿ_X := ⊗ⁿdρX the probability measure of the draw. We also denote by Ω the set of all possible draws, that we divide into the set Ω+ of all draw such that

9G−I9≤ 1 2, and the complement set Ω−:= Ω\Ω+. According to (1.3), we have

Pr{Ω−}=Z

Ω−

dρⁿX≤2n^−r, (2.1)

under the condition (1.4). This leads to E(kf−f˜k²) =Z

Ω

kf −f˜k²dρⁿ_X ≤ Z

Ω+

kf −P_mⁿfk²dρⁿ_X+ 8L²n^−r,

where we have usedkf−f˜k²≤2L², as well as the fact thatTLis a contraction that preservesf.

It remains to prove that the first term in the above right side is bounded by (1 +ε(n))em(f)². With g:=f−Pmf, we observe that

f −P_mⁿf =f −Pmf +P_mⁿPmf−P_mⁿf =g−P_mⁿg.

(7)

Sinceg is orthogonal toVm, we thus have

kf−P_mⁿfk²=kgk²+kP_mⁿgk²=kgk²+

m

X

j=1

|aj|²,

wherea= (aj)j=1,...,m is solution of the system

Ga=b,

withb:= (hg, Lki_n)k=1,...,m. When the draw belongs to Ω+, we havekG⁻¹k₂≤2 and therefore

m

X

j=1

|aj|²≤4

m

X

k=1

|hg, Lki_n|².

It follows that Z

Ω+

kf −P_mⁿfk²dρⁿ_X ≤ Z

Ω+

kgk²+ 4

m

X

k=1

|hg, Lki_n|²

!

dρⁿ_X≤ kgk²+ 4

m

X

k=1

E(|hg, Lki_n|²).

We estimate each of theE(|hg, Lki_n|²) as follows:

E(|hg, Lkin|²) = 1 n²

n

X

i=1 n

X

j=1

E(g(xi)g(xj)Lk(xi)Lk(xj))

= 1 n²

n(n−1)|E(g(x)Lk(x))|²+nE(|g(x)Lk(x)|²)

= 1− 1

n

|hg, Lki|²+ 1 n

Z

X

|g(x)|²|Lk(x)|²dρX

= 1 n

Z

X

|g(x)|²|Lk(x)|²dρX,

where we have used the fact that gis orthogonal toVmand thus to Lk. Summing overk, we obtain

m

X

k=1

E(|hg, Lkin|²)≤ K(m)

n kgk²≤ κ log(n)kgk², where we have used (1.4). We have thus proven that

Z

Ω₊

kf −Pmⁿfk²dρⁿX ≤(1 + 4κ

log(n))kgk²= (1 +ε(n))em(f)²,

which concludes the proof. 2

Proof of Theorem 3: We define the additive noise in the sample by writing yi=f(xi) +ηi,

and thus theηi are i.i.d. copies of the variable

η=y−f(x).

(8)

Note thatη andxare not assumed to be independent. However we have E(η|x) = 0,

which implies the decorrelation property

E(ηh(x)) = 0,

for any functionh. As in the proof of Theorem 2 we split Ω into Ω+ and Ω− and find that E(kf−f˜k²)≤

Z

Ω₊

kf−wk²dρⁿX+ 8M²n^−r,

where w now stands for the solution to the least squares problem with noisy data (y1, . . . , yn). With the same definition ofg=f−Pmf, we can write

f−w=g−P_mⁿg−w,e

wherewe stands for the solution to the least squares problem for the noise data (η1, . . . , ηn). Therefore kf −P_mⁿfk²=kgk²+kP_mⁿg+wke ²≤ kgk²+ 2kP_mⁿgk²+ 2kwke ²=|gk²+ 2

m

X

j=1

|aj|²+ 2

m

X

j=1

|dj|²,

wherea= (aj)j=1,...,m is as in the proof of Theorem 2 and d= (dj)j=1,...,m is solution of the system Gd=n,

withn:= (_n¹Pn

i=1ηiLk(xi))k=1,...,m= (nk)k=1,...,m. By the same arguments as in the proof of Theorem 2, we thus obtain

E(kf −fk˜ ²)≤(1 + 2ε(n))em(f)²+ 8L²n^−r+ 8

m

X

k=1

E(|nk|²).

We are left to show thatPm

k=1E(|nk|²)≤^σ²_n^m. For this we simply write that E(|nk|²) = 1

n²

n

X

i=1 n

X

j=1

E(ηiLk(xi)ηjLk(xj)).

Fori6=j, we have

E(ηiLk(xi)ηjLk(xj)) = (E(ηLk(x)))²= 0.

Fori=j, we have

E(|ηiLk(xi)|²) =E(|ηLk(x)|²)

=Z

X

E(|ηLk(x)|²|x)dρX

=Z

X

E(|η|²|x)|Lk(x)|²dρX

≤σ² Z

X

|Lk(x)|²dρX =σ².

It follows thatE(|nk|²)≤^σ_n², which concludes the proof. 2

(9)

3 Examples and numerical illustrations

We now give several examples of approximation schemes for which one can compute the quantityK(m) and therefore estimate the range ofm such that the condition (1.4) holds. For each of these examples, we also exhibit a deterministic sampling (x1, . . . , xn) for which the stability property

9G−I9≤ 1 2, or equivalently

kvk²n− kvk² ≤1

2kvk², v∈Vm,

is ensured for the same range ofm(actually slightly better by a logarithmic factor). For the sake of simplic- ity, we work in the one dimensional setting, withX a bounded interval.

Piecewise constant functions. HereX = [a, b] andVmis the space of piecewise constant functions over a partition of X into intervalsI1, . . . , Im. In such a case, an orthonormal basis with respect toL²(X, ρX) is given by the characteristic functionsLk := (ρX(Ik))^−1/2

χ

_I_k, and therefore

K(m) = max

k=1,...,m(ρX(Ik))⁻¹.

Given a measureρX, the partition that minimizesK(m), and therefore allows to fulfill (1.4) for the largest range ofm, is one that evenly distributes the measureρX. With such partitions,K(m) reaches its minimal value

K(m) =m, and (1.4) can be achieved withm∼ _logⁿ_n.

If we now choosen=mdeterministic pointsx1, . . . , xm withxk ∈Ik, we clearly have kvk²_n =kvk², v∈Vm.

Therefore the stability of the least squares problem can be ensured withmup to the valuenusing a deterministic sample.

Trigonometric polynomials and uniform measure. Without loss of generality, we takeX = [−π, π], and we consider for oddm= 2p+ 1 the spaceVmof trigonometric polynomials of degreep, which is spanned by the functions Lk(x) = e^ikx for k = −p, . . . , p. Assuming that ρX is the uniform measure, this is an orthonormal basis with respect toL²(X, ρX). In this example, we again obtain the minimal value

K(m) =m.

Therefore (1.4) can be achieved with m∼_logⁿ_n.

We now consider the deterministic uniform samplingxi:=−π+^2πi_n fori= 1, . . . , n. With such a sampling, one has the identity

π

Z

−π

v(x)dρX= 1 2π

π

Z

−π

v(x)dx= 1 n

n

X

i=1

v(xi),

for all trigonometric polynomialsvof degreen−1 (this is easily seen by checking the identity on every basis element). Whenv∈Vmwithm= 2p+ 1, we know that|v|²is a trigonometric polynomial of degree 2p. We thus find that

kvk²_n =kvk², v∈Vm,

(10)

provided that 2p≤n−1, or equivalentlym≤n. Therefore the stability of the least squares problem can be ensured with mup to the valuenusing a deterministic sample.

Algebraic polynomials and uniform measure. Without loss of generality, we takeX= [−1,1], and we considerVm=Pm−1 the space of algebraic polynomials of degreem−1. WhenρX is the uniform measure, an orthonormal basis is given by definingLk as the Legendre polynomial of degreek−1 with normalization

kLkk_L∞([−1,1])=|Lk(1)|=√ 2k−1, and thus

K(m) =

m

X

k=1

(2k−1) =m². Therefore (1.4) can be achieved withm∼q _n

logn which is a lower range compared to the previous examples.

We now consider the deterministic sampling obtained by partitioningX intonintervals (I1, . . . , In) of equal length ²_n, and picking one pointxiin each Ii. For anyv∈Vm, we may write

Z

Ii

|v(x)|²dρX− 1 n|v(xi)|²

= 1 2

Z

Ii

|v(x)|²dx− 1 n|v(xi)|²

= 1 2

Z

Ii

(|v(x)|²− |v(xi)|²)dx

≤1 2

Z

Ii

|v(x)|²− |v(xi)|² dx

≤1 2|Ii|

Z

I_i

|(v²)⁰(x)|dx

= 2 n

Z

I_i

|v⁰(x)v(x)|dρX.

Summing over i, it follows that kvk²n− kvk²

≤ 2 n

Z

X

|v⁰(x)v(x)|dρX≤ 2

nkv⁰k kvk ≤ 2(m−1)² n kvk²,

where we have used the Cauchy-Schwarz and Markov inequalities. Therefore the stability of the least squares problem can be ensured withmup to the value ^√₂ⁿ + 1 using a deterministic sample.

Algebraic polynomials and Chebyshev measure. Consider again algebraic polynomials of degree m−1 onX = [−1,1], now equipped with the measure

dρX = dx π√

1−x².

Then an orthonormal basis is given by definingLkas the Chebyshev polynomial of degreek−1, withL1= 1 and

Lk(x) =√

2 cos((k−1) arccosx),

(11)

fork >1, and thus

K(m) = 2m−1.

Therefore (1.4) can be achieved with m∼ _logⁿ_n, which expresses the fact that least squares approximations are stable for higher polynomial degrees when working with the Chebyshev measure rather than with the uniform measure.

We now consider the deterministic sampling obtained by partitioningX intonintervals (I1, . . . , In) of equal Chebyshev measure ρX(Ii) = ¹_n, and picking one pointxi in eachIi. For anyv∈Vm, we may write

Z

I_i

|v(x)|²dρX−1 n|v(xi)|²

= Z

I_i

(|v(x)|²− |v(xi)|²)dρX

≤ Z

I_i

|v(x)|²− |v(xi)|² dρX

≤ρX(Ii)Z

Ii

|(v²)⁰(x)|dx

= 1 n

Z

Ii

|v⁰(x)v(x)|dx.

Summing over i, it follows that kvk²_n− kvk²

≤ 1 n

Z

X

|v⁰(x)v(x)|dx≤ 1 nkvk

π Z

X

|v⁰(x)|²p

1−x²dx1/2

.

Using the change of variablex= cost, it is easily seen that the inverse estimate Z

X

|v⁰(x)|²p

1−x²dx≤(m−1)²Z

X

|v(x)|² 1

√1−x²dx,

holds for anyv∈Vm. Therefore

kvk²_n− kvk² ≤ 1

n Z

X

|v⁰(x)v(x)|dx≤ π(m−1) n kvk²

which shows that the stability of the least squares problem can be ensured withm up to the value _2πⁿ + 1 using a deterministic sample.

Numerical illustration. We conclude with a brief numerical illustration of our theoretical results for the setting of algebraic polynomials. Specifically, we consider the smooth function f1(x) = 1/(1 + 25x²) originally considered by Runge to illustrate the instability of polynomial interpolation at equispaced points, and the non-smooth functionf2(x) =|x|, both restricted to the interval [−1,1].

For both functions, we takeni.i.d. samplesx1, . . . , xn with respect to a measureρX onX = [−1,1] and compute the noise-free observations yi =f(xi). We consider either the uniform measure ρX := ^dx₂ or the Chebyshev measureρX:= _π^√^dx

1−x². In both cases, we compute the least squares approximating polynomial of degree musing these points for a range of different values ofm≤n. We then numerically compute the

(12)

0 50 100 150 200 10⁻¹⁰

10⁰ 10¹⁰ 10²⁰ 10³⁰

m

Error

Uniform Chebyshev

0 200 400 600 800 1000

10⁰ 10¹⁰⁰ 10²⁰⁰

m

Error

Uniform Chebyshev

(a) (b)

Figure 3.1: TheL²(X, ρX) error asmvaries (a) forf1 and (b) forf2.

0 50 100 150 200

0 50 100 150

n

m(n)

Uniform Chebyshev

0 200 400 600 800 1000

0 20 40 60 80 100 120

n

m(n)

Uniform Chebyshev

(a) (b)

Figure 3.2: Optimal values m(n) as nvaries (a) for f1 (comparison with 0.7n and 2.5√

n) and (b) forf2

(comparison with 0.1nand 0.4√ n).

error in the L²(X, ρX) norm, with ρX the corresponding measure in which the sample have been drawn, using the adaptive Simpson’s quadrature rule.

Figure 3.1 shows the results of this simulation usingn1 = 200 samples for estimatingf1 andn2= 1000 samples for estimating f2. We observe that, in all cases, as m approaches n the solutions become highly inaccurate due to the inherent instability of the problem. However, we can set mto be much larger before unstability starts to develop when the points are drawn with respect to the Chebyshev measure, as is expected.

Next we consider the the effect ofnon the best choice ofm. Specifically, for any given sample of points we can compute the value m(n) that corresponds to the polynomial degree for which we obtain the best approximation tof1orf2and examine how this behaves as a function ofn. This is shown in Figure 3.2, that displays as a function of nthe average value ofm(n) over 50 realizations of the sample, for both measures and both functions f1 and f2 (the averaging has the effect of reducing oscillation in the curven 7→ m(n) making it more readable). We vary the sample size from n= 1 to 1000 for f2, but only fromn= 1 to 200

(13)

for the smooth function f1, since in that case theL²(X, ρX) error drops below machine precision for larger values ofnwithmin the regime where the least squares problem is stable and therefore the minimal value m(n) cannot be precisely located.

We observe that, in accordance with our theoretical results,m(n) behaves like√

nwhen the points are drawn with respect to the uniform measure, while it behaves almost linear inn when the points are drawn with respect to the Chebyshev measure.

References

[1] Alhswede, R. and A. Winter, Strong converse for identification via quantum channels, IEEE Trans.

Information Theory48, 569–579, 2002.

[2] Baraud, Y.,Model selection for regression on a random design, ESAIM Prob. Stat.6, 127–146, 2002.

[3] Birg´e L. and P. Massart, Minimum contrast estimators on sieves: exponential bounds and rates of convergence, Bernoulli4, 329–375, 1998.

[4] Cohen, A., R. DeVore and C. Schwab,Analystic regularity and polynomial approximation of parametric elliptic PDE’s, Analysis and Applications9, 11–47, 2011.

[5] Gy¨orfi, L., M. Kohler, A. Krzyzak, A. and H. Walk,A distribution-free theory of nonparametric regression, Springer, Berlin, 2002.

[6] Migliorati, G., F. Nobile, E. von Schweriny and R. Tempone,Analysis of the point collocation method, preprint MOX, Politecnico di Milano, 2011.

[7] Tropp, J.User friendly tail bounds for sums of random matrices, to appear in J. FoCM, 2011.

Albert Cohen

Laboratoire Jacques-Louis Lions Universit´e Pierre et Marie Curie 4, Place Jussieu, 75005 Paris, France cohen@ann.jussieu.fr

Mark Davenport

Department of Statistics Stanford University Stanford, CA 94035, USA markad@stanford.edu Dany Leviatan

Raymond and Beverly Sackler School of Mathematics Tel Aviv University

69978, Tel Aviv, Israel leviatan@post.tau.ac.il