2000 Mathematics Subject Classification: Primary 62C99, sec-ondary 62C10, 62C20, 62J05

(1)

HAL Id: hal-01292382

https://hal.archives-ouvertes.fr/hal-01292382

Submitted on 7 Apr 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

2000 Mathematics Subject Classification: Primary 62C99, sec-ondary 62C10, 62C20, 62J05

Yu. Golubev, Th. Zimolo

To cite this version:

Yu. Golubev, Th. Zimolo. 2000 Mathematics Subject Classification: Primary 62C99, sec-ondary 62C10, 62C20, 62J05. Mathematical Methods of Statistics, Allerton Press, Springer (link), 2016, 25 (1), pp.1-25. �hal-01292382�

(2)

Tikhonov-Phillips Regularizations in Linear Models with Blurred Design

Yu. Golubev^∗ and Th. Zimolo^†

Abstract

The paper deals with recovering an unknown vectorβ ∈R^p based on the observationsY = Xβ+ǫξ and Z = X+σζ, where X is an unknownn×p- matrix withn≥p,ξ∈R^pis a standard white Gaussian noise,ζis an×p- matrix with i.i.d. standard Gaussian entries, and ǫ, σ ∈ R⁺ are known noise levels. It is assumed that X has a large condition number andpis large. Therefore, in order to estimateβ, the simple Tikhonov-Phillips regularization (ridge regression) with a data- driven regularization parameter is used. For this estimation method, we study the effect of noiseσζ on the quality of the recovering ofXβ using concentration inequalities for the prediction error.

Keywords: ridge regression, Tikhonov-Phillips method, data-driven regularization, blurred design, prediction mean squared error, concentration inequality.

2000 Mathematics Subject Classification: Primary 62C99; sec- ondary 62C10, 62C20, 62J05.

1 Introduction and main results

This paper deals with recovering an unknown vector β ∈ R^p based on the noisy observations

Y =Xβ+ǫξ,

Z =X+σζ, (1)

where X is an unknown n×p - matrix with n ≥ p, ξ ∈ Rⁿ is a standard white Gaussian noise, ζ is a n×p - matrix with i.i.d. standard Gaussian entries. For simplicity it is assumed that noise levelsǫ, σ >0 are known.

Despite a very simple probabilistic structure of this model, estimating β is a non-trivial statistical problem even in the simple case p = n = 1 (see

∗CNRS, Aix-Marseille Universit´e, I2M, UMR 7373, 13453 Marseille, France, and Insti- tute for Information Transmission Problems, Moscow, Russia

†Aix-Marseille Universit´e, I2M, UMR 7373, 13453 Marseille, France.

(3)

[11]), but real difficulties arise when the condition number of X and p are large. In this case, the maximum likelihood estimate of β has typically a large risk and to avoid this drawback one has to make use of an available a priori information aboutβandX. Recently some new methods of estimating β in this situation have been proposed. These approaches rely on specific statistical interpretations of (1) ranging from ”smooth” models for β [6], [13], [15], [5] to ”sparse” ones [12], [1], [2].

In this paper, we focus solely on commonly used in practice estimates based on Tikhonov-Phillips regularization technique [17], often calledridge regression in statistics. This method relies on the hypothesis that the Eu- clidean norm ofβ is bounded and it is usually used when X is known precisely (σ= 0). The standard ridge regression estimate of β is computed as follows:

βˆ^α(Y, X) = arg min

β

n

kY −Xβk²+αkβk²o

= (X^⊤X+αI)⁻¹X^⊤Y, where here and belowk·kis the Euclidean norm,Iis the identity matrix, and α∈R⁺is the regularization parameter. We emphasize that despite the well- known drawbacks of this regularization technique (see e.g. [7]), it remains widely used in practice because of the very low numerical complexity.

In applications, the regularization parameter is usually data-driven and very often [8] it is computed as a minimizer of the unbiased risk estimate of the mean square prediction risk, i.e.

ˆ

α(Y, X) = arg min

α≥0

nY −Xβˆ^α(Y, X)

²+ 2ǫ²tr

H^α(X)o

, (2) where

H^α(X) =X(X^⊤X+αI)⁻¹X^⊤.

Thus, we estimateβ with the help of ˆβ^α(Y,X)^ˆ (Y, X). From a mathematical viewpoint, the most elegant and interesting fact about performance of this method goes back to Kneip [14].

Theorem 1 For any x≥1 Pn

kXβˆ^α(Y,X)^ˆ (Y, X)−Xβk ≥ r

minα≥0R^α(X, β) +ǫxo

≤exp(−Kx²), (3) where here and in what followsK stands for generic constants and

R^α(β, X)^def=EkXβˆ^α(Y, X)−Xβk²

=

[I−H^α(X)]Xβ

²+ǫ²tr

H^α(X)^⊤H^α(X) is the mean square prediction risk.

(4)

In order to understand why this theorem is really surprising, let us chose α as a minimizer of the prediction error, i.e.

α(β, X) = arg min

α≥0R^α(β, X). (4)

Notice thatα(β, X) depends on β, so it is not a statistical estimate in the ordinary sense and it cannot be used in practice. It can be interpreted solely as an oracle regularization parameter. One can check with a rather simple algebra that for sufficiently large x

P n

kXβˆ^α(X,β)(Y, X)−Xβk ≥ q

R^α(β,X)(β, X) +ǫxo

≥exp(−K^′x²), (5) whereK^′is a constant. So, comparing Equations (5) and (3), we see that the booth prediction errorskXβˆ^α(Y,X)^ˆ (Y, X)−XβkandkXβˆ^α(β,X)(Y, X)−Xβk are close to p

min_α≥0R^α(β, X) at the same parametric rateǫ. This is why Equation (3) is often called oracle inequality.

We emphasize that if the risk of an estimate ˜β(Y, X) is measured by kβ(Y, X˜ )−βk, then mathematical analysis of this risk becomes rather complicated. For instance, to our knowledge nothing similar to Theorem 1 is known about the distribution of kβˆ^α(Y,X)^ˆ (Y, X)−βk despite the fact that βˆ^α(Y,X)^ˆ (Y, X) behaves reasonably good in numerous applications [8]. The principal difficulties in obtaining good upper bounds for kβ˜^α(Y, X) −βk are due to the data-driven choice of regularization parameters. During last years, several new approaches to this fundamental statistical problem have been proposed (see e.g. [3], [9]), but unfortunately, these methods need the Singular Value Decomposition of X, thus resulting in hardly computable statistical methods compared to (2).

The main goal in this paper is to study the case where the design matrix Xis blurred by a white Gaussian noise and to find out how this noise affects the quality of recovering Xβ. In order to estimate β in this situation, we make use of the simple plug-in ridge regression estimate

βˆ^α(Y, Z) = arg min

β

n

kY −Zβk²+αkβk²o

= (Z^⊤Z+αI)⁻¹Z^⊤Y (6) combined with the data-driven regularization parameter

ˆ

α(Y, Z) = arg min

α≥α◦

n

kY −Zβˆ^α(Y, Z)k²+ 2ǫ²tr[H^α(Z)]o

, (7)

whereα◦>0 is given, andH^α(Z) =Z(Z^⊤Z+αI)⁻¹Z^⊤.

In what follows, we are interested in the concentration properties of the empirical analog of the prediction errorkZβˆ^α(Y, Z)−Xβk.

The next theorem shows that when Z is given, this random variable is concentrated in the vicinity ofp

min_α≥α_◦R^α(β, Z),where R^α(β, Z)^def=E_ZkZβˆ^α(Y, Z)−Xβk²

=

[I −H^α(Z)]Xβ

²+ǫ²tr

H^α(Z)^⊤H^α(Z) .

(5)

Here and in what follows E_Z and P_Z stand for the expectation and prob- ability measure generated by the observations {Y, Z} from (1) whenZ is a given matrix.

Theorem 2 For any λ∈R⁺ and any α≥α_◦ E_Zexpn

λ

Zβˆ^α(Y,Z^ˆ ⁾(Y, Z)−Xβ o

≤expn λp

R^α(β, Z) +Kǫ²λ²o . (8) This theorem permits to control the concentration of the empirical prediction error of the data-driven ridge regression estimate via the concentration ofR^α(β, Z) for givenα. More precisely, it says that for anyα ≥α_◦

kZβˆ^α(Y,Z)^ˆ (Y, Z)−Xβk≤^P p

R^α(β, Z) +Kǫξ₊,

whereξ₊ is a positive sub-Gaussian random variable and symbol ≤^P means that this inequality is interpreted similar to (3).

Therefore, our next step is to study concentration of R^α(β, Z). Unfor- tunately, in spite of the apparent simplicity, this problem is rather difficult from a mathematical viewpoint. Therefore, in this paper, we focus solely on the second order approximation (with respect to σ) of R^α(β, Z). In other words, we will focus on the random variable Re^α_σ(β, X, ζ) defined by

R^α(β, X+σζ) =Re^α_σ(β, X, ζ) +o(σ²), as σ→0. (9) In order to make concentration inequalities simpler and more transpar- ent, we will assume the following condition.

Condition 1 There exists a constant Q≥1 such that for any α∈R⁺ tr

H^α(X)

≤Qtr

H^α(X)^⊤H^α(X) .

Roughly speaking, this condition means that eigenvalues λ_k(X^⊤X) of X^⊤X decrease rather rapidly with k. We emphasize that only in this case Tikhonov’s regularization may have significant advantages with respect to the maximum likelihood estimate.

We begin our study ofRe^α_σ(β, X, ζ) with the simple case assuming thatX in (1) is a diagonal matrix. So, we have at out disposal the noisy observations

Y_i =X_iβ_i+ǫξ_i,

Z_i =X_i+σζ_i, i= 1, . . . , p, (10) whereζ, ξare independent standard white Gaussian noises, and we estimate β∈R^p with the help of the ridge regression method.

Notice that this statistical model may be viewed as a simple probabilistic model describing the famous blind deconvolution problem (see e.g. [4, 11]).

(6)

Theorem 3 Suppose X in (1)is a diagonal matrix and Condition 1 holds.

Then for any givenα ≥α_◦ and any λ∈R⁺ Eexpn

λ q

[Re^α_σ(β, X, ζ)]₊o

≤exp

λ

1 +KQσ² α

R^α(β, X) 1/2

+9σ²λ²

8 max

ǫ² α,kβk²∞

.

(11)

Let us briefly discuss this result. If we choose α=α^◦(β, X) = arg min

α≥α◦

R^α(β, X), then we obtain from (11) the following inequality:

Eexpn

λqRe^α_σ^◦(β, X, ζ)

+

o≤exp

λ

1 +KQσ² α_◦

R^α^◦(β, X) 1/2

+9σ²λ²

8 max

ǫ²

α_◦,kβk²∞

.

(12)

Thus, we see that compared to (8) the noise inX results in

• strengthening of the oracle risk by the factor of 1 +KQσ²/α_◦;

• shifting of the “mean” prediction error by 3

2max ǫ

√α_◦,kβk∞

σξ₊,

whereξ₊ is a positive sub-Gaussian random variable.

Remark 1. α◦ (see (7)) plays a rather important role when the design is blurred. It follows from (12) that in order to guarantee the global concentration rate similar to (8),α_◦ must be of order of ǫ² and σ² of order of ǫ².

In the general case, the concentration inequality for eR^α_σ(β, X, ζ)

+ has a little bit more complicated form.

Theorem 4 Suppose Condition 1 holds. Then for any given α ≥ α_◦ and any λ∈R⁺

Eexpn λ

q

[Re^α_σ(β, X, ζ)]+

o

≤

≤exp

λ

1 +KQpσ² α

R^α(β, X) +σ²pS^α(β, X) α²

1/2

+Kλ²σ² α

R^α(β, X) +ǫ²+S^α(β, X) α

.

(13)

where S^α(β, X) = X^⊤

I−H^α(X)]Xβ ².

(7)

Remark 2. It is easy to see that S^α(β, X) can be bounded from above as follows:

S^α(β, X) =α²kX^⊤(αI+X^⊤X)⁻¹Xβk² ≤α²kβk².

Remark 3. Equation (13) reveals the principal drawback of the naive plug-in approach. This method works good solely when pσ² is small, more precisely, when this value is of order of the oracle regularization parameter (4). However, we emphasize here once again that important advantage of this method is related to its low numerical complexity.

In order to improve the low concentration of the plug-in approach, we have to incorporate into the estimate of (X^⊤X)⁻¹ an a priori information about X. This may be done in different ways. For instance, in [4] it is assumed that

X = Xp

k=1

pb_kϕ_kϕ^∗⊤_k ,

where

• b_k∈R⁺ are unknown, but such that b₁≥b₂ ≥. . .≥b_p;

• ϕ_k and ϕ^∗_k are known orthonormal systems inRⁿand R^p respectively.

It is easy to check that this assumption reduces (1) to diagonal Model (10).

Notice also that this model is equivalent to very specific deconvolution problems (see [11]). For classical statistical linear models, where usually n≫p, the above hypothesis looks rather restrictive.

A less restrictive assumption would be to suppose X^⊤X =

Xp

k=1

b_kϕ_kϕ^⊤_k, (14) whereb_k∈R⁺ are unknown andϕ_k is a known orthonormal system inR^p.

2 Proofs

2.1 Proof of Theorem 2

The proof of the next simple auxiliary lemma can be found in [10].

Lemma 1 I) If for some U, u≥0 U ≤u+ min

γ∈(0,F]

x² γ +γU

, then √ U ≤√

r+|x|max

2, r2

F

.

II) If for some U, u≥0 U ≤u+ min

γ∈(0,F]

x² γ +γu

, then √ U ≤√

u+|x|max

1, r2

F

.

(8)

2.1.1 Proof of the theorem Let

s_k(Z) ∈ R⁺, ϕ_k(Z) ∈ R^p, k = 1, . . . , p be eigenvalues and eigenvectors ofZ^⊤Z, i.e.,

Z^⊤Zϕ_k(Z) =s_k(Z)ϕ_k(Z).

It is easy to check that

ϕ^∗_k(Z) = Zϕ_k(Z)

ps_k(Z), k= 1, . . . , p, are the orthonormal vectors in Rⁿand

Z = Xp

k=1

ps_k(Z)ϕ^∗_k(Z)ϕ^⊤_k(Z) ZZ^⊤= Xp

k=1

ps_k(Z)ϕ^∗_k(Z)ϕ^∗⊤_k (Z).

Therefore

H^α(Z) = (αI+ZZ^⊤)⁻¹ZZ^⊤= Xp

k=1

h^α_k(Z)ϕ^∗_k(Z)ϕ^∗⊤_k (Z), where

h^α_k(Z) = s_k(Z) α+s_k(Z).

Projecting the observations Y onto the linear space spanned by vectors {ϕ^∗_k(Z), k= 1, . . . , p}, we arrive at estimating

¯

µ_k(Z)^def=

Xβ, ϕ^∗_k(Z)

, k= 1, . . . , p based on the observations

Y¯_k(Z)^def=

Y, ϕ^∗_k(Z)

= ¯µ_k(Z) +ǫξ_k^′, k= 1, . . . , p. (15) whereξ_k^′ are i.i.d. standard Gaussian random variables.

It is also easy to check that kZβˆ^α(Y, Z)−Xβk² =

Xp

k=1

µ¯_k(Z)−h^α_k(Z) ¯Y_k(Z)2

,

kZβˆ^α(Y, Z)−Yk² = Xp

k=1

Y¯_k(Z)−h^α_k(Z) ¯Y_k(Z)2

,

Therefore, in view of (15), we have kZβˆ^α(Y, Z)−Xβk² =L[¯µ(Z), h^α(Z)]

+ 2ǫ Xp

k=1

1−h^α_k(Z)

h^α_k(Z)¯µ_k(Z)ξ_k^′ +ǫ² Xp

k=1

h^α_k(Z)2

(ξ_k^′)²−1 (16)

(9)

and

kZβˆ^α(Y, Z)−Yk² =L[¯µ(Z), h^α(Z)]−2ǫ² Xp

k=1

h^α_k(Z) +ǫ² Xp

k=1

ξ^′_k2

+ 2ǫ Xp

k=1

1−h^α_k(Z)2

¯ µ_k(Z)ξ^′_k +ǫ²

Xp

k=1

h^α_k(Z)2

−2h^α_k(Z) (ξ^′_k)²−1 ,

(17)

where here and in what follows L[¯µ(Z), h^α(Z)] =

Xp

k=1

1−h^α_k(Z)2

¯

µ²_k(Z) +ǫ² Xp

k=1

h^α_k(Z)2

.

In view of the definition of ˆα(Y, Z) (see (7)), we have for any given α≥α_◦

kZβˆ^{α(Y Z)}^ˆ (Y, Z)−Yk²+ 2ǫ² Xp

k=1

h^α(Y,Z)_k^ˆ (Y, Z)≤kZβˆ^α(Y, Z)−Yk² + 2ǫ²

Xp

k=1

h^α_k(Z) and using (17), we obtain the following equivalent form of this inequality:

L

¯

µ(Z), h^α(Y,Z^ˆ ⁾(Z) + 2ǫ

Xp

k=1

1−h^α(Y,Z)_k^ˆ (Z)2

¯ µ_k(Z)ξ_k^′ +ǫ²

Xp

k=1

h^α(Y,Z_k^ˆ ⁾(Z)2

−2h^α(Y,Z)_k^ˆ (Z) (ξ_k^′)²−1

≤L[¯µ(Z), h^α(Z)] + 2ǫ Xp

k=1

1−h^α_k(Z)2

µ_k(Z)ξ^′_k +ǫ²

Xp

k=1

h^α_k(Z)2

−2h^α_k(Z) (ξ^′_k)²−1 .

Letγ, γ^′ ∈(0,3/16). Then we can rewrite this inequality as follows:

L

¯

µ(Z), h^α(Y,Z)^ˆ (Z)

≤L

¯

µ(Z), h^α(Z) +ρ₁(γ) +γL

¯

µ(Z), h^α(Y,Z)^ˆ (Z)

+ρ₂(γ) +γL

¯

µ(Z), h^α(Y,Z)^ˆ (Z) +ρ₃(γ^′) +γ^′L

¯

µ(Z), h^α(Z)

+ρ₄(γ^′) +γ^′L

¯

µ(Z), h^α(Z) .

(18)

where

ρ₁(γ) =−2ǫ Xp

k=1

1−h^α(Y,Z)_k^ˆ (Z)2

¯

µ_k(Z)ξ_k^′ −γ Xp

k=1

1−h^α(Y,Z)_k^ˆ (Z)2

¯ µ²_k(Z),

(10)

ρ₂(γ) =−ǫ² Xp

k=1

−2h^α(Y,Z)_k^ˆ (Z) (ξ^′_k)²−1

−ǫ²γ Xp

k=1

,

ρ₃(γ^′) =2ǫ Xp

k=1

1−h^α_k(Z)2

¯

µ_k(Z)ξ_k^′ −γ^′ Xp

k=1

1−h^α_k(Z)2

¯ µ²_k(Z), and

ρ₄(γ^′) =ǫ² Xp

k=1

h^α_k(Z)2

−2h^α_k(Z) (ξ^′_k)²−1

−γ^′ǫ² Xp

k=1

h^α_k(Z)2

.

By Lemma 10 we obtain for any integer m≥1 Eρ^m_l (γ)1/m

≤ Kǫ²m

γ , l= 1, . . . ,4.

Therefore with (18) we arrive at E

L

¯

µ(Z), h^α(Y,Z^ˆ ⁾(Z)m 1/m

≤L

¯

µ(Z), h^α(Z) + Kǫ²m

γ +γ E

L

¯

µ(Z), h^α(Y,Z^ˆ ⁾(Z)m 1/m

+ Kǫ²m

γ^′ +γ^′L

¯

µ(Z), h^α(Z) .

So, minimizing the right-hand side at this equation in γ, γ^′ ∈(0,3/16) and using Lemma 1, we obtain that for any integer m

n E

hq L

¯

µ(Z), h^α(Y,Z)^ˆ (Z)i^mo1/m

≤ q

L

¯

µ(Z), h^α(Z)

+Kǫ√

m. (19) The similar technique is used in proving (see (16)) the following inequality:

n

E_ZkZβˆ^α(Y,Z)^ˆ (Y, Z)−Xβk^mo1/m

≤n E_Z

hq L

¯

µ(Z), h^α(Y,Z)^ˆ (Z)i^mo1/m

+Kǫ√ m.

Finally, combining this equation with (19), we finish the proof.

Let us first computeRe^α_σ(β, X, ζ). We will make use of the following formula:

R^α(β, Z) =L[β, H^α(Z)]^def= Xp

k=1

n1−H_k^α(Z)2

X_k²β_k²+ǫ²

H_k^α(Z)2o ,

(11)

where

H_k^α(Z) = Z_k² α+Z_k². We obviously have

∂

∂Z_kL[β, H^α(Z)] =n

2ǫ²H_k^α(Z)−2

1−H_k^α(Z)

X_k²β_k²o ∂

∂Z_kH^α(Z),

∂²

∂Z_k²L[β, H^α(Z)] =n

2ǫ²H_k^α(Z)−2

1−H_k^α(Z)

X_k²β_k²o∂²

∂Z_k²H_k^α(Z) + 2X_k²+ 2ǫ²

∂

∂Z_kH^α(Z) 2

,

and

∂

∂Z_kH_k^α(Z) = 2αZ_k

(α+Z_k²)², ∂²

∂Z_k²H_k^α(Z) = 2α(α−3Z_k²) (α+Z_k²)³ . Therefore with the help of Taylor’s expansion we arrive at (see (9))

Re^α_σ(β, X, ζ) =R^α(β, X) +σ∆^α₁(β, X, ζ) + σ²

2 ∆^α₂(β, X, ζ), where

∆^α₁(β, X, ζ) = Xp

k=1

ζ_k ∂

∂X_kL[β, H^α(X)]

= 4 Xp

k=1

ζ_kn

ǫ²H_k^α(X)−

1−H_k^α(X)

X_k²β_k²o1−H_k(X) X_k α+X_k² ,

(20)

and

∆^α₂(β, X, ζ) = Xp

k=1

ζ_k²∂²

∂X_k²L[β, H^α(X)]

= Xp

k=1

ζ_k²X_k²β_k²

1−H_k^α(X)27H_k^α(X)−

H_X^α(X)2

−2 α+X_k²

+ 8ǫ² Xp

k=1

ζ_k²H_k^α(X)

1−H_k^α(X)9−12H_k^α(X) α+X_k² .

(21)

(12)

By (20) we have E

∆^α₁(β, X, ζ)2

= Xp

k=1

n

ǫ²H_k^α(X)−

1−H_k^α(X)

X_k²β_k²o2

1−H_k^α(X)2

X_k² (α+X_k²)²

≤ǫ⁴ Xp

k=1

[H_k^α(X)]²[1−H_k^α(X)]² X_k² (α+X_k²)² +

Xp

k=1

1−H_k^α(X)4

(X_k²β_k²)² X_k² (α+X_k²)²

≤ǫ² αǫ²

Xp

k=1

[H_k^α(X)]²+ max

k [1−H_k^α(X)]²X_k²β_k² X_k² (α+X_k²)²

× Xp

k=1

[1−H_k^α(X)]²X_k²β_k² ≤max ǫ²

α,kβk²∞

R^α(β, X).

Since ∆^α₁(β, X, ζ) is a zero mean Gaussian random variable, we obtain from the above equation

Eexph λ

q

R^α(β, X) +σ

∆^α₁(β, X, ζ)

+

i

≤exph λp

R^α(β, X)i Eexp

λσ 2 max

ǫ

√α,kβk∞

ξ₊

≤exp

λp

R^α(β, X) +λ²σ² 8 max

ǫ² α,kβk²∞

.

(22)

Next, (see (21)) ∆^α₂(β, X, ζ)

+ ≤K Xp

k=1

n

ǫ²H_k^α(X) +

1−H_k^α(X)2

X_k²β_k²o ζ_k² α+X_k². Therefore by Condition 1

E

∆^α₂(β, X, ζ)

+≤ KQ

α R^α(β, X). (23)

Next we make use of the following simple algebraic fact:

maxx≥0

α²x

(α+x)³ + αx (α+x)²

= 2

3√ 3 < 1

2. With this equation we get

maxk

ǫ²H_k^α(X) α+X_k² +

1−H_k^α(X)2

X_k² α+X_k² β_k²

≤max ǫ²

α,kβk²∞

maxk

αX_k²

(α+X_k²)² + α²X_k² (α+X_k²)³

≤max ǫ²

α,kβk²∞

maxx≥0

αx

(α+x)² + α²x (α+x)³

≤ 1 2max

ǫ² α,kβk²∞

(13)

and we obtain by (23) and Lemma 11 Eexph

λ q

σ²

∆^α₂(β, X, ζ)

+

i

≤exp

λ

rKQσ²R^α(β, X) α +σ²λ²

2 max ǫ²

α,kβk²∞

.

Finally, combining this equation and (22) with the help of Lemma 12, we complete the proof.

First, notice (see (6)) that ˆβ^α(Y, Z) is a solution to Z^⊤

Y −Zβˆ^α(Y, Z)

=αβˆ^α(Y, Z). (24) Since we are interested inZβˆ^α(Y, Z), we denote for brevity

ˆ

µ^α(Y, Z) =Zβˆ^α(Y, Z) and we obtain from (24)

ZZ^⊤

Y −µˆ^α(Y, Z)

=αµˆ^α(Y, Z), or, equivalently,

ˆ

µ^α(Y, Z) =H^α(Z)Y, where H^α(Z) = (αI+ZZ^⊤)⁻¹ZZ^⊤. Thus, our problem is reduced to estimating µ=Xβ ∈Rⁿ based on the noisy data

Y =µ+ǫξ, Z =X+σζ

with the help of the plug-in ridge regression estimates ˆ

µ^α(Y, Z) =H^α(Z)Y, α≥α_◦.

In order to compute Re^α_σ(β, X, ζ), we will need the second order approximation for H^α(Z) which can be obtained with the help of the following simple formulas:

H^α(Z)−H^α(X) =α(αI+XX^⊤)⁻¹−α(αI+ZZ^⊤)⁻¹

= (αI+XX^⊤)⁻¹ ZZ^⊤−XX^⊤

[I−H^α(Z)]

=α⁻¹[I−H^α(X)] ZZ^⊤−XX^⊤

[I −H^α(Z)].

(14)

and similarly

H^α(Z)−H^α(X) =α(αI+XX^⊤)⁻¹−α(αI+ZZ^⊤)⁻¹

= (αI+XX^⊤)⁻¹ ZZ^⊤−XX^⊤

[I−H^α(Z)]

=α⁻¹[I−H^α(Z)] ZZ^⊤−XX^⊤

[I−H^α(X)].

From the above equations we obtain

H^α(Z) =H^α(X) +α⁻¹[I −H^α(X)] ZZ^⊤−XX^⊤

[I−H^α(Z)]

=H^α(X) +α⁻¹[I −H^α(X)] ZZ^⊤−XX^⊤

[I−H^α(X)]

−α⁻¹[I−H^α(X)] ZZ^⊤−XX^⊤

[H^α(Z)−H^α(X)]

=G^α(X, ζ) +σ²

α [I−H^α(X)]ζζ^⊤[I−H^α(X)]

−α⁻²[I−H^α(X)] ZZ^⊤−XX^⊤

[I−H^α(Z)]

× ZZ^⊤−XX^⊤

[I −H^α(X)], where

G^α(X, ζ)^def= H^α(X) + σ

α[I −H^α(X)]ζ(X)[I−H^α(X)] (25) is the first order approximation of H^α(Z) and

ζ(X) =Xζ^⊤+ζX^⊤.

So, we arrive at the following second order approximation of H^α(Z):

He^α(X, ζ) =G^α(X, ζ) +σ²

α[I −H^α(X)]ζζ^⊤[I−H^α(X)]

−σ²

α²[I−H^α(X)]ζ(X)[I −H^α(X)]ζ(X)^⊤[I −H^α(X)].

(26)

In order to simplify calculations, we will make use of the SVD ofX. Let λ_k(X) ∈R⁺, e_k(X) ∈ R^p, k = 1, . . . , p be eigenvalues and eigenvectors of X^⊤X, i.e.,

X^⊤Xe_k(X) =λ_k(X)e_k(X), k= 1, . . . , p.

Define the orthonormal vectors inRⁿ e^∗_k(X) = Xe_k(X)

pλ_k(X), k= 1, . . . , p.

It is well-know and easy to check that X=

Xp

k=1

pλ_k(X)e^∗_k(X)e^⊤_k(X),

(15)

and so,

XX^⊤ = Xp

k=1

λ_ke^∗_k(X)e^∗⊤_k (X), H^α(X) =

Xp

k=1

λ_k(X)

α+λ_k(X)e^∗_k(X)e^∗⊤_k (X).

(27)

Therefore, denoting,

h^α_k(X) = λ_k(X)

α+λ_k(X), (28)

we have

H^α(X) = Xp

k=1

h^α_k(X)e^∗_k(X)e^∗⊤_k (X), I−H^α(X) =

Xp

k=1

[1−h^α_k(X)]e^∗_k(X)e^∗⊤_k (X).

We are ready to obtain the spectral representation of G^α(X, ζ). Let G¯^α_ij(X, ζ) =e^∗⊤_i (X)G^α(X, ζ)e^∗_j(X).

Then we get from (25)

G¯^α_ij(X, ζ) =h^α_i(X)δ_ij +σ α

1−h^α_i(X)

1−h^α_j(X)

Ξ_ij(X), where

Ξ_ij(X) =e^∗⊤_i (X)ζ(X)e^∗_j(X).

The covariance function of this Gaussian random matrix is computed in the following lemma.

Lemma 2

EΞ_ks(X)Ξ_k^′_s^′(X) = [λ_k(X) +λ_s(X)](δ_ss^′δ_kk^′+δ_sk^′δ_ks^′). (29) Proof. Letζ be a random matrix with i.i.d. entries andB be a non-random matrix. Then

EζBζ =B^⊤

since X

k,l

B_klEζ_ikζ_lj =X

k,l

B_klδ_ilδ_kj =B_ji. Similarly

Eζ^⊤Bζ^⊤=B^⊤, EζBζ^⊤=Itr(B), Eζ^⊤Bζ =Itr(B).

(16)

Next, noticing that

X= Xp

l=1

pλ_l(X)e^∗_l(X)e^⊤_l (X),

we get

Ξij(X) =e^∗⊤_i (X)[Xζ^⊤+ζX^⊤]e^∗_j(X) =p

λi(X)e^⊤_i (X)ζ^⊤e^∗_j(X) +

q

λj(X)e^∗⊤_i (X)ζej(X).

Therefore

EΞ_ij(X)Ξ_kl(X) =Ep

λ_i(X)e^⊤_i (X)ζ^⊤e^∗_j(X) + q

λ_j(X)e^∗⊤_i (X)ζe_j(X)

×p

λ_k(X)e^⊤_k(X)ζ^⊤e^∗_l(X) +p

Λ_l(X)e^∗⊤_k (X)ζe_l(X)

=p

λ_i(X)Λ_k(X)e^⊤_i (X)E

ζ^⊤e^∗_j(X)e^⊤_k(X)ζ^⊤ e^∗_l(X) +p

λ_i(X)λ_l(X)e^⊤_i (X)E ζ^⊤e^∗_j(X)e^∗⊤_k (X)ζ e_l(X) +

q

λ_j(X)λ_k(X)e^∗⊤_i (X)E ζe_j(X)e^⊤_k(X)ζ^⊤ e^∗_l(X) +

q

λ_j(X)λ_l(X)e^∗⊤_i (X)E

ζe_j(X)e^∗⊤_k (X)ζ e_l(X)

=p

λ_i(X)λ_k(X)δ_ikδ_jl+p

λ_i(X)λ_l(X)δ_jkδ_il +

q

λj(X)λ_k(X)δ_ilδ_jk+ q

λj(X)λ_l(X)δ_ikδ_jl. In what follows we will need two concentration inequalities related to Ξ(X).

Lemma 3 Let D be a p×p-diagonal matrix and u ∈ R^p. Then for any λ≥0

Eexpn λ

DΞ(X)u o

≤expn λ

q E

DΞ(X)u ² +λ²max

j

h

D²_jjλ_j(X)kuk²+D_jj² p

λ(X)u²i

+ 2λ²kDλ(X)uk × kDuko .

(30)

Proof. For the Gaussian random vector v with components v_i =D_ii

Xp

k=1

Ξ_ik(X)u_k

(17)

one easily obtains (see (29)) R_ij =Ev_iv_j =D_iiD_jj

Xp

k,l=1

EΞ_ik(X)Ξ_jl(X)u_ku_l

=D_iiD_jj Xp

k,l=1

[λ_i(X) +λ_k(X)](δ_ijδ_kl+δ_ilδ_jk)u_ku_l

=D²_iiδ_ij Xp

k=1

[λ_i(X) +λ_k(X)]u²_k+D_iiD_jj[λ_i(X) +λ_j(X)]u_iu_j. Thus

Ekvk² = Xp

k=1

R_kk= Xp

k=1

D_kk² λ_k(X) Xp

s=1

u²_s+ Xp

k=1

D²_kk Xp

s=1

u²_sλs(X) + 2

Xp

k=1

D_kk² λ_k(X)u²_k. Next, we make use of the formula

kvk² = Xp

k=1

s_k(R)ξ_k²,

whereξ_kare i.i.d. standard Gaussian random variables ands_k(u) are eigenvalues ofR, i.e.

Rφ_k=s_k(R)φ_k, k= 1, . . . , p.

Therefore by Lemma 11 Eexp

λkvk ≤expn λp

Ekvk²+λ²s₁(R)o

. (31)

In order to bound from aboves1(R), notice that for any φ∈R^p [Rφ]_i =D_ii²λ_i(X)kuk²φ_i+D_ii²φ_i

Xp

k=1

λ_k(X)u²_k +D_iiλ_i(X)u_i

Xp

k=1

D_kku_kφ_k+D_iiu_i Xp

k=1

D_kkλ_k(X)u_kφ_k. Hence, by the Cauchy-Shwarz inequality we obtain from the above equation

s₁(R) = max

kφk≤1kRφk ≤max

j

D²_jjλ_j

Xp

k=1

u²_k+D_jj² Xp

k=1

λ_k(X)u²_k

+ 2 X^p

k=1

D²_kku²_k

1/2X^p

k=1

D²_kkλ²_k(X)u²_k 1/2

and, so, (30) follows from this inequality and (31).

(18)

Lemma 4 Let D be a p×p-diagonal matrix. Then for any λ∈R⁺ Eexpn

λ q

tr

DΞ(X)D²Ξ(X)Do

≤expn λ

q Etr

DΞ(X)D²Ξ(X)D +Kλ²max

i,j D²_iiD²_jjλ_j(X)o

. (32) Proof. Notice

tr

DΞ(X)D²Ξ(X)D

= Xp

i,j=1

D²_iiD²_jjΞ²_ij(X).

= Xp

i=1

D⁴_iiΞ²_ii(X) + 2X

j<i

D²_iiD²_jjΞ²_ij(X).

(33)

It is easy to check with the help of (29) that

EΞ_ii(X)Ξ_jj(X) = 4λ_i(X)δ_ij, EΞ_ij(X)Ξ_ij(X) =λ_i(X) +λ_j(X), i6=j, and whenj < i and k < l

EΞ_ij(X)Ξ_kl(X) = [λ_i(X) +λ_j(X)]δ_ilδ_kj = 0.

So, let ξ_ij, i, j = 1. . . , p be i.i.d. standard Gaussian random variables.

Then we have

Xp

i=1

D⁴_iiΞ²_ii(X)= 4^P Xp

i=1

D_ii⁴λ_i(X)ξ_ii² (34) and

X

j<i

D_ii²D_jj² Ξ²_ij(X) =X

j<i

D²_iiD²_jj[λ_i(X) +λ_j(X)]ξ_ij²

=X

j<i

D²_iiD²_jjλ_i(X)ξ_ij² +X

j<i

D_ii²D_jj² λ_j(X)ξ_ij². (35) Hence by Lemma 11 we obtain

Eexp

λ Xp

i=1

D⁴_iiλ_i(X)ξ_ii² 1/2

≤exp

λ Xp

i=1

D⁴_iiλ_i(X) 1/2

+λ²max

i D_ii⁴λ_i(X)

,

Eexp

λX

j<i

D_ii²D_jj²λ_i(X)ξ²_ij 1/2

≤exp

λX

j<i

D²_iiD²_jjλ_i(X) 1/2

+λ²max

i,j D²_jjD²_iiλ_i(X)

,

(19)

Eexp

λX

j<i

D_jj² D_ii²λjξ_ij² 1/2

≤exp

λX

j<i

D_jj²D²_iiλj(X) 1/2

+λ²max

i,j D²_iiD_jj²λ_j(X)

.

Therefore, combining the above inequalities and (33)-(35) with the help of Lemma 12, we arrive at (32).

In the proof of Theorem 4 the following simple lemma plays an important role.

Lemma 5 Suppose Condition 1 holds. Then for any givenα∈R⁺ and any µ∈Rⁿ

EL

µ, G^α(X, ζ)

≤

1 +KQpσ² α

L

µ, H^α(X) + σ²p

α² X^⊤

I−H^α(X) µ².

(36)

where

L[µ, H] =k(I−H)µk²+ǫ²tr HH^⊤ .

Proof. Since L[µ, H] is a quadratic functional inH, we have (see (25)) EL

µ, G^α(X, ζ)

=L[µ, H^α(X)]

+σ² α²E

[I−H^α(X)]ζ(X)[I −H^α(X)]µ ² +σ²ǫ²

α² Etrn

I−H^α(X) ζ(X)

I−H^α(X)

×

I−H^α(X)⊤

ζ(X)^⊤

I−H^α(X)⊤o .

(37)

Let ¯µ(X) be vector inR^p with componets ¯µ_k(X) =

µ, e^∗_k(X)

andh(X) be the diagonal matrix with the entriesh^α_k(X), k= 1, . . . , pdefined by (28).

Then, with the help of Lemma 2, we obtain E

[I−H^α(X) ζ(X)

I−H^α(X) µ

²

= ¯µ^⊤(X)

I −h^α(X) E

Ξ^⊤

I−h^α(X)2

Ξ I−h^α(X)

¯ µ(X)

= Xp

i=1

1−h^α_i(X)2

¯ µ²_i(X)

2

1−h^α_i(X)2

λ_i(X) +λi(X)

Xp

k=1

[1−h^α_k(X)]²+ Xp

k=1

1−h^α_k(X)2

λ_k(X)

≤α(2 +p) Xp

i=1

1−h^α_i(X)2

¯

µ²_i(X) +p Xp

i=1

1−h^α_i(X)2

¯

µ²_i(X)λ_i(X).

(20)

With the similar arguments and Condition 1 we arrive at Etrn

[I−H^α(X)]ζ(X)[I−H^α(X)][I−H^α(X)]^⊤ζ(X)^⊤[I −H^α(X)]^⊤o

= trn

[I−h^α(X)]E

Ξ[1−h^α(X)]²Ξ^⊤ [I −h^α(X)]o

= Xp

i=1

[1−h^α_i(X)]²

2[1−h^α_i(X)]²λi(X) +λ_i(X)

Xp

k=1

[1−h^α_k(X)]²+ Xp

k=1

[1−h^α_k(X)]²λ_k(X)

≤4Qαp Xp

i=1

[h^α_i(X)]².

Thus (36) follows from (37) and the above equations.

The following lemma controls the concentration of L[µ, G^α(X, ζ)].

Lemma 6 For any given α >0 and λ≥0 Eexpn

λ q

L

µ, G^α(X, ζ)o

≤exp

λ

1 +KQpσ² α

L

µ, H^α(X) + σ²p

α² X^⊤

I−H^α(X)]µ² 1/2

+ Kλ²σ² α

L

µ, H^α(X)

+ǫ²+α⁻¹ X^⊤

I −H^α(X)]µ ²

.

(38)

Proof. Notice that L

µ, G^α(X, ζ)

admits the following decomposition:

L

µ, G^α(X, ζ)

=L

µ, H^α(X) +2σ

α ∆^α₁(µ, X,Ξ) +σ²

α²∆^α₂(µ, X,Ξ), (39) where

∆^α₁(µ, X,Ξ) =¯µ^⊤[I−h^α(X)]²Ξ[I−h^α(X)]²µ¯ +ǫ²trn

h^α(X)[I −h^α(X)]Ξ[I−h^α(X)]o and

∆^α₂(µ, X,Ξ) = ¯µ^⊤[I−h^α(X)]Ξ[I−h^α(X)]²Ξ^⊤[I−h^α(X)]¯µ +ǫ²trn

[I−h^α(X)]Ξ[I −h^α(X)]²Ξ^⊤[I−h^α(X)]o .

Recall that ¯µ∈R^pin the above equations has components ¯µ_k=hµ, ϕ_k(X)i.